Miscellaneous
Misc features
Information schema
✅
Views
✅
Common table expressions (CTEs)
✅
Cursors
✅
Triggers
✅
Client Compatibility
Some MySQL features are client features, not server features. Dolt ships with a client (ie. dolt sql
) and a server (dolt sql-server
). The Dolt client is not as sophisticated as the mysql
client. To access these features you can use the mysql
client that ships with MySQL.
SOURCE
❌
Works with Dolt via the mysql
client
LOAD DATA LOCAL INFILE
❌
LOAD DATA INFILE works with the Dolt client. The LOCAL option only works with Dolt via the mysql
client
Join hints
Dolt supports the following join hints:
JOIN_ORDER(,...)
✅
Join tree in scope should use the following join execution order. Must include all table names.
LOOKUP_JOIN(,)
✅
Use LOOKUP strategy joining two tables.
MERGE_JOIN(,)
✅
Use MERGE strategy joining two tables.
HASH_JOIN(,)
✅
Use HASH strategy joining two tables.
INNER_JOIN(,)
✅
Use INNER strategy joining two tables.
SEMI_JOIN(,)
✅
Use SEMI strategy joining two tables (for EXISTS
or IN
queries).
ANTI_JOIN(,)
✅
Use ANTI strategy joining two tables (for NOT EXISTS
or NOT IN
queries).
JOIN_FIXED_ORDER
❌
Join tree uses in-place table order for execution.
NO_ICP
❌
Disable indexed range scans on index using filters.
Join hints are indicated immediately after a SELECT
token in a special comment format /*+ */
. Multiple hints should be separated by spaces:
Join hints currently require a full set of valid hints for all to be applied. For example, if we have a three table join we can enforce JOIN_ORDER on its own, join strategies on their own, or both order and strategy:
Additional notes:
If one hint is invalid given the execution options, no hints are applied and the engine falls back to default costing.
Join operator hints are order-insensitive
Join operator hints apply as long as the indicated tables are subsets of the join left/right.
Table Statistics
ANALYZE table
Dolt currently supports table statistics for index and join costing.
Statistics are auto-collected by default for servers, but cab be manually collected by running ANALYZE TABLE <table, ...>
.
Here is an example of how to initialize and observe statistics:
Statistics are persisted in database's chunk store in a refs/stats
ref stored separately from the commit graph. Each database has its own statistics store. The contents of the refs/stats
reflect a single point-in-time for a single branch and are un-versioned. The contents of this ref in the current database can be inspected with the dolt_statistics
system table.
Auto-Refresh
Static statistics become stale quickly for tables that change frequently. Users can choose to manually manage run ANALYZE
statements, or use some form of auto-refresh.
Auto-refresh statistic updates work the same way as partial ANALYZE
updates. A table's "former" and "new" chunk set will 1) share common chunks preexisting in "former" 2) differ by deleted chunks only in the "former" table, and 3) differ by new chunks in the "new" table. This mirrors Dolt's inherent structural sharing. Rather than forcing an update on every refresh interval, we can toggle how many changes triggers the update.
When the auto-refresh threshold is 0%, the auto-refresh thread behaves like a cron job that runs ANALYZE
periodically.
Setting a non-zero threshold defers updates until after a certain fraction of chunks are edited. For example, a 100% difference threshold updates stats when:
The table was previously empty and now contains data.
The table grew or shrank such that the tree height grew or shrank, and therefore the target fanout level changed.
Inserts added twice as many chunks.
Deletes removed 100% of the preexisting chunks.
50% of the chunks were edited (an in-place edit deletes one chunk and adds one chunk, for a total of two changes relative to the original chunk)
Any combination of edits/inserts/deletes that exceeds the trigger threshold will also update stats.
We enable refresh with one mandatory and two optional system variables:
The first enables auto-refresh. It is a global variable that must be set during dolt sql-server
startup and affects all databases in a server context. Databases added or dropped to a running server automatically opt-in to statistics refresh if enabled.
The second two variables configure 1) how often a timer wakes up to check stats freshness (seconds), and 2) the threshold updating a table's active statistics (new+deleted/previous chunks as a percentage between 0-1). For example, dolt_stats_auto_refresh_interval = 600
means the server only attempt to update stats every 10 minutes, regardless of how much a table has changed. Setting dolt_stats_auto_refresh_threshold = 0
forces stats to update in response to any table change.
A last variable blocks statistics from loading from disk on startup, or writing to disk on ANALYZE:
Stats Controller Functions
Dolt exposes a set of helper functions for managing statistics collection and use:
dolt_stats_drop()
: Deletes the stats ref on disk and wipes the database stats held in memory for the current database.dolt_stats_stop()
: Cancels active auto-refresh threads for the current database.dolt_stats_restart()
: Stops and restarts a refresh thread for the current database with the current session's interval and threshold variables.dolt_stats_status()
: Returns the latest update to statistics for the current database.dolt_stats_prune()
: Garbage collects the statistics cache storage, retaining only the most recent statistic updates.dolt_stats_purge()
: Deletes the old statistics cache from the filesystem. This can be used to silence warnings from backwards incompatible upgrades. Statistics will need to be recollected, which can be time consuming.
Performance
Lowering check intervals and update thresholds increases the refresh read and write load. Refreshing statistics uses shortcuts to avoid reading from disk when possible, but in most cases at least needs to read the target fanout level of the tree from disk to compare previous and current chunk sets. Exceeding the refresh threshold reads all data from disk associated with the new chunk ranges, which will be the most expensive impact of auto-refresh. Dolt uses ordinal offsets to avoid reading unnecessary data, but the tree growing or shrinking by a level forces a full tablescan.
For example, setting the check interval to 0 seconds (constant), the update threshold to 0 (any change triggers refresh) reduces the oltp_read_write
sysbench benchmark's throughput by 15%. An increase in the update threshold for a 0-interval reduces throughput even more. On the other hand, basically any non-zero interval reduces the fraction of time spent performing stats updates to a negligible level:
0
0
-15%
0
1
-46%
0
10
-45%
1
0
-.1%
1
1
0%
A small set of TPC-C run with one thread has a similar pattern compared to the baseline values, comparing queries per second (qps) now:
0
0
-15%
0
1
-26%
0
10
-10%
1
0
-4%
1
1
0%
Statistics' usefulness is rarely improved by immediate updates. Updating every minute or hour is probably fine for most workloads. If you do need quick statistics updates, performing them immediately instead of in batches appears to be preferable with the current implementation tradeoffs.
Statistics also have read performance implications, expensing more compute cycles to obtain better join cost estimates. Histograms with the maximum bucket fanout will be the most expensive to use. That said, at the time of writing this sysbench read benchmarks are not impacted by stats estimate overhead. Behavior for custom workloads will depend on read/write/freshness trade-offs.
Last updated