SQL Sync

Background and Motivation

Dolt is a relational database that uses a commit graph as its underlying data store. It emulates Git's model of versioning by creating a commit in that commit graph when prompted via either command line interface, SQL, or some other API.
This is distinct from existing relational database solutions, such as MySQL, Postgresql, Oracle, that have a "last write wins" model of versioning. A cell takes on the value set by the last query that modified it. Some of these databases, such as Microsoft SQL Sever and Maria DB, also support AS OF functionality that stores the history of values and exposes a syntax for querying them. This requires explicit configuration. Additionally, these relational database implementations offer various sorts of backup and recovery mechanisms that, at a very high level, take periodic snapshots which can be restored.
This document shows how to use Dolt to create version history of an existing relational database without having to actually modify that database. We use our Python API, Doltpy, to achieve this. The setup can visualized as follows:
Sync to Dolt
The left hand side shows a schematic for the how each sync corresponds to a commit in Dolt, and the right hand side the query interface presented to the user via Dolt SQL. We can now "time travel" through our database history.

Scope and Prerequisites

This guide will explain how to sync data to and from Dolt, and either Postgres or MySQL. Currently automated schema sync is supported only when syncing from another database to Dolt.
This guide assumes that you have both Dolt and Doltpy installed, and that you are somewhat familiar with both Dolt and Python. Checkout the installation guide if you haven't install Dolt or Doltpy.

MySQL

Our current support for syncing MySQL and Dolt allows users to sync data from MySQL to Dolt, or from Dolt to MySQL. We also support syncing a schema from MySQL into Dolt.

MySQL to Dolt

Setup

In order to perform the sync we need some objects that provide connections to the relevant databases, and we need to start any database servers needed. Let's assume we are running MySQL on mysql_host and mysql_port. We can create the required database engine object, and an object to represent our Dolt database as follows:
1
from doltpy.cli import Dolt
2
from doltpy.sql import DoltSQLServerContext, ServerConfig
3
import sqlalchemy as sa
4
5
# Setup objects to represents source and target databases, start Dolt SQL Server
6
dolt = Dolt.clone('my-org/myback')
7
dssc = DoltSQLServerContext(dolt, ServerConfig())
8
dssc.start_server()
9
mysql_engine = sa.create_engine(
10
'{dialect}://{user}:{password}@{host}:{port}/{database}'.format(
11
dialect='mysql+mysqlconnector',
12
user=mysql_user,
13
password=mysql_password,
14
host=mysql_host,
15
port=mysql_port,
16
database=mysql_database
17
)
18
)
Copied!

Schema

The code for syncing the schema to Dolt once the database engine object and Dolt object are created is straight forward. We simply execute a generic function with engine appropriate parameters that take care of mapping types that Dolt does not support, which for MySQL is currently only JSON:
1
from doltpy.sql.sync import sync_schema_to_dolt, MYSQL_TO_DOLT_TYPE_MAPPINGS
2
3
sync_schema_to_dolt(mysql_engine,
4
dssc.engine,
5
{'revenue_estimates': 'revenue_estimates'},
6
MYSQL_TO_DOLT_TYPE_MAPPINGS)
Copied!

Data

Syncing the data to Dolt is similarly straight forward, and we use the same design pattern: pass implementation specific parameters, in this case reader and writer functions, to a generic procedure for syncing to Dolt:
1
from doltpy.sql.sync import sync_to_dolt, get_dolt_target_writer, get_mysql_source_reader
2
3
# Execute the sync
4
sync_to_dolt(get_mysql_source_reader(mysql_engine),
5
get_dolt_target_writer(dssc),
6
{'revenue_estimates': 'revenue_estimates'})
Copied!
Finally, we might want to stop the Dolt SQL server running:
1
dssc.stop_server()
Copied!
Relational databases have a straightforward concept of state, that is they simply capture the last value written to a given cell. We provide a simple reader that just captures the state of the table, and passes it to a simple writer for creating a commit. Note, Dolt will handle discerning what has changed, and commit only the changes which will allow users to see diffs across syncs.

Dolt to MySQL

We now provide an example of going from Dolt to a MySQL instance. We again assume that the MySQL instance is running at mysql_host on port mysql_port, and we assume the existence of mysql_engine and dolt objects that we created in the Setup section above to keep the example succinct.

Data

Since we do not currently support copying a Dolt schema to MySQL, we jump straight to syncing data. We use the same design pattern, passing implementation specific functions to a generic procedure for syncing from Dolt:
1
from doltpy.sql.sync import sync_from_dolt, get_mysql_target_writer, get_dolt_source_reader
2
3
sync_from_dolt(get_dolt_source_reader(dssc, get_dolt_table_reader()),
4
get_mysql_target_writer(mysql_engine),
5
{'revenue_estimates': 'revenue_estimates'})
Copied!
In order to facilitate user defined behavior of the actual database interactions, the sync function takes functions for reading and writing. Here we use the library's default readers and writers for Dolt and MySQL respectively. The Dolt source reader just reads the database at the latest commit, defaulting to the tip of master, and the target writer just updates that state in the target MySQL instance. Below we dive deeper into how we might go about implementing custom behavior.
Recall that Dolt databases are a commit graph, and so each commit is essentially a database state, thus we replicate precisely that state, including dropping primary keys that might have been deleted in that commit.

Postgres

Our current support for syncing Postgresql and Dolt allows users to sync data from Postgresql to Dolt, or from Dolt to Postgresql. We also support syncing a schema from Postgresql into Dolt.

Postgres to Dolt

Syncing to Postgres is similar. In the previous section we showed code snippets that use function parameters to specify implementation specific behavior:
1
from doltpy.sql.sync import sync_to_dolt, get_DB_target_writer, get_dolt_source_reader
2
3
sync_from_dolt(get_dolt_source_reader(dolt, get_dolt_table_reader()),
4
get_DB_target_writer(mysql_engine),
5
{'revenue_estimates': 'revenue_estimates'})
Copied!
We simply replace use postgres instead of the placeholder DB in order to sync to Postgres.

Setup

As in the MySQL section, we need some objects to represent the Postgres connection and the Dolt database. We assume Postgres is running postgres_host on port postgres_port:
1
from doltpy.cli import Dolt
2
from doltpy.sql import DoltSQLServerContext, ServerConfig
3
import sqlalchemy as sa
4
5
dolt_repo = Dolt.clone('my-org/analyst-estimates')
6
dssc = DoltSQLServerContext(dolt, ServerConfig())
7
dssc.start_server()
8
postgres_engine = sa.create_engine(
9
'{dialect}://{user}:{password}@{host}:{port}/{database}'.format(
10
dialect='postgresql',
11
user=postgres_user,
12
password=postgres_password,
13
host=postgres_host,
14
port=postgres_port,
15
database=postgres_database
16
)
17
)
Copied!

Schema

We can now use the same generic sync_schema_to_dolt function with implementation specific parameters to sync the Postgres schema to Dolt:
1
from doltpy.sql.sync import sync_schema_to_dolt, POSTGRES_TO_DOLT_TYPE_MAPPINGS
2
3
sync_schema_to_dolt(postgres_engine,
4
dssc.engine,
5
{'revenue_estimates': 'revenue_estimates'},
6
POSTGRES_TO_DOLT_TYPE_MAPPINGS)
Copied!

Data

To sync data we again use sync_to_dolt, this time with Postgres specific function parameters to get the correct behavior:
1
from doltpy.sql.sync import sync_to_dolt, get_postgres_target_writer, get_dolt_source_reader
2
3
sync_to_dolt(get_postgres_source_reader(postgres_engine),
4
get_dolt_target_writer(dssc),
5
{'revenue_estimates': 'revenue_estimates'})
Copied!

Dolt to Postgres

We again assume that Postgres is running on postgres_host at port postgres_port, and reuse the objects we defined in the setup section. Because we don't yet support syncing a Postgres schema to Dolt, we jump straight into getting data from Dolt into Postgres.

Data

As with syncing to Dolt from Postgres, we employ a generic method parameterized with database implementation specific parameters:
1
from doltpy.sql.sync import sync_from_dolt, get_postgres_target_writer, get_dolt_source_reader
2
3
sync_from_dolt(get_dolt_source_reader(dssc, get_dolt_table_reader()),
4
get_postgres_target_writer(postgres_engine),
5
{'revenue_estimates': 'revenue_estimates'})
Copied!

Oracle

Our current support for Oracle is data-only, which is to say we support syncing only data, and any schema must be manually created before schema sync can occur.

Oracle to Dolt

We again employ the same pattern for syncing from Oracle to Dolt that we used in the MySQL and Postgres sections.

Setup

As in the Postgres and MySQL sections, we need some objects to represent the Oracle connection and the Dolt database. We assume Oracle is running oracle_host on port oracle_port:
1
from doltpy.cli import Dolt
2
from doltpy.sql import DoltSQLServerContext, ServerConfig
3
import sqlalchemy as sa
4
import cx_Oracle
5
6
dolt = Dolt.clone('my-org/analyst-estimates')
7
dssc = DoltSQLServerContext(dolt, ServerConfig())
8
dssc.start_server()
9
10
def _oracle_connection_helper:
11
return cx_Oracle.connect('oracle_user', 'oracle_pwd', '{}:{}/{}'.format('oracle_host', 1521, 'oracle_db'))
12
13
engine = create_engine('oracle+cx_oracle://', creator=_oracle_connection_helper)
Copied!

Data

To sync data we again use sync_to_dolt, this time with Oracle specific function parameters to get the correct behavior:
1
from doltpy.sql.sync import sync_to_dolt, get_oracle_source_reader, get_dolt_target_writer
2
3
sync_to_dolt(get_oracle_source_reader(oracle_engine),
4
get_dolt_target_writer(dssc),
5
{'revenue_estimates': 'revenue_estimates'})
Copied!

Dolt to Oracle

We again assume that Oracle is running on oracle_host at port oracle_port, and reuse the objects we defined in the setup section. Because we don't yet support syncing an Oracle schema to Dolt, we jump straight into getting data from Dolt into Oracle.

Data

As with syncing to Dolt from Postgres, we employ a generic method parameterized with database implementation specific parameters:
1
from doltpy.sql.sync import sync_from_dolt, get_oracle_target_writer, get_dolt_source_reader
2
3
sync_from_dolt(get_dolt_source_reader(dssc, get_dolt_table_reader()),
4
get_oracle_target_writer(oracle_engine),
5
{'revenue_estimates': 'revenue_estimates'})
Copied!

Customizing Behavior

Diving into the code that does the actual sync, one thing that jumps out is how simple it is (doc strings omitted for clarity):
1
# Types that reflect the different nature of the syncs
2
DoltTableUpdate = Tuple[Iterable[dict], Iterable[dict]]
3
TableUpdate = Iterable[dict]
4
5
# For using Dolt as the target
6
DoltAsTargetUpdate = Mapping[str, TableUpdate]
7
DoltAsTargetReader = Callable[[List[str]], DoltAsTargetUpdate]
8
DoltAsTargetWriter = Callable[[DoltAsTargetUpdate], None]
9
10
# For using Dolt as the source
11
DoltAsSourceUpdate = Mapping[str, DoltTableUpdate]
12
DoltAsSourceReader = Callable[[List[str]], DoltAsSourceUpdate]
13
DoltAsSourceWriter = Callable[[DoltAsSourceUpdate], None]
14
15
16
def sync_to_dolt(source_reader: DoltAsTargetReader, target_writer: DoltAsTargetWriter, table_map: Mapping[str, str]):
17
_sync_helper(source_reader, target_writer, table_map)
18
19
20
def sync_from_dolt(source_reader: DoltAsSourceReader, target_writer: DoltAsSourceWriter, table_map: Mapping[str, str]):
21
_sync_helper(source_reader, target_writer, table_map)
22
23
24
def _sync_helper(source_reader, target_writer, table_map: Mapping[str, str]):
25
to_sync = source_reader(list(table_map.keys()))
26
remapped = {table_map[source_table]: source_data for source_table, source_data in to_sync.items()}
27
target_writer(remapped)
Copied!
The "interface" that the two databases communicate over is strikingly simple. Mapping[str, TableUpdate] is a mapping from table name to a list of dict instances, each one representing a row. Users are free to customize behavior by providing a function that reads from their database (Dolt, MySQL, Postgres), and produces a list of dict values that match the schema of the target database.

Future Work

We are excited by the possibilities a database with Dolt's unique features creates for data engineering workflows. Next up the sync is expanding to more relational database implementations (MS SQL Server and Oracle), and supporting syncing a schema from Dolt to any of the supported database implementations.
Last modified 23d ago