Importing Data
You can load the following data formats into Dolt:
CSV, JSON, Parquet
MySQL databases
Postgres
Spreadsheets
Pandas dataframes
We support several commands for this including:
CSV, JSON, and Parquet Files
The easiest sources of data to work with are CSV, JSON, and Parquet files. These pair best with the custom dolt table import
command.
Importing with no schema
Dolt supports importing csv
files without a defined SQL schema. Consider the following csv file:
We can import and create a table as follows:
We can query the table and see the new schema and data:
You can reference the dolt table import
documenation for additional ways to modify your database such as updating or replacing your existing data.
Importing with a schema
In the case of JSON or Parquet files we require you provide a schema in the form of a CREATE TABLE
SQL statement. You can also specify a schema for a csv file. Let's walk through the following file.
Our sql schema will look like this:
Putting it all together
You can also use the MySQL
LOAD DATA
command to work with data that is compatible with the LOAD DATA api. For example you can load the abovefile.csv
as follows:
Selecting from above you get
MySQL Databases
Migrating your MySQL database to Dolt is easy. We recommend the mysqldump
program. Let's say you have MySQL server with the following configuration
You can dump the database test
as follows:
Note: Using the --databases
flag will cause mysqldump
to include a command to create the database for you if it doesn't exist yet. Without this flag, you will need to ensure you have already created your database (e.g. create database test;
) before you can load in the mysqldump
file.
To load into dolt:
Hosted Dolt Deployments
Importing data into a Hosted Dolt deployment can be done using the deployment's default configuration. Let's say you've dumped an existing MySQL database using mysqldump
and want to import that dump into your Hosted Dolt deployment.
Navigate to the "Configuration" tab of your Hosted deployment dashboard, and make sure that the box behavior_auto_commit
is checked. This will automatically create SQL COMMIT
s after every statement is executed, ensuring the data is persisted after each import statement is executed.
Also, make sure that the box behavior_disable_multistatements
is unchecked. This is required for importing dumps, as they often combine multiple statements into single strings.
New Hosted deployments will have these correct server configuration settings by default. Please note that altering them prior to performing an import has been known to cause some imports to fail.
With the above settings in place, you can import the SQL dump using a standard MySQL client or database tool:
Postgres
You can load a postgres database into dolt with our custom pg2mysql tool. If you have a postgres database you can export a postgres dump file with the pg_dump utility.
With a postgres dump file of file.pgdump
you can convert it into a mysql dump as follows.
Finally, you can load the mysql file into dolt.
Spreadsheets
There are multiple ways to get spreadsheet data into Dolt. The first is with dolt table import
. Consider the following excel file: employees.xlsx
Just like a csv file we run the command dolt table import -c --pk=id employees employees.xlsx
to load the excel file into our Dolt. Be sure to give the same name for the table as the spreadsheet's name.
The other way to work with spreadsheet data is with Dolthub's "edit like a spreadsheet" feature. You can create a SQL table from scratch with just typing values into a spreadsheet.
Pandas Dataframe
We recommend standard MySQL + Python techniques when intergrating Dolt with Pandas. First you want to connect to your Dolt database with the SQLAlchemy ORM. Here's some sample code below:
In the above example we are creating a data frame of employees and writing it to our Dolt database with the to_sql
function. We can then use the read_sql function to read back data from our MySQL database into Dolt.
The second way to do this is by exporting your pandas dataframe as a csv file which can then be imported with dolt table import
.
Import Best Practices
There are some best practices to keep in mind in order to make importing external data into Dolt as fast as possible. These performance differences are especially relevant with large databases (~50GB+).
Avoid adding foreign keys or unique indexes until after the import is completed. These substantially increase import time.
Minimize your use of blob types. These are expensive to create.
If running multiple import jobs back to back, be sure to garbage collect the database. Imports can generate a substantial amount of garbage.
If you're writing your own
.sql
file instead of usingmysqldump
, consider the following:Import one table at a time.
Prefer fewer INSERT statements that each insert multiple values over more, shorter statements that only insert a single value each.
Sort your inserts by primary key. If a table doesn't have a primary key, consider adding one.
Last updated