Importing Data

You can load the following data formats into Dolt:

  • CSV, JSON, Parquet

  • MySQL databases

  • Postgres

  • Spreadsheets

  • Pandas dataframes

We support several commands for this including:

CSV, JSON, and Parquet Files

The easiest sources of data to work with are CSV, JSON, and Parquet files. These pair best with the custom dolt table import command.

  1. Importing with no schema

Dolt supports importing csv files without a defined SQL schema. Consider the following csv file:

pk,val
1,2
2,3

We can import and create a table as follows:

dolt table import -c --pk=pk mytable file.csv

Rows Processed: 2, Additions: 2, Modifications: 0, Had No Effect: 0
Import completed successfully.

We can query the table and see the new schema and data:

> dolt sql -q "describe mytable"
+-------+--------------+------+-----+---------+-------+
| Field | Type         | Null | Key | Default | Extra |
+-------+--------------+------+-----+---------+-------+
| pk    | int unsigned | NO   | PRI |         |       |
| val   | int unsigned | NO   |     |         |       |
+-------+--------------+------+-----+---------+-------+

> dolt sql -q "select * from mytable"
+----+-----+
| pk | val |
+----+-----+
| 1  | 2   |
| 2  | 3   |
+----+-----+

You can reference the dolt table import documenation for additional ways to modify your database such as updating or replacing your existing data.

  1. Importing with a schema

In the case of JSON or Parquet files we require you provide a schema in the form of a CREATE TABLE SQL statement. You can also specify a schema for a csv file. Let's walk through the following file.

{
  "rows": [
    {
      "id": 0,
      "first name": "tim",
      "last name": "sehn",
      "title": "ceo"
    },
    {
      "id": 1,
      "first name": "aaron",
      "last name": "son",
      "title": "founder"
    },
    {
      "id": 2,
      "first name": "brian",
      "last name": "hendricks",
      "title": "founder"
    }
  ]
}

Our sql schema will look like this:

CREATE TABLE employees (
    `id` LONGTEXT NOT NULL,
    `first name` LONGTEXT,
    `last name` LONGTEXT,
    `title` LONGTEXT,
    PRIMARY KEY (`id`)
);

Putting it all together

> dolt table import -c -s schema.sql employees file.json
Rows Processed: 3, Additions: 3, Modifications: 0, Had No Effect: 0
Import completed successfully.

> dolt sql -q "select * from employees"
+----+------------+-----------+---------+
| id | first name | last name | title   |
+----+------------+-----------+---------+
| 0  | tim        | sehn      | ceo     |
| 1  | aaron      | son       | founder |
| 2  | brian      | hendricks | founder |
+----+------------+-----------+---------+
  1. You can also use the MySQL LOAD DATA command to work with data that is compatible with the LOAD DATA api. For example you can load the above file.csv as follows:

create table test(pk int, val int);
LOAD DATA INFILE '/Users/vinairachakonda/misc/test/file.csv' INTO table test FIELDS TERMINATED BY ',' IGNORE 1 LINES;

Selecting from above you get

test> select * from test;
+----+-----+
| pk | val |
+----+-----+
| 2  | 3   |
| 1  | 2   |
+----+-----+

MySQL Databases

Migrating your MySQL database to Dolt is easy. We recommend the mysqldump program. Let's say you have MySQL server with the following configuration

user: root
password:
host: 0.0.0.0
port: 3306
database: test

You can dump the database test as follows:

mysqldump --databases test -P 3306 -h 0.0.0.0 -u root -p > dump.sql

Note: Using the --databases flag will cause mysqldump to include a command to create the database for you if it doesn't exist yet. Without this flag, you will need to ensure you have already created your database (e.g. create database test;) before you can load in the mysqldump file.

To load into dolt:

dolt sql < dump.sql

Hosted Dolt Deployments

Importing data into a Hosted Dolt deployment can be done using the deployment's default configuration. Let's say you've dumped an existing MySQL database using mysqldump and want to import that dump into your Hosted Dolt deployment.

Navigate to the "Configuration" tab of your Hosted deployment dashboard, and make sure that the box behavior_auto_commit is checked. This will automatically create SQL COMMITs after every statement is executed, ensuring the data is persisted after each import statement is executed.

Also, make sure that the box behavior_disable_multistatements is unchecked. This is required for importing dumps, as they often combine multiple statements into single strings.

New Hosted deployments will have these correct server configuration settings by default. Please note that altering them prior to performing an import has been known to cause some imports to fail.

With the above settings in place, you can import the SQL dump using a standard MySQL client or database tool:

mysql -h my-deployment.dbs.hosted.doltdb.com -u <user name> -p<password> database < dump.sql

Postgres

You can load a postgres database into dolt with our custom pg2mysql tool. If you have a postgres database you can export a postgres dump file with the pg_dump utility.

With a postgres dump file of file.pgdump you can convert it into a mysql dump as follows.

./pg2mysql.pl < file.pgdump > mysql.sql

Finally, you can load the mysql file into dolt.

dolt sql < mysql.sql

Spreadsheets

There are multiple ways to get spreadsheet data into Dolt. The first is with dolt table import. Consider the following excel file: employees.xlsx

0,tim,sehn,ceo
1,aaron,son,founder
2,brian,hendricks,founder

Just like a csv file we run the command dolt table import -c --pk=id employees employees.xlsx to load the excel file into our Dolt. Be sure to give the same name for the table as the spreadsheet's name.

The other way to work with spreadsheet data is with Dolthub's "edit like a spreadsheet" feature. You can create a SQL table from scratch with just typing values into a spreadsheet.

Pandas Dataframe

We recommend standard MySQL + Python techniques when intergrating Dolt with Pandas. First you want to connect to your Dolt database with the SQLAlchemy ORM. Here's some sample code below:

from sqlalchemy import create_engine
import pymysql
import pandas as pd

# Define a data frame
employees = {
    "id": [0, 1, 2],
    "FirstName": ["Tim","Aaron","Brian"],
    "LastName": ["Sehn", "Son", "Hendriks"],
    "Title": ["CEO", "Founder", "Founder"],
}

tableName = "employees"
dataFrame = pd.DataFrame(data=employees)

# Create an engine that connect to our dolt sql-server.
sqlEngine = create_engine("mysql+pymysql://root:@127.0.0.1/test", pool_recycle=3600)
dbConnection = sqlEngine.connect()

try:
    frame = dataFrame.to_sql(tableName, dbConnection, if_exists="fail")
    print("Table %s created successfully." % tableName)
except Exception as ex:
    print(ex)
finally:
    dbConnection.close()

In the above example we are creating a data frame of employees and writing it to our Dolt database with the to_sql function. We can then use the read_sql function to read back data from our MySQL database into Dolt.

frame = pd.read_sql('SELECT * from employees', dbConnection)
print(frame)

#    index  id FirstName  LastName    Title
# 0      0   0       Tim      Sehn      CEO
# 1      2   2     Brian  Hendriks  Founder
# 2      1   1     Aaron       Son  Founder

The second way to do this is by exporting your pandas dataframe as a csv file which can then be imported with dolt table import.

>>> import pandas as pd
>>> df = pd.DataFrame({'name': ['Raphael', 'Donatello'],
...                    'mask': ['red', 'purple'],
...                    'weapon': ['sai', 'bo staff']})
>>> df.to_csv('out.csv', index=False)

Import Best Practices

There are some best practices to keep in mind in order to make importing external data into Dolt as fast as possible. These performance differences are especially relevant with large databases (~50GB+).

  • Avoid adding foreign keys or unique indexes until after the import is completed. These substantially increase import time.

  • Minimize your use of blob types. These are expensive to create.

  • If running multiple import jobs back to back, be sure to garbage collect the database. Imports can generate a substantial amount of garbage.

  • If you're writing your own .sql file instead of using mysqldump, consider the following:

    • Import one table at a time.

    • Prefer fewer INSERT statements that each insert multiple values over more, shorter statements that only insert a single value each.

    • Sort your inserts by primary key. If a table doesn't have a primary key, consider adding one.

Last updated