LogoLogo
DoltHubBlogDiscordGitHubDolt
  • Introduction
    • What Is Dolt?
    • Installation
      • Linux
      • Windows
      • Mac
      • Build from Source
      • Application Server
      • Docker
      • Upgrading
    • Getting Started
      • Version Controlled Database
      • Git For Data
      • Versioned MySQL Replica
    • Use Cases
      • Data Sharing
      • Data and Model Quality Control
      • Manual Data Curation
      • Version Control for your Application
      • Versioned MySQL Replica
      • Audit
      • Configuration Management
      • Offline First
  • Concepts
    • Dolt
      • Git
        • Commits
        • Log
        • Diff
        • Branch
        • Merge
        • Conflicts
        • Remotes
        • Working Set
      • SQL
        • Databases
        • Schema
        • Tables
        • Primary Keys
        • Types
        • Indexes
        • Views
        • Constraints
        • Triggers
        • Procedures
        • Users/Grants
        • Transactions
        • System Variables
      • RDBMS
        • Server
        • Backups
        • Replication
    • DoltHub/DoltLab
      • Permissions
      • Pull Requests
      • Issues
      • Forks
  • SQL Reference
    • Running the Server
      • Configuration
      • Access Management
      • Branch Permissions
      • Backups
      • Garbage Collection
      • Metrics
      • Replication
      • Troubleshooting
    • Version Control Features
      • Using Branches
      • Merges
      • Querying History
      • Using Remotes
      • Procedures
      • Functions
      • System Tables
      • System Variables
      • Saved Queries
    • SQL Language Support
      • Data Description
      • Expressions, Functions, Operators
      • Supported Statements
      • MySQL Information Schema
      • Collations and Character Sets
      • System Variables
      • Miscellaneous
    • Supported Clients
      • Programmatic
      • SQL Editors
    • Benchmarks and Metrics
      • Correctness
      • Latency
      • Import
  • CLI Reference
    • Commands
    • Git Comparison
  • Architecture
    • Overview
    • Storage Engine
      • Commit Graph
      • Prolly Trees
      • Block Store
    • SQL
      • Go MySQL Server
      • Vitess
  • Guides
    • Cheat Sheet
    • Contributing
      • dolt
      • go-mysql-server
    • MySQL to Dolt Replication
    • Importing Data
    • Integrations
  • Other
    • FAQ
    • Roadmap
    • Versioning
  • Products
    • Hosted Dolt
      • Getting Started
      • Notable Features
      • SQL Workbench
      • Cloning a Hosted Database
      • Using DoltHub as a Remote
      • Infrastructure
    • DoltHub
      • Data Sharing
      • API
        • Authentication
        • SQL
        • CSV
        • Database
        • Hooks
      • Continuous Integration
        • Getting Started
        • Workflow Reference
      • Transform File Uploads
      • Workspaces
    • DoltLab
    • Dolt Workbench
    • DoltgreSQL
Powered by GitBook
On this page
  • CSV, JSON, and Parquet Files
  • MySQL Databases
  • Hosted Dolt Deployments
  • Postgres
  • Spreadsheets
  • Pandas Dataframe
  • Import Best Practices

Was this helpful?

Edit on GitHub
Export as PDF
  1. Guides

Importing Data

PreviousMySQL to Dolt ReplicationNextIntegrations

Last updated 1 year ago

Was this helpful?

You can load the following data formats into Dolt:

  • CSV, JSON, Parquet

  • MySQL databases

  • Postgres

  • Spreadsheets

  • Pandas dataframes

We support several commands for this including:

CSV, JSON, and Parquet Files

The easiest sources of data to work with are CSV, JSON, and Parquet files. These pair best with the custom dolt table import command.

  1. Importing with no schema

Dolt supports importing csv files without a defined SQL schema. Consider the following csv file:

pk,val
1,2
2,3

We can import and create a table as follows:

dolt table import -c --pk=pk mytable file.csv

Rows Processed: 2, Additions: 2, Modifications: 0, Had No Effect: 0
Import completed successfully.

We can query the table and see the new schema and data:

> dolt sql -q "describe mytable"
+-------+--------------+------+-----+---------+-------+
| Field | Type         | Null | Key | Default | Extra |
+-------+--------------+------+-----+---------+-------+
| pk    | int unsigned | NO   | PRI |         |       |
| val   | int unsigned | NO   |     |         |       |
+-------+--------------+------+-----+---------+-------+

> dolt sql -q "select * from mytable"
+----+-----+
| pk | val |
+----+-----+
| 1  | 2   |
| 2  | 3   |
+----+-----+
  1. Importing with a schema

In the case of JSON or Parquet files we require you provide a schema in the form of a CREATE TABLE SQL statement. You can also specify a schema for a csv file. Let's walk through the following file.

{
  "rows": [
    {
      "id": 0,
      "first name": "tim",
      "last name": "sehn",
      "title": "ceo"
    },
    {
      "id": 1,
      "first name": "aaron",
      "last name": "son",
      "title": "founder"
    },
    {
      "id": 2,
      "first name": "brian",
      "last name": "hendricks",
      "title": "founder"
    }
  ]
}

Our sql schema will look like this:

CREATE TABLE employees (
    `id` LONGTEXT NOT NULL,
    `first name` LONGTEXT,
    `last name` LONGTEXT,
    `title` LONGTEXT,
    PRIMARY KEY (`id`)
);

Putting it all together

> dolt table import -c -s schema.sql employees file.json
Rows Processed: 3, Additions: 3, Modifications: 0, Had No Effect: 0
Import completed successfully.

> dolt sql -q "select * from employees"
+----+------------+-----------+---------+
| id | first name | last name | title   |
+----+------------+-----------+---------+
| 0  | tim        | sehn      | ceo     |
| 1  | aaron      | son       | founder |
| 2  | brian      | hendricks | founder |
+----+------------+-----------+---------+
create table test(pk int, val int);
LOAD DATA INFILE '/Users/vinairachakonda/misc/test/file.csv' INTO table test FIELDS TERMINATED BY ',' IGNORE 1 LINES;

Selecting from above you get

test> select * from test;
+----+-----+
| pk | val |
+----+-----+
| 2  | 3   |
| 1  | 2   |
+----+-----+

MySQL Databases

Migrating your MySQL database to Dolt is easy. We recommend the mysqldump program. Let's say you have MySQL server with the following configuration

user: root
password:
host: 0.0.0.0
port: 3306
database: test

You can dump the database test as follows:

mysqldump --databases test -P 3306 -h 0.0.0.0 -u root -p > dump.sql

Note: Using the --databases flag will cause mysqldump to include a command to create the database for you if it doesn't exist yet. Without this flag, you will need to ensure you have already created your database (e.g. create database test;) before you can load in the mysqldump file.

To load into dolt:

dolt sql < dump.sql

Hosted Dolt Deployments

Navigate to the "Configuration" tab of your Hosted deployment dashboard, and make sure that the box behavior_auto_commit is checked. This will automatically create SQL COMMITs after every statement is executed, ensuring the data is persisted after each import statement is executed.

Also, make sure that the box behavior_disable_multistatements is unchecked. This is required for importing dumps, as they often combine multiple statements into single strings.

New Hosted deployments will have these correct server configuration settings by default. Please note that altering them prior to performing an import has been known to cause some imports to fail.

With the above settings in place, you can import the SQL dump using a standard MySQL client or database tool:

mysql -h my-deployment.dbs.hosted.doltdb.com -u <user name> -p<password> database < dump.sql

Postgres

With a postgres dump file of file.pgdump you can convert it into a mysql dump as follows.

./pg2mysql.pl < file.pgdump > mysql.sql

Finally, you can load the mysql file into dolt.

dolt sql < mysql.sql

Spreadsheets

There are multiple ways to get spreadsheet data into Dolt. The first is with dolt table import. Consider the following excel file: employees.xlsx

0,tim,sehn,ceo
1,aaron,son,founder
2,brian,hendricks,founder

Just like a csv file we run the command dolt table import -c --pk=id employees employees.xlsx to load the excel file into our Dolt. Be sure to give the same name for the table as the spreadsheet's name.

Pandas Dataframe

from sqlalchemy import create_engine
import pymysql
import pandas as pd

# Define a data frame
employees = {
    "id": [0, 1, 2],
    "FirstName": ["Tim","Aaron","Brian"],
    "LastName": ["Sehn", "Son", "Hendriks"],
    "Title": ["CEO", "Founder", "Founder"],
}

tableName = "employees"
dataFrame = pd.DataFrame(data=employees)

# Create an engine that connect to our dolt sql-server.
sqlEngine = create_engine("mysql+pymysql://root:@127.0.0.1/test", pool_recycle=3600)
dbConnection = sqlEngine.connect()

try:
    frame = dataFrame.to_sql(tableName, dbConnection, if_exists="fail")
    print("Table %s created successfully." % tableName)
except Exception as ex:
    print(ex)
finally:
    dbConnection.close()
frame = pd.read_sql('SELECT * from employees', dbConnection)
print(frame)

#    index  id FirstName  LastName    Title
# 0      0   0       Tim      Sehn      CEO
# 1      2   2     Brian  Hendriks  Founder
# 2      1   1     Aaron       Son  Founder

The second way to do this is by exporting your pandas dataframe as a csv file which can then be imported with dolt table import.

>>> import pandas as pd
>>> df = pd.DataFrame({'name': ['Raphael', 'Donatello'],
...                    'mask': ['red', 'purple'],
...                    'weapon': ['sai', 'bo staff']})
>>> df.to_csv('out.csv', index=False)

Import Best Practices

There are some best practices to keep in mind in order to make importing external data into Dolt as fast as possible. These performance differences are especially relevant with large databases (~50GB+).

  • Avoid adding foreign keys or unique indexes until after the import is completed. These substantially increase import time.

  • Minimize your use of blob types. These are expensive to create.

  • If you're writing your own .sql file instead of using mysqldump, consider the following:

    • Import one table at a time.

    • Prefer fewer INSERT statements that each insert multiple values over more, shorter statements that only insert a single value each.

    • Sort your inserts by primary key. If a table doesn't have a primary key, consider adding one.

You can reference the documenation for additional ways to modify your database such as updating or replacing your existing data.

You can also use the MySQL LOAD DATA command to work with data that is compatible with the . For example you can load the above file.csv as follows:

Importing data into a deployment can be done using the deployment's default configuration. Let's say you've dumped an existing MySQL database using mysqldump and want to import that dump into your Hosted Dolt deployment.

You can load a postgres database into dolt with our custom tool. If you have a postgres database you can export a postgres dump file with the utility.

The other way to work with spreadsheet data is with Dolthub's feature. You can create a SQL table from scratch with just typing values into a spreadsheet.

We recommend standard MySQL + Python techniques when intergrating Dolt with Pandas. First you want to connect to your Dolt database with the ORM. Here's some sample code below:

In the above example we are creating a data frame of employees and writing it to our Dolt database with the to_sql function. We can then use the function to read back data from our MySQL database into Dolt.

If running multiple import jobs back to back, be sure to the database. Imports can generate a substantial amount of garbage.

dolt table import
LOAD DATA
dolt sql
dolt table import
LOAD DATA api
Hosted Dolt
pg2mysql
pg_dump
"edit like a spreadsheet"
SQLAlchemy
read_sql
garbage collect
Hosted Dolt Importing Configuration
Spreadsheet