DoltHub

Introduction

DoltHub is a collaboration platform for managing Dolt databases. Dolt is the format the data travels in, and provides the SQL query interface. DoltHub builds collaboration tools on top of Dolt to make acquiring, publishing, and collaboratively building datasets using Dolt a joy.

Signing Up

This tutorial assumes that you have signed up for DoltHub, though you don't need to for cloning public datasets. You can sign up for DoltHub easily here. To follow along with the commands you need to have Dolt installed. See the installation guide, but it's as easy as brew install dolt, and we publish .msi files for Windows users.

Data Acquisition

We show several ways to explore data on DoltHub, starting with the most user friendly. Let's say we are interested in the dolthub/ip-to-country mappings that DoltHub publishes. Throughout this section we will demo the various ways to acquire data against this data set.

Web

The most accessible way to access data that is stored on Dolt and hosted on DoltHub is to navigate to the database homepage:
DoltHub database page
Click on the SQL console to start writing a query against that Dolt database, using the left hand menu to browse the schema:
SQL editor open
Now let's write a query for all the IPv4 codes in Australia:
Sample query
Executing the query renders the results:
Query results
While this is a useful data exploration interface, it doesn't offer the kind of interactivity that a local copy of the data might. Let's switch gears to the command line, where we will see the integration point between Dolt and DoltHub.

Dolt CLI

The integration point between Dolt and DoltHub is the concept of a "remote." Just like Git, Dolt imagines remotes which are stored in the metadata of the Dolt database. Dolt the command line tool has the ability to clone from remotes, and update them using the familiar commands clone and pull. The data arrives as a SQL database, and your Dolt binary allows you to run SQL queries across the data. Let's say we are interested in pulling our example dataset into our local environment, either for use in some production system or for exploration. The database path is dolthub/ip-to-country. We can acquire it in a single command:
1
$ dolt clone dolthub/ip-to-country && cd ip-to-country
2
cloning https://doltremoteapi.dolthub.com/dolthub/ip-to-country
3
23,716 of 23,716 chunks complete. 0 chunks being downloaded currently.
Copied!
Now we can launch a SQL interpreter and start executing queries:
1
$ dolt sql
2
# Welcome to the DoltSQL shell.
3
# Statements must be terminated with ';'.
4
# "exit" or "quit" (or Ctrl-D) to exit.
5
ip_to_country> select country, count(*) from IPv4ToCountry group by country order by count(*) desc limit 10;
6
+--------------------+----------+
7
| Country | COUNT(*) |
8
+--------------------+----------+
9
| United States | 56127 |
10
| Russian Federation | 10757 |
11
| Brazil | 10607 |
12
| Germany | 9226 |
13
| China | 8485 |
14
| Canada | 8090 |
15
| United Kingdom | 8006 |
16
| Australia | 7938 |
17
| India | 6093 |
18
| Netherlands | 5199 |
19
+--------------------+----------+
Copied!
We also acquired a full history of values for each cell in the database, which we can immediately inspect via the history tables. There are other interesting datasets on DoltHub, and you should head over to our discover page and check them out, and clone them if they are interesting.

Data Publishing

In the previous section we showed three canonical ways to get data from Dolt databases hosted on DoltHub: a web based SQL query interface, by cloning the database and using Dolt to run SQL against the database, and finally via an API. In this section we switch to editing data on DoltHub and pushing data to DoltHub from Dolt.

Using DoltHub

There are a few ways to publish data without using Dolt.
Create table

File Upload

You can upload a file to create or update a table in your database. Click on "Upload a file" in the add item dropdown on the top right of a database page.
Upload a file dropdown
You'll be taken through a few steps in the upload wizard.
First, choose a base branch (commits directly to the main branch for an empty database)
Choose a branch
Next, choose a table name (create a new table or update an existing table)
Choose table name
Upload a file and optionally choose primary keys (we currently support CSV, PSV, XLSX, and JSON files)
Choose primary keys
View and commit changes
Commit changes
You can learn more from our blog here.

Spreadsheet Editor

As a part of the file upload process, you can also create and edit spreadsheets directly in DoltHub. You will see this option during step 3.
Spreadsheet editor option
If you chose to update an existing table, the spreadsheet editor will be populated with the table's existing data. If the data has more than 50 rows it will paginate on scroll.
Spreadsheet with existing data
Once you are satisfied with your changes, you can click on "Upload table", and continue with the file upload process like above.
Learn more from our blog here.

SQL Console

In addition to being able to run SELECT queries from the SQL Console to navigate data, you can also run queries that update data on DoltHub.
Use the cell buttons or SQL Console to run a write query.
Edit cell value
You'll be navigated to a "workspace", which is a temporary staging area where you can make changes without updating the main branches of the database.
Workspace
Once you are satisfied, you can create a pull request with your changes for review.
Learn more from our blog here.

Using Dolt

In addition to using DoltHub to publish data, you can also use the Dolt CLI.

Configuration

To publish data on DoltHub using Dolt, the first thing to do is configure your copy of Dolt to recognize DoltHub as a remote to your database. To do that, and then write to that database, you need to login. Dolt provides a command for logging in that will launch a browser window, have you authenticate, and create token which your local Dolt will use to identify itself. The steps are straightforward:
1
$ dolt login
2
Credentials created successfully.
3
pub key: t7oc1qsgc8isfq9po0d4kteg8lio9et9...
4
~/.dolt/creds/<some hash>.jwk
5
Opening a browser to:
6
https://dolthub.com/settings/credentials#t7oc1qsgc8isfq9po0d4kteg8lio9et9...
7
Please associate your key with your account.
8
Checking remote server looking for key association.
9
requesting update
Copied!
This launches a browser window to create a token:
Dolt Login Screen
Give the token a name, click create, and you should see control returned to the prompt:
1
Key successfully associated with user: your-username email: [email protected]
Copied!
You are now logged in and you can push data to DoltHub.

DoltHub Remotes

Publishing data is equally easy once it's in the Dolt format. In another section we showed how to write data to Dolt. Let's assume we have some data in a local Dolt repository:
1
$ cat > great_players.csv
2
name,id
3
rafa,1
4
roger,2
5
novak,3
6
andy,4
7
^C
8
$ dolt table import -c --pk id great_players great_players.csv
9
Rows Processed: 4, Additions: 4, Modifications: 0, Had No Effect: 0
10
Import completed successfully.
11
$ dolt add great_players
12
$ dolt commit -m 'Added some initial great players'
13
commit ht24tetekl12hmek03e6ldl0hbqm8l93
14
Author: you <[email protected]>
15
Date: Wed May 06 23:38:45 -0700 2020
16
17
Added some initial great players
Copied!
Now suppose we want to share this data with others. The model for sharing on DoltHub is similar to GitHub, we create a public (or private, see pricing) repository and add it as a remote to our local repository. Let's start by creating a repository on DoltHub. We can do that easily by heading over to DoltHub with just a few clicks:
Create a DoltHub repository
Earlier we ran dolt login to allow our local copy of Dolt to authenticate with our DoltHub account, we now put this to use by connecting our local Dolt repository to the repository we just created. Just like Git, we add a remote:
1
$ dolt remote add origin dolthub/great-players-example
2
3
$ dolt push origin master
4
Tree Level: 2 has 3 new chunks of which 0 already exist in the database. Buffering 3 chunks.
5
Tree Level: 2. 100.00% of new chunks buffered.
6
Tree Level: 1 has 2 new chunks of which 0 already exist in the database. Buffering 2 chunks.
7
Tree Level: 1. 100.00% of new chunks buffered.
8
Successfully uploaded 1 of 1 file(s).
Copied!
And now anyone that wants to consume this data can do so with a single command. Once the consumer has run that command they can immediately stand up a Dolt SQL Server instance and start querying the data. If you're ready for collaborators you can add them as one on DoltHub. ``

Data Collaboration

A major motivator for building Dolt and DoltHub was to create world class tools for data collaboration. Earlier sections of this guide to using DoltHub show how Dolt makes moving data seamless by implementing Git style version control, including clone, pull, and pull operations. This section shows how to use DoltHub's collaboration tools to efficiently coordinate updates to shared datasets.

Pull Requests

In the simplest case two DoltHub users wish to make updates to the same database on DoltHub. Let's suppose that our esteemed CEO Tim has suddenly developed a passion for tennis, and would like to contribute to dolthub/great-players-example, the database we created in the previous section.
Adding a DoltHub Collaborator
Dolt has a concept of branches, almost identical to branches in Git. A branch is a named pointer to a commit. Users can create pull requests by proposing to merge one branch into another. The model looks something like this:
Pull Request Workflow
Note that in the diagram the each user has their own copy of the database, and they use dolt push origin <branch> to push their branch to DoltHub.
Let's work through an example of adding some players in a new branch, pushing that branch to DoltHub, and raising a pull request:
1
$ dolt clone dolthub/great-players-example
2
cloning https://doltremoteapi.dolthub.com/dolthub/great-players-example
3
8 of 8 chunks complete. 0 chunks being downloaded currently.
4
$ cd great-players-example
5
$ dolt checkout -b more-great-players
6
Switched to branch 'more-great-players'
7
8
$ dolt sql
9
# Welcome to the DoltSQL shell.
10
# Statements must be terminated with ';'.
11
# "exit" or "quit" (or Ctrl-D) to exit.
12
13
great_players_example> insert into great_players (name, id) values ('stan', 5);
14
Query OK, 1 row affected
15
16
great_players_example> ^D
17
Bye
18
19
$ dolt diff
20
diff --dolt a/great_players b/great_players
21
--- a/great_players @ c2tpkad9e5345sjq2h7e6d9pdp7383a6
22
+++ b/great_players @ jopmq0sa2lkevugong75vqgpjir8ecve
23
+-----+------+----+
24
| | name | id |
25
+-----+------+----+
26
| + | stan | 5 |
27
+-----+------+----+
28
29
$ dolt add great_players && dolt commit -m 'Added Stan The Man'
30
commit 4bsqsuanjsvra3kq7tchqre2kf7qgt29
31
Author: oscarbatori <[email protected]>
32
Date: Wed Oct 07 13:53:35 -0700 2020
33
34
Added Stan The Man
35
36
$ dolt push origin more-great-players
37
Tree Level: 3 has 3 new chunks of which 2 already exist in the database. Buffering 1 chunks.
38
Tree Level: 3. 100.00% of new chunks buffered.
39
Tree Level: 1 has 2 new chunks of which 1 already exist in the database. Buffering 1 chunks.
40
Tree Level: 1. 100.00% of new chunks buffered.
41
Successfully uploaded 1 of 1 file(s).
Copied!
Now we pushed the branch more-great-players to DoltHub, we can open a pull request by selecting the appropriate branch:
Creating a Pull Request from a branch
This pull request can be reviewed and merged:
Merging a Pull Request
And we are done!
Merged Pull Request

Forking a Dolt Database

While this model is fine for small numbers of collaborators with high mutual trust, it doesn't necessarily scale to the kind of mass-participation that has fueled growth in open source software. The practical issues are that the database owners will have to vet and permission every would-be collaborator before granting them permissions. The "fork" model exists to solve this problem. In the fork model users can copy a database into their own namespace or organization.
Dolt Fork
Let's work through an example by forking the example database we have been working with:
Forking a Database
There is now a fork in the namespace sampleorg, which we can we clone and edit:
1
$ dolt clone sampleorg/great-players-example
2
cloning https://doltremoteapi.dolthub.com/sampleorg/great-players-example
3
14 of 14 chunks complete. 0 chunks being downloaded currently.
4
$ cd great-players-example
5
6
$ dolt sql
7
# Welcome to the DoltSQL shell.
8
# Statements must be terminated with ';'.
9
# "exit" or "quit" (or Ctrl-D) to exit.
10
11
great_players_example> insert into great_players (name, id) values ('marin', 6);
12
Query OK, 1 row affected
13
great_players_example> ^D
14
Bye
15
16
$ dolt diff
17
diff --dolt a/great_players b/great_players
18
--- a/great_players @ c2tpkad9e5345sjq2h7e6d9pdp7383a6
19
+++ b/great_players @ 7m0qs6sebr61r00pi8e301tt99cg89dk
20
+-----+-------+----+
21
| | name | id |
22
+-----+-------+----+
23
| + | marin | 6 |
24
+-----+-------+----+
25
26
$ dolt add great_players && dolt commit -m 'Added Marin'
27
commit gt6c904uksachevarvp5pup1cc17pb48
28
Author: oscarbatori <[email protected]>
29
Date: Wed Oct 07 14:25:26 -0700 2020
30
31
Added Marin
32
33
$ dolt push origin master
34
Tree Level: 3 has 3 new chunks of which 2 already exist in the database. Buffering 1 chunks.
35
Tree Level: 3. 100.00% of new chunks buffered.
36
Tree Level: 1 has 2 new chunks of which 1 already exist in the database. Buffering 1 chunks.
37
Tree Level: 1. 100.00% of new chunks buffered.
38
Successfully uploaded 1 of 1 file(s).
Copied!

Pulls Requests from Forks

We can now create a pull request in a manner similar to the previous section, but instead of choosing only the from and to branches, we now choose the from repository:
Creating a Pull Request from a fork
This creates a pull request, which will be familiar from the previous section, which we can go ahead and merge!

Updating Your Fork

One of the benefits of the fork model is being able to continue to get updates from the parent while maintaining a local set of changes. This is a powerful model for data distribution where consumers can continue to receive updates from trusted distributors, while maintaining a set of changes that represent their preferences or views with minimal technical overhead.
Forking a Database
Let's continue with our example. Suppose that SampleOrg disagrees with the folks at DoltHub about whether Andy Roddick was a great player. The folks at DoltHub do not believe he was a great player, but the folks at SampleOrg do. However, SampleOrg agree with DoltHub on most other players, and would like to continue getting updates.
We do this by adding the parent as a remote. For a SampleOrg user they can add the parent repository dolthub/great-players-example as a remote as follows:
1
$ dolt remote add upstream dolthub/great-players-example
Copied!
Now let's create a branch that will be used to pull in changes from the upstream, call it vendor:
1
$ dolt branch vendor
Copied!
We can create an entry on master to reflect SampleOrg's belief that Andy Roddick was a great player:
1
$ dolt sql
2
# Welcome to the DoltSQL shell.
3
# Statements must be terminated with ';'.
4
# "exit" or "quit" (or Ctrl-D) to exit.
5
great_players_example> insert into great_players (name, id) values ('roddick', 7);
Copied!
Now suppose that the DoltHub team has added David Nalbandian to the dataset, an update SampleOrg would like to capture. We can fetch and merge as follows:
1
$ dolt fetch upstream
2
Tree Level: 5 has 3 new chunks of which 1 already exist in the database. Buffering 2 chunks.
3
Tree Level: 5. 100.00% of new chunks buffered.
4
Tree Level: 4 has 7 new chunks of which 4 already exist in the database. Buffering 3 chunks.
5
Tree Level: 4. 100.00% of new chunks buffered.
6
Tree Level: 1 has 2 new chunks of which 1 already exist in the database. Buffering 1 chunks.
7
Tree Level: 1. 100.00% of new chunks buffered.
8
Successfully uploaded 1 of 1 file(s).
9
10
$ dolt merge upstream/master
11
Updating gt6c904uksachevarvp5pup1cc17pb48..rqhgn8suonl9ppahntbbhtn888vjlm36
12
Fast-forward
13
$ dolt log
14
commit rqhgn8suonl9ppahntbbhtn888vjlm36
15
Author: oscarbatori <[email protected]>
16
Date: Wed Oct 07 19:33:22 -0700 2020
17
18
Added David Nalbandian, what a backhand!
Copied!
Our diff comparing master and vendor will now show reflect this:
1
$ dolt diff master vendor
2
diff --dolt a/great_players b/great_players
3
--- a/great_players @ 7m0qs6sebr61r00pi8e301tt99cg89dk
4
+++ b/great_players @ clrm54jkd6ecc0q08fh8pomptumli6c0
5
+-----+-------+----+
6
| | name | id |
7
+-----+-------+----+
8
| + | david | 8 |
9
+-----+-------+----+
Copied!
We have seen two models of collaboration. One suitable for smaller groups of collaborators with high mutual trust, say within an organization. The second model is robust for the kind of mass-participation that occurs when a community decides to maintain a shared dataset.

Conclusion

This succinctly illustrates the value of Dolt and DoltHub. Dolt provides for the seamless transfer of structured data, and ensures all data arrives ready for use in a familiar query interface. DoltHub provides a layer of collaboration tools.
Last modified 23d ago