Data and Model Quality Control

Problem

Are you in the business of creating data and models?
Do you want to institute human or automated review on data changes for data quality assurance?
Are you worried about model reproducibility?
Do different people or teams want to work on slightly different versions of the data?
Are long running projects hard to pull off because of parallel data changes?
Would data branches help?
Do you want the ability to query or roll back to a previous version of the data instantly?

Dolt solves this by…

Traditional databases were built for a world of transactions and reports. Modern data science tools use data to create models that behave more like software than reports. Models produce user visible outputs and define application behavior. Tuning data to get the right model can be a lot like writing code.

The version control tools we use to build software apply to modern data science. Version control for data did not exist until Dolt, the first and only database you can branch, diff, and merge just like a Git repository.

Modern data science applications require model reproducibility, data quality, and multiple versions of data to perform at their best. Dolt allows for these capabilities directly in your database, in a Git-style version control model most developers understand.

Dolt is used for model reproducibility. If you build a model from a version of the data, make a tag at that commit and refer to that tag in the model metadata. Some of our data and model quality control customers only use Dolt for this simple feature. Dolt shares storage between versions so you can store many more copies of the data using Dolt than say storing copies of the data in S3.

Dolt allows for human or automated review on data changes increasing data quality. If a bad change makes it through review simply roll the data back to a previous version. DoltHub, DoltLab, and the Hosted Dolt Workbench all implement a Pull Request workflow, the standard for human reviewing code changes. Extend that model to your data changes.

Dolt is the only database with branch and merge functionality. Branches allow for long running data projects. Want to add an additional feature to a model but don't want the new feature effecting the production model build? Make branch and run the project on that branch. Occasionally merge production data into that branch so you can stay in touch with changes there. Companies use Dolt branches to increase the number of parallel data projects by an order of magnitude.

Lastly, commits, logs, and diffs can be used for model insights. Did Thursday's model perform better than Tuesday's but had the same model weights? Inspect the data diff to see what changed. Inspect the commit log to see where that new data came from.

Dolt replaces...

Unstructured files in cloud storage

It is common practice to store copies of training data or database backups in cloud storage for model reproducibility. A full copy of the data is stored for every training run. This can become quite expensive and limit the amount of models you can reproduce. Dolt stores only the differences between stored versions decreasing the cost of data storage. Additionally, Dolt can produce diffs between versions of training data producing novel model insights.

MySQL, Postgres, or other databases

Dolt can replace any database used to store and query data. Many of our customers switch from other OLTP databases like MySQL or Postgres to improve data and model quality through versioning. Customers have also switched to Dolt from document databases like MongoDB. Dolt's additional unique features like branches, diffs, and merges allow for human review of data changes and multiple parallel data projects.

Companies Doing This

Case Studies

Turbine

PreviousData Sharing NextManual Data Curation

Last updated 1 year ago

Was this helpful?