LogoLogo
DoltHubBlogDiscordGitHubDolt
  • Introduction
    • What Is Dolt?
    • Installation
      • Linux
      • Windows
      • Mac
      • Build from Source
      • Application Server
      • Docker
      • Upgrading
    • Getting Started
      • Version Controlled Database
      • Git For Data
      • Versioned MySQL Replica
    • Use Cases
      • Data Sharing
      • Data and Model Quality Control
      • Manual Data Curation
      • Version Control for your Application
      • Versioned MySQL Replica
      • Audit
      • Configuration Management
      • Offline First
  • Concepts
    • Dolt
      • Git
        • Commits
        • Log
        • Diff
        • Branch
        • Merge
        • Conflicts
        • Remotes
        • Working Set
      • SQL
        • Databases
        • Schema
        • Tables
        • Primary Keys
        • Types
        • Indexes
        • Views
        • Constraints
        • Triggers
        • Procedures
        • Users/Grants
        • Transactions
        • System Variables
      • RDBMS
        • Server
        • Backups
        • Replication
    • DoltHub/DoltLab
      • Permissions
      • Pull Requests
      • Issues
      • Forks
  • SQL Reference
    • Running the Server
      • Configuration
      • Access Management
      • Branch Permissions
      • Backups
      • Garbage Collection
      • Metrics
      • Replication
      • Troubleshooting
    • Version Control Features
      • Using Branches
      • Merges
      • Querying History
      • Using Remotes
      • Procedures
      • Functions
      • System Tables
      • System Variables
      • Saved Queries
    • SQL Language Support
      • Data Description
      • Expressions, Functions, Operators
      • Supported Statements
      • MySQL Information Schema
      • Collations and Character Sets
      • System Variables
      • Miscellaneous
    • Supported Clients
      • Programmatic
      • SQL Editors
    • Benchmarks and Metrics
      • Correctness
      • Latency
      • Import
  • CLI Reference
    • Commands
    • Git Comparison
  • Architecture
    • Overview
    • Storage Engine
      • Commit Graph
      • Prolly Trees
      • Block Store
    • SQL
      • Go MySQL Server
      • Vitess
  • Guides
    • Cheat Sheet
    • Contributing
      • dolt
      • go-mysql-server
    • MySQL to Dolt Replication
    • Importing Data
    • Integrations
  • Other
    • FAQ
    • Roadmap
    • Versioning
  • Products
    • Hosted Dolt
      • Getting Started
      • Notable Features
      • SQL Workbench
      • Cloning a Hosted Database
      • Using DoltHub as a Remote
      • Infrastructure
    • DoltHub
      • Data Sharing
      • API
        • Authentication
        • SQL
        • CSV
        • Database
        • Hooks
      • Continuous Integration
        • Getting Started
        • Workflow Reference
      • Transform File Uploads
      • Workspaces
    • DoltLab
    • Dolt Workbench
    • DoltgreSQL
Powered by GitBook
On this page
  • Problem
  • Dolt solves this by…
  • Dolt replaces...
  • Unstructured files in cloud storage
  • MySQL, Postgres, or other databases
  • Companies Doing This
  • Case Studies
  • Other Related Articles

Was this helpful?

Edit on GitHub
Export as PDF
  1. Introduction
  2. Use Cases

Data and Model Quality Control

PreviousData SharingNextManual Data Curation

Last updated 1 year ago

Was this helpful?

Problem

  • Are you in the business of creating data and models?

  • Do you want to institute human or automated review on data changes for data quality assurance?

  • Are you worried about model reproducibility?

  • Do different people or teams want to work on slightly different versions of the data?

  • Are long running projects hard to pull off because of parallel data changes?

  • Would data branches help?

  • Do you want the ability to query or roll back to a previous version of the data instantly?

Dolt solves this by…

Traditional databases were built for a world of transactions and reports. Modern data science tools use data to create models that behave more like software than reports. Models produce user visible outputs and define application behavior. Tuning data to get the right model can be a lot like writing code.

The version control tools we use to build software apply to modern data science. Version control for data did not exist until Dolt, the first and only database you can , , and just like a Git repository.

Modern data science applications require model reproducibility, data quality, and multiple versions of data to perform at their best. Dolt allows for these capabilities directly in your database, in a most developers understand.

Dolt is used for model reproducibility. If you build a model from a version of the data, make a tag at that commit and refer to that tag in the model metadata. Some of our data and model quality control customers only use Dolt for this simple feature. Dolt shares storage between versions so you can store many more copies of the data using Dolt than say storing copies of the data in S3.

Dolt allows for human or automated review on data changes increasing data quality. If a bad change makes it through review simply . DoltHub, DoltLab, and the Hosted Dolt Workbench all implement workflow, the standard for human reviewing code changes. Extend that model to your data changes.

Dolt is the only database with and functionality. Branches allow for long running data projects. Want to add an additional feature to a model but don't want the new feature effecting the production model build? Make branch and run the project on that branch. Occasionally merge production data into that branch so you can stay in touch with changes there. Companies use Dolt branches to increase the number of parallel data projects by an order of magnitude.

Dolt replaces...

Unstructured files in cloud storage

It is common practice to store copies of training data or database backups in cloud storage for model reproducibility. A full copy of the data is stored for every training run. This can become quite expensive and limit the amount of models you can reproduce. Dolt stores only the differences between stored versions decreasing the cost of data storage. Additionally, Dolt can produce diffs between versions of training data producing novel model insights.

MySQL, Postgres, or other databases

Dolt can replace any database used to store and query data. Many of our customers switch from other OLTP databases like MySQL or Postgres to improve data and model quality through versioning. Customers have also switched to Dolt from document databases like MongoDB. Dolt's additional unique features like branches, diffs, and merges allow for human review of data changes and multiple parallel data projects.

Companies Doing This

Case Studies

Other Related Articles

Lastly, , , and can be used for model insights. Did Thursday's model perform better than Tuesday's but had the same model weights? Inspect the data diff to see what changed. Inspect the commit log to see where that new data came from.

branch
diff
merge
Git-style version control model
roll the data back to a previous version
a Pull Request
branch
merge
commits
logs
diffs
Turbine
KAPSARC
Flock Safety
Tome
Bosch
IMTF
Turbine
Better Data with Great Expectations + Dolt
Upleveling Flyte’s Data Lineage Using Dolt
Data Version Control and Dolt Reproducibility
Using Dolt to Manage Train/Test Splits
So you want Data Quality Control