Even if you have code versioned (e.g., via git), you simply can’t reproduce an analysis if you don’t run the code on the same data. This means that you need to have a plan and tooling in place to retrieve the state of both your analysis and your data at certain points in history. Data science prior to data versioning is a little bit like software engineering before Git.
Pachyderm - the system we will use for data versioning
github.com/pachyderm/pachyderm/src/client docs
Connecting to a running instance of Pachyderm
Creating a data repository
Committing data into a repository
Retrieving data from a repository
Create another data repository called "diabetes." We will use this repository to version other data that we will use throughout the course.
Make a commit of the data in diabetes.csv to the newly created "diabetes" data repository.
All material is licensed under the Apache License Version 2.0, January 2004.