- We recommend using Data Version Control (DVC) to version your data.
- DVC is a version control system for data science and machine learning projects.
- It is designed to handle large files, data sets, machine learning models, and code.
- DVC works with Git to version control data files and models.
- If you want to try out the examples using our data, run
dvc pull
in the root directory of the repository. - Otherwise, see the next section on how to manage your own training data with DVC.
- Make sure your request access to the data by filling out the form here.
- Once you have access, create a service account key by visiting the Google Cloud Console.
- Create a JSON key for your service account and download it to your local machine. Note: Store this key securely.
- Tell the
dvc cli
where to find your downloaded JSON key file as follows:
dvc remote modify gdrive --local gdrive_service_account_json_file_path <path-to-json-key-file>
- Alternatively the
GOOGLE_APPLICATION_CREDENTIALS
environment variable to point to the path of the JSON key file. - You are now ready to download the data by running:
dvc pull
# Note: This will not work if you dont have access to the data
# Request access to the data here: https://forms.gle/XEcjEhMDqWU7q9pb6
- You can track your training data with DVC by running
dvc add <path-to-file-or-foler>
. - For example to track raw data in a file
data.zip
, you can run:
dvc add data.zip
# Make sure the file is not ignored by git before running the add command
- If all is good, your original file will be moved to .dvc/cache and a small metafile
data.zip.dvc
will be created in the original location - You can now safely commit the
.dvc
file to git. Each time there is a change in the data file, you can rundvc add data.zip
to update the cache and the.dvc
file. - You can either track individual files or entire directories with DVC.
- It is usually a good idea to save your data to a remote storage to avoid losing it in case of a system crash.
- In addition, if you are working in a team, it is easier to share data with your team members.
- DVC supports various remote storage options like AWS S3, Google Cloud Storage, Azure Blob Storage, Google Drive, etc.
- For Example: To save your data in Google Drive, you can run:
dvc remote add -d googledrive gdrive://<google-drive-folder-id>
# You can get the folder id from the URL of the folder in Google Drive
# e.g. https://drive.google.com/drive/folders/1md_0000MvM4AXcB6uzFBHRcZ8oA60000
# The folder id is 1md_0000MvM4AXcB6uzFBHRcZ8oA60000
- You can now push your data to Google Drive by running:
dvc push
Note: To use Google Drive as a remote storage, you need to authenticate DVC with Google Drive. The recommended way is to create a Service Account from Google Cloud Console. Once your service account is created, you can create and download the JSON key file and set the
GOOGLE_APPLICATION_CREDENTIALS
to the path of the JSON key file.
- Although at first, it seems like a good idea to store data in a shared folder on Google Drive, you will soon realize that it is not a scalable solution.
- You can potentially get away with it for small projects where the data can be downloaded quickly.
- However, once you start working with GigaBytes or TeraBytes of data, downloading the data every time you need to train will become a huge bottleneck.
- For such case, tools like DVC come to the rescue.
- Although DVC supports Google Drive as a remote storage, it is not recommended for large datasets. Instead, you can use cloud storage services like AWS S3, Google Cloud Storage, Azure Blob Storage, etc.
- Assume the team has decided to use Azure Blob Storage to store the data to develop a custom BERT model for classifying court cases.
- The team has created a storage account on Azure and has the following details:
- Storage Account Name:
mydatastore
- Container Name:
court-cases
- Access Key:
my-access-key
- Storage Account Name:
- The team can now add the Azure Blob Storage as a remote storage in DVC by running:
dvc remote add -d myazure azure://mydatastore/court-cases
# This assumes that the access key is stored in the environment variable AZURE_STORAGE_KEY
# Refer to the DVC documentation for other ways to authenticate with Azure Blob Storage
- Now the team can push all the tracked files and folders to Azure Blob Storage by running:
dvc push -r myazure
# Note you could have multiple remotes and you can specify the remote name with -r
- Lets say a new team member joins the team and wants to work on the project. The team member can pull the data from Azure Blob Storage by running:
export AZURE_STORAGE_KEY=my-access-key
dvc pull -r myazure
# This pulls all the data from the remote storage to the local machine
# However, in practice, you may want to pull only the data you need for your task to save time and space
# You can do this by specifying the path to the data you want to pull. For example:
# dvc pull -r myazure data/raw/court-cases-proceedings.csv