Example repository for the Data Version Control With Python and DVC article on Real Python.
To use this repo as part of the tutorial, you first need to get your own copy. Click the Fork button in the top-right corner of the screen, and select your private account in the window that pops up. GitHub will create a forked copy of the repository under your account.
Clone the forked repository to your computer with the git clone
command
git clone [email protected]:YourUsername/data-version-control.git
Make sure to replace YourUsername
in the above command with your actual GitHub username.
Happy coding!
This is a fork from a tutorial example repo. I'm adding some new features like an script to classify images using the generated model. For more details read the original tutorial.
- Tutorial: https://realpython.com/python-data-version-control/
- Original repo: https://github.com/realpython/data-version-control
git clone [email protected]:josecelano/data-version-control.git
cd data-version-control
conda create --name dvc python=3.8.2 -y
conda config --add channels conda-forge
conda install dvc scikit-learn scikit-image pandas numpy
Alternatively you can create the conda environment with:
conda env create --file environment.yml
conda activate dvc
Generate csv files:
python3 src/prepare.py
Train the model:
python3 src/train.py
Evaluate the model with the test set:
python3 src/evaluate.py
User the model to classify the image:
python3 src/predict.py
Sample output for predict.py script:
(dvc) josecelano@josecelano:~/Documents/github/josecelano/data-version-control$ python src/predict.py -i /home/josecelano/Documents/github/josecelano/data-version-control/data/raw/train/n03888257/n03888257_24024.JPEG
Predicting for image: " /home/josecelano/Documents/github/josecelano/data-version-control/data/raw/train/n03888257/n03888257_24024.JPEG "
['parachute']
We are using act to run GitHub Actions locally.
act
usage:
act -h
Run workflow locally:
act -j build --secret-file .env
With the j
you can run only a single job.
Don't forget to add your Azure Blog Storage credentials to pull images from remote DVC storage. Otherwise you will get this error:
| ERROR: failed to pull data from the cloud - Authentication to Azure Blob Storage requires either account_name or connection_string.
| Learn more about configuration settings at <https://man.dvc.org/remote/modify>
[Build the model/build] ❌ Failure - Pull dataset from remote
You need to add the secrets in .env.ci
file:
AZURE_STORAGE_ACCOUNT='YOUR_STORAGE_ACCOUNT_NAME'
AZURE_STORAGE_KEY='YOUR_STORAGE_KEY'
- Add remote storage using Azure Blog Storage
- Basic workflow: pull dataset, train the model, evaluate the model and make some predictions
- Write article for basic workflow.
- Cache for DVC cache? We have to pull the whole dataset on every pipeline.
- Consider docker insted of conda setup? maybe faster?
- Update README to use conda environment.yml for installation