This project is a work in progess
Code is written in python 3.7. Parts of this project also require the following:
- Docker
We highly recommend using a virtual environment. To install dependencies, from within your virtual environment run:
pip install -Ur requirements.txt
The data were scraped from Christies auction house website. This project uses
Google Cloud Platform infrastructure when applicable, so full replication of
this research will require familiarity with gcloud
(the command line tools
for Google Cloud Platform).
- Create a python3 virtual environment and install
. - Create a GCP service account with storage read/write permissions via the cloud console. Download the key.
- Build a docker image based on the
, tag it, and upload it to Cloud Repository
# Set up google registry
gcloud auth login
gcloud auth configure docker
GOOGLE_KEY=$(base64 -i [SERVICE-KEY-FILE].json)
docker build -t --build-arg GOOGLE_KEY=${GOOGLE_KEY} .
docker push
- Create a virtual machine
# List VMs
gcloud compute instances list
# Deploy VM
gcloud compute instances create-with-container paap-1 \ \
--container-restart-policy "never" \
--machine-type=n1-standard-1 \
# SSH into VM
gcloud compute --project art-auction-prices ssh paap-1
# stop VM
gcloud compute --project art-auction-prices instances stop paap-1
# delete VM
gcloud compute instances delete paap-1
- Verify that the crawl container is running
gcloud compute ssh paap-1 --command "docker container ps -a"
Follow the same process as above, but override the default container entrypoint with:
gcloud compute instances create-with-container paap-1 \ \
--container-restart-policy "never" \
--machine-type=n1-standard-1 \
--scopes=storage-rw,logging-write \
--container-command="scrapy" \
--container-arg="crawl" --container-arg="christiesImages"
Alternatively you can run it locally with:
docker run --entrypoint scrapy crawl christiesImages
To get a list of all the images in a given bucket and write that list to a file, run:
gsutil ls gs://paap/christies/data/img/full/ >> ./data/img_in_gcs.txt
To simplify the process for local development, there are a number of scripts for processing the images, which more or less represent a directed pipeline for image acquisition and resizing.
Given the raw json files of data scraped from Christies' website, process them into a CSV where each row represents a lot.
# python art/ -i data/raw/*.json -o data/process_christies_output.csv
Clean json data, scraped from Christies into a format that can be used for predictive analytics
optional arguments:
-h, --help show this help message and exit
Input newline delimited json files to process.
-o OUTPUT_PATH, --output-path OUTPUT_PATH
CSV to save to
Given a CSV of raw piece data, scrapped from Christies website, return a two column csv where each column is an artwork with lot_id and the corresponding image_url:
# python scripts/ -h
usage: [-h] input output
Create a two column CSV (<lot_id>,<image_url>) of image metadata from an input CSV of raw art data.
positional arguments:
input Input csv with raw art data
output Output path
optional arguments:
-h, --help show this help message and exit
From a two column csv of lot id and image url, download images to a local directory.
# python scripts/ -h
usage: [-h] input output_dir
Given an input csv with image urls, download images and save them to a location
positional arguments:
input Input csv. Two columns (<lot_id>,<image_url>), no header.
output_dir Directory to save images to.
optional arguments:
-h, --help show this help message and exit
In practice, after downloading the raw images they were uploaded to cloud storage as a backup.
Given the raw CSV data and a CSV of image URLs to uuids, clean the data and join to get UUID:
# python art/scripts/ -h
usage: [-h] input_csv image_urls output_csv
Clean raw tabular data
positional arguments:
input_csv Path to input csv with raw scrapped data, header on first row
image_urls Path to a csv with image urls
output_csv Path to save output csv containing only cleaned data
optional arguments:
-h, --help show this help message and exit
Given a CSV with data, the following runs an interactive script which opens the sale in a browser and asks whether the sale contains 2 dimensional works of art. This is a coarse filtering mechanism that is used to exclude sales that are comprised mostly of furniture, sculpture, ceramics, and books, which will damage the ability of the model.
# python art/scripts/ --help
usage: [-h] input_json output_json
Iterate over sales in a dataframe, and determine if they are exclusively 2d artwork
positional arguments:
input_json Input JSON file with sale_number/sale_url column
output_json Output JSON file to write to
optional arguments:
-h, --help show this help message and exit
We filtered images further based on keywords that indicated they might not be two dimensional:
# python art/scripts/ -h
usage: [-h] input_json is_2d_json output_json
Filter out artwork that is unsuitable for analysis according to a set of rules
positional arguments:
input_json Path to input json, the output from
is_2d_json Path to is_2d json, the output from
output_json Path to save output json containing only filtered data
optional arguments:
-h, --help show this help message and exit
Resize images to a common minimum dimension (i.e. the smallest of the two image dimensions will have this pixel size):
# python scripts/ -h
usage: [-h] [--image-size IMAGE_SIZE] [--delete] images output_dir
Resize images to a common minimum dimension, retaining aspect ratio
positional arguments:
images Images to process a newline separated file of image paths. A command like the following should get you started: `find data/img/christies/raw/ -type f -name '*.jpg' > data/img/christies/raw_images.txt`
output_dir Directory to save cropped images to
optional arguments:
-h, --help show this help message and exit
--image-size IMAGE_SIZE
Images will be scaled to be this large in their minimum dimension, in pixels
--delete Delete the input photo after processing
We provide a script for randomly sampling the data and determining the proportion of non-2d artwork.
# python -m art.scripts.sample_is_2d -h
usage: [-h] [--input INPUT] [--output OUTPUT]
Randomly sample images from a dataset and determine if they are 2d or not
optional arguments:
-h, --help show this help message and exit
--input INPUT Path to input dataframe
--output OUTPUT Where to write the output
To standardize prices, we calculate exchange rates, relative to a given date in time:
# python art/scripts/ -h
usage: [-h] [-o OUTPUT] [-d TARGET_DATE]
Get the dollar equivalent of currencies at a given date, and the inflation relative to today. Output is a csv with 5 columns: year, month, currency, dollar_equivalent, inflation. currency is the currency which we wish to convert to
dollars. dollar_equivalent is the price of that currency, in USD at that year and month. inflation is the multiple by which to multiply a dollar in that year/month to get it's equivalent worth at the target date.
optional arguments:
-h, --help show this help message and exit
-o OUTPUT, --output OUTPUT
-d TARGET_DATE, --target-date TARGET_DATE
Then we join those back into the cleaned dataset:
# python art/scripts/ -h
usage: [-h] [--filtered_artwork FILTERED_ARTWORK] [--exchange_rates EXCHANGE_RATES] [--output OUTPUT]
Calculate prices in Jan 1, 2020 USD
optional arguments:
-h, --help show this help message and exit
--filtered_artwork FILTERED_ARTWORK
Path to filtered artwork json, the output from filter_artwork
--exchange_rates EXCHANGE_RATES
Path to the output of exchange rates
--output OUTPUT Output path to write the dataset to
Center crop images to a minimum dimension:
# python -m art.scripts.crop_images -h
usage: [-h] images output_dir
Crop images so they are square
positional arguments:
images Images to process a newline separated file of image paths. A
command like the following should get you started: `find
data/img/christies/raw/ -type f -name '*.jpg' >
output_dir Directory to save cropped images to
optional arguments:
-h, --help show this help message and exit
From the nn directory:
# Create a cluster
gcloud container clusters create paap-training-cluster \
--num-nodes=1 \
--zone=us-east1-c \
--accelerator="type=nvidia-tesla-t4,count=1" \
--machine-type="n1-highmem-4" \
--scopes="gke-default,storage-rw" \
# installing GPU nodes
# device drivers
# use `gcloud container get-server-config` to get the default image type
kubectl apply -f
# Resize the cluster
gcloud container clusters resize paap-training-cluster --num-nodes=0
Deploy a job to the cluster
Deploy an image to the cluster
kubectl apply -f ./job.yaml
watch --interval 10 "kubectl get jobs"
watch --interval 10 "kubectl get pods"
watch --interval 10 "kubectl describe pod <pod-id> | grep -A20 Events"
kubectl describe pod dcec-pod
kubectl logs dcec-paint-xpj9d --follow