Perform PCA (or another reduction method) using specified number of components:
python -m app.reduction.pipeline
N_COMPONENTS=2 FIG_SHOW=true FIG_SAVE=true python -m app.reduction.pipeline
N_COMPONENTS=3 FIG_SHOW=true FIG_SAVE=true python -m app.reduction.pipeline
# other methods:
REDUCER_TYPE="T-SNE" N_COMPONENTS=2 FIG_SHOW=true FIG_SAVE=true python -m app.reduction.pipeline
REDUCER_TYPE="UMAP" N_COMPONENTS=2 FIG_SHOW=true FIG_SAVE=true python -m app.reduction.pipeline
This will save the results (plots and embeddings) to the "results/reduction" dir.
NOTE: T-SNE gets super slow when N_COMPONENTS >= 4 (see T-SNE docs)
Use PCA to calculate explained variance for each number of components, up to a specified max (to help understand the ideal number of components to use):
python -m app.pca_tuner
MAX_COMPONENTS=250 FIG_SHOW=true FIG_SAVE=true python -m app.reduction.pca_tuner
Use T-SNE KL divergence metric to find optimal n components:
MAX_COMPONENTS=10 FIG_SHOW=true FIG_SAVE=true python -m app.reduction.tsne_tuner
NOTE: T-SNE gets super slow when N_COMPONENTS >= 4 (see T-SNE docs)
After performing dimensionality reduction using a variety of combinations of methods and number of components, and after saving the results to CSV files, we are now combining all the results into a single file:
python -m app.reduced_dataset
Copy the resulting dataset from the "results" directory into the "data" directory as "data/text-embedding-ada-002/botometer_sample_openai_tweet_embeddings_reduced_20230825.csv.gz". Also upload to Google Drive. When creating new versions of this CSV file, update the date in the filepath at the top of the python script.