Skip to content

Latest commit

 

History

History
280 lines (192 loc) · 12.6 KB

SpecDoc.md

File metadata and controls

280 lines (192 loc) · 12.6 KB

Specification Document Sentimental BB

A. WHAT THE PROGRAM DOES

A1. DATA

The 1st subpart of the program is related to datasets in general.

A1a. Download/Scrap

Goal: Download or scrap data online to store them in data/raw.

1) Use Case Allocine:

For allocine, the download is directly done in make_dataset_allocine (see A1b allocine).

2) Use Case Twitter:

Every day, X tweets mentionning every candidate must be downloaded from twitter to data/raw/twitter/candidate_name with the command below. The tweets mentionning a candidate must be formatted as such tbd and named as such tbd.

  • Called via CLI
  • Command: poetry run python -m src data --download twitter
  • Other Arguments:
    • tbd
  • Outputs: create csvs containing nb tweets for every candidate in data/raw/twitter/candidate_name

The source code to perform this action must be written in src/data/download/download_twitter.py

A1b. Make Dataset

Goal: From data/raw or data/processed, Make a processessed csv dataset.

1) Use Case Allocine:

There are 2 types of datasets that need to be made for allocine:

1. Train
  • What it needs: Nothing
  • What it does:
    • If not already done, download allocine train dataset from Datasets from Huggingface somewhere in a cache file (hopefully works at school too)
    • reformat the dataset (see output below)
    • Export the reformatted dataset as csv (see output below)
  • Called via CLI
  • Command: poetry run python m src data --task make-dataset --data allocine --split train
  • Other Arguments:: None
  • Outputs: creates a csv named allocine_trainset_[nbofreviews]reviews.csv in data/processed/allocine with 160 000 reviews formatted as such:
    • 'text': contains the text review
    • 'Positive': contains 1 if review is positive, 0 otherwise
    • 'Negative': contains 1 if review is negative, 0 otherwise

The source code to perform this action is written in a function in src/data/make_dataset/make_dataset_allocine.py

2. Test
  • What it needs: Nothing
  • What it does:
    • If not already done, download allocine test dataset from Datasets from Huggingface somewhere in a cache file (hopefully works at school too)
    • Format the dataset in the right format (see output)
    • Randomly select nb_reviews (see other arguments below) from the dataset and export them in a csv (see output)
  • Called via CLI
  • Command: poetry run python m src data --task make-dataset --data allocine --split test
  • Other Arguments:: not required
    • --nb_reviews:
      • The nb of reviews we want to have in our test set
      • default=180
      • Must be in range [1-10000]
  • Outputs: creates a csv named allocine_testset_[nbofreviews]reviews.csv in src/tests formatted as such:
    • 'text': contains the text review
    • 'Positive': contains 1 if review is positive, 0 otherwise
    • 'Negative': contains 1 if review is negative, 0 otherwise

The source code to perform this action is written in a function in src/data/make_dataset/make_dataset_allocine.py

2) Use Case Twitter:

There are 2 types of datasets that need to be made for twitter:

1. Predict
  • What it needs:
    • Tweets downloaded via part A1a.
    • folders in data/raw/twitter/ named as such: {"pecresse","zemmour","dupont-aignan","melenchon","lepen","lassalle","hidalgo","macron","jadot","roussel","arthaud","poutou"}
    • In each of these folders: csvs whose names contain 'start_time-yyyy-mm-dd'
    • Csvs formatted with at least the following columns:
      • 'text': text of tweet
      • 'created_at': timestamp of tweet with 'hh:mm:ss' in it
      • 'id': tweet id
  • What it does:
    • Given a dates range and a candidate name (either one candidate or all candidates), takes the corresponding csvs in data/raw/twitter
    • Concatenates all the csvs for one candidate and a specific day
    • Reformat the columns of the csv
    • Delete the RTs
    • Creates a csv with all the tweets for a specific candidate and for a specific day.
    • Export the csv
  • Called via CLI
  • Command: poetry run python m src data --task make-dataset --data twitter --split predict
  • Other Arguments: All are required
    • --candidate: name of candidate ("all" if perform task on all of them): {"Pecresse","Zemmour","Dupont-Aignan","Melenchon","Le Pen","Lassalle","Hidalgo","Macron","Jadot","Roussel","Arthaud","Poutou","all"}
    • --start_time: first day for the date range. format: yyyy-mm-dd
    • --end_time: last day for the date range. format: yyyy-mm-dd
  • Outputs: creates csvs in data/processed/twitter/predict/[mmdd] named as such [candidatename](see above in 'what it needs' for format)_[mmdd]_[nbtweets]tweets.csv: formatted as such:
    • 'candidate': name of candidate (see list above for name format)
    • 'time': timestamp of tweet formatted as such hh:mm:ss
    • 'tweet_id': id of tweet
    • 'text': text of tweet
  • Prints:
    • "csv created at" if csv has successfully been created
    • "no raw twitter data..." if there was no raw csv file for the specific date and specific candidate
  • Warning: If the program must create a csv file that already exists, it will overwrite it.

The source code to perform this action is written in src/data/make_dataset/make_dataset_twitter.py

2. Test
  • What it needs:
    • Csvs made by the Predict part here above.
    • The csvs must be somewhere in data/processed/twitter/ with names beginning with: {"pecresse","zemmour","dupont-aignan","melenchon","lepen","lassalle","hidalgo","macron","jadot","roussel","arthaud","poutou"}
  • What it does:
    • Randomly chooses nb_tweets tweets from all the csv files in data/processed/twitter/predict so that every candidate has the exact same number of tweets in the outputted test set.
    • Concatenates all those tweets in a new csv with added columns for labels
    • WARNING: Those tweets are NOT removed from the predict csvs files from which they came from
    • Warning: csv stored in src/test as this small data needs to be git pushed for the unit tests.
  • Called via CLI
  • Command: poetry run python m src data --task make-dataset --data twitter --split test
  • Other Arguments: Not Required
    • --nb_tweets:
      • Number of tweets wanted in the test dataset outputted.
      • Default is 120.
      • Must be in Range [12-480]
      • the program actually creates a csv with nb_tweets - (nb_tweets % 12)
      • If the program randomly select a csv file in predict with less than [nb_tweets/12] tweets, it will still output a csv but with less tweets than wanted.
  • Outputs: creates a csv as src/tests/testset_[nb_tweets]tweets_unlabeled.csv formatted as such:
    • 'candidate': name of candidate
    • 'time': timestamp of tweet formatted as hh:mm:ss
      • Day unfortunately not specified
    • 'tweet_id': id of tweet
    • 'text': text of tweet
    • 'Positive': every value set to 0 as it yet has to be labelled
    • 'Negative': every value set to 0 as it yet has to be labelled

Once the csv is created, it will have to be manually labelled (via google sheet for exple) and the name of the csv will have to be updated to testset_[nb_tweets]tweets_unlabeled.csv

The source code to perform this action is written in a function in src/data/make_dataset/make_dataset_twitter.py

A1c. Load Dataset

Goal: From csvs in data/processed, creates python objects such as np.array or pd.dataframes to be mostly used by functions and classes in models.

Called: as a library by other parts of the code

1) Use Case Models

tbd

The source code to perform these actions must be written in in src/data/load_dataset/.../.py

A2. FEATURES

The 2nd subpart of the program build features on top of the data python objects created by load_dataset.

  • Used as a library by other parts of codes
  • Returns: Mainly returns python data objects with new features built on top

Not clear yet if we'll have to use it.

tbd

A3. MODELS

The 3rd subpart of the program is related to everything related to model training/testing/predicting. In order to function it needs the appropriate dataset(s) made by data/make_dataset.

Before calling the parts here under in the CLI, it's good to know in advance if the needed code to make the needed dataset has already been done.

A3a. Train

Goal: Based on a chosen algo, chosen hyperparameters and chosen train data set stored in data/processed/train, train a model and outputs it in the folder models in root. The ouput will mainly consists of weights

  • Called via CLI (but also as lib?)
  • Command: tbd
  • Other Arguments:
    • tbd
  • Outputs: store weights in a folder in models/ tbd

The source code to perform this action must written in src/models/typeoflibraryused/nameofalgo.py

A3b. Test

Goal: Based on a chosen train model (weights), a chosen test set in src/test and a chosen perofrmance metric, outputs the score of the model in a file tbd in data/processed/results and outputs the csv used to test with the predictions added somewhere in data/processed/results tbd.

  • Called via CLI (but also as lib?)
  • Command: tbd
  • Other Arguments:
    • tbd
  • Outputs:
    • score of the model in a file tbd in data/processed/results
    • csv used to test with the predictions added 'Postivie_pred', 'Negative_pred' somewhere in data/processed/results tbd

The source code to perform this action must written in src/models/typeoflibraryused/nameofalgo.py

A3c. Predict

Goal: Based on a chosen train model (weights), a chosen predict set in data/processed/predict and a chosen perofrmance metric outputs the csv used to predict with the predictions added.

The source code to perform this action must written in src/models/typeoflibraryused/nameofalgo.py

Based on a csv from a specific candidate on a specific day in data/processed/predict, outputs the csv with the predictions. Possibility to do many candidates and many dates all at once?

  • Called via CLI
  • Command: tbd
  • Other Arguments:
    • tbd
  • Outputs: csv as _data/processed/results/modelname_timestampofcommand/day/candidate.csv formatted as such:
    • 'Candidate': candidate name
    • 'pred': 1 if positive, 0 if negative
    • 'datetime': same format as in prodict csv passed in input
    • 'text': text of tweet

A4. TEST

The 4th part of the program is related to all the tests run in the CI/CD process. For each new block of code, a corresponding unit test need to be added somewhere in src/tests and the function written must start with "test_" to be tested by pytest.

A5. VISUALIZATION

The 5th part of the program takes csv results and creates charts from it.

tbd

B. DATA ARCHITECTURE

All the data are stored in the folder data located in the root folder of the project. The data stored is not git push and in order to retrieve it, dvc pull must be done.

tbd