A tool for quickly adding labels to unlabeled datasets
Running on streamlit!
We can walk through a simple example of going from an unlabeled dataset to some usable labels in just a few minutes
First, go to the streamlit app above, or you can run it locally
Then upload a csv file with your text. The only requirement of the file is that it must have a text
column. Any other columns added can be used for coloring the embedding plot. If you don't have one, you can use the conv-intent dataset from this repo!
Once the embeddings have processed, you'll see your dataframe on the left and embeddings on the right. The dataframe view comes with an extra text_length
column that you can sort by, or color the embeddings plot with (in case text length is useful to you).
You can filter with the text search (regex coming soon!) or, by lasso selecting embedding clusters from the chart. You can also color the chart and resize the points using the menu on the left
Since we see some clear clusters already, let's start by investigating them. We can see one cluster with a lot of references to weather. Let's select this cluster
Screen.Recording.2022-10-04.at.4.31.31.PM.mov
Confirming that this is about weather, we can register a new label "weather" and assign our samples
Screen.Recording.2022-10-04.at.4.33.19.PM.mov
The UI will reset automatically. Let's look at another one. This cluster has a lot of references to bookings and reservations. Let's select that one.
Screen.Recording.2022-10-04.at.4.34.45.PM.mov
We can use the streamlit table's builtin text search (by clicking on the table, then CMD+F) to see how many references to "book" there are. Unlike the text search filter, this won't actually filter the selection.
Screen.Recording.2022-10-04.at.4.37.30.PM.mov
Loads of samples have "book" in them, but we can be a bit more generic and call this "reservations". Let's register a new label "reservations" and label these samples.
Screen.Recording.2022-10-04.at.4.39.00.PM.mov
We can inspect our labeled samples in the label-viewer page.
Once we are ready, we simple click "Export assigned labels" and then click the "Download" button
export.mov
We just labeled N samples in a few minutes!
There are some pretty funny "mistakes" in the embeddings (samples that are semantically similar to other categories, but have words that trigger weather/reservation) that should be considered! The embeddings aren't perfect. We are using a smaller model (paraphrase-MiniLM-L3-v2) in order to get embeddings in a reasonable speed. But it's a good start! Feel free to run this locally and use a better model
If you have a GPU running locally, want to try different encoder algorithms, or don't want to upload your data, you can run this locally.
- Create a virtual environment (I recommend pyenv)
pyenv install $(cat .python-version)
python -m venv .venv
source .venv/bin/activate
# Check that it worked
which python pip
- Install reqs
pip install -r requirements.txt && pyenv rehash
- Run the app:
streamlit run app.py