bulk-labeling

A tool for quickly adding labels to unlabeled datasets

How to use

We can walk through a simple example of going from an unlabeled dataset to some usable labels in just a few minutes

First, go to the streamlit app above, or you can run it locally

Then upload a csv file with your text. The only requirement of the file is that it must have a text column. Any other columns added can be used for coloring the embedding plot. If you don't have one, you can use the conv-intent dataset from this repo!

Once the embeddings have processed, you'll see your dataframe on the left and embeddings on the right. The dataframe view comes with an extra text_length column that you can sort by, or color the embeddings plot with (in case text length is useful to you).

You can filter with the text search (regex coming soon!) or, by lasso selecting embedding clusters from the chart. You can also color the chart and resize the points using the menu on the left

Since we see some clear clusters already, let's start by investigating them. We can see one cluster with a lot of references to weather. Let's select this cluster

Screen.Recording.2022-10-04.at.4.31.31.PM.mov

Confirming that this is about weather, we can register a new label "weather" and assign our samples

Screen.Recording.2022-10-04.at.4.33.19.PM.mov

The UI will reset automatically. Let's look at another one. This cluster has a lot of references to bookings and reservations. Let's select that one.

Screen.Recording.2022-10-04.at.4.34.45.PM.mov

We can use the streamlit table's builtin text search (by clicking on the table, then CMD+F) to see how many references to "book" there are. Unlike the text search filter, this won't actually filter the selection.

Screen.Recording.2022-10-04.at.4.37.30.PM.mov

Loads of samples have "book" in them, but we can be a bit more generic and call this "reservations". Let's register a new label "reservations" and label these samples.

Screen.Recording.2022-10-04.at.4.39.00.PM.mov

We can inspect our labeled samples in the label-viewer page.

Once we are ready, we simple click "Export assigned labels" and then click the "Download" button

export.mov

We just labeled N samples in a few minutes!

There are some pretty funny "mistakes" in the embeddings (samples that are semantically similar to other categories, but have words that trigger weather/reservation) that should be considered! The embeddings aren't perfect. We are using a smaller model (paraphrase-MiniLM-L3-v2) in order to get embeddings in a reasonable speed. But it's a good start! Feel free to run this locally and use a better model

Run locally

If you have a GPU running locally, want to try different encoder algorithms, or don't want to upload your data, you can run this locally.

Create a virtual environment (I recommend pyenv)

pyenv install $(cat .python-version)
python -m venv .venv
source .venv/bin/activate
# Check that it worked
which python pip

Install reqs pip install -r requirements.txt && pyenv rehash
Run the app: streamlit run app.py

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
.streamlit		.streamlit
pages		pages
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
app.py		app.py
conv_intent.csv		conv_intent.csv
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

bulk-labeling

How to use

Run locally

About

Releases

Packages

Languages

License

rungalileo/bulk-labeling

Folders and files

Latest commit

History

Repository files navigation

bulk-labeling

How to use

Run locally

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages