The Label Studio ML backend is an SDK that lets you wrap your machine learning code and turn it into a web server. The web server can be connected to a running Label Studio instance to automate labeling tasks.
If you just need to load static pre-annotated data into Label Studio, running an ML backend might be overkill for you. Instead, you can import pre-annotated data.
To start using the models, use docker-compose to run the ML backend server.
Use the following command to start serving the ML backend at http://localhost:9090
:
git clone https://github.com/HumanSignal/label-studio-ml-backend.git
cd label-studio-ml-backend/label_studio_ml/examples/{MODEL_NAME}
docker-compose up
Replace {MODEL_NAME}
with the name of the model you want to use (see below).
Allow the ML backend to access Label Studio data
In most cases, you will need to set LABEL_STUDIO_URL
and LABEL_STUDIO_API_KEY
environment variables to allow the ML backend access to the media data in Label Studio.
Read more in the documentation.
The following models are supported in the repository. Some of them work without any additional setup, and some of them require additional parameters to be set.
Check the Required parameters column to see if you need to set any additional parameters.
- Pre-annotation column indicates if the model can be used for pre-annotation in Label Studio:
you can see pre-annotated data when opening the labeling page or after running predictions for a batch of data. - Interactive mode column indicates if the model can be used for interactive labeling in Label Studio: see interactive predictions when performing actions on labeling page.
- Training column indicates if the model can be used for training in Label Studio: update the model state based the submitted annotations.
MODEL_NAME | Description | Pre-annotation | Interactive mode | Training | Required parameters |
---|---|---|---|---|---|
segment_anything_model | Image segmentation by Meta | ❌ | ✅ | ❌ | None |
llm_interactive | Prompt engineering with OpenAI, Azure LLMs. | ✅ | ✅ | ✅ | OPENAI_API_KEY |
grounding_dino | Object detection with prompts. Details | ❌ | ✅ | ❌ | None |
tesseract | Interactive OCR. Details | ❌ | ✅ | ❌ | None |
easyocr | Automated OCR. EasyOCR | ✅ | ❌ | ❌ | None |
spacy | NER by SpaCy | ✅ | ❌ | ❌ | None |
flair | NER by flair | ✅ | ❌ | ❌ | None |
bert_classifier | Text classification with Huggingface | ✅ | ❌ | ✅ | None |
huggingface_llm | LLM inference with Hugging Face | ✅ | ❌ | ❌ | None |
huggingface_ner | NER by Hugging Face | ✅ | ❌ | ✅ | None |
nemo_asr | Speech ASR by NVIDIA NeMo | ✅ | ❌ | ❌ | None |
mmdetection | Object Detection with OpenMMLab | ✅ | ❌ | ❌ | None |
sklearn_text_classifier | Text classification with scikit-learn | ✅ | ❌ | ✅ | None |
interactive_substring_matching | Simple keywords search | ❌ | ✅ | ❌ | None |
langchain_search_agent | RAG pipeline with Google Search and Langchain | ✅ | ✅ | ✅ | OPENAI_API_KEY, GOOGLE_CSE_ID, GOOGLE_API_KEY |
To start developing your own ML backend, follow the instructions below.
Download and install label-studio-ml
from the repository:
git clone https://github.com/HumanSignal/label-studio-ml-backend.git
cd label-studio-ml-backend/
pip install -e .
label-studio-ml create my_ml_backend
You can go to the my_ml_backend
directory and modify the code to implement your own inference logic.
The directory structure should look like this:
my_ml_backend/
├── Dockerfile
├── docker-compose.yml
├── model.py
├── _wsgi.py
├── README.md
└── requirements.txt
Dockefile
and docker-compose.yml
are used to run the ML backend with Docker.
model.py
is the main file where you can implement your own training and inference logic.
_wsgi.py
is a helper file that is used to run the ML backend with Docker (you don't need to modify it).
README.md
is a readme file with instructions on how to run the ML backend.
requirements.txt
is a file with Python dependencies.
In your model directory, locate the model.py
file (for example, my_ml_backend/model.py
).
The model.py
file contains a class declaration inherited from LabelStudioMLBase
. This class provides wrappers for
the API methods that are used by Label Studio to communicate with the ML backend. You can override the methods to
implement your own logic:
def predict(self, tasks, context, **kwargs):
"""Make predictions for the tasks."""
return predictions
The predict
method is used to make predictions for the tasks. It uses the following:
tasks
: Label Studio tasks in JSON formatcontext
: Label Studio context in JSON format - for interactive labeling scenariopredictions
: Predictions array in JSON format
Once you implement the predict
method, you can see predictions from the connected ML backend in Label Studio.
You can also implement the fit
method to train your model. The fit
method is typically used to train the model on
the labeled data, although it can be used for any arbitrary operations that require data persistence (for example,
storing labeled data in a database, saving model weights, keeping LLM prompts history, etc).
By default, the fit
method is called at any data action in Label Studio, like creating a new task or updating
annotations. You can modify this behavior from the project settings under Webhooks.
To implement the fit
method, you need to override the fit
method in your model.py
file:
def fit(self, event, data, **kwargs):
"""Train the model on the labeled data."""
old_model = self.get('old_model')
# write your logic to update the model
self.set('new_model', new_model)
with
event
: event type can be'ANNOTATION_CREATED'
,'ANNOTATION_UPDATED'
, etc.data
the payload received from the event (check more on Webhook event reference)
Additionally, there are two helper methods that you can use to store and retrieve data from the ML backend:
self.set(key, value)
- store data in the ML backendself.get(key)
- retrieve data from the ML backend
Both methods can be used elsewhere in the ML backend code, for example, in the predict
method to get the new model
weights.
Other methods and parameters are available within the LabelStudioMLBase
class:
self.label_config
- returns the Label Studio labeling config as XML string.self.parsed_label_config
- returns the Label Studio labeling config as JSON.self.model_version
- returns the current model version.self.get_local_path(url, task_id)
- this helper function is used to download and cache an url that is typically stored intask['data']
, and to return the local path to it. The URL can be: LS uploaded file, LS Local Storage, LS Cloud Storage or any other http(s) URL.
To run without Docker (for example, for debugging purposes), you can use the following command:
label-studio-ml start my_ml_backend
Modify the my_ml_backend/test_api.py
to ensure that your ML backend works as expected.
To modify the port, use the -p
parameter:
label-studio-ml start my_ml_backend -p 9091
Before you start:
- Install gcloud.
- Initialize billing for your account if it's not activated.
- Initialize gcloud, enter the following commands and login with your browser:
gcloud auth login
- Activate your Cloud Build API.
- Find your GCP project ID.
- (Optional) Add
GCP_REGION
with your default region to your ENV variables.
To start deployment:
- Create your own ML backend
- Start deployment to GCP:
label-studio-ml deploy gcp {ml-backend-local-dir} \
--from={model-python-script} \
--gcp-project-id {gcp-project-id} \
--label-studio-host {https://app.heartex.com} \
--label-studio-api-key {YOUR-LABEL-STUDIO-API-KEY}
- After Label Studio deploys the model, you can find the model endpoint in the console.
If you encounter an error similar to the following when running docker-compose up --build
on Windows:
exec /app/start.sh : No such file or directory
exited with code 1
This issue is likely caused by Windows' handling of line endings in text files, which can affect scripts
like start.sh
. To resolve this issue, follow the steps below:
Before cloning the repository, ensure your Git is configured to not automatically convert line endings to
Windows-style (CRLF) when checking out files. This can be achieved by setting core.autocrlf
to false
. Open Git Bash
or your preferred terminal and execute the following command:
git config --global core.autocrlf false
If you have already cloned the repository before adjusting your Git configuration, you'll need to clone it again to ensure that the line endings are preserved correctly:
- Delete the existing local repository. Ensure you have backed up any changes or work in progress.
- Clone the repository again. Use the standard Git clone command to clone the repository to your local machine.
Navigate to the appropriate directory within your cloned repository that contains the Dockerfile
and docker-compose.yml
. Then, proceed with the Docker commands:
-
Build the Docker containers: Run
docker-compose build
to build the Docker containers based on the configuration specified indocker-compose.yml
. -
Start the Docker containers: Once the build process is complete, start the containers using
docker-compose up
.
- This solution specifically addresses issues encountered on Windows due to the automatic conversion of line endings. If you're using another operating system, this solution may not apply.
- Remember to check your project's
.gitattributes
file, if it exists, as it can also influence how Git handles line endings in your files.
By following these steps, you should be able to resolve issues related to Docker not recognizing the start.sh
script
on Windows due to line ending conversions.
Sometimes, you want to reset the pip cache to ensure that the latest versions of the dependencies are installed.
For example, Label Studio ML Backend library is used as
label-studio-ml @ git+https://github.com/HumanSignal/label-studio-ml-backend.git
in requirements.txt. Let's assume that it
is updated, and you want to jump on the latest version in your docker image with the ML model.
You can rebuild a docker image from scratch with the following command:
docker compose build --no-cache
You might see these errors if you send multiple concurrent requests.
Note that the provided ML backend examples are offered in development mode, and do not support production-level inference serving.