diff --git a/docs/netlify.toml b/docs/netlify.toml index d239567253ae..371a804923c1 100644 --- a/docs/netlify.toml +++ b/docs/netlify.toml @@ -22,6 +22,30 @@ to = "https://humansignal.com/platform/starter-cloud/manage" status = 301 force = true +[[redirects]] +from = "/guide/dataset_create" +to = "/guide" +status = 301 +force = true + +[[redirects]] +from = "/guide/dataset_manage" +to = "/guide" +status = 301 +force = true + +[[redirects]] +from = "/guide/dataset_overview" +to = "/guide" +status = 301 +force = true + +[[redirects]] +from = "/guide/dataset_search" +to = "/guide" +status = 301 +force = true + [[redirects]] from = "/" to = "/guide" diff --git a/docs/source/guide/dataset_create.md b/docs/source/guide/dataset_create.md deleted file mode 100644 index f19aea6f860b..000000000000 --- a/docs/source/guide/dataset_create.md +++ /dev/null @@ -1,452 +0,0 @@ ---- -title: Create a dataset for Data Discovery - Beta 🧪 -short: Import unstructured data -tier: enterprise -type: guide -order: 0 -order_enterprise: 205 -meta_title: Create a dataset to use with Data Discovery in Label Studio Enterprise -meta_description: How to create a dataset in Label Studio Enterprise using Google Cloud, Azure, or AWS. -section: "Curate Datasets" -date: 2023-08-16 11:52:38 ---- - -!!! note - * At this time, we only support building datasets from a bucket of unstructured data, meaning that the data must be in individual files rather than a structured format such as CSV or JSON. - * To create a new dataset, your [user role](manage_users#Roles-in-Label-Studio-Enterprise) must have Owner or Administrator permissions. - -## Before you begin - -Datasets are retrieved from your cloud storage environment. As such, you will need to provide the appropriate access key to pull data from your cloud environment. - -If you are using a firewall, ensure you whitelist the following IP addresses (in addition to the [app.humansignal.com range](saas#IP-Range)): - -`34.85.250.235` -`35.245.250.139` -`35.188.239.181` - -## Datasets using AWS - -Requirements: - -- Your data is located in an AWS S3 bucket. -- You have an AWS access key with view permissions for the S3 bucket. -- Your AWS S3 bucket has CORS configured properly. Configuring CORS allows you to view the data in Label Studio. When CORS is not configured, you are only able to view links to the data. - -{% details Configure CORS for the AWS S3 bucket %} - -**Prerequisites:** - -You have edit access to the bucket. - -###### Configure CORS access to your bucket - -Set up cross-origin resource sharing (CORS) access to your bucket using a policy that allows GET access from the same host name as your Label Studio deployment. - -You can use the AWS Management Console, the API, or SDKs. For more information, see [Configuring cross-origin resource sharing (CORS)](https://docs.aws.amazon.com/AmazonS3/latest/userguide/enabling-cors-examples.html) - -You can use or modify the following example: - -```shell -[ - { - "AllowedHeaders": [ - "*" - ], - "AllowedMethods": [ - "GET" - ], - "AllowedOrigins": [ - "*" - ], - "ExposeHeaders": [ - "x-amz-server-side-encryption", - "x-amz-request-id", - "x-amz-id-2" - ], - "MaxAgeSeconds": 3000 - } -] -``` - -{% enddetails %} - -{% details Create an AWS access key %} - - -**Prerequisites:** - -- You must have the admin permissions in your AWS account to generate keys and create service accounts. - -For more information on completing the following steps, see the following pages in the AWS user guides: - -[Creating an IAM user in your AWS account](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_users_create.html) -[Managing access keys for IAM users](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_credentials_access-keys.html) -[Policies and permissions in IAM](https://docs.aws.amazon.com/IAM/latest/UserGuide/access_policies.html) - -###### Create a policy for the user - -You need a permissions policy that can list S3 buckets and read objects within buckets. If you already have a policy that does this, or if you feel comfortable using the pre-configured **AmazonS3ReadOnlyAccess** policy, then you can skip this step. - -1. From the AWS Management Console, use the search bar or navigation menu to locate the **IAM** service. -2. Select **Access Management > Policies** from the menu on the left. -3. Click **Create policy**. -4. From the policy editor, select the **JSON** option and paste the following: - - ```json -{ - "Version": "2012-10-17", - "Statement": [ - { - "Effect": "Allow", - "Action": [ - "s3:ListBucket", - "s3:GetObject" - ], - "Resource": "*" - } - ] -} - ``` - - If you want to further restrict the permissions to certain buckets, edit the `Resource` key as follows: - - ```json - - "Resource": [ - "arn:aws:s3:::", - "arn:aws:s3:::/*" - ] - - ``` - - -###### Create a user - -This user can be tied to a specific person or a group. - -1. From the AWS Management Console, use the search bar or navigation menu to locate the **IAM** service. - -2. Select **Access Management > Users** from the menu on the left. - -3. Click **Create user**. - -4. Enter a descriptive name for this user, such as “Label_Studio_access”. - - Leave **Provide user access to the AWS Management Console** unselected. Click **Next**. - -5. Select **Attach policies directly**. - -6. Under **Permissions policies**, use the search field to find and select the policy you are using with the user (see above). Click **Next**. - -7. Click **Create user**. - - - -###### Generate an access key for the user - -1. From the **Users** page, click the user you created in the previous section. - -2. Click the **Security Credentials** tab. - - ![Screenshot of the Security Credentials option](/images/data_discovery/aws_key.png) - -3. Scroll down to **Access keys** and click **Create access key**. -4. Select **Other** and note the recommendations provided by AWS. Click **Next**. -5. Optionally, add a description for the key. -6. Click **Create access key**. -7. Copy the access key ID and your secret access key and keep them somewhere safe, or export the key to a CSV file. - - ![Screenshot of the copy icon next to the key](/images/data_discovery/aws_key2.png) - -

Important

This is the only time you will be able to copy the secret access key. Once you click Done, you will not be able to view or copy it again.

- -8. Click **Done**. - -{% enddetails %} - -### Create a dataset from an AWS S3 bucket - -1. From Label Studio, navigate to the Datasets page and click **Create Dataset**. - - ![Create a dataset action](/images/data_discovery/dataset_create.png) - -2. Complete the following fields and then click **Next**: - -
- - | | | - | --- | --- | - | Name | Enter a name for the dataset. | - | Description | Enter a brief description for the dataset. | - | Source | Select AWS S3. | - -
- -3. Complete the following fields: - -
- - | | | - | --- | --- | - | Bucket Name | Enter the name of the AWS S3 bucket. | - | Bucket Prefix | Enter the folder name within the bucket that you would like to use. For example, `data-set-1` or `data-set-1/subfolder-2`. | - | File Name Filter | Use glob format to filter which file types to sync. For example, to sync all JPG files, enter `*jpg`. To sync all JPG and PNG files, enter `**/**+(jpg\|png)`.

At this time, we support the following file types: .jpg, .jpeg, .png, .txt, .text | - | Region Name | By default, the region is `us-east-1`. If your bucket is located in a different region, overwrite the default and enter your region here. Otherwise, keep the default. | - | S3 Endpoint | Enter an S3 endpoint if you want to override the URL created by S3 to access your bucket. | - | Access Key ID | Enter the ID for the access key you created in AWS. Ensure this access key has read permissions for the S3 bucket you are targeting (see [Create an AWS access key](#Create-a-policy-for-the-user) above). | - | Secret Access Key | Enter the secret portion of the [access key](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_credentials_access-keys.html) you created earlier. | - | Session Token | If you are using a session token as part of your authorization (for example, [MFA](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_credentials_mfa.html)), enter it here. | - | Treat every bucket object as a source file | **Enabled** - Each object in the bucket will be imported as a separate record in the dataset.
You should leave this option enabled if you are importing a bucket of unstructured data files such as JPG, PNG, or TXT.

**Disabled** - Disable this option if you are importing structured data, such as JSON or CSV files.

**NOTE:** At this time, we only support unstructured data. Structured data support is coming soon. | - | Recursive scan | Perform recursive scans over the bucket contents if you have nested folders in your S3 bucket. | - | Use pre-signed URLs | If your tasks contain `s3://…` links, they must be pre-signed in order to be displayed in the browser. | - | Expiration minutes | Adjust the counter for how many minutes the pre-signed URLs are valid. | - -
- -4. Click **Check Connection** to verify your credentials. If your connection is valid, click **Next**. - - ![Check Dataset connection](/images/data_discovery/dataset_check_connection_aws.png) - -5. Provide a name for your dataset column and select a data type. The data type that you select tells Label Studio how to store your data in a way that can be searched using an AI-powered semantic search. - - ![Select dataset column](/images/data_discovery/dataset_column_aws.png) - -6. Click **Create Dataset**. - -Data sync initializes immediately after creating the dataset. Depending on how much data you have, syncing might take several minutes to complete. - - - -## Datasets using Google Cloud Storage - -Requirements: - -- Your data is located in a Google Cloud Storage bucket. -- You have a Google Cloud access key with view permissions for the Google Cloud Storage bucket. -- Your Google Cloud Storage bucket has CORS configured. - -{% details Configure CORS for the Google Cloud Storage bucket %} - -Configuring CORS allows you to view the data in Label Studio. When CORS is not configured, you are only able to view links to the data. - -**Prerequisites:** - -* You have installed the gcloud CLI. For more information, see [Google Cloud Documentation - Install the gcloud CLI](https://cloud.google.com/sdk/docs/install). -* You have edit access to the bucket. - -###### Configure CORS access to your bucket - -Set up cross-origin resource sharing (CORS) access to your bucket using a policy that allows GET access from the same host name as your Label Studio deployment. - -For instructions, see [Configuring cross-origin resource sharing (CORS)](https://cloud.google.com/storage/docs/configuring-cors#configure-cors-bucket) in the Google Cloud User Guide. - -You can use or modify the following example: - -```shell -echo '[ - { - "origin": ["*"], - "method": ["GET"], - "responseHeader": ["Content-Type","Access-Control-Allow-Origin"], - "maxAgeSeconds": 3600 - } -]' > cors-config.json -``` - -Replace `YOUR_BUCKET_NAME` with your actual bucket name in the following command to update CORS for your bucket: - -```shell -gsutil cors set cors-config.json gs://YOUR_BUCKET_NAME -``` - -{% enddetails %} - - - -{% details Create an access key for Google Cloud Storage %} - - -**Prerequisites:** - - -- You must have the appropriate Google Cloud permissions to create a service account. -- If you have not yet used a service account in your Google Cloud project, you may need to enable the service account API. See [Create service accounts](https://cloud.google.com/iam/docs/service-accounts-create?hl=en) in the Google Cloud documentation. - -###### Create a service account - -1. From the Google Cloud console, go to **IAM & Admin > Service Accounts**. - - ![Screenshot of the Google Cloud Console](/images/data_discovery/gcp_service_accounts.png) - -2. Click **Create service account** and complete the following fields: - -
- - | | | - |---|---| - | **Service account name** | Enter a name for the service account that will appear in the console. | - | **Service account ID** | The account ID is generated from the service name. | - | **Description** | Optionally, provide a description for the service account. | - -
- -3. Click **Create and continue**. -4. When selecting a role, use the search fields provided to select the **Storage Object Viewer** role. -5. Optionally, you can link the service account to a user or group. For more information, see [Manage access to service accounts](https://cloud.google.com/iam/docs/manage-access-service-accounts) in the Google Cloud documentation. -6. Click **Done**. - -###### Generate a key for the service account - -1. From the Service Accounts page in the Google Cloud console, click the name of the service account you just created to go to its details. -2. Select the **Keys** tab. -3. Select **Add key > Create new key**. -4. Select **JSON** and then click **Create**. - -The private key is automatically downloaded. This is the only time you can download the key. - -![Screenshot of the access key page](/images/data_discovery/gcp_key.png) - -{% enddetails %} - - -### Create a dataset from Google Cloud Storage - -1. From Label Studio, navigate to the Datasets page and click **Create Dataset**. - - ![Create a dataset action](/images/data_discovery/dataset_create.png) - -2. Complete the following fields and then click **Next**: - -
- - | | | - | --- | --- | - | Name | Enter a name for the dataset. | - | Description | Enter a brief description for the dataset. | - | Source | Select Google Cloud Storage | - -
- -3. Complete the following fields: - -
- - | | | - | --- | --- | - | Bucket Name | Enter the name of the Google Cloud bucket. | - | Bucket Prefix | Optionally, enter the folder name within the bucket that you would like to use. For example, `data-set-1` or `data-set-1/subfolder-2`. | - | File Name Filter | Use glob format to filter which file types to sync. For example, to sync all JPG files, enter `*jpg`. To sync all JPG and PNG files, enter `**/**+(jpg\|png)`.
At this time, we support the following file types: .jpg, .jpeg, .png, .txt, .text | - | Treat every bucket object as a source file | **Enabled** - Each object in the bucket will be imported as a separate record in the dataset.
You should leave this option enabled if you are importing a bucket of unstructured data files such as JPG, PNG, or TXT.

**Disabled** - Disable this option if you are importing structured data, such as JSON or CSV files.

**NOTE:** At this time, we only support unstructured data. Structured data support is coming soon. | - | Use pre-signed URLs | If your tasks contain `gs://…` links, they must be pre-signed in order to be displayed in the browser. | - | Pre-signed URL counter | Adjust the counter for how many minutes the pre-signed URLs are valid. | - | Google Application Credentials | Copy and paste the full contents of the JSON file you downloaded when you created your service account key (see above). | - | Google Project ID | Optionally, you can specify a specific Google Cloud project. In most cases, you can leave this blank to inherit the project from the application credentials. | - -
- -4. Click **Check Connection** to verify your credentials. If your connection is valid, click **Next**. - - ![Check Dataset connection](/images/data_discovery/dataset_check_connection.png) - -5. Provide a name for your dataset column and select a data type. The data type that you select tells Label Studio how to store your data in a way that can be searched using an AI-powered semantic search. - - ![Select dataset column](/images/data_discovery/dataset_column.png) - -6. Click **Create Dataset**. - -Data sync initializes immediately after creating the dataset. Depending on how much data you have, syncing might take several minutes to complete. - - - -## Datasets using Microsoft Azure - -Requirements: - -- Your data is saved as blobs in an Azure storage account. We do not currently support Azure Data Lake. -- You have access to retrieve the [storage account access key](https://learn.microsoft.com/en-us/azure/storage/common/storage-account-keys-manage). -- Your storage container has CORS configured properly. Configuring CORS allows you to view the data in Label Studio. When CORS is not configured, you are only able to view links to the data. - -{% details Configure CORS for the Azure storage account %} - - -Configure CORS at the storage account level. - -1. In the Azure portal, navigate to the page for the storage account. -2. From the menu on the left, scroll down to **Settings > Resource sharing (CORS)**. -3. Under **Blob service** add the following rule: - - * **Allowed origins:** `*` - * **Allowed methods:** `GET` - * **Allowed headers:** `*` - * **Exposed headers:** `Access-Control-Allow-Origin` - * **Max age:** `3600` - -4. Click **Save**. - -![Screenshot of the Azure portal page for configuring CORS](/images/azure-storage-cors.png) - - -{% enddetails %} - -{% details Retrieve the Azure storage access key %} - -###### Get the Azure storage account access key - -When you create a storage account, Azure automatically generates two keys that will provide access to objects within that storage account. For more information about keys, see [Azure documentation - Manage storage account access keys](https://learn.microsoft.com/en-us/azure/storage/common/storage-account-keys-manage). - -1. Navigate to the storage account page in the portal. -2. From the menu on the left, scroll down to **Security + networking > Access keys**. -3. Copy the **key** value for either Key 1 or Key 2. - -![Screenshot of the Azure portal access keys page](/images/azure-access-key.png) - - -{% enddetails %} - -### Create a dataset from an Azure blob storage container - -1. From Label Studio, navigate to the Datasets page and click **Create Dataset**. - - ![Create a dataset action](/images/data_discovery/dataset_create.png) - -2. Complete the following fields and then click **Next**: - -
- - | | | - | --- | --- | - | Name | Enter a name for the dataset. | - | Description | Enter a brief description for the dataset. | - | Source | Select Microsoft Azure. | - -
- -3. Complete the following fields: - -
- - | | | - | --- | --- | - | Container Name | Enter the name of a container within the Azure storage account. | - | Container Prefix | Enter the folder name within the container that you would like to use. For example, `data-set-1` or `data-set-1/subfolder-2`. | - | File Name Filter | Use glob format to filter which file types to sync. For example, to sync all JPG files, enter `*jpg`. To sync all JPG and PNG files, enter `**/**+(jpg\|png)`.

At this time, we support the following file types: .jpg, .jpeg, .png, .txt, .text | - | Account Name | Enter the name of the Azure storage account. | - | Account key | Enter the access key for the Azure storage account (see [Retrieve the Azure storage access key](#Get-the-Azure-storage-account-access-key) above). | - | Treat every bucket object as a source file | **Enabled** - Each object in the bucket will be imported as a separate record in the dataset.
You should leave this option enabled if you are importing a bucket of unstructured data files such as JPG, PNG, or TXT.

**Disabled** - Disable this option if you are importing structured data, such as JSON or CSV files.

**NOTE:** At this time, we only support unstructured data. Structured data support is coming soon. | - | Use pre-signed URLs | If your tasks contain `azure-blob://…` links, they must be pre-signed in order to be displayed in the browser. | - | Expiration minutes | Adjust the counter for how many minutes the pre-signed URLs are valid. | - -
- -4. Click **Check Connection** to verify your credentials. If your connection is valid, click **Next**. - - ![Check Dataset connection](/images/data_discovery/dataset_check_connection_azure.png) - -5. Provide a name for your dataset column and select a data type. The data type that you select tells Label Studio how to store your data in a way that is [searchable](dataset_search). - - ![Select dataset column](/images/data_discovery/dataset_column_azure.png) - -6. Click **Create Dataset**. - -Data sync initializes immediately after creating the dataset. Depending on how much data you have, syncing might take several minutes to complete. diff --git a/docs/source/guide/dataset_manage.md b/docs/source/guide/dataset_manage.md deleted file mode 100644 index cf1574cd79cc..000000000000 --- a/docs/source/guide/dataset_manage.md +++ /dev/null @@ -1,59 +0,0 @@ ---- -title: Manage datasets for Data Discovery - Beta 🧪 -short: Manage datasets -tier: enterprise -type: guide -order: 0 -order_enterprise: 215 -meta_title: Manage a dataset in Label Studio Enterprise -meta_description: How to manage your datasets in Label Studio Enterprise -date: 2023-08-23 12:07:13 -section: "Curate Datasets" ---- - - -## Dataset settings - -From the Datasets page, click the overflow menu next to dataset and select **Settings**. - -![Overflow menu next to a dataset](/images/data_discovery/dataset_settings.png) - - -| Settings page             | Description | -| ---------------- | --- | -| **General** | Edit the dataset name and description. | -| **Storage** | Review the storage settings. For information about the storage setting fields, see their descriptions in [Create a dataset](dataset_create). | -| **Members** | Manage dataset members. See [Add or remove members](#Add-or-remove-members). | - - - -## Create project tasks from a dataset - -Select the records you want to annotate and click ***n* Records**. From here you can select a project or you can create a new project. - -The selected records are added to the project as individual tasks. - -![Screenshot of the button to add tasks to project](/images/data_discovery/add_tasks.png) - -## Add or remove members - -From here you can add and remove members. Only users in the Manager role can be added or removed from a dataset. Reviewers and Annotators cannot be dataset members. - -By default, all Owner or Administrator roles are dataset members and cannot be removed. - -| Permission | Roles                                                 | -| ---------------- | --- | -| **Create a dataset** | Owner

Administrator | -| **Delete a dataset** | Owner

Administrator | -| **View and update dataset settings** | Owner

Administrator | -| **View and search dataset** | Owner

Administrator

Manager | -| **Export records to projects** | Owner

Administrator

Manager | - - - - -## Delete a dataset - -From the Datasets page, select the overflow menu next to dataset and select **Delete**. A confirmation prompt appears. - -Deleting a dataset does not affect any project tasks you created using the dataset. diff --git a/docs/source/guide/dataset_overview.md b/docs/source/guide/dataset_overview.md deleted file mode 100644 index 01c8a24ed5f4..000000000000 --- a/docs/source/guide/dataset_overview.md +++ /dev/null @@ -1,67 +0,0 @@ ---- -title: Data Discovery overview - Beta 🧪 -short: Data Discovery overview -tier: enterprise -type: guide -order: 0 -order_enterprise: 201 -meta_title: Data Discovery overview and features -meta_description: An overview of Label Studio's Data Discovery functionality, including features and limitations. -section: "Curate Datasets" -date: 2023-11-10 15:23:18 ---- - -> Streamline your data preparation process using Data Discovery in Label Studio. - -In machine learning, the quality and relevance of the data used for training directly affects model performance. However, sifting through extensive unstructured datasets to find relevant items can be cumbersome and time-consuming. - -Label Studio's Data Discovery simplifies this by allowing users to perform targeted, [AI-powered searches](dataset_search) within their data. This is incredibly beneficial for projects where specific data subsets are required for training specialized models. - -For example, imagine a scenario in a retail context where a company wants to develop an AI model to recognize and categorize various products in their inventory. Using Label Studio's Data Discovery functionality, they can quickly gather images of specific product types from their extensive database, significantly reducing the time and effort needed for manual data labeling and sorting. This efficiency not only speeds up the model development process, but also enhances the model's accuracy by ensuring a well-curated training dataset. - -This targeted approach to data gathering not only saves valuable time but also contributes to the development of more accurate and reliable machine learning models. - -!!! info Tip - You can use the label distribution charts on a project's [dashboard](dashboards) to identify areas within the project that are underrepresented. You can then use Data Discovery to identify the appropriate dataset records to add to your project for more uniform coverage. - - -#### Process overview - -1. Create a dataset by connecting your cloud environment to Label Studio and importing your data. See [Create datasets](dataset_create). -2. Use our AI-powered search to sort and filter the dataset. See [Search and filter datasets](dataset_search). -3. Select the data you want to use and add it to a labeling project. See [Manage datasets](dataset_manage). -4. Start labeling data! - -## Terminology - -| Term | Description | -| --- | --- | -| **Dataset** | In general terms, a dataset is a collection of data.
When referred to here, it means a collection of data created using the Datasets page in Label Studio. | -| **Data discovery** | In general terms, data discovery is the process of gathering, refining, and classifying data. A data discovery tool helps teams find relevant data for labeling. This covers a full spectrum of tasks, from finding data to include in your initial ground truth dataset to finding very specific data points to remedy underperforming classes or address edge cases. | -| **Natural language search**

**Semantic search**| These two terms are used interchangeably and, in simple terms, mean using text as the search query.| -| **Similarity search** | Similarity search is when you select one or more records and then sort the dataset by similarity to your selections. | -| **Record** | An item in a dataset. Each record can be added to a Label Studio project as a task. | - - -## Features, requirements, and constraints - -
- -| Feature | Support | -| --- | --- | -| **Supported file types** | .txt

.png

.jpg/.jpeg | -| **Indexable/searchable data** | Image and text | -| **Supported storage for import** | Google Cloud storage

AWS S3

Azure blob storage | -| **Number of storage sources per dataset** | One | -| **Maximum number of records per dataset** | 1 million | -| **Maximum number of records returned per search** | 16,384 | -| **Number of datasets per org** | 10 | -| **Supported search types** | Natural language search

Similarity search | -| **Supported filter types** | Similarity score | -| **Required permissions** | **Owners and Administrators** -- Can create datasets and have full administrative access to any existing datasets

**Managers** -- Must be invited to a dataset. Once invited, they can view the dataset and export records as project tasks. Managers cannot create new datasets or perform administrative tasks on existing ones.

**Reviewers and Annotators** -- No access to datasets and cannot be added as dataset members. | -| **Enterprise vs. Open Source** | Label Studio Enterprise only | - -
- - - diff --git a/docs/source/guide/dataset_search.md b/docs/source/guide/dataset_search.md deleted file mode 100644 index 9a5802687fc4..000000000000 --- a/docs/source/guide/dataset_search.md +++ /dev/null @@ -1,107 +0,0 @@ ---- -title: Use Data Discovery search to refine datasets for labeling - Beta 🧪 -short: Search and filter datasets -tier: enterprise -type: guide -order: 0 -order_enterprise: 210 -meta_title: Data Discovery search and filtering in Label Studio -meta_description: Use filters, natural language search, and similarity search to refine your datasets. -date: 2023-08-23 12:18:50 -section: "Curate Datasets" ---- - -Once your dataset is created, you will want to add records to projects as tasks. For some projects, this might be the complete dataset. But in most cases you will likely want to select a subset of data based on certain criteria. - -If your dataset consists of several thousand unstructured items, then manually sorting, categorizing, and structuring that data can take a significant amount of time and effort. Instead, you can use Label Studio's AI-powered search capabilities to refine your datasets. - -Label studio provides several search mechanisms: - -* **Natural language searching** - Also known as "semantic searching." Use keywords and phrases to explore your data. -* **Similarity searching** - Select one or more records and then sort the data based on semantic similarity to your selections. -* **Combined searches** - Combine similarity searching and natural language searching. -* **Filtering** - Reduce your dataset to only show records that have a certain threshold of similarity to your searches. - -!!! attention - Search results are limited to 16,384 records at a time. All records with the dataset are stored, but the Label Studio interface is limited to returning a smaller subset per search. As you change your search query, you’ll see different sets of records (all with a max of 16,384 at a time). - -## How searches work - -### Embeddings - -When you sync a dataset, Label Studio generates an embedding for each item (or "record"). An embedding is a way of converting complex, often high-dimensional data (like text or images) into a simpler, lower-dimensional form. Our embeddings are generated using an off-the-shelf CLIP model. - -For example, say you have a library of books and you want to catalog and sort them. Now imagine how helpful it would be to have a summary of each book written out on a small card, capturing its most important themes or ideas. These cards represent the books in a more manageable way, just like embeddings represent complex data in a simpler format. - -Embeddings also convey meaning in ways that things like keywords and metadata do not. For example, you might want to search your library for "healthy eating." A traditional search might just look for books with those exact words in the title or text. But an AI-powered semantic search using embeddings would understand the concept of *healthy eating* and find books related to nutrition, dieting, healthy recipes, and more, even if they don't use the exact words "healthy eating." - - -### Reference embeddings - -When you perform a search, we generate a single "reference embedding" representing your search query. - -If your search consists of one natural language query or one record (in a similarity search), then generating the reference embedding is straightforward. - -But when you start performing more complex searches with multiple search terms and records, we need to combine your queries to create the reference embedding. - -If you are combining multiple queries in a natural language search, we generate embeddings for each search term and then average them to create a singular reference embedding. Each query is afforded equal weight when calculating the average for the reference embedding. Meaning, for example, that we do not take the order in which you entered your queries into consideration. - -Likewise, if you select multiple records when performing a similarity search, we average the embeddings of each record to create the reference embedding. - -If you combine similarity searching and natural language search terms, the natural language search reference embedding is created in one operation and the similarity search reference embedding is created in another operation. Then, in a third operation, the two reference embeddings are combined. By doing this, we can calculate a lower weight for the similarity search portion. This is because natural language queries tend to be more accurate descriptions of what you're trying to find. - - - -### Calculating similarity - -In its raw form, an embedding is essentially multiple floating points within an array. This means that we can mathematically calculate similarity and assign a numerical **similarity score** to each record. The similarity score is the distance between those floating points. - -First, we take your reference embedding, and then we compare it to the embeddings generated from each record in the dataset. From there we calculate the distance between points within their arrays, and the result is the similarity score. - -The closer the similarity score is to `0`, the less distance between points, and the more confident we are that the record is a match to your search criteria. The higher the score, the more distance between points, and similarity diminishes. - -You can use this principle to filter your datasets by a similarity threshold (see [Search results and refining by similarity](#Search-results-and-refining-by-similarity) below). - - - -## Natural language searching - -You are probably already very familiar with natural language searching (also known as "semantic" searching). Natural language search is simply searching with text queries like keywords (e.g. "plants") or phrases (e.g. "which plants grow the fastest"). - -To perform a natural language search, enter your query into the search field provided. You can keep adding search queries to refine your search. - - -![Animated gif of semantic search](/images/data_discovery/semantic_search_flower.gif) - - -!!! note - Searches are sticky and cumulative. For example, when you execute two searches in a row, both queries are used to [calculate similarity](#Calculating-similarity). To start over, you must manually clear your previous search. - ![Screenshot of how to remove a query](/images/data_discovery/search_remove.png) - - -## Similarity searches - -To perform a similarity search, select one or more records. From the **Actions** menu, select **Find similar**. - -![Animated gif similarity search](/images/data_discovery/similarity_search_lake.gif) - -You can continue to refine your similarity search by adding and removing records from the query. - -To adjust your search, simply select or deselect records and click **Find similar** again. - -## Search results and refining by similarity - -It is important to note that when you perform a search against a dataset, you are not applying a filter. The search does not return a subset of "matches." Instead, it returns your entire dataset sorted by similarity. - -To reduce your dataset to records that meet a certain threshold of similarity, click the search field to view your search criteria and a similarity score filter. - -To set the similarity threshold, you can use the slider or enter a value into the field provided. - -![Animated gif similarity score slide](/images/data_discovery/similarity_score_filter.gif) - -### Create labeling tasks - -Once you have refined your dataset, you can create tasks by [exporting dataset records to a project](dataset_manage#Create-project-tasks-from-a-dataset). - - - diff --git a/docs/source/guide/setup_project.md b/docs/source/guide/setup_project.md index a66388b6d023..3f0044dd9c71 100644 --- a/docs/source/guide/setup_project.md +++ b/docs/source/guide/setup_project.md @@ -103,13 +103,6 @@ When you're done, click **Save**. -
- -!!! info Tip - Rather than importing data directly into the project, you can [create a dataset](dataset_create). From here, you can use an AI-powered search to refine your data, which can then be added to different projects as tasks. For more information, see [Data Discovery overview](dataset_overview). - -
-
diff --git a/docs/source/guide/tasks.md b/docs/source/guide/tasks.md index da4af298ff4a..339f24982564 100644 --- a/docs/source/guide/tasks.md +++ b/docs/source/guide/tasks.md @@ -18,24 +18,6 @@ Get data into Label Studio by importing files, referencing URLs, or syncing with - If your data is stored locally, [import it into Label Studio](#Import-data-from-a-local-directory). - If your data contains predictions or pre-annotations, see [Import pre-annotated data into Label Studio](predictions.html). -
- -!!! info Tip - If your data is stored in Google Cloud, AWS, or Azure, you can [import your unstructured data as a dataset in Label Studio Enterprise](dataset_create). - - From here, you can use semantic search and similarity search to curate data for labeling, which can then be added to different projects as tasks. For more information, see [Data Discovery overview](dataset_overview). - -
- -
- -!!! error Enterprise - If your data is stored in Google Cloud, AWS, or Azure, you can [import your unstructured data as a dataset in Label Studio Enterprise](https://docs.humansignal.com/guide/dataset_create). - - From here, you can use semantic search and similarity search to curate data for labeling, which can then be added to different projects as tasks. For more information, see [Data Discovery overview](https://docs.humansignal.com/guide/dataset_overview). - -
- ## General guidelines for importing data * It’s best to keep about 100k tasks / 100k annotations per project for optimal performance.