Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[LLM pipeline] Language filter component #232

Merged
merged 15 commits into from
Jul 5, 2023
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 18 additions & 0 deletions components/language_filter/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
FROM --platform=linux/amd64 python:3.8-slim

## System dependencies
RUN apt-get update && \
apt-get upgrade -y && \
apt-get install git -y

# install requirements
COPY requirements.txt /
RUN pip3 install --no-cache-dir -r requirements.txt

# Set the working directory to the component folder
WORKDIR /component/src

# Copy over src-files
COPY src/ .

ENTRYPOINT ["python", "main.py"]
7 changes: 7 additions & 0 deletions components/language_filter/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
# Language filter

## Description
This component is based on the `TransformComponent` and is used to filter a dataframe based on language.
It allows you to remove rows that do not match the provided language, thus providing a way to focus
on specific languages within your data.
NielsRogge marked this conversation as resolved.
Show resolved Hide resolved

Empty file.
14 changes: 14 additions & 0 deletions components/language_filter/fondant_component.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
name: Language filter
description: A component that filter a provided dataframe based on the language.
image: ghcr.io/ml6team/language_filter:latest
mrchtr marked this conversation as resolved.
Show resolved Hide resolved

consumes:
passages:
fields:
text:
type: string
mrchtr marked this conversation as resolved.
Show resolved Hide resolved

args:
language:
description: A valid language code or identifier (e.g., "en", "fr", "de").
type: string
4 changes: 4 additions & 0 deletions components/language_filter/requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
fondant
mrchtr marked this conversation as resolved.
Show resolved Hide resolved
pyarrow>=7.0
gcsfs==2023.4.00
fasttext==0.9.2
Binary file added components/language_filter/src/lid.176.ftz
Binary file not shown.
70 changes: 70 additions & 0 deletions components/language_filter/src/main.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,70 @@
"""A component that filter a provided dataframe based on the language"""
import logging
import dask.dataframe as dd
from fondant.component import DaskTransformComponent
from fondant.logger import configure_logging
import fasttext

configure_logging()
mrchtr marked this conversation as resolved.
Show resolved Hide resolved
logger = logging.getLogger(__name__)


class LanguageIdentification:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rather than including the ftz file, can we load from the hub since FastText is now hosted there?

just:

import fasttext
from huggingface_hub import hf_hub_download

model_path = hf_hub_download(repo_id="facebook/fasttext-language-identification", filename="model.bin")
model = fasttext.load_model(model_path)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What speaks against to include the ftz file in the repository? Alternative we could download the file during the image build process. Just want to avoid the situation, if some external dependencies can not be reached that the execution of the component will fail.

"""A class for language detection using FastText."""

def __init__(self, model_path: str = "lid.176.ftz"):
"""
Initializes the LanguageDetect class.

Args:
model_path (str): The path to the FastText language identification model.
"""
pretrained_lang_model_weight_path = model_path
self.model = fasttext.load_model(pretrained_lang_model_weight_path)

def predict_lang(self, text: str):
"""
Detects the language of a text sequence.

Args:
text (str): The text for language detection.

Returns:
str: The predicted language label.
"""
predictions = self.model.predict(text, k=1)
return predictions[0][0]

def is_language(self, row, language):
return language in self.predict_lang(row["text"])


class LanguageFilterComponent(DaskTransformComponent):
mrchtr marked this conversation as resolved.
Show resolved Hide resolved
"""Component that filter columns based on provided language"""

def transform(
self,
*,
dataframe: dd.DataFrame,
language: str,
) -> dd.DataFrame:
"""
Args:
dataframe: Dask dataframe.
language: Only keep text passages which are in the provided language

Returns:
Dask dataframe
"""

lang_detector = LanguageIdentification()
mrchtr marked this conversation as resolved.
Show resolved Hide resolved
mask = dataframe.map_partitions(
lambda df: df.apply(lambda row: lang_detector.is_language(row, language), axis=1),
meta=bool)

return dataframe[mask]


if __name__ == "__main__":
component = LanguageFilterComponent.from_args()
component.run()
Empty file.
31 changes: 31 additions & 0 deletions components/language_filter/tests/language_filter_component_test.py
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting to see these tests.

This could probably also be easier if we split the general component behavior from the user implementation into separate classes as discussed in chat. Since then we could test the user implementation without having to provide dummy variables for all the general component behavior.

Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
import pandas as pd
from components.language_filter.src.main import LanguageFilterComponent
from fondant.component_spec import ComponentSpec
from dask.dataframe import from_pandas


def test_run_component_test():
"""Test language filter component"""

# Given: Dataframe with text in different languages
data = [{"text": "Das hier ist ein Satz in deutscher Sprache"}, {"text": "This is a sentence in English"},
{"text": "Dit is een zin in het Nederlands"}]
df = pd.DataFrame(data)
ddf = from_pandas(df, npartitions=1)

# When: The language filter component proceed the dataframe
# and filter out all entries which are not written in german
spec = ComponentSpec.from_file("../fondant_component.yaml")

component = LanguageFilterComponent(spec, input_manifest_path="./dummy_input_manifest.json",
output_manifest_path="./dummy_input_manifest.json",
metadata={},
user_arguments={"language": "de"}
)

ddf = component.transform(dataframe=ddf, **component.user_arguments)

# Then: dataframe only contains one german row
df = ddf.compute()
assert len(df) == 1
assert df.loc[0]["text"] == "Das hier ist ein Satz in deutscher Sprache"