Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement fuzzy search in Masader #25

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from
Draft

Conversation

AMR-KELEG
Copy link

@AMR-KELEG AMR-KELEG commented Sep 6, 2024

Current behavior

Currently, Masader implements exact matching for the search query and most importantly the dataset name. Having a less strict matching algorithm can improve the quality of the results.

For instance, here are the results on searching for hate speech and hate speech:

Query Results Search query
hate speech [{"Name":"MLMA hate speech"},{"Name":"Religious Hate Speech"},{"Name":"Arabic OSACT5 : Arabic Hate Speech"},{"Name":"Arabic Hate Speech 2022 Shared Task"}] "https://web-production-25a2.up.railway.app/datasets?query=Name.str.contains%28%27%28%3Fi%29hate+speech%27%29&features=Name"
hatespeech [] "https://web-production-25a2.up.railway.app/datasets?query=Name.str.contains%28%27%28%3Fi%29hatespeech%27%29&features=Name"

Proposal

If we use the difflib stdlib python package for the matching (after applying lowercasing), we get the following:

Query Results
hate speech ['mlma hate speech', 'a-speechdb', 'religious hate speech', 'mediaspeech ', 'arabic hate speech 2022']
hatespeech ['mlma hate speech', 'religious hate speech', 'a-speechdb', 'arabic hate speech 2022', 'mediaspeech ']

using the following code snippet:

import difflib
import pandas as pd
from datasets import load_dataset

masader = load_dataset("arbml/masader")
masader_df = pd.DataFrame(masader["train"])


for search_query in ["hatespeech", "hate speech"]:
    print(
        difflib.get_close_matches(
            word=search_query.lower(),
            possibilities=masader_df["Name"].apply(lambda s: s.lower()),
            cutoff=0.6, # the minimum normalized similarity score for a close match
            n=20, # the maximum number of results to get after sorting using the similarity score
        )
    )

P.S.: I noticed that because of how the scores are normalized by the length of the query and the dataset, and for short queries, long dataset names (e.g.: Arabic Hate Speech 2022 Shared Task) end up being assigned lower similarity scores.

More on how the algorithm works here: https://en.wikipedia.org/wiki/Gestalt_pattern_matching#Sample

@AMR-KELEG AMR-KELEG marked this pull request as draft September 6, 2024 17:00
@AMR-KELEG
Copy link
Author

AMR-KELEG commented Sep 7, 2024

We will need to modify the query filters in this script as well: https://github.com/ARBML/masader/blob/98e1e175da707d8a3564b6b17c88793cde839d47/assets/js/search.js#L141

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant