Implement fuzzy search in Masader #25

AMR-KELEG · 2024-09-06T17:00:24Z

Current behavior

Currently, Masader implements exact matching for the search query and most importantly the dataset name. Having a less strict matching algorithm can improve the quality of the results.

For instance, here are the results on searching for hate speech and hate speech:

Query	Results	Search query
hate speech	[{"Name":"MLMA hate speech"},{"Name":"Religious Hate Speech"},{"Name":"Arabic OSACT5 : Arabic Hate Speech"},{"Name":"Arabic Hate Speech 2022 Shared Task"}]	"https://web-production-25a2.up.railway.app/datasets?query=Name.str.contains%28%27%28%3Fi%29hate+speech%27%29&features=Name"
hatespeech	[]	"https://web-production-25a2.up.railway.app/datasets?query=Name.str.contains%28%27%28%3Fi%29hatespeech%27%29&features=Name"

Proposal

If we use the difflib stdlib python package for the matching (after applying lowercasing), we get the following:

Query	Results
hate speech	['mlma hate speech', 'a-speechdb', 'religious hate speech', 'mediaspeech ', 'arabic hate speech 2022']
hatespeech	['mlma hate speech', 'religious hate speech', 'a-speechdb', 'arabic hate speech 2022', 'mediaspeech ']

using the following code snippet:

import difflib
import pandas as pd
from datasets import load_dataset

masader = load_dataset("arbml/masader")
masader_df = pd.DataFrame(masader["train"])


for search_query in ["hatespeech", "hate speech"]:
    print(
        difflib.get_close_matches(
            word=search_query.lower(),
            possibilities=masader_df["Name"].apply(lambda s: s.lower()),
            cutoff=0.6, # the minimum normalized similarity score for a close match
            n=20, # the maximum number of results to get after sorting using the similarity score
        )
    )

P.S.: I noticed that because of how the scores are normalized by the length of the query and the dataset, and for short queries, long dataset names (e.g.: Arabic Hate Speech 2022 Shared Task) end up being assigned lower similarity scores.

More on how the algorithm works here: https://en.wikipedia.org/wiki/Gestalt_pattern_matching#Sample

AMR-KELEG · 2024-09-07T13:03:50Z

We will need to modify the query filters in this script as well: https://github.com/ARBML/masader/blob/98e1e175da707d8a3564b6b17c88793cde839d47/assets/js/search.js#L141

Add a comment about fuzzy search

3dfa0c3

AMR-KELEG marked this pull request as draft September 6, 2024 17:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement fuzzy search in Masader #25

Implement fuzzy search in Masader #25

AMR-KELEG commented Sep 6, 2024 •

edited

Loading

AMR-KELEG commented Sep 7, 2024 •

edited

Loading

Implement fuzzy search in Masader #25

Are you sure you want to change the base?

Implement fuzzy search in Masader #25

Conversation

AMR-KELEG commented Sep 6, 2024 • edited Loading

Current behavior

Proposal

AMR-KELEG commented Sep 7, 2024 • edited Loading

AMR-KELEG commented Sep 6, 2024 •

edited

Loading

AMR-KELEG commented Sep 7, 2024 •

edited

Loading