Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How does MongoDB free text search work exactly? #9

Open
jotelha opened this issue Nov 10, 2023 · 1 comment
Open

How does MongoDB free text search work exactly? #9

jotelha opened this issue Nov 10, 2023 · 1 comment
Labels
documentation Improvements or additions to documentation

Comments

@jotelha
Copy link
Member

jotelha commented Nov 10, 2023

This question needs to be answered in the documentation.

jic-dtool/dtool-lookup-webapp#2 belongs here.

Follwoing question reached me via email,

Eigentlich sieht es so aus, als würden nur ganze Wörter gesucht, Wildcards gibt es wohl auch nicht. Dann wiederum kann ich nach "toluen" suchen und finden lauter Datensätze in deren Readme ich das so nicht finden kann, sondern nur als "toluene". Welche Daten sind denn Indiziert für die Suche? Nur das Readme? Und dort auch nur die Values, nicht die Keys? Wenn ich zwei getrente Worte in die Suche eingebe, dann werden die beiden Suchen wohl "oder"-Verknüpft - kann ich daraus irgendwie ein "und" machen?

deepl translation:

Actually, it looks as if only whole words are searched for, and there are probably no wildcards. Then again, I can search for "toluen" and find lots of data records in whose readme I can't find that, but only as "toluene". What data is indexed for the search? Only the readme? And only the values there, not the keys? If I enter two separate words in the search, then the two searches are probably linked with "or" - can I somehow make an "and" out of this?

@jotelha jotelha added the documentation Improvements or additions to documentation label Nov 10, 2023
@jotelha
Copy link
Member Author

jotelha commented Nov 10, 2023

MongoDB-internally, a dataset is represented as the JSON-like document

{
  _id: ObjectId("64907e6b59deeb406cde0e98"),
  uuid: '1a1f9fad-8589-413e-9602-5bbd66bfe675',
  dtoolcore_version: '3.17.0',
  name: 'simple_test_dataset',
  type: 'dataset',
  creator_username: 'jotelha',
  created_at: ISODate("2020-11-08T18:38:40.736Z"),
  frozen_at: ISODate("2020-11-08T19:42:05.691Z"),
  uri: 'smb://test-share/1a1f9fad-8589-413e-9602-5bbd66bfe675',
  base_uri: 'smb://test-share',
  readme: '---\n' +
    'project: testing project\n' +
    'description: testing description\n' +
    'owners:\n' +
    '  - name: Testing User\n' +
    '    email: [email protected]\n' +
    '    username: testing_user\n' +
    '    orcid: testing_orcid\n' +
    'funders:\n' +
    '  - organization: testing_organization\n' +
    '    program: testing_program\n' +
    '    code: testing_code\n' +
    'creation_date: 2020-11-08\n' +
    'expiration_date: 2022-11-08\n',
  manifest: {
    dtoolcore_version: '3.18.2',
    hash_function: 'md5sum_hexdigest',
    items: {
      eb58eb70ebcddf630feeea28834f5256c207edfd: {
        hash: '2f7d9c3e0cfd47e8fcab0c12447b2bf0',
        relpath: 'simple_text_file.txt',
        size_in_bytes: 17,
        utc_timestamp: 1689169397.658288
      }
    }
  },
  annotations: {},
  tags: [],
  number_of_items: 1,
  size_in_bytes: 17
}

The line https://github.com/jic-dtool/dtool-lookup-server-search-plugin-mongo/blob/0a65df3aeaf9c88e61664f9bb66da44d8de183fa/dtool_lookup_server_search_plugin_mongo/utils_search.py#L151-L156 builds a text index on the whole document, MongoDB calls this a Wildcard text index, https://www.mongodb.com/docs/manual/core/indexes/index-types/index-text/create-wildcard-text-index/#std-label-create-wildcard-text-index

To me, this means the following behavior:

Need to reflect this in documentation.

More query examples at https://www.mongodb.com/docs/manual/reference/operator/query/text/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

No branches or pull requests

1 participant