Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PaNOSC Search Scoring V2.x #29

Draft
wants to merge 19 commits into
base: master
Choose a base branch
from
Draft

PaNOSC Search Scoring V2.x #29

wants to merge 19 commits into from

Conversation

nitrosx
Copy link
Collaborator

@nitrosx nitrosx commented Feb 20, 2023

This PR implements incremental weight computation, which should scale better to a higher number of datasets.
It still needs a lot of testing and in-depth review.
I created so other people can work on it, as I do not have time to work on this at the moment.

@VKTB
Copy link

VKTB commented Feb 22, 2023

Hi @nitrosx, posting my findings here as they will hopefully be useful to anyone that may be testing/reviewing/working on this PR.

From your last email, my understanding is that the /compute endpoint does not necessarily have to be used for the weights to be computed anymore. If I understood you correctly this PR changes the logic so that when a new item is inserted or an existing one is updated, the database should automatically update all the components of the weights that are influenced by the update. Also, when a query is sent, the database should compute the weights (on the fly) of the words extracted from the query and present in the items, and return the relevant ones.

Listed below are the things I did and my findings:

  • I deleted the whole database and started from scratch.
  • I updated the configuration file to include incrementalWeightsComputation which I set it to True.
  • Populated the search scoring component with 100 documents by sending a POST request to the /items endpoint which included the documents.
  • I know that one of the documents has an id pid:123 and a summary which starts like this This proposal is part of a ... so I sent a POST request to the /score endpoint with the following JSON {"query": "This proposal is part of a"}.
  • I was expecting to get the item (along with a score) back that has an id pid:123 and a summary This proposal is part of a ..., however, as shown below I did not get any items back.
{
    "request": {
        "query": "This proposal is part of a",
        "itemIds": [],
        "group": "",
        "limit": -1
    },
    "query": {
        "query": "This proposal is part of a",
        "terms": [
            "propos",
            "part"
        ]
    },
    "scores": [],
    "dimension": 0,
    "computeInProgress": false,
    "started": "2023-02-22T16:12:22.575992",
    "ended": "2023-02-22T16:12:22.581431"
}
  • I also tested it by specifying all the parameters in the JSON ({ "query": "This proposal is part of a", "group": "Documents", "limit": 1000, "itemIds": ["pid:123"] }), but I did not get any items back.
  • I also did not get any items back after modifying the values of some of the fields of the document that has an id pid:123 and a summary This proposal is part of a .... It's worth me mentioning that the PATCH /items/<id> endpoint is not doing what it is expected to do because it updates the entire item rather than updating the values of the fields supplied in the request.
  • I was constantly checking the database when I was populating it with items and sending score requests and I could only see the items collection in it so no collections for weights etc.

@nitrosx
Copy link
Collaborator Author

nitrosx commented Feb 22, 2023

@VKTB thank you so much for testing the new version and the details.
Would you be able to do the following on your testing environment:

  • Make a GET on /items and see if you get all your items back
  • Make a GET on /terms/count and check how many terms have been extracted
  • Make a GET on /terms and check the output.

You could always connect to the database directly and see if there is any entry in the tf and idf collections

Let me know

@VKTB
Copy link

VKTB commented Feb 22, 2023

@nitrosx Thank you for your reply.

  • Make a GET on /items and see if you get all your items back

Yes, I can see all the items that I posted to the search scoring component

  • Make a GET on /terms/count and check how many terms have been extracted

I get a 500 - Internal Server Error

  • Make a GET on /terms and check the output.

I get an empty list back ([]) presumably because no terms have been created when I inserted the items or modified the item with id id:123?

You could always connect to the database directly and see if there is any entry in the tf and idf collections

Like I said in my previous comment, I can only see the items collection in the database so no collections for weights, tf, idf etc. so not sure why this is the case.

@VKTB
Copy link

VKTB commented Jul 4, 2023

Hi @nitrosx, I thought I would post my findings here as well in case they are useful to anyone that may be testing/reviewing/working on this PR.

I pulled the latest changes from the v2.x branch and modified the docker-compose.yml file to build from the Dockerfile to ensure that the Docker image uses the latest code changes. I then tried testing the changes but I am getting the following error when I post an item of group Documents to the /items endpoint: 400 Bad Request – An exception of type TypeError occurred. Arguments:\n('string indices must be integers',).

I can see that the item gets added to the items collection in the database but from the entry (see below) in the status collection, the computation seems stuck because it hasn’t changed for the past 30 minutes.

[
  {
    _id: ObjectId("64a3ed5180fa4d2d2668250e"),
    inProgress: true,
    incrementalWeightsComputation: true,
    progressDescription: 'Computing weights TF',
    progressPercent: 0.2
  }
]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants