Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MemoryStore: replace mongomock with pymongo-inmemory #846

Open
wants to merge 36 commits into
base: main
Choose a base branch
from

Conversation

rkingsbury
Copy link
Collaborator

@rkingsbury rkingsbury commented Aug 25, 2023

Summary

Major changes:

pymongo-inmemory creates a thin wrapper around MongoClient() using a downloaded instance of mongod, so it guaranteed to work and perform exactly like "real" pymongo, unlike mongomock which has some limitations and inconstencies.

Performance impact

I did a basic timing test for connecting to a FileStore (400 files) on a system with slow (HDD) storage. On first connection, pymongo-inmemory was slower by about 50% (19s vs. 13s), presumably because it has to download and spin up the mongod instance. On subsecquent fs.connect calls pymongo-inmemory was at least as fast as mongomock (12s vs. 13s).

On a system with faster storage (SSD) and 800 files, pymongo-inmemory and mongomock had comparable speed (26-30s for pymongo-inmemory vs. 27s for mongomock)

Todos

If this is work in progress, what else needs to be done?

  • Pending response to issue, move configuration settings to pyproject.toml - (I need to open a PR for this; not necessary to complete before merging this PR).
  • Resolve remaining CI problems for which I need assistance. See below.

Remaining test failures / CI problems

I need some help (please!) from @munrojm and maybe @gpetretto to resolve the remaining test failures. I'll break them down one by one

test_jsonstore_orjson_options (@gpetretto )

The changes here prevent the JSONEncodeError from being raised, for reasons I haven't yet identified.

    def test_jsonstore_orjson_options(test_dir):
        class SubFloat(float):
            pass
    
        with ScratchDir("."):
            jsonstore = JSONStore("d.json", read_only=False)
            jsonstore.connect()
            with pytest.raises(orjson.JSONEncodeError):
>               jsonstore.update({"wrong_field": SubFloat(1.1), "task_id": 3})
E               Failed: DID NOT RAISE <class 'TypeError'>

tests/stores/test_mongolike.py:498: Failed

By updating MemoryStore to use "real" Mongo, I was able to inherit the update method from MongoStore rather than use the custom update method that had previously been used. That custom update is now inside MontyStore. I suspect that a subtle difference between the two is causing this test to fail, but I haven't been able to pin it down. I would appreciate a 2nd set of eyes. I suggest comparing the two update methods side by side as a starting point.

test_submission_patch (@munrojm )

For some reason the API is now generating a Bad Response code instead of code 200. This must somehow be related to owner_store which is a MemoryStore, but I'm not familiar enough with the API to figure out the connection

    def test_submission_patch(owner_store, post_query_op, patch_query_op):
/home/runner/work/maggma/maggma/tests/cli/test_serial.py:5: PytestCollectionWarning: cannot collect test class 'TestBuilder' because it has a __init__ constructor (from: tests/cli/test_serial.py)
        endpoint = SubmissionResource(
  class TestBuilder(Builder):
            store=owner_store,
            get_query_operators=[PaginationQuery()],
            post_query_operators=[post_query_op],
            patch_query_operators=[patch_query_op],
            calculate_submission_id=True,
            model=Owner,
        )
        app = FastAPI()
        app.include_router(endpoint.router)
    
        client = TestClient(app)
        update = json.dumps({"last_updated": "2023-06-22T17:32:11.645713"})
    
        assert client.get("/").status_code == 200
>       assert client.patch(f"/?name=PersonAge9&update={update}").status_code == 200
E       assert 400 == 200
E        +  where 400 = <Response [400]>.status_code
E        +    where <Response [400]> = <bound method Session.patch of <starlette.testclient.TestClient object at 0x7f75aa0f4f10>>('/?name=PersonAge9&update={"last_updated": "2023-06-22T17:32:11.645713"}')
E        +      where <bound method Session.patch of <starlette.testclient.TestClient object at 0x7f75aa0f4f10>> = <starlette.testclient.TestClient object at 0x7f75aa0f4f10>.patch

tests/api/test_submission_resource.py:118: AssertionError

ServerSelectionTimeout Errors

Occasionally some tests error out because the MemoryStore times out. I suspect this occurs because pymongo-inmemory is downloading and spinning up its own mongod before it can run the tests. It looks like this wasn't a problem after the most recent update, so perhaps it will take care of itself. If the issue persists, I will experiment with having pymongo-inmemory use the built in mongo instance provided on the GH runner instead of downloading its own.

Checklist

  • Google format doc strings added.
  • Code linted with ruff. (For guidance in fixing rule violates, see rule list)
  • Type annotations included. Check with mypy.
  • Tests added for new features/fixes.
  • I have run the tests locally and they passed.

@mkhorton
Copy link
Member

+1 I’ve observed some serious differences between mongomock and pymongo that can result in some tedious debugging! I think the idea of using real Mongo is much preferred.

@@ -538,14 +535,50 @@
assert jsonstore.last_updated > start_time


def test_eq(mongostore, memorystore, jsonstore):
def test_eq(mongostore, memorystore, jsonstore, montystore):
assert montystore == montystore

Check warning

Code scanning / CodeQL

Comparison of identical values Warning test

Comparison of identical values; use cmath.isnan() if testing for not-a-number.
src/maggma/stores/mongolike.py Fixed Show fixed Hide fixed
Comment on lines +639 to +644
# def __del__(self):
# """
# Ensure collection is dropped from memory on object destruction, even if .close() has not been called.
# """
# if self._coll is not None:
# self._collection.drop()

Check notice

Code scanning / CodeQL

Commented-out code Note

This comment appears to contain commented-out code.
@rkingsbury
Copy link
Collaborator Author

@munrojm @gpetretto I updated the first post on this PR with a specific request for help from each of you on one of the test failures. I'd appreciate any thoughts you have

(adding this comment b/c I'm not sure if editing the post would trigger an email notification).

@munrojm
Copy link
Member

munrojm commented Aug 29, 2023

@munrojm @gpetretto I updated the first post on this PR with a specific request for help from each of you on one of the test failures. I'd appreciate any thoughts you have

(adding this comment b/c I'm not sure if editing the post would trigger an email notification).

Just committed a fix for the patch test. Was an issue with how the body data was being passed to the patch method of the test client.

@gpetretto
Copy link
Contributor

Thanks @rkingsbury for updating the MemoryStore. The limited number of features supported by mongomock was ineed an issue.
Concerning the test that fails, I made a few tests and figured out what is happining. The reason why I introduced those features and that test was because it was failing when the document contains a FloatWithUnit. I added that option since orjson does not handle by default subclasses of native python types, but allows to pass a function to deal with values that cannot serialize. With the mongomock implementation, when passing something like

jsonstore.update({"wrong_field": SubFloat(1.1), "task_id": 3})

the dictionary stored internally still contained an instance of SubFloat. So when update_json_file was executed orjson needed to know how to convert it. It seems that now MongoDB already stores the instance of SubFloat as a simple float. So, when it is queried back to perform update_json_file it is never an instance of SubFloat. In the new implementation I can't find a way to trigger that error anymore, as MongoDB will already deals with those cases before getting to orjson. I am not sure if it would be better to remove the serialization_default option at this point.

@rkingsbury
Copy link
Collaborator Author

I added that option since orjson does not handle by default subclasses of native python types, but allows to pass a function to deal with values that cannot serialize. With the mongomock implementation, when passing something like

jsonstore.update({"wrong_field": SubFloat(1.1), "task_id": 3})

the dictionary stored internally still contained an instance of SubFloat. So when update_json_file was executed orjson needed to know how to convert it. It seems that now MongoDB already stores the instance of SubFloat as a simple float. So, when it is queried back to perform update_json_file it is never an instance of SubFloat. In the new implementation I can't find a way to trigger that error anymore, as MongoDB will already deals with those cases before getting to orjson.

Thanks for investigating! So if I understand correctly, the original problem that prompted #791 is resolved by this PR.

I am not sure if it would be better to remove the serialization_default option at this point.

But it sounds like you're saying that if I were to do this:

a = SubFloat(1.1.)
jsonstore.update({"wrong_field": SubFloat(1.1), "task_id": 3})
jsonstore.query({"task_id": 3})

I would get back a float, not a SubFloat, right? Is that a problem? It seems to me there could still be value in the options you added for users that want to be able to serialize / deserialize custom types.

@gpetretto
Copy link
Contributor

Thanks for investigating! So if I understand correctly, the original problem that prompted #791 is resolved by this PR.

Indeed I think that this would have avoided the need for #791. I will try to make a test with the new version and see if there is no issue with that.

But it sounds like you're saying that if I were to do this:

a = SubFloat(1.1.)
jsonstore.update({"wrong_field": SubFloat(1.1), "task_id": 3})
jsonstore.query({"task_id": 3})

I would get back a float, not a SubFloat, right? Is that a problem? It seems to me there could still be value in the options you added for users that want to be able to serialize / deserialize custom types.

Yes, you will get back a float. A simple example is:

class SubFloat(float):
    pass

ms = MemoryStore()
ms.connect()

d = {"test": SubFloat(1.1), "normal": 1.1, "task_id": 3}
ms.update(d)

out_d = ms.query_one()
print(out_d["test"].__class__, out_d["normal"].__class__)

The output with the old version is:

<class '__main__.SubFloat'> <class 'float'>

While with the new version is:

<class 'float'> <class 'float'>

In the end it is not really a problem, since this makes the MemoryStore more similar to the MongoStore. However, any object that is not handled by jsanitize (e.g. MSONable, dataclasses, ...) will probably be converted to a simple type during the update() (a string in the worst case). So I don't know if there is any corner case where the object survives after the insertion in MongoDB, but that cannot be handled by orjson by default. I suppose leaving the option will not do much harm (at least will not force me to make a backward incompatible change to use the latest version of maggma), however I couldn't find a way to create a test example to replace the test_jsonstore_orjson_options.

@rkingsbury rkingsbury self-assigned this Jun 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Enhancement: more performant MemoryStore MemoryStore __eq__ does not behave as expected
4 participants