Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create NREL-hosted instance of OpenPATH for ease of use by public agencies #721

Closed
shankari opened this issue Apr 18, 2022 · 78 comments · Fixed by e-mission/e-mission-server#851

Comments

@shankari
Copy link
Contributor

shankari commented Apr 18, 2022

@aGuttman As we discussed, the plan is to have multiple enclaves, one for each program or study.

  • each enclave would contain multiple inter-connected containers and be accessible via reverse proxy
  • we would also create a study-specific landing page, which we would describe the study and ask users to install OpenPATH and then scan a QR code to join the study
@shankari
Copy link
Contributor Author

@aGuttman

The first step for this will be to create one enclave for internal NREL use.
Creating this enclave will require permission from multiple NREL departments:

  • cyber security
  • legal
  • contracting
  • communications
    etc

Since this is a complicated endeavor we will use an internal NREL project as an example to get started. @shankari will continue coordinating this since she has the context.

In parallel, we should start preparing for the more complex solution above. The immediate tasks that I can list to prepare include:

@shankari
Copy link
Contributor Author

shankari commented Apr 18, 2022

This should fix #326 (once we expand from the NREL internal project to an open enrollment option) and provide a template for fixing #452

@aGuttman
Copy link

Of these, I am first focusing on compatibility issues between mongodb and documentDB. I will familiarize myself with both to the extent needed to understand the problems present and what a proper solution should be like.

Secondarily, I will be attentive of @shankari 's communication and progress on creating the internal NREL enclave so that I can understand the process and be able to take point on future enclave creation.

@shankari
Copy link
Contributor Author

shankari commented Apr 25, 2022

Another, more serious compatibility wrt DocumentDB vs. mongodb: https://github.nrel.gov/nrel-cloud-computing/emissionlhd/issues/10#issuecomment-38430

Extracting the issue out here - the main issue is that operations on DocumentDB appear to be asynchronous, while mongodb is synchronous. In particular, a write followed by read does not return the newly written value.

Sample test

>>> import emission.core.get_database as edb
>>> from uuid import UUID

>>> def test_save_and_list():
...     edb.save(edb.get_pipeline_state_db(), {'user_id': UUID('39bf2b1c-e254-48da-aef1-e3cc7e6cf27f'), 'pipeline_stage': 6, 'curr_run_ts': 1622065279.3911324, 'last_processed_ts': None, 'last_ts_run': None})
...     print(list(edb.get_pipeline_state_db().find()))
...
>>> edb.get_pipeline_state_db().delete_many({})
<pymongo.results.DeleteResult object at 0x7f5c7a9a5280>
>>> test_save_and_list()
[]

Adding a sleep fixes it

>>> def test_save_and_list():
...     edb.save(edb.get_pipeline_state_db(), {'user_id': UUID('39bf2b1c-e254-48da-aef1-e3cc7e6cf27f'), 'pipeline_stage': 6, 'curr_run_ts': 1622065279.3911324, 'last_processed_ts': None, 'last_ts_run': None})
...     time.sleep(15)
...     print(list(edb.get_pipeline_state_db().find()))
...
>>> edb.get_pipeline_state_db().delete_many({})
<pymongo.results.DeleteResult object at 0x7f5c7a9a2b90>
>>> test_save_and_list()
[{'_id': ObjectId('60aec4b6be757e10c9069a27'), 'user_id': UUID('39bf2b1c-e254-48da-aef1-e3cc7e6cf27f'), 'pipeline_stage': 6, 'curr_run_ts': 1622065279.3911324, 'last_processed_ts': None, 'last_ts_run': None}]

@shankari
Copy link
Contributor Author

I have currently worked around this on the NREL setup by adding a sleep to every DB call, but that is so bad and hacky that I will never merge it to master (https://github.nrel.gov/nrel-cloud-computing/emissionlhd/pull/16)

@shankari
Copy link
Contributor Author

shankari commented Apr 25, 2022

As pointed out by @jgu2 here are the pages on DocumentDB related to the issue:

https://docs.aws.amazon.com/documentdb/latest/developerguide/how-it-works.html#how-it-works.replication
https://docs.aws.amazon.com/documentdb/latest/developerguide/transactions.html#transactions-isolation-level
https://docs.aws.amazon.com/documentdb/latest/developerguide/how-it-works.html#durability-consistency-isolation.read-consistency

Ddocdb cluster is distributed, it has primary node and replicas, primary node supports read/write, and replica supports read only. The reads from a read replica are eventually consistent, often less than 100ms replica lag. To keep read-after-write consistency, it can setup the read preference,

db.example.find().readPref('primary')
or 
db.example.find().readPref('primaryPreferred')

Since we should be using the timeseries interface (emission.storage.timeseries.abstract_timeseries) for most calls, there should not be a lot of direct calls to find(). So maybe we should start with this fix and see if it works?

@shankari
Copy link
Contributor Author

Although the real fix should be to look at the control flow and see where we have read-after-write dependencies.
Here are all the write calls:

$ egrep -R -l "insert_one|replace_one|update_one" emission | grep -v tests | grep -v .pyc

which currently returns

emission/net/usercache/builtin_usercache.py
emission/net/ext_service/push/notify_interface_impl/firebase.py
emission/net/api/usercache.py
emission/core/wrapper/user.py
emission/core/wrapper/client.py
emission/core/get_database.py
emission/storage/timeseries/builtin_timeseries.py

@shankari
Copy link
Contributor Author

shankari commented May 2, 2022

Tasks for the documentDB support:

  • Talk to cyber and get access to the documentDB staging environment
  • Read the current documentDB support for indices and see whether HASHED and GEOSPHERE are now supported
    • If yes, remove the hack to remove those unsupported indices
    • If still not supported:
      • run the unit test suite against a mongodb on your laptop
      • run the unit test suite against the documentDB staging environment
      • if documentDB is comparable to mongodb on laptop (no more than 10% slower), we are revisit in 6 months
      • if documentDB is significantly slower than mongodb on laptop, instead of removing index, replace with HASHED index types that are supported (e.g. ASCENDING)
        • GEOSPHERE cannot be replaced by ascending/descending, but it is not used extensively just yet
  • for the timing issue, revise "read replicas" and the read-after-write problem

@shankari
Copy link
Contributor Author

shankari commented May 2, 2022

Tasks for May 3rd @aGuttman

  • Request access to OpenPATH staging database
  • Get docker installed on your laptop
  • Figure out how to run all the tests
  • Investigate current status of DocumentDB indices
  • stretch goal if they do now support HASHED and GEOSPHERE, remove the SIMPLE_INDEX hack from the docker file linked above and submit PR

@shankari
Copy link
Contributor Author

shankari commented May 3, 2022

We also need to have a demographic survey for each user. We have typically used external surveys (Qualtrics/Google Forms) before, but those have the following limitations:

  • they are not approved survey tools at NREL
  • they are not integrated with the rest of the data, so we can't ensure that we only popup the survey if the user has not yet answered it earlier
  • we also can't experiment with fancy solutions where we break up the onboarding survey and ask the questions in little dribbles

A solution that would address all those issues is to store the survey information in mongodb, just like everything else. But we don't want to create a survey builder. Instead, we will use kobotoolbox to create a survey which we will display using the enketo library.

The UNSW group has already integrated with enketo core in the https://github.com/e-mission/e-mission-phone/tree/rciti branch, so let's start by exploring that approach.

This feature is tracked in #727

@shankari
Copy link
Contributor Author

shankari commented May 6, 2022

From @jgu2:

It looks like you are using a hard-coded db name in code of emission called “Stage_database”. For best practice, it’s recommended to use different db for different deploy environment, as we are moving towards production deploy, a production db will be created on the same docdb cluster.

What’s more, as OpenPath needs multi-containers for different centers/organizations, ideally each center/organization would have its own db setup.

We can setup DB name in an environment variable, for example DB_NAME, where you can get access to according db via the variable. If this way, you will need to change your code to adapt it. Otherwise, all your data of stage/production from different centers/orgs in the future would be written into same db. What’s your thoughts?

From @shankari:

Makes sense.

For the record, the reason that the DB/collection name is hardcoded is because I typically run multiple different MongoDB containers for the different instances. Too bad that isn’t an option on the AWS cluster, but we can adapt it pretty easily, I think.

@aGuttman, can you take care of this after the index changes and the readPrefs?

@shankari
Copy link
Contributor Author

shankari commented May 9, 2022

@aGuttman will have the primary DB changes done by tomorrow:

  • Fix indices
  • Deal with read replicas and readPreferred
  • Configure DB name

Then he will work on #724
And then he will work on at least part of #709, the part about storing the generated models into the database instead of a file.

@aGuttman
Copy link

aGuttman commented May 10, 2022

@aGuttman will have the primary DB changes done by tomorrow:

  • Fix indices
  • Deal with read replicas and readPreferred
  • Configure DB name

Then he will work on #724 And then he will work on at least part of #709, the part about storing the generated models into the database instead of a file.

  • Indices made compatible with documentDB. HASHED is not supported while GEOSPHERE is, but not allowed to be sparse. Replaced instances of HASHED with ASCENDING and removed sparse=true from instances of GEOSPHERE. No impact on speed noticed when running ./runAllTests.sh. Fix indexes e-mission-server#849

  • URL when connecting to remote documentDB instance can include a read preference setting. Setting this to primaryPreferred should avoid read-after-write problem without changing every find() call in code to find().readPref('primaryPreferred').
    Specifically, use a url in the format:
    mongodb://<user>:<password>@documentdb-moderate-1.<rest_of_host>:27017/?ssl=true&ssl_ca_certs=<path_to_certs>/rds-combined-ca-bundle.pem&replicaSet=rs0&readPreference=primaryPreferred&retryWrites=false
    in db.config (e-mission-server/conf/storage/) when connecting to documentDB.

  • Investigating DB name configuring also through URL. Should be able to include a name and then parse it off using pymongo uri_parser such as:
    https://stackoverflow.com/questions/52536093/get-the-database-name-from-a-pymongo-object
    Not sure if this is the best way to do it, but thought it was attractive after seeing the URL solution to read preference to keep all these considerations in a single part of the already used db.config file.
    The current hard-coded name Stage_database is mainly a concern in get_databse.py, but the string "Stage_database" appears in:

    • ./emission/core/__pycache__/get_database.cpython-37.pyc
    • ./emission/core/get_database.py
    • ./emission/analysis/classification/inference/Classifier.ipynb
    • ./emission/integrationTests/storageTests/TestMongodbAuth.py
    • ./emission/storage/timeseries/abstract_timeseries.py
    • ./emission/storage/timeseries/__pycache__/abstract_timeseries.cpython-37.pyc
    • ./bin/historical/migrations/rename_client_ts_to_ts.py
    • ./bin/plot_trip.py
    • ./setup/db_auth.py

@shankari
Copy link
Contributor Author

shankari commented May 10, 2022

@aGuttman

URL when connecting to remote documentDB instance can include a read preference setting. Setting this to primaryPreferred should avoid read-after-write problem without changing every find() call in code to find().readPref('primaryPreferred').

can you confirm that you tested this? the simple test-after-set function above should be sufficient

@shankari
Copy link
Contributor Author

shankari commented May 10, 2022

Not sure if this is the best way to do it, but thought it was attractive after seeing the URL solution to read preference to keep all these considerations in a single part of the already used db.config file.

@aGuttman Agree that this is attractive. What are some additional design options, and can you list the pros and cons before choosing this one?

@shankari
Copy link
Contributor Author

shankari commented May 10, 2022

@aGuttman wrt this list, here's my initial pass-through. ** are the files that you need to investigate and potentially adapt. The other can either be ignored or removed, although double check their history once as well 😄

    ./emission/core/__pycache__/get_database.cpython-37.pyc : compile artifact, ignore
    ./emission/core/get_database.py  : already identified
    ./emission/analysis/classification/inference/Classifier.ipynb : fairly obsolete, ignore
    **./emission/integrationTests/storageTests/TestMongodbAuth.py** : this is a test of the authenticated connections to mongodb. Use it to test the solution we come up with
    ** ./emission/storage/timeseries/abstract_timeseries.py**: Look at the history/blame for the code which includes it, and see if it can be rewritten to use get_database.
    ./emission/storage/timeseries/__pycache__/abstract_timeseries.cpython-37.pyc: compile artifact, ignore
    ./bin/historical/migrations/rename_client_ts_to_ts.py: historical script, ignore. Maybe add a readme to the directory indicating that these are for reference and are not guaranteed to work now.
    ./bin/plot_trip.py: Pretty sure this is obsolete. Check history and remove.
    **./setup/db_auth.py**: This is a script to setup the mongodb auth properly based on username and password. adapt to handle the DB name as well.

@shankari
Copy link
Contributor Author

shankari commented May 10, 2022

@aGuttman we don't store anything outside that database, so do you even need to parse it out? Maybe if you pass in a URL with the database embedded, you don't need to specify the database any more. From the SO post that you linked, it looks like maybe something like the following would just work

MongoClient("mongodb://<user>:<password>@documentdb-moderate-1.<rest_of_host>:27017/<dbName>?ssl=true&ssl_ca_certs=<path_to_certs>/rds-combined-ca-bundle.pem&replicaSet=rs0&readPreference=primaryPreferred&retryWrites=false").Stage_uuids.count_documents()

you could then replace

print("Connecting to database URL "+url)
_current_db = MongoClient(url).Stage_database

with

print("Connecting to database URL "+url)
_current_db = MongoClient(url)

@aGuttman
Copy link

@aGuttman

URL when connecting to remote documentDB instance can include a read preference setting. Setting this to primaryPreferred should avoid read-after-write problem without changing every find() call in code to find().readPref('primaryPreferred').

can you confirm that you tested this? the simple test-after-set function above should be sufficient

I did test this with the test_save_and_list() example typed out above and it did work, though I'm confused about how to interprate the behavior. primary, primaryPreferred and secondaryPreferred all produced the correct output. secondary times out. Invalid settings cause a warning, but the correct output is still produced. In my testing I haven't been able to get an empty list back like the example. I'm not including any sleeps.

Using primary read preference in the URL to avoid read-after-write problems is definitely an intended feature, as seen in the first example under Multiple Connection Pools here: https://docs.aws.amazon.com/documentdb/latest/developerguide/connect-to-replica-set.html

Also, if we don't take advantage of replica sets, it seems like we could just leave out the rs0 part of the URL.

If you omit /?replicaSet=rs0, the client routes all requests to the cluster endpoint, that is, your primary instance

@shankari
Copy link
Contributor Author

In my testing I haven't been able to get an empty list back like the example. I'm not including any sleeps.

That might be because your connection to the database is slow so it acts as an implicit sleep. That is also probably why the tests take a long time to run. I think we can rely on the documentation for now (maybe leave out the rs0) and revisit if we still have issues.

@aGuttman
Copy link

@aGuttman we don't store anything outside that database, so do you even need to parse it out? Maybe if you pass in a URL with the database embedded, you don't need to specify the database any more. From the SO post that you linked, it looks like maybe something like the following would just work

MongoClient("mongodb://<user>:<password>@documentdb-moderate-1.<rest_of_host>:27017/<dbName>?ssl=true&ssl_ca_certs=<path_to_certs>/rds-combined-ca-bundle.pem&replicaSet=rs0&readPreference=primaryPreferred&retryWrites=false").Stage_uuids.count_documents()

you could then replace

print("Connecting to database URL "+url)
_current_db = MongoClient(url).Stage_database

with

print("Connecting to database URL "+url)
_current_db = MongoClient(url)

I don't think this works. I'm trying to do something like it now and it's returning errors like TypeError: 'Collection' object is not callable. If you meant to call the 'delete_many' method on a 'Database' object it is failing because no such method exists. I might be misunderstanding something, though I am able to get it to work parsing out the name and using dictionary style access.

url = config_data["timeseries"]["url"]
parsed=pymongo.uri_parser.parse_uri(url)
...
print("Connecting to database URL "+url)
_current_db = MongoClient(url)[parsed['database']]

It would be nice to avoid more parsing than necessary, but I think the name does need to be specified on the client object (?) in order to use it. Given that, it needs to be passed somehow from an external source. We can make it its own field in a json, but that just changes where the parse happens. Maybe the pymongo parser is slower, but we only need it once at the start and it just feels cleaner to me to specify it all in the url instead of adding another field to the config (or making an additional config file). Those are the only options I'm able to think of right now, and of those, I just like the url option better, even at the cost of using a potentially slightly less effective parser.

@aGuttman
Copy link

Also let's hold off on my pull request for a minute. tests/netTests/TestBuiltinUserCacheHandlerInput.py is failing. I can go back on all my changes expect the ones that need to be there (swap ASCENDING for HASHED, remove GEOSPHERE entirely or just remove sparse=True) but testMoveDuplicateKey and testTwoLongTermCalls fail. I'm going to look at what those two do.

@shankari
Copy link
Contributor Author

If you see https://github.nrel.gov/nrel-cloud-computing/emissionlhd/issues/10#issuecomment-38429, laptop to DocumentDB always worked with the read preference.

on the laptop, the communication overhead for the call is high, which effectively functions as a sleep

You need to get @jgu2 to create an AWS EC2 instance (maybe to ssh into the staging instance?) and then run the script from there.

@shankari
Copy link
Contributor Author

Given that, it needs to be passed somehow from an external source. We can make it its own field in a json, but that just changes where the parse happens. Maybe the pymongo parser is slower, but we only need it once at the start and it just feels cleaner to me to specify it all in the url instead of adding another field to the config (or making an additional config file). Those are the only options I'm able to think of right now, and of those, I just like the url option better, even at the cost of using a potentially slightly less effective parser.

This makes sense to me. Please submit a PR for this as well. You probably want to communicate the new URL format (after it is finalized) with @jgu2 so he can set the environment variables correctly.

@shankari
Copy link
Contributor Author

  • Get access to an AWS EC2 instance to test out the read preference
  • Debug and fix the test failures on DocumentDB
  • Clean up if needed and submit PR for DB name

@shankari
Copy link
Contributor Author

shankari commented May 10, 2022

@aGuttman every line is different because the timestamps are different. If you grep -v all the lines related to Updated result for and cut -d ":" -f ... to remove all the timestamp stuff, you should be able to see exactly what is different. Or at least, narrow it down significantly.

@aGuttman
Copy link

Ran 237 tests in 39327.855s

FAILED (failures=2, errors=15)

Just about 11 hours to run the tests remotely this time. No sleeping this time. I guess the timer from the last run didn't count that time either.

@aGuttman
Copy link

For the TestBuiltinUserCacheHandlerInput.py test failures:

  • It appears to be a timing issue. Different individual tests in the file will fail from run to run, sometimes even at different places.
  • Adding sleeps at certain points seems to help some failures, but not always, and often getting one test to pass is coupled with a new one failing.
  • It does not appear to be an issue with replicas/read preference. Changing these in the url at least hasn't corresponded with more or less success.
  • These failures do not seem to be a result of the index changes which is the only code change being tested. Commenting out/replacing the index changes don't affect the test.
  • The failure is almost always an assert statement enforcing that a user cache is empty after calling moveToLongTerm() from emission/net/usercache/builtin_usercache_handler.py
  • ie AssertionError: 30 != 0 on
# Then we move entries for user1 into longterm
enuah.UserCacheHandler.getUserCacheHandler(self.testUserUUID1).moveToLongTerm()

# Now, we have two sets of entries, so we will have 60 entries in longterm
self.assertEqual(len(self.uc1.getMessage()), 0)

Again, adding sleeps around this call seems to help sometimes, but not always.

Looking at the definition of moveToLongTerm() and seeing that the problem is that the cache is failing to be emptied, it seems like the problem would be in its call to clearProcessedMessages() in emission/net/usercache/builtin_usercache.py but that just boils down to

def clearProcessedMessages(self, timeQuery, key_list=None):
        del_query = self._get_msg_query(key_list, timeQuery)
        logging.debug("About to delete messages matching query %s" % del_query)
        del_result = self.db.delete_many(del_query)
        logging.debug("Delete result = %s" % del_result)

There doesn't seem to very much online discussion of documentDB issues online in general and I'm not finding much about synchronization issues with delete_many(), but I don't know. I'm not sure what to do from here.

@shankari
Copy link
Contributor Author

@aGuttman I would:

  1. document what you did with the sleeps and the amount of the sleep and document the failures that you encountered
  2. try my prior sleep hack and see if that fixes it

Note that if we have moveToLongTerm() followed by an assertion to 0 but the actual value is 30, this could also just be a read of stale information.

@shankari
Copy link
Contributor Author

The next thing I would try is to come up with an MRE that doesn't use any of the openpath code, but just pymongo directly.
For example, something like:

test_db = MongoClient(url).Test_inconsistency_database
test_coll = test_db.Test_collection
lengths = [10,100,1000,10000]
for l in lengths:
     # Insert l entries for l = 10, 100...
     for i in range(l):
         test_coll.insert({"i": i})
     # busy loop until we get l documents
     for test_coll.count_documents() < l:
         # potentially add sleep here
         print(test_coll.count_documents())
     # Delete all the entries
     test_coll.remove_many({})
     # busy loop until we get 0 documents
     for test_coll.count_documents() > 0:
         # potentially add sleep here
         print(test_coll.count_documents())    

@shankari
Copy link
Contributor Author

Re-ran with additional logging. Results this time were:

grep AssertionError *
test_run_more_logs_1:AssertionError: 60 != 0
test_run_more_logs_2:AssertionError: 30 != 0
test_run_more_logs_3:AssertionError: 30 != 0
test_run_more_logs_4:AssertionError: 60 != 0
test_run_more_logs_5:AssertionError: 60 != 0
test_run_more_logs_6:AssertionError: 60 != 0
test_run_more_logs_7:AssertionError: 60 != 0
test_run_more_logs_8:AssertionError: 60 != 0
test_run_more_logs_9:AssertionError: 60 != 0

Let's start by comparing run 1 and run 2

@shankari
Copy link
Contributor Author

shankari commented May 17, 2022

So the new logs don't actually have the additional logs that I thought I added, but I was able to use them anyway.

The difference seems to happen when we have multiple entries with the same write_ts. Note that we sort our entries by write_ts. In that case, it looks like MongoDB returns them in a consistent order, but document DB returns them in a different order sometimes.

For example, consider runs 1 and 2 above.

For the first set of entries processed (write_ts = 1652644367 on the first run and 1652644404 on the second run), we read entries in the order (activity, location, transition) for both runs.

Screen Shot 2022-05-16 at 10 03 04 PM

For write_ts = 1652644367 and 1652644434 right below, we read entries in the order (activity, location, transition) in one and in the order (transition, activity, location) in the second.

Screen Shot 2022-05-16 at 9 58 21 PM

and for the last set of entries processed before the tests diverge, the order on the second run is with the location last.

Screen Shot 2022-05-16 at 10 12 36 PM

So the last timestamp is the incorrect one and we don't delete anything and the test fails.

Screen Shot 2022-05-16 at 10 13 38 PM

@aGuttman not sure what the order should be on MongoDB vs documentDB if the sort key is the same.

@shankari
Copy link
Contributor Author

And in the final failure on run 1 also, location is last

2022-05-15 12:57:52,107:DEBUG:4475928064:module_name = emission.net.usercache.formatters.android.activity
2022-05-15 12:57:52,107:DEBUG:4475928064:write_ts = 1652644607
2022-05-15 12:57:52,107:DEBUG:4475928064:module_name = emission.net.usercache.formatters.android.transition
2022-05-15 12:57:52,107:DEBUG:4475928064:Timestamp conversion: 1652644607 -> 1652644607 done
2022-05-15 12:57:52,107:DEBUG:4475928064:write_ts = 1652644607
2022-05-15 12:57:52,108:DEBUG:4475928064:Mapped local.state.waiting_for_trip_start -> 1
2022-05-15 12:57:52,108:DEBUG:4475928064:No existing timestamp, copyied from metadata1652644607
2022-05-15 12:57:52,108:DEBUG:4475928064:module_name = emission.net.usercache.formatters.android.location
2022-05-15 12:57:52,108:DEBUG:4475928064:write_ts = 1652644.607

@aGuttman
Copy link

I should have checked back here earlier, I just found the same thing myself and was excited to report back.

I added an extra log to format_location_raw() and it looks like it always creates a number that is unreasonably small. There doesn't seem to be a legitimate example where dividing by 1000 creates a time stamp that we want. Can we just get rid of the division? Then the ordering shouldn't matter.

I guess to add to that, I never see format_location_simple() being used. Perhaps if ("mLatitude" in entry.data): is branching off the wrong condition?

I think that's the logical root of this. All entries have a mLatitude. entry.data["mLatitude"] is 37.3885529 for all of the entries being tested which makes it seem like some sort of default value. I would look into it further right now, but I just spent 5 minutes getting really frustrated with grep before realizing I was in a completely wrong directory, so I think I'm at my limit for the day.

@aGuttman
Copy link

Ok, I didn't really understand how the test examples were being generated before but I see now.

All entries are based off of those in emission/tests/data/netTests/. android.location.raw.txt from there is an "old style" entry with an mLatitude field and a write_ts in milliseconds (13 digits). However, when we populate the user cache with entries, we are overwriting write_ts with times based on the current time, in seconds (10 digits). So we are giving these old style android location entries a write_ts in the wrong format, which was always a problem, regardless of what database we use, but the problem was hidden on mongoDB since the entries always came in (activity, location, transition) order, so the location entires were never last and thus their write_ts were never used in the query to clearProcessedMessages().

While the ordering being consistent on mongo and not on document is notable and good to know going forward, the way location entries were formatted was already wrong.

@aGuttman
Copy link

aGuttman commented May 17, 2022

I tried giving location entries a write_ts in milliseconds in setUp() to see what would happen, thinking it would probably break moveToLongTerm()'s query, but it didn't. Turns out sync_phone_to_server() in emission/net/api/usercache.py already checks for timestamps in milliseconds and converts them.

if ecc.isMillisecs(data["metadata"]["write_ts"]):
        data["metadata"]["write_ts"] = old_div(float(data["metadata"]["write_ts"]), 1000)

if "ts" in data["data"] and ecc.isMillisecs(data["data"]["ts"]):
        data["data"]["ts"] = old_div(float(data["data"]["ts"]), 1000)

Given this, is there any case where timestamps in the user cache can be in milliseconds? What over functions can put data in the user cache?

Is format_location_raw() from emission/net/usercache/formatters/android/location.py ever used on entries not in the user cache? If no, we can remove the division from it. If yes, we need another function or check inside of the existing function for entries that have already been converted. Otherwise we divide timestamps for old style entires by 1000*1000 instead of 1000.

@aGuttman
Copy link

Adding the check

    if ecc.isMillisecs(entry.metadata.write_ts):
        metadata.write_ts = old_div(float(entry.metadata.write_ts), 1000)
    else:
        metadata.write_ts = entry.metadata.write_ts

to format_location_raw() fixes the problem.

However, this is pattern is copied from sync_phone_to_server() in emission/net/api/usercache.py which includes the comment "Hack to deal with milliseconds until we have moved everything over" which implies that this is not a great long term fix.

Additionally format_location_raw() is proceeded by the comment:

TODO: Remove the RAW code since we don't really have any clients that
are sending it. But while it is there, we should still convert ms to sec
since the old data did have that in place

Is this a task worth the effort? Hunting down how milliseconds are dealt with and making sure that they all get divided once, but not twice (or more) and cleaning the code of these hacks? Seems like a pain, but I don't understand the way things are processed enough to judge whether it's worth it or not.

@aGuttman
Copy link

Added server name confining in URL and hack around millisecond division commits to my pull request.

@shankari
Copy link
Contributor Author

shankari commented May 17, 2022

@aGuttman I understand the urge to fix the test, and yes, we can remove the hacks/workarounds finally now that we don't use "old style" formats any more.

But wrt the meeting that we are going to attend in a few hours with @jgu2, the issue is another inconsistency between mongodb and documentDB

In other words:

  • how many other places in the code does this inconsistency affect?
  • how many other such inconsistencies between mongo and documentDB exist?

It's great that we have a test framework and automated tests, but we don't currently have a defensible estimate of the code coverage for the tests.

So my first question holds: "not sure what the order should be on MongoDB vs documentDB if the sort key is the same." Is this actually documented somewhere as an inconsistency?

In other other words, wrt:

While the ordering being consistent on mongo and not on document is notable and good to know going forward, the way location entries were formatted was already wrong.

The way that location entries were formatted was not wrong, it was right for that obsolete code branch that ended up being inconsistent with that particular test case. Since we don't send "old style" entries from the phones any more, the location formatting is not a problem. The more serious issue is that it exposed this inconsistency between mongodb and documentDB, and we don't know the extent of its effect on the codebase.

@aGuttman
Copy link

Understood. Another hit against the use of documentDB is lack of online resources covering it. There is the official amazon documentation and very little else. If I don't understand something from the documentation or suspect it might be inaccurate, there are very often no discussions about it.

I looked for information about the ordering inconsistency and didn't find anything.

@shankari
Copy link
Contributor Author

@aGuttman ok, let's see what the cloud services (@jgu2) team says about a mongodb cluster.

@aGuttman
Copy link

The way that location entries were formatted was not wrong, it was right for that obsolete code branch that ended up being inconsistent with that particular test case.

I should have been more specific. The test inserts a timestamp in seconds into old style entries which should be in milliseconds, and when the old style formatter is applied to them, it misformats them since it was expecting milliseconds. Inserting the seconds timestamp into the old style entires in this test was already creating misformatted entries, regardless of database.

@shankari
Copy link
Contributor Author

Inserting the seconds timestamp into the old style entires in this test was already creating misformatted entries, regardless of database.

Agreed. But the "incorrect format of the entries in the database" are:

  • something we would not encounter in the real world any more
  • something that the test doesn't really care about since we don't explicitly check for it

What we care about is that the correct number of entries is moved to long term

Maybe we should care about the format of the entries as well and not just their being copied correctly, but that is part of the code coverage discussion...

@shankari
Copy link
Contributor Author

@aGuttman I see that you have included the fix for the incorrectly formatted ts in your PR, but I'd rather just remove this obsolete code path than hack around it some more.

I originally included the hack in 2015 (e-mission/e-mission-server@8cab86f) to provide backwards compatibility and a consistent migration path while changing the location format, since the phone clients can take a while to be upgraded.

It is now almost 5 years later. We are working on publishing a brand new client. Keeping obsolete code paths around increases complexity unnecessarily, and (as we saw) can make the codebase harder to reason about.

Let's remove this codepath completely. @aGuttman I know it can be scary to remove code that you haven't written, but creative destruction is an important part of keeping code manageable. LMK if you would like me to make the change instead this first time.

@shankari
Copy link
Contributor Author

wrt "Maybe we should care about the format of the entries as well", I checked and we do have a test explicitly for that

https://github.com/e-mission/e-mission-server/blame/2177582fdea9a52faa8c1a238e62b111dbbcedba/emission/tests/netTests/TestFormatters.py#L39

Of course, in this case, the write_ts is not overridden so it still works.

@aGuttman
Copy link

I did some more looking around while on the call and did find documentation of sort ordering issues:
https://docs.aws.amazon.com/documentdb/latest/developerguide/functional-differences.html#functional-differences.result-ordering
If I understand it correctly, this confirms that our problem is known, sorts will be ordered based on what they are given, but nothing else is guaranteed. Also if something is sorted by one field and then sorted by another field, mongo will preserve the first ordering in the second, where documentDB does not.

I don't know if this occurs and is important anywhere else in our code, but if it is, doing a compound sort could fix it. I added a compound sort to getMessage()

retrievedMsgs = list(self.db.find(combo_query).sort([("metadata.write_ts", pymongo.ASCENDING),("metadata.key", pymongo.ASCENDING)]).limit(edb.result_limit))

Which prevents locations from coming last and it works. Adding this and removing my hack fix allows the tests to pass.

@shankari
Copy link
Contributor Author

@aGuttman @jgu2 We are looking into a couple of options:

  1. using mongodb in a container as a transition while we experiment further with documentDB
  • this will require additional cyber approval and may take more time
  • it is only a stopgap
  1. go ahead with documentDB
  • get the current tests to pass by removing the code path for the "old style" locations
  • figure out a way to ask our questions to an AWS engineer
  • figure out a way to run tests in the AWS environment
  • determine the code coverage of the existing codebase and potentially bulk up on tests to ensure that we don't have other gremlins lurking in the code

tracking integration with a code coverage tool at #729

@aGuttman
Copy link

aGuttman commented May 23, 2022

I missed something important.

Traceback (most recent call last):
  File "emission/tests/storageTests/TestMongoGeoJSONQueries.py", line 35, in testGeoWithinPostsData
    for a in self.Sections.find({ "loc" : { "$geoWithin" : { "$polygon" :[ [ 90,31 ],[90,40] ,[ 110,40 ],[110,31]] } } }):

...

pymongo.errors.OperationFailure: unknown geo specifier: $polygon, full error: {'ok': 0.0, 'operationTime': Timestamp(1653342214, 1), 'code': 2, 'errmsg': 'unknown geo specifier: $polygon'}

When I ran the unit tests on remote I had the problem of my computer going to sleep serval times. These seemed to produce some timeout errors, which I thought weren't a worry, because when I went back and reran tests without sleeping/individually, they seemed to go away.

I must have overlooked this one somehow. Also, it produces an error in testing instead of a failure, I think I may have filed it in my mind as lower priority and then forgot to track it anywhere after looking into the failures.

From https://docs.aws.amazon.com/documentdb/latest/developerguide/geospatial.html:

Amazon DocumentDB does not support querying or indexing of Polygons, LineString, MultiPoint, MultiPolygon, MultiLineString, and GeometryCollection.

Which is odd, because earlier on the same page:

A good example of inclusion querying is to find all points (all airports) that located in a specified area/polygon (state of New York). A good example of intersection query is finding a polygon (state) which intersects with a point (city). You can use the following Geospatial operators to gain insights from your data.
...

  • $geoWithin - $geoWithin is a find operator that supports finding documents with geospatial data that exists entirely within a specified shape such as a polygon.

This, of course, all works fine on my local mongodb.

@shankari
Copy link
Contributor Author

shankari commented May 23, 2022

@aGuttman great catch! We should definitely add this to the list of DocumentDB compat issues to flag.

Fortunately, IIRC, we are not actively using the $geoWithin functionality actively right now.

To think this through:

  • I used to use it for heatmaps, but we won't have the deployer dashboard up immediately.
  • I also use it for targeted surveys (people who typically travel a route), but again, we can defer that until we actually have some data to survey people about
  • I will want to use it to automatically split the open access data into geographic regions in the future but that is in the future

@aGuttman is there a way I can get a list of queries executed by a mongodb server in a past month or something? We can then verify that we are not using geoWithin in the CEO setup.

@shankari
Copy link
Contributor Author

@aGuttman I see that you have made the changes for e-mission/e-mission-server#849
I will merge it as soon as the tests finish.

Can you run the tests against DocumentDB from the newly created test AWS instance (see instructions from @jgu2) and see if there are any additional timing issues that we encounter in the AWS <-> AWS environment?

@shankari
Copy link
Contributor Author

shankari commented May 27, 2022

Now that e-mission/e-mission-server#849 has been merged and #727 has been resolved, we are ready to deploy the NREL OpenPATH instance to staging.

Let's track that at #732

In the meanwhile, @aGuttman will continue to test the server in the AWS environment.

@aGuttman
Copy link

aGuttman commented Jun 6, 2022

Running unit tests on AWS ec2 instance with DocumentDB:

  • runAllTests.sh runs in about 40 minutes, as opposed to ~15 when running locally with mongoDB
    • Generates 15 errors
      • 5 errors due to use of polygon queries
        • Issue discussed above
      • 10 errors due to auth error on pymongo's drop_database()
        • From David Rager: "FYI – on shared clusters we do not allow create/drop database permissions to protect the assets of other projects on cluster."
    • Other than these expected errors, passes all other tests

Database Name

Database names are configurable through conf/storage/db.conf. Use the desired name after :27017/ in the url.
ie:
mongodb://<inserYourUsername>:<insertYourPassword>@documentdb-moderate-1.cluster-ccklrxkcenui.us-west-2.docdb.amazonaws.com:27017/<db_name>?<...>

Note that cloud services will still need to create the database with the desired name and give the user access to it. This only accesses existing DBs that have been set up with the proper permissions, it does not create them.

Read Pref on DocumentDB

  • Setting read preference through database url seems to be the cleanest solution
    • Put &readPreference=primaryPreferred in url
    • ie: mongodb://<inserYourUsername>:<insertYourPassword>@documentdb-moderate-1.cluster-ccklrxkcenui.us-west-2.docdb.amazonaws.com:27017/<db_name>?ssl=true&ssl_ca_certs=rds-combined-ca-bundle.pem&replicaSet=rs0&readPreference=primaryPreferred&retryWrites=false
  • Solves the read-after-write issue documented above
    • The read-after-write test fails when tested on ec2 with readPreference=secondaryPreferred and no sleep. It succeeds with readPreference=primaryPreferred and no sleep. When tested with server run locally connecting to remote DocumentDB, network creates enough delay that never fails, regardless of read pref.

Code Coverage on DocumentDB

Code coverage drops for 72% with MongoDB to 70% on DocumentDB. The following files make up this drop:

  • analysis/classification/inference/mode/seed/pipeline.py
    • 90% to 21%
  • emission/analysis/classification/inference/mode/seed/section_features.py
    • 38% to 13%
  • core/common.py
    • 82% to 69%
  • emission/core/get_database.py
    • 70% to 68%
  • emission/tests/analysisTests/modeinferTests/TestPipelineSeed.py
    • 98% to 27%
  • emission/tests/storageTests/TestMongoGeoJSONQueries.py
    • 90% to 68%

Looking at this quickly, it appears that these drops are almost all due to the errors mentioned above causing tests to quit early (ie: pipeline.py has functions that get called in /emission/tests/analysisTests/modeinferTests/TestPipelineSeed.py, which has a drop_database() early on in a few tests, so it never calls those functions.) The differences between MongoDB and DocumentDB are not so large that they cause different branches to fire between the test cases.

Encountered DocumentDB Incompatibilities

  • Hashed indices
    • Not supported. Switched hashed to ascending. Makes no noticeable time difference in local MongoDB test.
  • Geosphere indices
    • Supported, but cannot be sparse. Removed sparse, makes no noticeable time difference in local MongoDB tests.
  • Stable sorting
    • MongoDB sorts are stable and consistent. DocumentDB sorts are not. Best thing to do is to avoid design decisions that rely on stable sorting, but if necessary, compound sorts can be used to create similar (but not exactly that same) results*.
  • Polygon quieres/indices
    • Not supported. No easy workaround found.
  • Replica Sets/Read Preference
    • read-after-write issue documented above. Solved with readPreference=secondaryPreferred in url.
*Sorting Details (click to unfold) MongoDB sorts return the same result over the same collection of items every time. That is, ties will always be broken the same way; whatever was first in the original collection will be first in the result. This is known as stable sorting. DocumentDB does not guarantee this. Sorts will not be incorrect, but could be different run to run where items that are tied might switch places. Ideally, things would be designed in a way where this is not a problem, but stable sorting can hide assumptions. Say we sort Collection by Index1, ascending. Later we sort Collection by Index2, ascending. After this second sort, we might unknowingly rely on the fact that no items of lower Index1 value can follow an item of higher Index1 value when they all have have equal Index2 values. Making this assumption in DocumentDB will cause errors. We can fix this by doing a compound sort by (Index2, ascending, Index1, ascending), but we must notice that this is happening first, which is tricky to see. Even with a compound sort we are not guaranteed exactly the same result in DocumentDB that we got in MongoDB, because ties are still broken "randomly" when they exist, (ie: Index2 and Index1 are equal.) We just have a second (or more) rule to fall back to after the first, but orderings can still change run to run. That said, if this causes problems, the code is very fragile and the design should be looked at anyway.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants