-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create NREL-hosted instance of OpenPATH for ease of use by public agencies #721
Comments
The first step for this will be to create one enclave for internal NREL use.
Since this is a complicated endeavor we will use an internal NREL project as an example to get started. @shankari will continue coordinating this since she has the context.
In parallel, we should start preparing for the more complex solution above. The immediate tasks that I can list to prepare include:
|
Of these, I am first focusing on compatibility issues between mongodb and documentDB. I will familiarize myself with both to the extent needed to understand the problems present and what a proper solution should be like. Secondarily, I will be attentive of @shankari 's communication and progress on creating the internal NREL enclave so that I can understand the process and be able to take point on future enclave creation. |
Another, more serious compatibility wrt DocumentDB vs. mongodb: https://github.nrel.gov/nrel-cloud-computing/emissionlhd/issues/10#issuecomment-38430 Extracting the issue out here - the main issue is that operations on DocumentDB appear to be asynchronous, while mongodb is synchronous. In particular, a write followed by read does not return the newly written value. Sample test
Adding a sleep fixes it
|
I have currently worked around this on the NREL setup by adding a sleep to every DB call, but that is so bad and hacky that I will never merge it to master (https://github.nrel.gov/nrel-cloud-computing/emissionlhd/pull/16) |
As pointed out by @jgu2 here are the pages on DocumentDB related to the issue: https://docs.aws.amazon.com/documentdb/latest/developerguide/how-it-works.html#how-it-works.replication
Since we should be using the timeseries interface ( |
Although the real fix should be to look at the control flow and see where we have read-after-write dependencies.
which currently returns
|
Tasks for the documentDB support:
|
Tasks for May 3rd @aGuttman
|
We also need to have a demographic survey for each user. We have typically used external surveys (Qualtrics/Google Forms) before, but those have the following limitations:
A solution that would address all those issues is to store the survey information in mongodb, just like everything else. But we don't want to create a survey builder. Instead, we will use kobotoolbox to create a survey which we will display using the enketo library. The UNSW group has already integrated with enketo core in the https://github.com/e-mission/e-mission-phone/tree/rciti branch, so let's start by exploring that approach. This feature is tracked in #727 |
From @jgu2:
From @shankari:
@aGuttman, can you take care of this after the index changes and the |
|
can you confirm that you tested this? the simple test-after-set function above should be sufficient |
@aGuttman Agree that this is attractive. What are some additional design options, and can you list the pros and cons before choosing this one? |
@aGuttman wrt this list, here's my initial pass-through. ** are the files that you need to investigate and potentially adapt. The other can either be ignored or removed, although double check their history once as well 😄
|
@aGuttman we don't store anything outside that database, so do you even need to parse it out? Maybe if you pass in a URL with the database embedded, you don't need to specify the database any more. From the SO post that you linked, it looks like maybe something like the following would just work
you could then replace
with
|
I did test this with the Using primary read preference in the URL to avoid read-after-write problems is definitely an intended feature, as seen in the first example under Multiple Connection Pools here: https://docs.aws.amazon.com/documentdb/latest/developerguide/connect-to-replica-set.html Also, if we don't take advantage of replica sets, it seems like we could just leave out the rs0 part of the URL.
|
That might be because your connection to the database is slow so it acts as an implicit sleep. That is also probably why the tests take a long time to run. I think we can rely on the documentation for now (maybe leave out the |
I don't think this works. I'm trying to do something like it now and it's returning errors like
It would be nice to avoid more parsing than necessary, but I think the name does need to be specified on the client object (?) in order to use it. Given that, it needs to be passed somehow from an external source. We can make it its own field in a json, but that just changes where the parse happens. Maybe the pymongo parser is slower, but we only need it once at the start and it just feels cleaner to me to specify it all in the url instead of adding another field to the config (or making an additional config file). Those are the only options I'm able to think of right now, and of those, I just like the url option better, even at the cost of using a potentially slightly less effective parser. |
Also let's hold off on my pull request for a minute. |
If you see https://github.nrel.gov/nrel-cloud-computing/emissionlhd/issues/10#issuecomment-38429, laptop to DocumentDB always worked with the read preference.
You need to get @jgu2 to create an AWS EC2 instance (maybe to ssh into the staging instance?) and then run the script from there. |
This makes sense to me. Please submit a PR for this as well. You probably want to communicate the new URL format (after it is finalized) with @jgu2 so he can set the environment variables correctly. |
|
@aGuttman every line is different because the timestamps are different. If you |
Just about 11 hours to run the tests remotely this time. No sleeping this time. I guess the timer from the last run didn't count that time either. |
For the
Again, adding sleeps around this call seems to help sometimes, but not always. Looking at the definition of
There doesn't seem to very much online discussion of documentDB issues online in general and I'm not finding much about synchronization issues with |
@aGuttman I would:
Note that if we have |
The next thing I would try is to come up with an MRE that doesn't use any of the openpath code, but just pymongo directly.
|
Re-ran with additional logging. Results this time were:
Let's start by comparing run 1 and run 2 |
So the new logs don't actually have the additional logs that I thought I added, but I was able to use them anyway. The difference seems to happen when we have multiple entries with the same For example, consider runs 1 and 2 above. For the first set of entries processed ( For write_ts = and for the last set of entries processed before the tests diverge, the order on the second run is with the location last. So the last timestamp is the incorrect one and we don't delete anything and the test fails. @aGuttman not sure what the order should be on MongoDB vs documentDB if the sort key is the same. |
And in the final failure on run 1 also, location is last
|
I should have checked back here earlier, I just found the same thing myself and was excited to report back. I added an extra log to I guess to add to that, I never see I think that's the logical root of this. All entries have a mLatitude. |
Ok, I didn't really understand how the test examples were being generated before but I see now. All entries are based off of those in While the ordering being consistent on mongo and not on document is notable and good to know going forward, the way location entries were formatted was already wrong. |
I tried giving location entries a write_ts in milliseconds in
Given this, is there any case where timestamps in the user cache can be in milliseconds? What over functions can put data in the user cache? Is |
Adding the check
to However, this is pattern is copied from Additionally
Is this a task worth the effort? Hunting down how milliseconds are dealt with and making sure that they all get divided once, but not twice (or more) and cleaning the code of these hacks? Seems like a pain, but I don't understand the way things are processed enough to judge whether it's worth it or not. |
Added server name confining in URL and hack around millisecond division commits to my pull request. |
@aGuttman I understand the urge to fix the test, and yes, we can remove the hacks/workarounds finally now that we don't use "old style" formats any more. But wrt the meeting that we are going to attend in a few hours with @jgu2, the issue is another inconsistency between mongodb and documentDB In other words:
It's great that we have a test framework and automated tests, but we don't currently have a defensible estimate of the code coverage for the tests. So my first question holds: "not sure what the order should be on MongoDB vs documentDB if the sort key is the same." Is this actually documented somewhere as an inconsistency? In other other words, wrt:
The way that location entries were formatted was not wrong, it was right for that obsolete code branch that ended up being inconsistent with that particular test case. Since we don't send "old style" entries from the phones any more, the location formatting is not a problem. The more serious issue is that it exposed this inconsistency between mongodb and documentDB, and we don't know the extent of its effect on the codebase. |
Understood. Another hit against the use of documentDB is lack of online resources covering it. There is the official amazon documentation and very little else. If I don't understand something from the documentation or suspect it might be inaccurate, there are very often no discussions about it. I looked for information about the ordering inconsistency and didn't find anything. |
I should have been more specific. The test inserts a timestamp in seconds into old style entries which should be in milliseconds, and when the old style formatter is applied to them, it misformats them since it was expecting milliseconds. Inserting the seconds timestamp into the old style entires in this test was already creating misformatted entries, regardless of database. |
Agreed. But the "incorrect format of the entries in the database" are:
What we care about is that the correct number of entries is moved to long term Maybe we should care about the format of the entries as well and not just their being copied correctly, but that is part of the code coverage discussion... |
@aGuttman I see that you have included the fix for the incorrectly formatted ts in your PR, but I'd rather just remove this obsolete code path than hack around it some more. I originally included the hack in 2015 (e-mission/e-mission-server@8cab86f) to provide backwards compatibility and a consistent migration path while changing the location format, since the phone clients can take a while to be upgraded. It is now almost 5 years later. We are working on publishing a brand new client. Keeping obsolete code paths around increases complexity unnecessarily, and (as we saw) can make the codebase harder to reason about. Let's remove this codepath completely. @aGuttman I know it can be scary to remove code that you haven't written, but creative destruction is an important part of keeping code manageable. LMK if you would like me to make the change instead this first time. |
wrt "Maybe we should care about the format of the entries as well", I checked and we do have a test explicitly for that Of course, in this case, the write_ts is not overridden so it still works. |
I did some more looking around while on the call and did find documentation of sort ordering issues: I don't know if this occurs and is important anywhere else in our code, but if it is, doing a compound sort could fix it. I added a compound sort to
Which prevents locations from coming last and it works. Adding this and removing my hack fix allows the tests to pass. |
@aGuttman @jgu2 We are looking into a couple of options:
tracking integration with a code coverage tool at #729 |
I missed something important.
When I ran the unit tests on remote I had the problem of my computer going to sleep serval times. These seemed to produce some timeout errors, which I thought weren't a worry, because when I went back and reran tests without sleeping/individually, they seemed to go away. I must have overlooked this one somehow. Also, it produces an error in testing instead of a failure, I think I may have filed it in my mind as lower priority and then forgot to track it anywhere after looking into the failures. From https://docs.aws.amazon.com/documentdb/latest/developerguide/geospatial.html:
Which is odd, because earlier on the same page:
This, of course, all works fine on my local mongodb. |
@aGuttman great catch! We should definitely add this to the list of DocumentDB compat issues to flag. Fortunately, IIRC, we are not actively using the To think this through:
@aGuttman is there a way I can get a list of queries executed by a mongodb server in a past month or something? We can then verify that we are not using |
@aGuttman I see that you have made the changes for e-mission/e-mission-server#849 Can you run the tests against DocumentDB from the newly created test AWS instance (see instructions from @jgu2) and see if there are any additional timing issues that we encounter in the AWS <-> AWS environment? |
Now that e-mission/e-mission-server#849 has been merged and #727 has been resolved, we are ready to deploy the NREL OpenPATH instance to staging. Let's track that at #732 In the meanwhile, @aGuttman will continue to test the server in the AWS environment. |
Running unit tests on AWS ec2 instance with DocumentDB:
Database NameDatabase names are configurable through conf/storage/db.conf. Use the desired name after Note that cloud services will still need to create the database with the desired name and give the user access to it. This only accesses existing DBs that have been set up with the proper permissions, it does not create them. Read Pref on DocumentDB
Code Coverage on DocumentDBCode coverage drops for 72% with MongoDB to 70% on DocumentDB. The following files make up this drop:
Looking at this quickly, it appears that these drops are almost all due to the errors mentioned above causing tests to quit early (ie: pipeline.py has functions that get called in /emission/tests/analysisTests/modeinferTests/TestPipelineSeed.py, which has a drop_database() early on in a few tests, so it never calls those functions.) The differences between MongoDB and DocumentDB are not so large that they cause different branches to fire between the test cases. Encountered DocumentDB Incompatibilities
*Sorting Details (click to unfold)MongoDB sorts return the same result over the same collection of items every time. That is, ties will always be broken the same way; whatever was first in the original collection will be first in the result. This is known as stable sorting. DocumentDB does not guarantee this. Sorts will not be incorrect, but could be different run to run where items that are tied might switch places. Ideally, things would be designed in a way where this is not a problem, but stable sorting can hide assumptions. Say we sort Collection by Index1, ascending. Later we sort Collection by Index2, ascending. After this second sort, we might unknowingly rely on the fact that no items of lower Index1 value can follow an item of higher Index1 value when they all have have equal Index2 values. Making this assumption in DocumentDB will cause errors. We can fix this by doing a compound sort by (Index2, ascending, Index1, ascending), but we must notice that this is happening first, which is tricky to see. Even with a compound sort we are not guaranteed exactly the same result in DocumentDB that we got in MongoDB, because ties are still broken "randomly" when they exist, (ie: Index2 and Index1 are equal.) We just have a second (or more) rule to fall back to after the first, but orderings can still change run to run. That said, if this causes problems, the code is very fragile and the design should be looked at anyway. |
@aGuttman As we discussed, the plan is to have multiple enclaves, one for each program or study.
The text was updated successfully, but these errors were encountered: