-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes how the ids and positions are retrieved #174
Conversation
f754003
to
1920499
Compare
02f77c6
to
1630710
Compare
1630710
to
8f26d4f
Compare
8f26d4f
to
d2b9b61
Compare
How does the change impact fetching multiple timesteps w/ a subset of GIDS? |
@mgeplf If the GIDs are quite far from each other (e.g., 2 GIDs, one in the very beginning and one at the very end), then the performance can be worse if the report is extremely large. However, the data will always be fetched following the same technique, meaning that even if we leave the implementation as it is, the cost to fetch the data will still be worse in this particular edge case. This was the trade-off we made last year when we switched from having the millions of hyperslabs to just one hyperslab per timestep. Overall, it provides lots of performance benefits in comparison. But just to clarify, my next work item would be in that direction actually! My plan is to have a threshold that would create ranges of data selection, instead of a single range like now. This way, following the previous example, if the gap between the GIDs is large, then we would just create two (or more) separate hyperslabs with as much data as possible. This would still be optimal to the file system because we have fewer requests, while still handling the edge cases that can happen. However, this is a separate task. For now, I wanted to unify both implementations, merge this into What do you think? |
Cool. Can we at least run the |
Yup, I will run a test throughout the morning and add it here, no problem at all. Just an additional note that I forgot to mention. Keeping the old implementation is heavy for GPFS when multiple processes open reports. For instance, Joseph was involuntarily blocking GPFS for everybody with his simulations and most of it was just in the beginning fetching the IDs and positions. Regardless of whether we merge this or not, the data will still be fetched with the Nonetheless, I 100% agree, let me just do a couple of runs just to understand where we are and we can take it from there. |
d2b9b61
to
4d794dd
Compare
After rebasing from The figure illustrates Execution Time in seconds on the y-axis and the number of GIDs requested in the x-axys. The test script uses a reference ~60TB report with 4.2M GIDs and randomly selects GIDs from the input, mimicking some short of spatial bining. The tests ask for a single timestep, which would first trigger internally the fetching of the GIDs and the positions (i.e., the part that involves the changes from this PR) plus accessing the data to retrieve a single timestep (i.e., this part identical in both versions and previously optimized in #104*). The table at the bottom shows specifically the execution time for each test, being the I must admit that, after the compilation fix, the original code doesn't look as bad as it used to. For instance, selecting chunks of sequential GIDs is not that worse. However, I would like to kindly point out that we cannot keep the code as it is, otherwise it will be quite stressful for GPFS when scientists utilize hundreds of processes on big reports. Moreover, this PR is required for the next optimization step that reads in chunks of a certain threshold, as I mentioned earlier in one of my comments. This would improve the performance when requesting a small number of GIDs that are located far apart, as in the figure above. For reference purposes, here is the Python script used for the tests: import libsonata
import time
import sys
import numpy as np
elements = libsonata.ElementReportReader('/gpfs/.../002/AllCurrents.h5')
population_elements = elements['All']
print("File loaded!")
for i in range(0, 7, 1):
count = pow(10, i)
node_ids = population_elements.get_node_ids()
np.random.seed(21)
node_ids = list(np.random.choice(node_ids, count, replace=False))
start = time.time()
data_frame = population_elements.get(node_ids, tstart=5000.0, tstop=5000.1)
end = time.time()
print(str(count) + " / " + str(end-start) + "\n-----") * Note that fetching more than one timestep in the tests will not illustrate any significant difference, as this part of the code is identical. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, some top-level abstract documentation would be good, though.
a11cc2f
to
6664df9
Compare
Thank you very much for the comments and suggestions, @mgeplf. @jorblancoa and I have addressed most of the hints and also verified that the performance still remains equivalent. If you like this version, let's go ahead and merge it to create a new Have a nice weekend and thank you once again! |
@mgeplf With all due respect, I think that we are missing the point here. Once again: the current This PR unifies the same IO pattern for the metadata that is required to obtain the data. Without this change, the code triggers millions (I insist, millions) of small IO requests on big reports, just before fetching a single timestep of the data. The performance numbers listed above is to fetch a single timestep, just one! Then, you have to add the time to fetch the other 9999 timesteps of the report, that will have identical cost in both implementations. I will upload now some of your suggested changes, ok? |
I think this actually pessimizing some use cases, though; for instance if I use |
Can you provide the scripts and report you used? Also, I repeatedly mentioned that this PR is needed in order to include an optimization for the edge cases, while handling the big reports. We have already optimized the code in 4-5 contributions, and this one adds to the previous contributions. |
The test case is:
The reason this is slower is likely due to The behavior also has changed (which we have to be careful about b/c of Hyrum's law): since there is a map now, the returned values are different: |
Seems to me that the test case is fetching two well-separated IDs, which is a known regression to be followed up and optimized on. It just seemed that doing it all in one go would lead to a more complex PR and breaking the work up would be better. With the return values… that could be a costly re-sorting according to the user input. Seems doable, though. |
Yeah, fair enough, we just need to be clear on what is improving, and this seems like a reasonable tradeoff.
Yup, for sure. |
* struct ElementIdsData -> NodeIdElementLayout * getElementIds -> getNodeIdElementLayout
commit 79ce37c Author: Mike Gevaert <[email protected]> Date: Mon Feb 7 08:16:36 2022 +0100 Try using a vector instead of a map * make NodeIdElementLayout::node_ranges a vector instead of a map; the key of the map wasn't used, in the calling functions, and only iteration was performed * rename NodeIdElementLayout::range -> min_max_range * only fill .times if there is a data payload
492bab1
to
d6e7e5b
Compare
d6e7e5b
to
5ea2a3b
Compare
Looks good to me; I asked @NadirRoGue about what the usage patterns were on the viz side, and it sounds like the usual number of node id's examined is I can merge, or you can do the honours, @sergiorg-hpc. Thanks for contribution! |
Do not thank me, this was a team effort (props to @jorblancoa, @matz-e and others). Also, thanks to you and your team, @mgeplf. I cannot merge because I do not have permissions, so please, you can go ahead. Also, if you don't mind, create a new tag Once again, thank you for the insight and let's hope this fix can further help our users. |
The PR alters how the node IDs and positions are retrieved from storage. The node pointers are now filtered into a sub-map and used directly for different purposes, instead of splitting the IDs and positions like in the original code. The IO pattern for the IDs is also different and avoids the use of multiple hyperslabs, following a similar approach to what was introduced when fetching the actual data from storage (#104). In addition, the changes simplify and unify parts of the code (e.g., removing the use of lambda-functions).
Using a reference report of around 60TB, fetching one single timestamp from the file for all the GIDs would imply the following:
These millions of small reads from the original implementation were causing problems recently in the GPFS file system of BB5, specially when scientists were launching multiple processes over multiple nodes conducting similar requests on big reports.