Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Changes how the ids and positions are retrieved #174

Merged
merged 8 commits into from
Feb 9, 2022

Conversation

sergiorg-hpc
Copy link
Contributor

@sergiorg-hpc sergiorg-hpc commented Dec 22, 2021

The PR alters how the node IDs and positions are retrieved from storage. The node pointers are now filtered into a sub-map and used directly for different purposes, instead of splitting the IDs and positions like in the original code. The IO pattern for the IDs is also different and avoids the use of multiple hyperslabs, following a similar approach to what was introduced when fetching the actual data from storage (#104). In addition, the changes simplify and unify parts of the code (e.g., removing the use of lambda-functions).

Using a reference report of around 60TB, fetching one single timestamp from the file for all the GIDs would imply the following:

  • In the former implementation, getting the metadata of each of the 4,234,929 GIDs of the report implies triggering a hyperslab + read for around 300 values per compartment (i.e., 4,234,929 reads of around 300 values each).
  • In the proposed implementation, only a single hyperslab is triggered to fetch the data for the 4,234,929 IDs. Then, this data is conveniently filtered in memory.

These millions of small reads from the original implementation were causing problems recently in the GPFS file system of BB5, specially when scientists were launching multiple processes over multiple nodes conducting similar requests on big reports.

@sergiorg-hpc sergiorg-hpc changed the title Temp. commit - Changes how the ids and positions are retrieved Changes how the ids and positions are retrieved Dec 22, 2021
@sergiorg-hpc sergiorg-hpc self-assigned this Dec 22, 2021
@sergiorg-hpc sergiorg-hpc force-pushed the sandbox/srivas/ids_perf_improvements branch from f754003 to 1920499 Compare December 22, 2021 02:10
@mgeplf mgeplf added the WIP work in progress label Dec 22, 2021
@sergiorg-hpc sergiorg-hpc force-pushed the sandbox/srivas/ids_perf_improvements branch from 02f77c6 to 1630710 Compare December 22, 2021 16:01
@sergiorg-hpc sergiorg-hpc force-pushed the sandbox/srivas/ids_perf_improvements branch from 1630710 to 8f26d4f Compare January 4, 2022 09:55
@sergiorg-hpc sergiorg-hpc removed the WIP work in progress label Jan 10, 2022
@sergiorg-hpc sergiorg-hpc force-pushed the sandbox/srivas/ids_perf_improvements branch from 8f26d4f to d2b9b61 Compare January 10, 2022 13:41
@sergiorg-hpc sergiorg-hpc marked this pull request as ready for review January 10, 2022 13:58
@mgeplf
Copy link
Contributor

mgeplf commented Jan 11, 2022

Using a reference report of around 60TB with a single process, fetching one single timestamp from the file for all the GIDs takes the following:

How does the change impact fetching multiple timesteps w/ a subset of GIDS?

@sergiorg-hpc
Copy link
Contributor Author

How does the change impact fetching multiple timesteps w/ a subset of GIDS?

@mgeplf If the GIDs are quite far from each other (e.g., 2 GIDs, one in the very beginning and one at the very end), then the performance can be worse if the report is extremely large. However, the data will always be fetched following the same technique, meaning that even if we leave the implementation as it is, the cost to fetch the data will still be worse in this particular edge case. This was the trade-off we made last year when we switched from having the millions of hyperslabs to just one hyperslab per timestep. Overall, it provides lots of performance benefits in comparison.

But just to clarify, my next work item would be in that direction actually! My plan is to have a threshold that would create ranges of data selection, instead of a single range like now. This way, following the previous example, if the gap between the GIDs is large, then we would just create two (or more) separate hyperslabs with as much data as possible. This would still be optimal to the file system because we have fewer requests, while still handling the edge cases that can happen.

However, this is a separate task. For now, I wanted to unify both implementations, merge this into libsonata and then the next step would be to add this extra step. Both Joseph and Andras have tried this patch and have obtained clear performance benefits.

What do you think?

@mgeplf
Copy link
Contributor

mgeplf commented Jan 11, 2022

But just to clarify, my next work item would be in that direction actually!

Cool.

Can we at least run the fetching multiple timesteps w/ a subset of GIDS w/ the old code, and the new code, and put that in this ticket?

@sergiorg-hpc
Copy link
Contributor Author

Yup, I will run a test throughout the morning and add it here, no problem at all.

Just an additional note that I forgot to mention. Keeping the old implementation is heavy for GPFS when multiple processes open reports. For instance, Joseph was involuntarily blocking GPFS for everybody with his simulations and most of it was just in the beginning fetching the IDs and positions. Regardless of whether we merge this or not, the data will still be fetched with the min/max range for the data anyway. So, having the IDs fetched similarly is equivalent to adding an extra timestep to the report, if that makes sense (i.e., like 10001 tsteps instead of 10000).

Nonetheless, I 100% agree, let me just do a couple of runs just to understand where we are and we can take it from there.

@sergiorg-hpc sergiorg-hpc force-pushed the sandbox/srivas/ids_perf_improvements branch from d2b9b61 to 4d794dd Compare January 18, 2022 08:32
@sergiorg-hpc
Copy link
Contributor Author

sergiorg-hpc commented Jan 21, 2022

After rebasing from master to fix the compilation mode bug (see #175), here are some results using a single process to fetch one timestep with the current version from master (i.e., v0.1.11) versus the same version with the changes from this PR (i.e., v0.1.11+Patch):
image

The figure illustrates Execution Time in seconds on the y-axis and the number of GIDs requested in the x-axys. The test script uses a reference ~60TB report with 4.2M GIDs and randomly selects GIDs from the input, mimicking some short of spatial bining. The tests ask for a single timestep, which would first trigger internally the fetching of the GIDs and the positions (i.e., the part that involves the changes from this PR) plus accessing the data to retrieve a single timestep (i.e., this part identical in both versions and previously optimized in #104*). The table at the bottom shows specifically the execution time for each test, being the 4M test equivalent to All 4.2M GIDs in the report.

I must admit that, after the compilation fix, the original code doesn't look as bad as it used to. For instance, selecting chunks of sequential GIDs is not that worse. However, I would like to kindly point out that we cannot keep the code as it is, otherwise it will be quite stressful for GPFS when scientists utilize hundreds of processes on big reports. Moreover, this PR is required for the next optimization step that reads in chunks of a certain threshold, as I mentioned earlier in one of my comments. This would improve the performance when requesting a small number of GIDs that are located far apart, as in the figure above.

For reference purposes, here is the Python script used for the tests:

import libsonata
import time
import sys
import numpy as np

elements = libsonata.ElementReportReader('/gpfs/.../002/AllCurrents.h5')
population_elements = elements['All']
print("File loaded!")

for i in range(0, 7, 1):
    count = pow(10, i)
    node_ids = population_elements.get_node_ids()

    np.random.seed(21)
    node_ids = list(np.random.choice(node_ids, count, replace=False))

    start = time.time()
    data_frame = population_elements.get(node_ids, tstart=5000.0, tstop=5000.1)
    end = time.time()

    print(str(count) + " / " + str(end-start) + "\n-----")

* Note that fetching more than one timestep in the tests will not illustrate any significant difference, as this part of the code is identical.

matz-e
matz-e previously approved these changes Jan 24, 2022
Copy link
Member

@matz-e matz-e left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, some top-level abstract documentation would be good, though.

include/bbp/sonata/report_reader.h Outdated Show resolved Hide resolved
include/bbp/sonata/report_reader.h Outdated Show resolved Hide resolved
include/bbp/sonata/report_reader.h Outdated Show resolved Hide resolved
include/bbp/sonata/report_reader.h Outdated Show resolved Hide resolved
include/bbp/sonata/report_reader.h Outdated Show resolved Hide resolved
src/report_reader.cpp Outdated Show resolved Hide resolved
src/report_reader.cpp Outdated Show resolved Hide resolved
src/report_reader.cpp Outdated Show resolved Hide resolved
include/bbp/sonata/report_reader.h Outdated Show resolved Hide resolved
src/report_reader.cpp Outdated Show resolved Hide resolved
@sergiorg-hpc
Copy link
Contributor Author

Thank you very much for the comments and suggestions, @mgeplf. @jorblancoa and I have addressed most of the hints and also verified that the performance still remains equivalent.

If you like this version, let's go ahead and merge it to create a new libsonata release for deployment.

Have a nice weekend and thank you once again!

include/bbp/sonata/report_reader.h Outdated Show resolved Hide resolved
include/bbp/sonata/report_reader.h Show resolved Hide resolved
include/bbp/sonata/report_reader.h Outdated Show resolved Hide resolved
src/report_reader.cpp Outdated Show resolved Hide resolved
src/report_reader.cpp Outdated Show resolved Hide resolved
src/report_reader.cpp Outdated Show resolved Hide resolved
src/report_reader.cpp Outdated Show resolved Hide resolved
src/report_reader.cpp Outdated Show resolved Hide resolved
src/report_reader.cpp Outdated Show resolved Hide resolved
src/report_reader.cpp Outdated Show resolved Hide resolved
@sergiorg-hpc
Copy link
Contributor Author

sergiorg-hpc commented Feb 1, 2022

@mgeplf With all due respect, I think that we are missing the point here. Once again: the current libsonata reads the data of the reports with the min/max optimization. This is no different and will also be penalized when fetching the first and last GID in the old implementation.

This PR unifies the same IO pattern for the metadata that is required to obtain the data. Without this change, the code triggers millions (I insist, millions) of small IO requests on big reports, just before fetching a single timestep of the data. The performance numbers listed above is to fetch a single timestep, just one! Then, you have to add the time to fetch the other 9999 timesteps of the report, that will have identical cost in both implementations.

I will upload now some of your suggested changes, ok?

@mgeplf
Copy link
Contributor

mgeplf commented Feb 1, 2022

This PR unifies the same IO pattern for the metadata that is required to obtain the data.

I think this actually pessimizing some use cases, though; for instance if I use get_node_id_element_id_mapping(sel) on a selection of the large report, the time to get the result goes from about 6s to almost 10s.

@sergiorg-hpc
Copy link
Contributor Author

This PR unifies the same IO pattern for the metadata that is required to obtain the data.

I think this actually pessimizing some use cases, though; for instance if I use get_node_id_element_id_mapping(sel) on a selection of the large report, the time to get the result goes from about 6s to almost 10s.

Can you provide the scripts and report you used? Also, I repeatedly mentioned that this PR is needed in order to include an optimization for the edge cases, while handling the big reports.

We have already optimized the code in 4-5 contributions, and this one adds to the previous contributions.

@mgeplf
Copy link
Contributor

mgeplf commented Feb 2, 2022

The test case is:

import libsonata
p = '/gpfs/bbp.cscs.ch/data/scratch/proj85/lotsofSims/baseline/633cbccb-f6b2-4e10-8f83-0a2655dcbcc7/000/AllCurrents.h5'
r = libsonata.ElementReportReader(p)['All']
mapping = r.get_node_id_element_id_mapping(libsonata.Selection([4228232, 15872]))
print(len(mapping))
print(mapping[0], mapping[-1])

The reason this is slower is likely due to getNodeIdElementIdMapping calling getElementIds which is doing more work than before.

The behavior also has changed (which we have to be careful about b/c of Hyrum's law): since there is a map now, the returned values are different:
Before, the second print gave: [4228232, 0] [15872, 176], and now it gives: [15872, 0] [4228232, 579]`; so we should mention that the returned value isn't guaranteed to be stable.

@matz-e
Copy link
Member

matz-e commented Feb 2, 2022

Seems to me that the test case is fetching two well-separated IDs, which is a known regression to be followed up and optimized on. It just seemed that doing it all in one go would lead to a more complex PR and breaking the work up would be better.

With the return values… that could be a costly re-sorting according to the user input. Seems doable, though.

@mgeplf
Copy link
Contributor

mgeplf commented Feb 4, 2022

Seems to me that the test case is fetching two well-separated IDs, which is a known regression to be followed up and optimized on.

Yeah, fair enough, we just need to be clear on what is improving, and this seems like a reasonable tradeoff.

It just seemed that doing it all in one go would lead to a more complex PR and breaking the work up would be better.

Yup, for sure.

mgeplf and others added 2 commits February 4, 2022 10:44
* struct ElementIdsData -> NodeIdElementLayout
* getElementIds -> getNodeIdElementLayout
commit 79ce37c
Author: Mike Gevaert <[email protected]>
Date:   Mon Feb 7 08:16:36 2022 +0100

    Try using a vector instead of a map

    * make NodeIdElementLayout::node_ranges a vector instead of a map; the
      key of the map wasn't used, in the calling functions, and only
      iteration was performed
    * rename NodeIdElementLayout::range -> min_max_range
    * only fill .times if there is a data payload
@sergiorg-hpc sergiorg-hpc force-pushed the sandbox/srivas/ids_perf_improvements branch from 492bab1 to d6e7e5b Compare February 8, 2022 14:01
@sergiorg-hpc sergiorg-hpc force-pushed the sandbox/srivas/ids_perf_improvements branch from d6e7e5b to 5ea2a3b Compare February 8, 2022 14:02
src/report_reader.cpp Show resolved Hide resolved
@mgeplf
Copy link
Contributor

mgeplf commented Feb 8, 2022

Looks good to me; I asked @NadirRoGue about what the usage patterns were on the viz side, and it sounds like the usual number of node id's examined is it can go from 100 to 200k; so I think it's fine to pessimize the low end number for the gains in the larger views.

I can merge, or you can do the honours, @sergiorg-hpc. Thanks for contribution!

@sergiorg-hpc
Copy link
Contributor Author

Looks good to me; I asked @NadirRoGue about what the usage patterns were on the viz side, and it sounds like the usual number of node id's examined is it can go from 100 to 200k; so I think it's fine to pessimize the low end number for the gains in the larger views.

I can merge, or you can do the honours, @sergiorg-hpc. Thanks for contribution!

Do not thank me, this was a team effort (props to @jorblancoa, @matz-e and others). Also, thanks to you and your team, @mgeplf.

I cannot merge because I do not have permissions, so please, you can go ahead. Also, if you don't mind, create a new tag v0.1.11 so that I can create a PR in Spack to release the Debug->Release patch and this one on BB5 as soon as possible, ok? I will add you to the reviewers of that PR as well.

Once again, thank you for the insight and let's hope this fix can further help our users.

@mgeplf mgeplf merged commit 2863397 into master Feb 9, 2022
@mgeplf mgeplf deleted the sandbox/srivas/ids_perf_improvements branch February 9, 2022 07:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants