-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature request: add collector accumulator #353
Comments
I think the return from this would exactly be an awkward array, FYI. @jpivarski |
Do I understand correctly that a Collector for a 2-axis histogram with 2 bins in the first axis and 3 bins in the second histogram would collect data like
That is, the data to be collected (for an unbinned fit, KDE, etc.) are variable-length lists of numbers only—no records or n-tuples of numbers—and that there's one per bin for some regular or sparse binning? If so, then at most what you need is a jagged array. That, by itself, is not too complicated and might fall on the "reimplement" side of the reimplement/dependency trade-off. If the objects in each of these bins is an n-tuple of numbers, it's still feasible, but if they get any more general, then you might want to use Awkward Array as a dependency. For presenting these in Python as slicable objects, you might want to have a custom implementation in C++ and only wrap them as Awkward Arrays as an optional dependency in Python. That way, you get all the slicing/broadcasting/etc. logic without taking Awkward as a C++ dependency, which Boost Histogram should not (one of its selling points is lightweight dependencies, and I don't think it could be included in Boost with a non-Boost dependency, right?). Another thing to consider: while filling it, it can't be a jagged array implemented with offsets. It needs to be an array of pointers to growable buffers ( |
I started working on this, because I need it now for an analysis. We can implement accumulators in boost-histogram that are not in boostorg/histogram, so in principle we are free here to add third party dependencies. For boostorg/histogram, adding third-party dependencies would be an issue. I realized that there are two collectors which seem useful, one is just keeping a collection of weights per bin, so it would be a variable-length array of doubles in each bin (a std::vector in each cell in C++). That's actually what I need right now. This would be the collector that corresponds to the The other collector would keep a variable-length array of two doubles, for a weight and a sample. That would be the collector that corresponds to |
A simple and widely useful accumulator would be the Collector (name to be refined if needed). The accumulator holds a std::vector and appends sample values to the vector. Users should be able to view the contents as a numpy array.
Motivation
The usual accumulators were designed to have a very small state, so that one can have very many of them. This accumulator is coming from other end of the spectrum, it uses the maximum amount of storage to hold all samples which ended up in a certain bin. Having the full sample of values in each bin is useful in a variety of contexts, to do unbinned fits in each bin, to compute the median, to compute a kernel density estimate.
Technical challenges?
Do we need a custom view again for the accumulator or maybe just a buffer interface? It would be nice if a collector instance would act like a normal numpy array (at least read-only, possibly read-write), which means it should support slicing, masking, advanced indexing access, can be passed to numpy ufuncs etc.
The text was updated successfully, but these errors were encountered: