[FEATURE] "group by" for cat axes / index based slicing #211

andrzejnovak · 2021-04-30T15:51:00Z

I don't think this is currently implemented, but would be super useful, allowing to merge samples that were processed separately.

I am imagining syntax like:
h[{'category: {'merged': ['sampleA', 'sampleB'], ...}}]

Also I thought

h[...,[0, 2], ...]

would work, but it doesn't seem to be possibly currently.

The text was updated successfully, but these errors were encountered:

andrzejnovak · 2021-05-01T14:46:03Z

@henryiii is there currently no way to do this even manually by assignment due to https://github.com/scikit-hep/boost-histogram/blob/6f4813f1e4b326ca14074b739f2214383f46bec6/src/boost_histogram/_internal/hist.py#L819 ?

henryiii · 2021-05-01T14:47:49Z

h2[...] = h1.view(flow=True) allows assignment. Histogram histogram assignment would probably need some checking on the axes which is not implemented yet.

henryiii · 2021-05-01T14:52:00Z

h[...,[0, 2], ...] would probably need (or be best with) support in Boost.Histogram, see scikit-hep/boost-histogram#296. @HDembinski, is this something that can be supported upstream? If we have to, we can implement it in boost-histogram via a workaround.

PS: Assuming this is for unordered axes only.

henryiii · 2021-05-01T14:56:11Z

PS: The issue that got opened and fixed in Boost.Histogram was for slicing on categorical axes, which enabled h[..., 0:2, ...] to work, but not selecting a subset of a categorical axes as is requested here and mentioned on the original boost-histogram issue.

andrzejnovak · 2021-05-01T15:16:39Z

Ok thanks. In case anyone stumbles here. This seems to be the workaround, thanks @henryiii

def groupby(h, groupmap, axis='dataset'):
    new = Hist(*[ax for ax in h.axes if ax.name != axis], 
                hist.axis.StrCategory(groupmap.keys(), name=axis, growth=True), 
                hist.storage.Weight()
          )

    for name, cats in groupmap.items():
        grouped = sum([h[{axis: name}] for name in cats])
        new[{axis: name}] = grouped.view(flow=True)
    return new

HDembinski · 2021-05-01T16:08:19Z

Seems like a nice feature for boost-histogram.

andrzejnovak · 2021-05-09T20:41:54Z

Related issue. This new[{axis: name}] = grouped.view(flow=True) syntax fails when growth axis dimensions don't match.

henryiii · 2021-05-18T21:04:00Z

Seems like a nice feature for boost-histogram.

boost-histogram doesn't have named axes, so it wouldn't be as pretty, and would need another layer of wrapping in Hist anyway, just like fill, project, ... (not against it, but probably best to implement it here first)

This new[{axis: name}] = grouped.view(flow=True) syntax fails when growth axis dimensions don't match.

How would it know what entries to add?

h[...,[0, 2], ...]

This is almost implementable on top of scikit-hep/boost-histogram#576, save for the caveats mentioned there.

andrzejnovak · 2021-05-18T21:27:56Z

How would it know what entries to add?

Admittedly I didn't think about it too deeply, but it could just pad zeros to the dimensions along the missing categorical entries? Should be equivalent to adding two histograms where the growth/cat axis has different entries?

nsmith- · 2023-01-05T17:52:50Z

Since we know the new axis elements already (the dictionary keys) I think we could have a workaround without growth as follows:

import hist

def group(h: hist.Hist, oldname: str, newname: str, grouping: dict[str, list[str]]):
    hnew = hist.Hist(
        hist.axis.StrCategory(grouping, name=newname),
        *(ax for ax in h.axes if ax.name != oldname),
        storage=h._storage_type,
    )
    for i, indices in enumerate(grouping.values()):
        hnew.view(flow=True)[i] = h[{oldname: indices}][{oldname: sum}].view(flow=True)

    return hnew

Note that the new axis is put at the beginning (for convenience in implementation). I couldn't find a public accessor for the storage type though.

An example

h = (
    hist.Hist.new
    .StrCat("abcde", name="letter")
    .Reg(10, 0, 1, name="number")
    .Double()
)

grouping = {
    "vowel": ["a", "e"],
    "consonant": ["b", "c", "d"],
}

print(group(h, "letter", "type", grouping))

returning

Hist(
  StrCategory(['vowel', 'consonant'], name='type', label='type'),
  Regular(10, 0, 1, name='number', label='number'),
  storage=Double())

nsmith- · 2023-10-17T20:45:50Z

A small update to my previous comment: the workaround now needs h._storage_type() due to a warning about passing the type and not an instance. Perhaps we can have a public accessor for the storage type that is stable?

henryiii · 2023-10-17T21:29:33Z

Can't you use h.storage_type?

nsmith- · 2023-10-17T21:35:15Z

Oops, guess it exists now!

andrzejnovak added the enhancement New feature or request label Apr 30, 2021

andrzejnovak assigned henryiii and LovelyBuggies Apr 30, 2021

andrzejnovak changed the title ~~[FEATURE] "group by" for cat axes~~ [FEATURE] "group by" for cat axes / index based slicing May 1, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEATURE] "group by" for cat axes / index based slicing #211

[FEATURE] "group by" for cat axes / index based slicing #211

andrzejnovak commented Apr 30, 2021 •

edited

Loading

andrzejnovak commented May 1, 2021

henryiii commented May 1, 2021 •

edited

Loading

henryiii commented May 1, 2021 •

edited

Loading

henryiii commented May 1, 2021 •

edited

Loading

andrzejnovak commented May 1, 2021 •

edited

Loading

HDembinski commented May 1, 2021

andrzejnovak commented May 9, 2021

henryiii commented May 18, 2021 •

edited

Loading

andrzejnovak commented May 18, 2021

nsmith- commented Jan 5, 2023

nsmith- commented Oct 17, 2023

henryiii commented Oct 17, 2023 •

edited

Loading

nsmith- commented Oct 17, 2023

[FEATURE] "group by" for cat axes / index based slicing #211

[FEATURE] "group by" for cat axes / index based slicing #211

Comments

andrzejnovak commented Apr 30, 2021 • edited Loading

andrzejnovak commented May 1, 2021

henryiii commented May 1, 2021 • edited Loading

henryiii commented May 1, 2021 • edited Loading

henryiii commented May 1, 2021 • edited Loading

andrzejnovak commented May 1, 2021 • edited Loading

HDembinski commented May 1, 2021

andrzejnovak commented May 9, 2021

henryiii commented May 18, 2021 • edited Loading

andrzejnovak commented May 18, 2021

nsmith- commented Jan 5, 2023

nsmith- commented Oct 17, 2023

henryiii commented Oct 17, 2023 • edited Loading

nsmith- commented Oct 17, 2023

andrzejnovak commented Apr 30, 2021 •

edited

Loading

henryiii commented May 1, 2021 •

edited

Loading

henryiii commented May 1, 2021 •

edited

Loading

henryiii commented May 1, 2021 •

edited

Loading

andrzejnovak commented May 1, 2021 •

edited

Loading

henryiii commented May 18, 2021 •

edited

Loading

henryiii commented Oct 17, 2023 •

edited

Loading