Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem with awkward array masking with exclusive_jets_constituent_index() #238

Open
jbrewster7 opened this issue Jul 14, 2023 · 9 comments · May be fixed by #274
Open

Problem with awkward array masking with exclusive_jets_constituent_index() #238

jbrewster7 opened this issue Jul 14, 2023 · 9 comments · May be fixed by #274

Comments

@jbrewster7
Copy link

Hello, I am working with @kpachal, @mswiatlo, and @lgray on the development of Coffea and some analysis that involves the use of fastjet. We've been very happy with how smoothly everything using awkward arrays integrates with fastjet.

However, we have noticed a problem when running exclusive_jets_constituent_index() with a masked awkward array. If an awkward array has a boolean mask applied to it on axis 0 before clustering, exclusive_jets_constituent_index() returns indices that are out of range. An error is also thrown when trying to call exclusive_jets_constituents() since the out of range indices are being applied to the array.

Here is an example. Running this

import awkward as ak 
import fastjet

px = ak.Array([[-0.181, 0.377],
               [0.54, 0.253],
               [0.294, 4.65],
               [1.45, 12.8],
               [-3.55, -1.33]])

py = ak.Array([[-0.798, 0.116],
               [-0.65, -0.0342],
               [0.254, -1.88],
               [-0.179, -1.8],
               [-1.64, -1.03]])

pz = ak.Array([[-0.652, -0.0749],
               [-0.00527, -0.0731],
               [-0.259, -3.29],
               [-0.876, -7.18],
               [-0.0941, 0.147]])

E = ak.Array([[1.06, 0.425],
              [0.857, 0.3],
              [0.467, 6],
              [1.71, 14.8],
              [3.91, 1.7]])

mask = ak.Array([True,True,False,True,True])

eg = ak.zip(
    {
        'px': px,
        'py': py,
        'pz': pz,
        'E': E,
    },
)

jetdef = fastjet.JetDefinition(fastjet.kt_algorithm,1)

fastjet.ClusterSequence(eg[mask], jetdef).exclusive_jets_constituent_index(njets=2)

returns

[[[1], [0]],
 [[0], [1]],
 [[1], [0, 2, 3]],
 [[0], [1]]]
---------------------------
type: 4 * var * var * int32

which contains out of range indices at index 2 on axis 0. If we then run

fastjet.ClusterSequence(eg[mask], jetdef).exclusive_jets_constituents(njets=2)

it throws the error

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
File ~/anaconda3/lib/python3.10/site-packages/awkward/highlevel.py:950, in Array.__getitem__(self, where)
    949 with ak._errors.SlicingErrorContext(self, where):
--> 950     out = self._layout[where]
    951     if isinstance(out, ak.contents.NumpyArray):

File ~/anaconda3/lib/python3.10/site-packages/awkward/contents/content.py:510, in Content.__getitem__(self, where)
    509 def __getitem__(self, where):
--> 510     return self._getitem(where)

File ~/anaconda3/lib/python3.10/site-packages/awkward/contents/content.py:563, in Content._getitem(self, where)
    562 elif isinstance(where, ak.highlevel.Array):
--> 563     return self._getitem(where.layout)
    565 # Convert between nplikes of different backends

File ~/anaconda3/lib/python3.10/site-packages/awkward/contents/content.py:638, in Content._getitem(self, where)
    637 elif isinstance(where, Content):
--> 638     return self._getitem((where,))
    640 elif is_sized_iterable(where):
    641     # Do we have an array

File ~/anaconda3/lib/python3.10/site-packages/awkward/contents/content.py:555, in Content._getitem(self, where)
    548 next = ak.contents.RegularArray(
    549     this,
    550     this.length,
    551     1,
    552     parameters=None,
    553 )
--> 555 out = next._getitem_next(nextwhere[0], nextwhere[1:], None)
    557 if out.length is not unknown_length and out.length == 0:

File ~/anaconda3/lib/python3.10/site-packages/awkward/contents/regulararray.py:708, in RegularArray._getitem_next(self, head, tail, advanced)
    693 self._maybe_index_error(
    694     self._backend[
    695         "awkward_RegularArray_getitem_jagged_expand",
   (...)
    706     slicer=head,
    707 )
--> 708 down = self._content._getitem_next_jagged(
    709     multistarts, multistops, head._content, tail
    710 )
    712 return RegularArray(
    713     down, headlength, self._length, parameters=self._parameters
    714 )

File ~/anaconda3/lib/python3.10/site-packages/awkward/contents/listoffsetarray.py:409, in ListOffsetArray._getitem_next_jagged(self, slicestarts, slicestops, slicecontent, tail)
    406 out = ak.contents.ListArray(
    407     self.starts, self.stops, self._content, parameters=self._parameters
    408 )
--> 409 return out._getitem_next_jagged(slicestarts, slicestops, slicecontent, tail)

File ~/anaconda3/lib/python3.10/site-packages/awkward/contents/listarray.py:497, in ListArray._getitem_next_jagged(self, slicestarts, slicestops, slicecontent, tail)
    495 sliceoffsets = slicecontent._offsets
--> 497 outcontent = next_content._getitem_next_jagged(
    498     sliceoffsets[:-1], sliceoffsets[1:], slicecontent._content, tail
    499 )
    501 return ak.contents.ListOffsetArray(
    502     outoffsets, outcontent, parameters=self._parameters
    503 )

File ~/anaconda3/lib/python3.10/site-packages/awkward/contents/listarray.py:544, in ListArray._getitem_next_jagged(self, slicestarts, slicestops, slicecontent, tail)
    535 assert (
    536     outoffsets.nplike is self._backend.index_nplike
    537     and nextcarry.nplike is self._backend.index_nplike
   (...)
    542     and self._stops.nplike is self._backend.index_nplike
    543 )
--> 544 self._maybe_index_error(
    545     self._backend[
    546         "awkward_ListArray_getitem_jagged_apply",
    547         outoffsets.dtype.type,
    548         nextcarry.dtype.type,
    549         slicestarts.dtype.type,
    550         slicestops.dtype.type,
    551         sliceindex.dtype.type,
    552         self._starts.dtype.type,
    553         self._stops.dtype.type,
    554     ](
    555         outoffsets.data,
    556         nextcarry.data,
    557         slicestarts.data,
    558         slicestops.data,
    559         slicestarts.length,
    560         sliceindex.data,
    561         sliceindex.length,
    562         self._starts.data,
    563         self._stops.data,
    564         self._content.length,
    565     ),
    566     slicer=ak.contents.ListArray(slicestarts, slicestops, slicecontent),
    567 )
    568 nextcontent = self._content._carry(nextcarry, True)

File ~/anaconda3/lib/python3.10/site-packages/awkward/contents/content.py:271, in Content._maybe_index_error(self, error, slicer)
    270 message = self._backend.format_kernel_error(error)
--> 271 raise ak._errors.index_error(self, slicer, message)

IndexError: cannot slice ListArray (of length 8) with [[1], [0], [0], [1], [1], [0, 2, 3], [0], [1]]: index out of range while attempting to get index 3 (in compiled code: https://github.com/scikit-hep/awkward/blob/awkward-cpp-17/awkward-cpp/src/cpu-kernels/awkward_ListArray_getitem_jagged_apply.cpp#L43)

The above exception was the direct cause of the following exception:

IndexError                                Traceback (most recent call last)
Cell In[5], line 1
----> 1 fastjet.ClusterSequence(eg[mask], jetdef).exclusive_jets_constituents(njets=2)

File ~/anaconda3/lib/python3.10/site-packages/fastjet/_pyjet.py:147, in AwkwardClusterSequence.exclusive_jets_constituents(self, njets)
    146 def exclusive_jets_constituents(self, njets=10):
--> 147     return self._internalrep.exclusive_jets_constituents(njets)

File ~/anaconda3/lib/python3.10/site-packages/fastjet/_multievent.py:307, in _classmultievent.exclusive_jets_constituents(self, njets)
    305 duplicate = ak.unflatten(np.zeros(total, np.int64), shape)
    306 prepared = self.data[:, np.newaxis][duplicate]
--> 307 return prepared[outputs_to_inputs]

File ~/anaconda3/lib/python3.10/site-packages/awkward/highlevel.py:949, in Array.__getitem__(self, where)
    520 def __getitem__(self, where):
    521     """
    522     Args:
    523         where (many types supported; see below): Index of positions to
   (...)
    947     have the same dimension as the array being indexed.
    948     """
--> 949     with ak._errors.SlicingErrorContext(self, where):
    950         out = self._layout[where]
    951         if isinstance(out, ak.contents.NumpyArray):

File ~/anaconda3/lib/python3.10/site-packages/awkward/_errors.py:63, in ErrorContext.__exit__(self, exception_type, exception_value, traceback)
     60 try:
     61     # Handle caught exception
     62     if exception_type is not None and self.primary() is self:
---> 63         self.handle_exception(exception_type, exception_value)
     64 finally:
     65     # `_kwargs` may hold cyclic references, that we really want to avoid
     66     # as this can lead to large buffers remaining in memory for longer than absolutely necessary
     67     # Let's just clear this, now.
     68     self._kwargs.clear()

File ~/anaconda3/lib/python3.10/site-packages/awkward/_errors.py:78, in ErrorContext.handle_exception(self, cls, exception)
     76     self.decorate_exception(cls, exception)
     77 else:
---> 78     raise self.decorate_exception(cls, exception)

IndexError: cannot slice ListArray (of length 8) with [[1], [0], [0], [1], [1], [0, 2, 3], [0], [1]]: index out of range while attempting to get index 3 (in compiled code: https://github.com/scikit-hep/awkward/blob/awkward-cpp-17/awkward-cpp/src/cpu-kernels/awkward_ListArray_getitem_jagged_apply.cpp#L43)

This error occurred while attempting to slice

    <Array [[[{px: -0.181, ...}, ...], ...], ...] type='4 * var * var * {px...'>

with

    <Array [[[1], [0]], [[0], ...], ..., [[0], [1]]] type='4 * var * var * int32'>

due to trying to use the out of range indices.

This is not a problem if we just run the unmasked array

fastjet.ClusterSequence(eg, jetdef).exclusive_jets_constituent_index(njets=2)

which gives

[[[1], [0]],
 [[0], [1]],
 [[0], [1]],
 [[0], [1]],
 [[0], [1]]]
---------------------------
type: 5 * var * var * int32

Thank you for the help on this issue, it will be greatly appreciated!

@jpivarski
Copy link
Member

I'm assuming that the error happens in eg[mask], not in fastjet.ClusterSequence or exclusive_jets_constituent_index, right?

Where does ak.num(eg, axis=1) not equal ak.num(mask, axis=1)?

Or if str(mask.type) has two "vars" just like str(eg.type), where does ak.num(eg, axis=2) not equal ak.num(mask, axis=2)?

That would be the first step. The next step is figuring out why they're not equal, because computing eg[mask] presupposes that they fit together.

I made the wrong assumption. You showed that the error definitely happens in exclusive_jets_constituents, but it looks like an Awkward slicing error, somewhere in its Python implementation.

@kpachal
Copy link

kpachal commented Jul 14, 2023

Hi @jpivarski, OK! So .... what is best for us to do here? Would you like us to open an issue in Awkward, or should we leave that to you? Since the working example we have relies on Fastjet, it's not entirely clear to us how to provide useful feedback to Awkward on this. Thoughts welcome!!

@jpivarski
Copy link
Member

The crossed-out part was assuming that it's a bug in Awkward (in eg[mask]). It might still be, but it's also possible that exclusive_jets_constituents is using it wrong.

It's implemented here:

def exclusive_jets_constituents(self, njets):
if njets <= 0:
raise ValueError("Njets cannot be <= 0")
outputs_to_inputs = self.exclusive_jets_constituent_index(njets)
shape = ak.num(outputs_to_inputs)
total = np.sum(shape)
duplicate = ak.unflatten(np.zeros(total, np.int64), shape)
prepared = self.data[:, np.newaxis][duplicate]
return prepared[outputs_to_inputs]

You've verified that exclusive_jets_constituent_index returns a value, so it should be possible to walk through each one of those steps. Do the results of the intermediate steps look right?

@jbrewster7
Copy link
Author

Hi @jpivarski, the error is thrown on line 307 when slicing prepared. The real issue seems to be happening before that in exclusive_jets_constituent_index since it is returning indices out of range. I tried to look into this a bit but all I could tell was that line 208 already had these out of range indices returned from to_numpy_exclusive_njet_with_constituents. I couldn't find a way to look any further into this.

@jpivarski
Copy link
Member

So exclusive_jets_constituent_index returned a value, but it was the wrong value because there are indexes out of bounds. (It also means that there is no pure Python work-around, since you can't switch to using exclusive_jets_constituent_index instead of exclusive_jets_constituent.)

I followed exclusive_jets_constituent_index back to the C++ routine, which is to_numpy_exclusive_njet_with_constituents. The error must be in here:

fastjet/src/_ext.cpp

Lines 365 to 436 in 0ede4d3

.def("to_numpy_exclusive_njet_with_constituents",
[](const output_wrapper ow, const int n_jets = 0) {
auto css = ow.cse;
int64_t len = css.size();
auto jk = 0;
auto sizepar = 0;
for(int i = 0; i < len; i++){
jk += css[i]->exclusive_jets(n_jets).size();
sizepar += css[i]->n_particles();
}
jk++;
auto parid = py::array(py::buffer_info(nullptr, sizeof(int), py::format_descriptor<int>::value, 1, {sizepar}, {sizeof(int)}));
auto bufparid = parid.request();
int *ptrid = (int *)bufparid.ptr;
auto eventoffsets = py::array(py::buffer_info(nullptr, sizeof(int), py::format_descriptor<int>::value, 1, {len+1}, {sizeof(int)}));
auto bufeventoffsets = eventoffsets.request();
int *ptreventoffsets = (int *)bufeventoffsets.ptr;
size_t eventidx = 0;
ptreventoffsets[eventidx] = 0;
eventidx++;
auto jetoffsets = py::array(py::buffer_info(nullptr, sizeof(int), py::format_descriptor<int>::value, 1, {jk}, {sizeof(int)}));
auto bufjetoffsets = jetoffsets.request();
int *ptrjetoffsets = (int *)bufjetoffsets.ptr;
size_t jetidx = 0;
size_t idxh = 0;
ptrjetoffsets[jetidx] = 0;
jetidx++;
auto eventprev = 0;
for (unsigned int i = 0; i < css.size(); i++){ // iterate through events
auto jets = css[i]->exclusive_jets(n_jets);
int size = css[i]->exclusive_jets(n_jets).size();
auto idx = css[i]->particle_jet_indices(jets);
int64_t sizz = css[i]->n_particles();
auto prev = ptrjetoffsets[jetidx-1];
for (unsigned int j = 0; j < jets.size(); j++){
ptrjetoffsets[jetidx] = jets[j].constituents().size() + prev;
prev = ptrjetoffsets[jetidx];
jetidx++;
}
for(int k = 0; k < size; k++){ // iterate through jets in event
for(int j = 0; j <sizz; j++){ // iterate through particles in event
if(idx[j] == k){ // if particle jet index matches jet index assign it to jet j
ptrid[idxh] = j;
idxh++;
}
}
}
ptreventoffsets[eventidx] = jets.size()+eventprev;
eventprev = ptreventoffsets[eventidx];
eventidx++;
}
return std::make_tuple(
jetoffsets,
parid,
eventoffsets
);
}, "n_jets"_a = 0, R"pbdoc(
Retrieves the constituents of n exclusive jets from multievent clustering and converts them to numpy arrays.
Args:
n_jets: Number of exclusive subjets. Default: 0.
Returns:
jet offsets, particle indices, and event offsets
)pbdoc")

which is unpacking the NumPy array inputs, constructing FastJet objects, running FastJet algoriths, and then packing the results into NumPy array outputs. Since the issue is that an index is off, it's probably not the translation between NumPy arrays and FastJet objects, but in handling the FastJet objects. Maybe an off-by-one error somewhere?

@jbrewster7
Copy link
Author

Hi, after looking into this a bit more I've realized that this is a problem with more than just exclusive_jets_constituent_index, though that is the function where the differences are apparent enough to cause noticeable problems. I believe the problem is most likely occurring in ClusterSequence. If I take the same example as I originally used, with the array eg, there are problems happening in the same place for the arrays returned from the functions exclusive_jets and inclusive_jets as well. This is noticeable when mask is applied before clustering vs not applied until afterwards.

>>> fastjet.ClusterSequence(eg[mask], jetdef).exclusive_jets(n_jets=2)
[[{px: 0.377, py: 0.116, pz: -0.0749, E: 0.425}, {px: -0.181, ...}],
 [{px: 0.54, py: -0.65, pz: -0.00527, E: 0.857}, {px: 0.253, ...}],
 [{px: 4.65, py: -1.88, pz: -3.29, E: 6}, {px: 14.5, py: -1.73, ...}],
 [{px: -3.55, py: -1.64, pz: -0.0941, E: 3.91}, {px: -1.33, py: ..., ...}]]
---------------------------------------------------------------------------
type: 4 * var * Momentum4D[
    px: float64,
    py: float64,
    pz: float64,
    E: float64
]
>>> fastjet.ClusterSequence(eg, jetdef).exclusive_jets(n_jets=2)
[[{px: 0.377, py: 0.116, pz: -0.0749, E: 0.425}, {px: -0.181, ...}],
 [{px: 0.54, py: -0.65, pz: -0.00527, E: 0.857}, {px: 0.253, ...}],
 [{px: 0.294, py: 0.254, pz: -0.259, E: 0.467}, {px: 4.65, py: ..., ...}],
 [{px: 1.45, py: -0.179, pz: -0.876, E: 1.71}, {px: 12.8, py: -1.8, ...}],
 [{px: -3.55, py: -1.64, pz: -0.0941, E: 3.91}, {px: -1.33, py: ..., ...}]]
---------------------------------------------------------------------------
type: 5 * var * Momentum4D[
    px: float64,
    py: float64,
    pz: float64,
    E: float64
]
>>> mask
[True,
 True,
 False,
 True,
 True]
--------------
type: 5 * bool

Specifically I noticed that fastjet.ClusterSequence(eg[mask], jetdef).exclusive_jets(n_jets=2)[2] should be equal to fastjet.ClusterSequence(eg, jetdef).exclusive_jets(n_jets=2)[mask][2] (which is equivalent to fastjet.ClusterSequence(eg, jetdef).exclusive_jets(n_jets=2)[3]). However, we have

>>> fastjet.ClusterSequence(eg[mask], jetdef).exclusive_jets(n_jets=2)[2]
[{px: 4.65, py: -1.88, pz: -3.29, E: 6},
 {px: 14.5, py: -1.73, pz: -8.31, E: 17}]
-----------------------------------------
type: 2 * Momentum4D[
    px: float64,
    py: float64,
    pz: float64,
    E: float64
]
>>> fastjet.ClusterSequence(eg, jetdef).exclusive_jets(n_jets=2)[mask][2]
[{px: 1.45, py: -0.179, pz: -0.876, E: 1.71},
 {px: 12.8, py: -1.8, pz: -7.18, E: 14.8}]
---------------------------------------------
type: 2 * Momentum4D[
    px: float64,
    py: float64,
    pz: float64,
    E: float64
]

Instead, fastjet.ClusterSequence(eg[mask], jetdef).exclusive_jets(n_jets=2)[2][0] is equal to fastjet.ClusterSequence(eg, jetdef).exclusive_jets(n_jets=2)[2][1].

Also, fastjet.ClusterSequence(eg[mask], jetdef).exclusive_jets(n_jets=2)[2][1] is equal (within a decimal place) to the sum of fastjet.ClusterSequence(eg, jetdef).exclusive_jets(n_jets=2)[2][0], fastjet.ClusterSequence(eg, jetdef).exclusive_jets(n_jets=2)[3][0], and fastjet.ClusterSequence(eg, jetdef).exclusive_jets(n_jets=2)[3][1].

Please let me know if any of this is unclear and thank you for all the help with this!

@chrispap95
Copy link
Collaborator

You are absolutely right. The way we handle masked arrays is wrong. Running

px, py, pz, E, offsets = self.extract_cons(self.data)
print(offsets)

gives

array([ 0,  2,  4,  8, 10])

The correct layout is

>>> eg[mask].layout
<ListArray len='4'>
    <starts><Index dtype='int64' len='4'>
        [0 2 6 8]
    </Index></starts>
    <stops><Index dtype='int64' len='4'>
        [ 2  4  8 10]
    </Index></stops>
...

This explains why you see all these extra elements between offsets 4 and 8.
I think we should implement this function with start & stop offsets instead of using only the stop as the offset. @jpivarski any suggestions would be very welcome. Do you think my proposal makes any sense?

@jpivarski
Copy link
Member

It sounds like a good idea. (I haven't looked deeply into the details.)

@lgray
Copy link
Collaborator

lgray commented Aug 29, 2023

@chrispap95 / @jpivarski any movement on this? It's blocking progress for future collider analysis, it would nice to have it sorted out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants