How does `how='zip'` work exactly? #1134

sfrances · 2024-02-20T00:25:52Z

sfrances
Feb 20, 2024

Short intro
I am trying to dump a TTree in an h5 file. In the TTree I have several vectorial branches that I'd like to zip together.
To give an idea, I have branches representing particle four momenta and particle properties, like particles_pt, particles_eta, particles_property1, particles_property2, and for each entry I'd like to have a simple 2D array (nParticles * nVariables).

I saw that when calling arrays() I can specify how='zip' but the output is not exactly as I expect. I see the difference with respect to default option but it's not exactly plain.

Question
How is the name/type of the zipped object assigned? Is it just assuming that all the branches start with the same string?
E.g. I see that tree.arrays(['particles_pt', 'particles_eta', 'particles_property1', 'particles_property2'], how='zip') gives something with type = {"particles": var * {"pt": float32, "eta": float32, ...}}.

Is there also a way to get natively a plain 2D array? (for later padding and dump into the h5)

Answered by jpivarski

Feb 20, 2024

I had forgotten about how="zip", and I didn't find it in any documentation: it's not in uproot.TTree.arrays or uproot.interpretation.library.Awkward, but I found the implementation here:

uproot5/src/uproot/interpretation/library.py

Lines 631 to 715 in 94c085b

     elif how == "zip":  
   nonjagged = []  
   offsets = []  
   jaggeds = []  
   renamed_arrays = {}  
   for name, context in expression_context:  
   array = renamed_arrays[_rename(name, context)] = arrays[name]  
   if context["is_jagged"]:  
   if (  
   isinstance(array.layout, awkward.contents.ListArray)  
   or array.layout.offsets[0] != 0  
   ):  
   array_layout = array.layout.to_ListOffsetArray64(True)  
   else:  

View full answer

jpivarski · 2024-02-20T02:07:55Z

jpivarski
Feb 20, 2024
Maintainer

I had forgotten about how="zip", and I didn't find it in any documentation: it's not in uproot.TTree.arrays or uproot.interpretation.library.Awkward, but I found the implementation here:

uproot5/src/uproot/interpretation/library.py

Lines 631 to 715 in 94c085b

    
           elif how == "zip": 
        
               nonjagged = [] 
        
               offsets = [] 
        
               jaggeds = [] 
        
               renamed_arrays = {} 
        
               for name, context in expression_context: 
        
                   array = renamed_arrays[_rename(name, context)] = arrays[name] 
        
                   if context["is_jagged"]: 
        
                       if ( 
        
                           isinstance(array.layout, awkward.contents.ListArray) 
        
                           or array.layout.offsets[0] != 0 
        
                       ): 
        
                           array_layout = array.layout.to_ListOffsetArray64(True) 
        
                       else: 
        
                           array_layout = array.layout 
        
                       if len(offsets) == 0: 
        
                           offsets.append(array_layout.offsets) 
        
                           jaggeds.append([_rename(name, context)]) 
        
                       else: 
        
                           for o, j in zip(offsets, jaggeds): 
        
                               if numpy.array_equal(array_layout.offsets, o): 
        
                                   j.append(_rename(name, context)) 
        
                                   break 
        
                           else: 
        
                               offsets.append(array_layout.offsets) 
        
                               jaggeds.append([_rename(name, context)]) 
        
                   else: 
        
                       nonjagged.append(_rename(name, context)) 
        
               out = None 
        
               if len(nonjagged) != 0: 
        
                   if len(nonjagged) == 0: 
        
                       out = awkward.Array( 
        
                           awkward.contents.RecordArray([], fields=[], length=0) 
        
                       ) 
        
                   else: 
        
                       out = awkward.Array( 
        
                           {name: renamed_arrays[name] for name in nonjagged}, 
        
                       ) 
        
               for number, jagged in enumerate(jaggeds): 
        
                   cut = len(jagged[0]) 
        
                   for name in jagged: 
        
                       cut = min(cut, len(name)) 
        
                       while cut > 0 and ( 
        
                           name[:cut] != jagged[0][:cut] 
        
                           or name[cut - 1] not in ("_", ".", "/") 
        
                       ): 
        
                           cut -= 1 
        
                       if cut == 0: 
        
                           break 
        
                   if ( 
        
                       out is not None 
        
                       and cut != 0 
        
                       and jagged[0][:cut].strip("_./") in awkward.fields(out) 
        
                   ): 
        
                       cut = 0 
        
                   if cut == 0: 
        
                       common = f"jagged{number}" 
        
                       if len(jagged) == 0: 
        
                           subarray = awkward.Array( 
        
                               awkward.contents.RecordArray([], fields=[], length=0) 
        
                           ) 
        
                       else: 
        
                           subarray = awkward.zip( 
        
                               {name: renamed_arrays[name] for name in jagged} 
        
                           ) 
        
                   else: 
        
                       common = jagged[0][:cut].strip("_./") 
        
                       if len(jagged) == 0: 
        
                           subarray = awkward.Array( 
        
                               awkward.contents.RecordArray([], fields=[], length=0) 
        
                           ) 
        
                       else: 
        
                           subarray = awkward.zip( 
        
                               { 
        
                                   name[cut:].strip("_./"): renamed_arrays[name] 
        
                                   for name in jagged 
        
                               } 
        
                           ) 
        
                   if out is None: 
        
                       out = awkward.Array({common: subarray}) 
        
                   else: 
        
                       out = awkward.with_field(out, subarray, common) 
        
               return out

That makes it an undocumented feature. According to this repo's git history, I implemented it 4 years ago.

I would guess that how="zip" would convert data with types like

N * {
  field1: var * float64,
  field2: var * float64,
}

into data with types like

N * var * {
  field1: float64,
  field2: float64,
}

because that's what ak.zip does. This can only work if the variable-length lists (var *) for all fields have the same variable lengths.

I'll try it out on uproot-HZZ.root:

>>> import uproot, skhep_testdata
>>> tree = uproot.open(skhep_testdata.data_path("uproot-HZZ.root"))["events"]
>>> tree.show(filter_name=["Electron_*", "Muon_*"])
name                 | typename                 | interpretation                
---------------------+--------------------------+-------------------------------
Muon_Px              | float[]                  | AsJagged(AsDtype('>f4'))
Muon_Py              | float[]                  | AsJagged(AsDtype('>f4'))
Muon_Pz              | float[]                  | AsJagged(AsDtype('>f4'))
Muon_E               | float[]                  | AsJagged(AsDtype('>f4'))
Muon_Charge          | int32_t[]                | AsJagged(AsDtype('>i4'))
Muon_Iso             | float[]                  | AsJagged(AsDtype('>f4'))
Electron_Px          | float[]                  | AsJagged(AsDtype('>f4'))
Electron_Py          | float[]                  | AsJagged(AsDtype('>f4'))
Electron_Pz          | float[]                  | AsJagged(AsDtype('>f4'))
Electron_E           | float[]                  | AsJagged(AsDtype('>f4'))
Electron_Charge      | int32_t[]                | AsJagged(AsDtype('>i4'))
Electron_Iso         | float[]                  | AsJagged(AsDtype('>f4'))

Without how="zip", we get these arrays:

>>> tree.arrays(filter_name=["Electron_*", "Muon_*"]).show(type=True)
type: 2421 * {
    Muon_Px: var * float32,
    Muon_Py: var * float32,
    Muon_Pz: var * float32,
    Muon_E: var * float32,
    Muon_Charge: var * int32,
    Muon_Iso: var * float32,
    Electron_Px: var * float32,
    Electron_Py: var * float32,
    Electron_Pz: var * float32,
    Electron_E: var * float32,
    Electron_Charge: var * int32,
    Electron_Iso: var * float32
}
[{Muon_Px: [-52.9, 37.7], Muon_Py: [-11.7, 0.693], Muon_Pz: [...], ...},
 {Muon_Px: [-0.816], Muon_Py: [-24.4], Muon_Pz: [20.2], Muon_E: [31.7], ...},
 {Muon_Px: [49, 0.828], Muon_Py: [-21.7, 29.8], Muon_Pz: [...], ...},
 {Muon_Px: [22.1, 76.7], Muon_Py: [-85.8, -14], Muon_Pz: [...], ...},
 {Muon_Px: [45.2, 39.8], Muon_Py: [67.2, 25.4], Muon_Pz: [...], ...},
 {Muon_Px: [9.23, -5.79], Muon_Py: [40.6, -30.3], Muon_Pz: [...], ...},
 {Muon_Px: [12.5, 29.5], Muon_Py: [-42.5, -4.45], Muon_Pz: [...], ...},
 {Muon_Px: [34.9], Muon_Py: [-16], Muon_Pz: [156], Muon_E: [160], ...},
 {Muon_Px: [-53.2, 11.5], Muon_Py: [92, -4.42], Muon_Pz: [...], ...},
 {Muon_Px: [-67, -18.1], Muon_Py: [53.2, -35.1], Muon_Pz: [...], ...},
 ...,
 {Muon_Px: [14.9], Muon_Py: [32], Muon_Pz: [-156], Muon_E: [160], ...},
 {Muon_Px: [-24.2], Muon_Py: [-35], Muon_Pz: [-19.2], Muon_E: [46.7], ...},
 {Muon_Px: [-9.2], Muon_Py: [-42.2], Muon_Pz: [-64.3], Muon_E: [77.4], ...},
 {Muon_Px: [34.5, -31.6], Muon_Py: [28.8, -10.4], Muon_Pz: [...], ...},
 {Muon_Px: [-39.3], Muon_Py: [-14.6], Muon_Pz: [61.7], Muon_E: [74.6], ...},
 {Muon_Px: [35.1], Muon_Py: [-14.2], Muon_Pz: [161], Muon_E: [165], ...},
 {Muon_Px: [-29.8], Muon_Py: [-15.3], Muon_Pz: [-52.7], Muon_E: [62.4], ...},
 {Muon_Px: [1.14], Muon_Py: [63.6], Muon_Pz: [162], Muon_E: [174], ...},
 {Muon_Px: [23.9], Muon_Py: [-35.7], Muon_Pz: [54.7], Muon_E: [69.6], ...}]

and with how="zip", we get these arrays:

>>> tree.arrays(filter_name=["Electron_*", "Muon_*"], how="zip").show(type=True)
type: 2421 * {
    Muon: var * {
        Px: float32,
        Py: float32,
        Pz: float32,
        E: float32,
        Charge: int32,
        Iso: float32
    },
    Electron: var * {
        Px: float32,
        Py: float32,
        Pz: float32,
        E: float32,
        Charge: int32,
        Iso: float32
    }
}
[{Muon: [{Px: -52.9, Py: ..., ...}, ...], Electron: []},
 {Muon: [{Px: -0.816, Py: -24.4, ...}], Electron: []},
 {Muon: [{Px: 49, Py: -21.7, ...}, ...], Electron: []},
 {Muon: [{Px: 22.1, Py: -85.8, ...}, ...], Electron: []},
 {Muon: [{Px: 45.2, Py: 67.2, ...}, ...], Electron: [...]},
 {Muon: [{Px: 9.23, Py: 40.6, ...}, ...], Electron: []},
 {Muon: [{Px: 12.5, Py: -42.5, ...}, ...], Electron: []},
 {Muon: [{Px: 34.9, Py: -16, ...}], Electron: []},
 {Muon: [{Px: -53.2, Py: 92, ...}, ...], Electron: []},
 {Muon: [{Px: -67, Py: 53.2, ...}, ...], Electron: []},
 ...,
 {Muon: [{Px: 14.9, Py: 32, ...}], Electron: []},
 {Muon: [{Px: -24.2, Py: -35, ...}], Electron: []},
 {Muon: [{Px: -9.2, Py: -42.2, ...}], Electron: []},
 {Muon: [{Px: 34.5, Py: 28.8, ...}, ...], Electron: []},
 {Muon: [{Px: -39.3, Py: -14.6, ...}], Electron: []},
 {Muon: [{Px: 35.1, Py: -14.2, ...}], Electron: []},
 {Muon: [{Px: -29.8, Py: -15.3, ...}], Electron: []},
 {Muon: [{Px: 1.14, Py: 63.6, ...}], Electron: []},
 {Muon: [{Px: 23.9, Py: -35.7, ...}], Electron: []}]

That's nice: it noticed that some branches have compatible list lengths and it made a nested structure that grouped the two equivalence classes. (Muon_Px, Muon_Py, etc. all have the same list lengths as each other, Electron_Px, Electron_Py, etc. all have the same list lengths as each other, but there's a different number of electron attributes and muon attributes in each event.) It also parsed the names, making a strong assumption that underscore is the delimiter.

I'm sure it's not using the names to determine the groupings, and if you have any branch that is accidentally more filtered than the ones that it's supposed to be grouped with, it will identify that branch as another group. That can be one way that the results can be unexpected.

But finally, these are not going to be good data structures for converting data into HDF5. HDF5 can't represent the hierarchical nesting within an array that ak.zip deliberately creates. (HDF5's hierarchy is for groups of different arrays.) To get data into an HDF5 file, you don't want to zip them together, you want to ak.unzip them apart. how=tuple or how=dict (which are documented) will do that:

>>> arrays = tree.arrays(filter_name=["Electron_*", "Muon_*"], how=dict)
>>> type(arrays)
<class 'dict'>
>>> arrays.keys()
dict_keys(['Muon_Px', 'Muon_Py', 'Muon_Pz', 'Muon_E', 'Muon_Charge', 'Muon_Iso', 'Electron_Px', 'Electron_Py', 'Electron_Pz', 'Electron_E', 'Electron_Charge', 'Electron_Iso'])
>>> arrays["Muon_Px"]
<Array [[-52.9, 37.7], [-0.816], ..., [23.9]] type='2421 * var * float32'>
>>> arrays["Muon_Py"]
<Array [[-11.7, 0.693], [-24.4], ..., [-35.7]] type='2421 * var * float32'>
>>> arrays["Electron_Px"]
<Array [[], [], [], [], [...], ..., [], [], [], []] type='2421 * var * float32'>

And then you have to find some way to flatten them for HDF5. (I'm assuming that you won't be using HDF5's vlen feature: it's not efficient. It's a record-oriented list-nesting, not columnar.) You could do it with

>>> muon_px = ak.flatten(arrays["Muon_Px"])
>>> nmuon = ak.num(arrays["Muon_Px"])
>>> muon_px
<Array [-52.9, 37.7, -0.816, 49, ..., -29.8, 1.14, 23.9] type='3825 * float32'>
>>> nmuon
<Array [2, 1, 2, 2, 2, 2, 2, 1, ..., 1, 2, 1, 1, 1, 1, 1] type='2421 * int64'>

because then you could get the ragged shape back with

>>> ak.unflatten(muon_px, nmuon)
<Array [[-52.9, 37.7], [-0.816], ..., [23.9]] type='2421 * var * float32'>

But if padding is better for your application, you could

>>> ak.to_numpy(ak.fill_none(ak.pad_none(arrays["Muon_Px"], np.max(nmuon)), np.nan))
array([[-52.89945602,  37.73778152,          nan,          nan],
       [ -0.81645936,          nan,          nan,          nan],
       [ 48.98783112,   0.82756668,          nan,          nan],
       ...,
       [-29.75678635,          nan,          nan,          nan],
       [  1.14186978,          nan,          nan,          nan],
       [ 23.9132061 ,          nan,          nan,          nan]])

Or hard-code a padding length, perhaps even clipping the lists that are too long. (See ak.pad_none.)

1 reply

sfrances Feb 20, 2024
Author

Great! This is awesome and super clear!

(Happy to have helped find an undocumented corner!)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How does `how='zip'` work exactly? #1134

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

	elif how == "zip":
	nonjagged = []
	offsets = []
	jaggeds = []
	renamed_arrays = {}
	for name, context in expression_context:
	array = renamed_arrays[_rename(name, context)] = arrays[name]
	if context["is_jagged"]:
	if (
	isinstance(array.layout, awkward.contents.ListArray)
	or array.layout.offsets[0] != 0
	):
	array_layout = array.layout.to_ListOffsetArray64(True)
	else:

How does how='zip' work exactly? #1134

sfrances Feb 20, 2024

Replies: 1 comment · 1 reply

jpivarski Feb 20, 2024 Maintainer

sfrances Feb 20, 2024 Author

How does `how='zip'` work exactly? #1134

sfrances
Feb 20, 2024

Replies: 1 comment 1 reply

jpivarski
Feb 20, 2024
Maintainer

sfrances Feb 20, 2024
Author