support for std::list #1017

DingXuefeng · 2023-11-10T02:18:30Z

Currently our collaboration keeps part of data in a TTree holding CalibEvent class type objects. The CalibEvent class holds its member in std::list, and std::list is not parsed by uproot5. It is not easy to update the software from our side since it's used by many packages.

If uproot5 can add support for std::list it would be nice. If there is not enough manpower I'd like to help, and some hints would be very useful where to start.

jpivarski · 2023-11-10T17:26:51Z

I'm willing to help you implement that. The first step is to get a file with a std::list in it, preferably a small file, a simple file, or a file for which the values in the list are known. The process of looking at the bytes and recognizing how a std::list is to be interpreted will be much easier if you can recognize the values in the list as bytes. Thus, it's easier if the values in the list are integers smaller than 256 or floating point numbers that are powers of 2—those are much easier to recognize by eye.

The next step will likely be in uproot/interpretation/identify.py, to recognize std::list in the C++ type string. It would likely be very similar to std::vector, so probably just another stanza after this one:

uproot5/src/uproot/interpretation/identify.py

Lines 958 to 971 in ccb56b2

    
           elif tokens[i].group(0) == "vector" or _simplify_token(tokens[i]) == "std::vector": 
        
               _parse_expect("<", tokens, i + 1, typename, file) 
        
               i, values = _parse_node( 
        
                   tokens, i + 2, typename, file, quote, inner_header, inner_header 
        
               ) 
        
               i = _parse_ignore_extra_arguments(tokens, i, typename, file, 1) 
        
               _parse_expect(">", tokens, i, typename, file) 
        
               if quote: 
        
                   return ( 
        
                       i + 1, 
        
                       f"uproot.containers.AsVector({header}, {values})", 
        
                   ) 
        
               else: 
        
                   return i + 1, uproot.containers.AsVector(header, values)

In the above, you can see that we use uproot.containers.AsVector to represent the std::vector interpretation. This is in uproot/containers.py. This file defines both AsVector the interpretation of the type and STLVector the instantiation of a value if not Awkward Array (e.g. if read with library="np"). Here is the definition of AsVector:

uproot5/src/uproot/containers.py

Lines 1009 to 1209 in ccb56b2

    
           class AsVector(AsContainer): 
        
               """ 
        
               Args: 
        
                   header (bool): Sets the :ref:`uproot.containers.AsContainer.header`. 
        
                   values (:doc:`uproot.model.Model` or :doc:`uproot.containers.Container`): Data 
        
                       type for data nested in the container. 
        
               A :doc:`uproot.containers.AsContainer` for ``std::vector``. 
        
               """ 
        
               def __init__(self, header, values): 
        
                   self.header = header 
        
                   if isinstance(values, AsContainer): 
        
                       self._values = values 
        
                   elif isinstance(values, type) and issubclass( 
        
                       values, (uproot.model.Model, uproot.model.DispatchByVersion) 
        
                   ): 
        
                       self._values = values 
        
                   else: 
        
                       self._values = numpy.dtype(values) 
        
               def __hash__(self): 
        
                   return hash((AsVector, self._header, self._values)) 
        
               @property 
        
               def values(self): 
        
                   """ 
        
                   Data type for data nested in the container. 
        
                   """ 
        
                   return self._values 
        
               def __repr__(self): 
        
                   if isinstance(self._values, type): 
        
                       values = self._values.__name__ 
        
                   else: 
        
                       values = repr(self._values) 
        
                   return f"AsVector({self._header}, {values})" 
        
               @property 
        
               def cache_key(self): 
        
                   return f"AsVector({self._header},{_content_cache_key(self._values)})" 
        
               @property 
        
               def typename(self): 
        
                   return f"std::vector<{_content_typename(self._values)}>" 
        
               def awkward_form(self, file, context): 
        
                   awkward = uproot.extras.awkward() 
        
                   return awkward.forms.ListOffsetForm( 
        
                       context["index_format"], 
        
                       uproot._util.awkward_form(self._values, file, context), 
        
                   ) 
        
               def read(self, chunk, cursor, context, file, selffile, parent, header=True): 
        
                   # AwkwardForth testing: test_0637's 00,03,04,06,07,08,09,10,11,12,13,14,15,16,17,23,24,26,27,28,31,33,36,38,41,42,43,44,45,46,49,50,55,56,57,58,59,60,61,62,63,67,68,72,73,76,77,80 
        
                   forth_stash = uproot._awkward_forth.forth_stash(context) 
        
                   if forth_stash is not None: 
        
                       forth_obj = forth_stash.get_gen_obj() 
        
                   if self._header and header: 
        
                       start_cursor = cursor.copy() 
        
                       ( 
        
                           num_bytes, 
        
                           instance_version, 
        
                           is_memberwise, 
        
                       ) = uproot.deserialization.numbytes_version(chunk, cursor, context) 
        
                       if forth_stash is not None: 
        
                           temp_jump = cursor._index - start_cursor._index 
        
                           if temp_jump != 0: 
        
                               forth_stash.add_to_pre(f"{temp_jump} stream skip\n") 
        
                   else: 
        
                       is_memberwise = False 
        
                   # Note: self._values can also be a NumPy dtype, and not necessarily a class 
        
                   # (e.g. type(self._values) == type) 
        
                   _value_typename = _content_typename(self._values) 
        
                   if is_memberwise: 
        
                       if forth_stash is not None: 
        
                           context["cancel_forth"] = True 
        
                       # let's hard-code in logic for std::pair<T1,T2> for now 
        
                       if not _value_typename.startswith("pair"): 
        
                           raise NotImplementedError( 
        
                               """memberwise serialization of {}({}) 
        
               in file {}""".format( 
        
                                   type(self).__name__, _value_typename, selffile.file_path 
        
                               ) 
        
                           ) 
        
                       if not issubclass(self._values, uproot.model.DispatchByVersion): 
        
                           raise NotImplementedError( 
        
                               """streamerless memberwise serialization of class {}({}) 
        
               in file {}""".format( 
        
                                   type(self).__name__, _value_typename, selffile.file_path 
        
                               ) 
        
                           ) 
        
                       # uninterpreted header 
        
                       cursor.skip(6) 
        
                       length = cursor.field(chunk, _stl_container_size, context) 
        
                       # no known class version number (maybe in that header? unclear...) 
        
                       model = self._values.new_class(file, "max") 
        
                       values = numpy.empty(length, dtype=_stl_object_type) 
        
                       # only do anything if we have anything to read... 
        
                       if length > 0: 
        
                           for i in range(length): 
        
                               values[i] = model.read( 
        
                                   chunk, 
        
                                   cursor, 
        
                                   dict(context, reading=False), 
        
                                   file, 
        
                                   selffile, 
        
                                   parent, 
        
                               ) 
        
                           # memberwise reading! 
        
                           for member_index in range(len(values[0].member_names)): 
        
                               for i in range(length): 
        
                                   values[i].read_member_n( 
        
                                       chunk, cursor, context, file, member_index 
        
                                   ) 
        
                   else: 
        
                       length = cursor.field(chunk, _stl_container_size, context) 
        
                       if forth_stash is not None: 
        
                           key = forth_obj.get_keys(1) 
        
                           node_key = f"node{key}" 
        
                           form_key = f"node{key}-offsets" 
        
                           forth_stash.add_to_header(f"output node{key}-offsets int64\n") 
        
                           forth_stash.add_to_init(f"0 node{key}-offsets <- stack\n") 
        
                           forth_stash.add_to_pre( 
        
                               f"stream !I-> stack\n dup node{key}-offsets +<- stack\n" 
        
                           ) 
        
                           # forth_stash.add_to_post("loop\n") 
        
                           if ( 
        
                               forth_obj.should_add_form() 
        
                               and forth_obj.awkward_model["name"] != node_key 
        
                           ): 
        
                               forth_obj.add_form_key(form_key) 
        
                               temp_aform = f'{{ "class":"ListOffsetArray", "offsets":"i64", "content": "NULL", "parameters": {{}}, "form_key": "node{key}"}}' 
        
                               forth_obj.add_form(json.loads(temp_aform)) 
        
                           if not isinstance(self._values, numpy.dtype): 
        
                               forth_stash.add_to_pre("0 do\n") 
        
                               forth_stash.add_to_post("loop\n") 
        
                           if forth_obj.awkward_model["name"] == node_key: 
        
                               temp = forth_obj.awkward_model 
        
                           else: 
        
                               temp = forth_obj.add_node( 
        
                                   node_key, 
        
                                   forth_stash.get_attrs(), 
        
                                   "i64", 
        
                                   1, 
        
                                   {}, 
        
                               ) 
        
                           context["temp_ref"] = temp 
        
                       values = _read_nested( 
        
                           self._values, length, chunk, cursor, context, file, selffile, parent 
        
                       ) 
        
                   if forth_stash is not None and not context["cancel_forth"]: 
        
                       forth_obj.go_to(temp) 
        
                   out = STLVector(values) 
        
                   if self._header and header: 
        
                       uproot.deserialization.numbytes_check( 
        
                           chunk, 
        
                           start_cursor, 
        
                           cursor, 
        
                           num_bytes, 
        
                           self.typename, 
        
                           context, 
        
                           file.file_path, 
        
                       ) 
        
                   return out 
        
               def __eq__(self, other): 
        
                   if not isinstance(other, AsVector): 
        
                       return False 
        
                   if self.header != other.header: 
        
                       return False 
        
                   if isinstance(self.values, numpy.dtype) and isinstance( 
        
                       other.values, numpy.dtype 
        
                   ): 
        
                       return self.values == other.values 
        
                   elif not isinstance(self.values, numpy.dtype) and not isinstance( 
        
                       other.values, numpy.dtype 
        
                   ): 
        
                       return self.values == other.values 
        
                   else: 
        
                       return False

and here is STLVector:

uproot5/src/uproot/containers.py

Lines 1729 to 1782 in ccb56b2

    
           class STLVector(Container, Sequence): 
        
               """ 
        
               Args: 
        
                   values (``numpy.ndarray`` or iterable): Contents of the ``std::vector``. 
        
               Representation of a C++ ``std::vector`` as a Python ``Sequence``. 
        
               """ 
        
               def __init__(self, values): 
        
                   if isinstance(values, types.GeneratorType): 
        
                       values = numpy.asarray(list(values)) 
        
                   elif isinstance(values, Set): 
        
                       values = numpy.asarray(list(values)) 
        
                   elif isinstance(values, (list, tuple)): 
        
                       values = numpy.asarray(values) 
        
                   self._values = values 
        
               def __str__(self, limit=85): 
        
                   def tostring(i): 
        
                       return _tostring(self._values[i]) 
        
                   return _str_with_ellipsis(tostring, len(self), "[", "]", limit) 
        
               def __repr__(self, limit=85): 
        
                   return f"<STLVector {self.__str__(limit=limit - 30)} at 0x{id(self):012x}>" 
        
               def __getitem__(self, where): 
        
                   return self._values[where] 
        
               def __len__(self): 
        
                   return len(self._values) 
        
               def __contains__(self, what): 
        
                   return what in self._values 
        
               def __iter__(self): 
        
                   return iter(self._values) 
        
               def __reversed__(self): 
        
                   return STLVector(self._values[::-1]) 
        
               def __eq__(self, other): 
        
                   if isinstance(other, STLVector): 
        
                       return self._values == other._values 
        
                   elif isinstance(other, Sequence): 
        
                       return self._values == other 
        
                   else: 
        
                       return False 
        
               def tolist(self): 
        
                   return [ 
        
                       x.tolist() if isinstance(x, (Container, numpy.ndarray)) else x for x in self 
        
                   ]

The STLVector is very straightforward; just a class with __getitem__ and __iter__ and such, so that it acts like a Sequence in Python. The AsVector is more complex because it's handling several cases:

Content as a value type, like int32 or float64, versus content as a more complex kind of record.
Reading the data from an entry of a TTree versus reading the data as an object that has been saved directly in a TDirectory.
Filling an Awkward Array (only TTree with library="ak" or library="pd" through awkward-pandas) versus filling a STLVector.
Using AwkwardForth if it is available (subset of Awkward case).
Reading memberwise or non-memberwise data (only non-memberwise has been implemented, but the other case needs to raise an error).

The first question I should have asked you is whether your std::list and CalibEvent are inside of a TTree or on their own in a TDirectory, since that cuts out some of the cases.

Even if it is inside of a TTree, which has more subcases, there is a minimal implementation that you can do to avoid the complex cases:

Do implement the value type versus complex record, even if you only have one kind of data, because this switch doesn't add much complexity and it would be confusing to future users if it handles std::list for one type of content but not another.
The things you need to worry about for inside-of-TTree are a strict superset of outside-of-TTree, so if your data are in a TTree, we'll get the outside-of-TTree case for free.
Don't worry about special-casing for Awkward Arrays or AwkwardForth. AwkwardForth is especially complicated and is being deeply refactored right now (feat: refactoring the AwkwardForth code-discovery process #943), so it would not pay to solve that problem in the main branch. You can do

if forth_stash is not None:
    context["cancel_forth"] = True

We have not been implementing memberwise deserialization anywhere, except in std::map, for which our only examples are memberwise (so for that one, we don't implement non-memberwise). You can check for memberwise (or non-memberwise, whichever your case isn't) and raise an error for the unhandled case.

As test-driven development, you can stub the read_members method of AsList with cursor.debug(chunk) (docs) followed by an exception just to stop the program flow. The debugging output looks like

--+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+-
123 123 123  63 140 204 205  64  12 204 205  64  83  51  51  64 140 204 205  64
  {   {   {   ? --- --- ---   @ --- --- ---   @   S   3   3   @ --- --- ---   @
                        1.1             2.2             3.3             4.4
    --+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+-
    176   0   0  64 211  51  51  64 246 102 102  65  12 204 205  65  30 102 102  66
    --- --- ---   @ ---   3   3   @ ---   f   f   A --- --- ---   A ---   f   f   B
            5.5             6.6             7.7             8.8             9.9
    --+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+-
    202   0   0  67  74   0   0  67 151 128   0 123 123
    --- --- ---   C   J --- ---   C --- --- ---   {   {
          101.0           202.0           303.0

(with some of the options turned on, dtype=">f4" and offset=3). The three rows that are always present are the --+---+---+---+--- separators, the decimal-valued bytes, and the interpretation as printable characters. With a given dtype, the debugging output will also show you the values interpreted as a numeric type, but you have to get the offset correct for this to be useful. Since not all of the data belong to a given dtype, it's usually easier to put data in the file that correspond to easy-to-read bytes. For example, big-endian (ROOT is big-endian) int32 values look like this as bytes:

>>> np.array([1, 2, 3, 4, 5], dtype=">i4").view("u1")
array([0, 0, 0, 1, 0, 0, 0, 2, 0, 0, 0, 3, 0, 0, 0, 4, 0, 0, 0, 5],
      dtype=uint8)

That's very readable, even if it's embedded among some headers and strings. (Strings are easy to identify from the character interpretation lines.) Keep in mind that you want the numbers you're using as anchors to be distinguishable from the surrounding headers, which often have a lot of zeros, so pick numbers that are not zero (or one). 123 is a great one to use; it's easy to pick out by eye and it's small enough to fit in one byte.

After having said all of that, I highly suspect that the byte-serialization of std::list will be just like that of std::vector. I highly suspect that there will be a 6 byte header that you can ignore, but it will start with a decimal 64 (that's a high-bit flag in a 4-byte integer part of the 6 byte header), followed by a 4 byte "number of items in the std::list," followed by that many data values.

For example:

--+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+-
 64   0   0  22   0   3   0   0   0   5   0   0   0   1   0   0   0   2   0   0
  @ --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- ---
--+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+-
  0   3   0   0   0   4   0   0   0   5
--- --- --- --- --- --- --- --- --- ---

where the total number of bytes in the object (which you don't need) is 22, the std::list serialization version (that I just made up) is 3, there are 5 elements in the list, followed by the values for 1, 2, 3, 4, and 5.

That's a guess, but the reason I guessed that is because it's how std::vector is serialized, how std::set is serialized, how std::map would be serialized except that it's memberwise and the data come in key-value pairs, and it's how ROOT's RVec is serialized. I'd be surprised if they break pattern for std::list. (How these STL objects are implemented in memory in C++ doesn't matter for how they are serialized to disk.)

Good luck, and I'm available for help if you have any questions!

DingXuefeng added the feature New feature or request label Nov 10, 2023

ioanaif self-assigned this Feb 9, 2024

ioanaif linked a pull request Mar 22, 2024 that will close this issue

feat: add support for std::list #1181

Merged

ioanaif closed this as completed in #1181 Mar 22, 2024

github-project-automation bot added this to Finalization Aug 28, 2024

github-project-automation bot moved this to Done! in Finalization Aug 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

support for std::list #1017

support for std::list #1017

DingXuefeng commented Nov 10, 2023

jpivarski commented Nov 10, 2023

support for std::list #1017

support for std::list #1017

Comments

DingXuefeng commented Nov 10, 2023

jpivarski commented Nov 10, 2023