Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support for std::list #1017

Closed
DingXuefeng opened this issue Nov 10, 2023 · 1 comment · Fixed by #1181
Closed

support for std::list #1017

DingXuefeng opened this issue Nov 10, 2023 · 1 comment · Fixed by #1181
Assignees
Labels
feature New feature or request

Comments

@DingXuefeng
Copy link

Currently our collaboration keeps part of data in a TTree holding CalibEvent class type objects. The CalibEvent class holds its member in std::list, and std::list is not parsed by uproot5. It is not easy to update the software from our side since it's used by many packages.

If uproot5 can add support for std::list it would be nice. If there is not enough manpower I'd like to help, and some hints would be very useful where to start.

@DingXuefeng DingXuefeng added the feature New feature or request label Nov 10, 2023
@jpivarski
Copy link
Member

I'm willing to help you implement that. The first step is to get a file with a std::list in it, preferably a small file, a simple file, or a file for which the values in the list are known. The process of looking at the bytes and recognizing how a std::list is to be interpreted will be much easier if you can recognize the values in the list as bytes. Thus, it's easier if the values in the list are integers smaller than 256 or floating point numbers that are powers of 2—those are much easier to recognize by eye.

The next step will likely be in uproot/interpretation/identify.py, to recognize std::list in the C++ type string. It would likely be very similar to std::vector, so probably just another stanza after this one:

elif tokens[i].group(0) == "vector" or _simplify_token(tokens[i]) == "std::vector":
_parse_expect("<", tokens, i + 1, typename, file)
i, values = _parse_node(
tokens, i + 2, typename, file, quote, inner_header, inner_header
)
i = _parse_ignore_extra_arguments(tokens, i, typename, file, 1)
_parse_expect(">", tokens, i, typename, file)
if quote:
return (
i + 1,
f"uproot.containers.AsVector({header}, {values})",
)
else:
return i + 1, uproot.containers.AsVector(header, values)

In the above, you can see that we use uproot.containers.AsVector to represent the std::vector interpretation. This is in uproot/containers.py. This file defines both AsVector the interpretation of the type and STLVector the instantiation of a value if not Awkward Array (e.g. if read with library="np"). Here is the definition of AsVector:

class AsVector(AsContainer):
"""
Args:
header (bool): Sets the :ref:`uproot.containers.AsContainer.header`.
values (:doc:`uproot.model.Model` or :doc:`uproot.containers.Container`): Data
type for data nested in the container.
A :doc:`uproot.containers.AsContainer` for ``std::vector``.
"""
def __init__(self, header, values):
self.header = header
if isinstance(values, AsContainer):
self._values = values
elif isinstance(values, type) and issubclass(
values, (uproot.model.Model, uproot.model.DispatchByVersion)
):
self._values = values
else:
self._values = numpy.dtype(values)
def __hash__(self):
return hash((AsVector, self._header, self._values))
@property
def values(self):
"""
Data type for data nested in the container.
"""
return self._values
def __repr__(self):
if isinstance(self._values, type):
values = self._values.__name__
else:
values = repr(self._values)
return f"AsVector({self._header}, {values})"
@property
def cache_key(self):
return f"AsVector({self._header},{_content_cache_key(self._values)})"
@property
def typename(self):
return f"std::vector<{_content_typename(self._values)}>"
def awkward_form(self, file, context):
awkward = uproot.extras.awkward()
return awkward.forms.ListOffsetForm(
context["index_format"],
uproot._util.awkward_form(self._values, file, context),
)
def read(self, chunk, cursor, context, file, selffile, parent, header=True):
# AwkwardForth testing: test_0637's 00,03,04,06,07,08,09,10,11,12,13,14,15,16,17,23,24,26,27,28,31,33,36,38,41,42,43,44,45,46,49,50,55,56,57,58,59,60,61,62,63,67,68,72,73,76,77,80
forth_stash = uproot._awkward_forth.forth_stash(context)
if forth_stash is not None:
forth_obj = forth_stash.get_gen_obj()
if self._header and header:
start_cursor = cursor.copy()
(
num_bytes,
instance_version,
is_memberwise,
) = uproot.deserialization.numbytes_version(chunk, cursor, context)
if forth_stash is not None:
temp_jump = cursor._index - start_cursor._index
if temp_jump != 0:
forth_stash.add_to_pre(f"{temp_jump} stream skip\n")
else:
is_memberwise = False
# Note: self._values can also be a NumPy dtype, and not necessarily a class
# (e.g. type(self._values) == type)
_value_typename = _content_typename(self._values)
if is_memberwise:
if forth_stash is not None:
context["cancel_forth"] = True
# let's hard-code in logic for std::pair<T1,T2> for now
if not _value_typename.startswith("pair"):
raise NotImplementedError(
"""memberwise serialization of {}({})
in file {}""".format(
type(self).__name__, _value_typename, selffile.file_path
)
)
if not issubclass(self._values, uproot.model.DispatchByVersion):
raise NotImplementedError(
"""streamerless memberwise serialization of class {}({})
in file {}""".format(
type(self).__name__, _value_typename, selffile.file_path
)
)
# uninterpreted header
cursor.skip(6)
length = cursor.field(chunk, _stl_container_size, context)
# no known class version number (maybe in that header? unclear...)
model = self._values.new_class(file, "max")
values = numpy.empty(length, dtype=_stl_object_type)
# only do anything if we have anything to read...
if length > 0:
for i in range(length):
values[i] = model.read(
chunk,
cursor,
dict(context, reading=False),
file,
selffile,
parent,
)
# memberwise reading!
for member_index in range(len(values[0].member_names)):
for i in range(length):
values[i].read_member_n(
chunk, cursor, context, file, member_index
)
else:
length = cursor.field(chunk, _stl_container_size, context)
if forth_stash is not None:
key = forth_obj.get_keys(1)
node_key = f"node{key}"
form_key = f"node{key}-offsets"
forth_stash.add_to_header(f"output node{key}-offsets int64\n")
forth_stash.add_to_init(f"0 node{key}-offsets <- stack\n")
forth_stash.add_to_pre(
f"stream !I-> stack\n dup node{key}-offsets +<- stack\n"
)
# forth_stash.add_to_post("loop\n")
if (
forth_obj.should_add_form()
and forth_obj.awkward_model["name"] != node_key
):
forth_obj.add_form_key(form_key)
temp_aform = f'{{ "class":"ListOffsetArray", "offsets":"i64", "content": "NULL", "parameters": {{}}, "form_key": "node{key}"}}'
forth_obj.add_form(json.loads(temp_aform))
if not isinstance(self._values, numpy.dtype):
forth_stash.add_to_pre("0 do\n")
forth_stash.add_to_post("loop\n")
if forth_obj.awkward_model["name"] == node_key:
temp = forth_obj.awkward_model
else:
temp = forth_obj.add_node(
node_key,
forth_stash.get_attrs(),
"i64",
1,
{},
)
context["temp_ref"] = temp
values = _read_nested(
self._values, length, chunk, cursor, context, file, selffile, parent
)
if forth_stash is not None and not context["cancel_forth"]:
forth_obj.go_to(temp)
out = STLVector(values)
if self._header and header:
uproot.deserialization.numbytes_check(
chunk,
start_cursor,
cursor,
num_bytes,
self.typename,
context,
file.file_path,
)
return out
def __eq__(self, other):
if not isinstance(other, AsVector):
return False
if self.header != other.header:
return False
if isinstance(self.values, numpy.dtype) and isinstance(
other.values, numpy.dtype
):
return self.values == other.values
elif not isinstance(self.values, numpy.dtype) and not isinstance(
other.values, numpy.dtype
):
return self.values == other.values
else:
return False

and here is STLVector:

class STLVector(Container, Sequence):
"""
Args:
values (``numpy.ndarray`` or iterable): Contents of the ``std::vector``.
Representation of a C++ ``std::vector`` as a Python ``Sequence``.
"""
def __init__(self, values):
if isinstance(values, types.GeneratorType):
values = numpy.asarray(list(values))
elif isinstance(values, Set):
values = numpy.asarray(list(values))
elif isinstance(values, (list, tuple)):
values = numpy.asarray(values)
self._values = values
def __str__(self, limit=85):
def tostring(i):
return _tostring(self._values[i])
return _str_with_ellipsis(tostring, len(self), "[", "]", limit)
def __repr__(self, limit=85):
return f"<STLVector {self.__str__(limit=limit - 30)} at 0x{id(self):012x}>"
def __getitem__(self, where):
return self._values[where]
def __len__(self):
return len(self._values)
def __contains__(self, what):
return what in self._values
def __iter__(self):
return iter(self._values)
def __reversed__(self):
return STLVector(self._values[::-1])
def __eq__(self, other):
if isinstance(other, STLVector):
return self._values == other._values
elif isinstance(other, Sequence):
return self._values == other
else:
return False
def tolist(self):
return [
x.tolist() if isinstance(x, (Container, numpy.ndarray)) else x for x in self
]

The STLVector is very straightforward; just a class with __getitem__ and __iter__ and such, so that it acts like a Sequence in Python. The AsVector is more complex because it's handling several cases:

  • Content as a value type, like int32 or float64, versus content as a more complex kind of record.
  • Reading the data from an entry of a TTree versus reading the data as an object that has been saved directly in a TDirectory.
  • Filling an Awkward Array (only TTree with library="ak" or library="pd" through awkward-pandas) versus filling a STLVector.
  • Using AwkwardForth if it is available (subset of Awkward case).
  • Reading memberwise or non-memberwise data (only non-memberwise has been implemented, but the other case needs to raise an error).

The first question I should have asked you is whether your std::list and CalibEvent are inside of a TTree or on their own in a TDirectory, since that cuts out some of the cases.

Even if it is inside of a TTree, which has more subcases, there is a minimal implementation that you can do to avoid the complex cases:

  • Do implement the value type versus complex record, even if you only have one kind of data, because this switch doesn't add much complexity and it would be confusing to future users if it handles std::list for one type of content but not another.
  • The things you need to worry about for inside-of-TTree are a strict superset of outside-of-TTree, so if your data are in a TTree, we'll get the outside-of-TTree case for free.
  • Don't worry about special-casing for Awkward Arrays or AwkwardForth. AwkwardForth is especially complicated and is being deeply refactored right now (feat: refactoring the AwkwardForth code-discovery process #943), so it would not pay to solve that problem in the main branch. You can do
if forth_stash is not None:
    context["cancel_forth"] = True
  • We have not been implementing memberwise deserialization anywhere, except in std::map, for which our only examples are memberwise (so for that one, we don't implement non-memberwise). You can check for memberwise (or non-memberwise, whichever your case isn't) and raise an error for the unhandled case.

As test-driven development, you can stub the read_members method of AsList with cursor.debug(chunk) (docs) followed by an exception just to stop the program flow. The debugging output looks like

--+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+-
123 123 123  63 140 204 205  64  12 204 205  64  83  51  51  64 140 204 205  64
  {   {   {   ? --- --- ---   @ --- --- ---   @   S   3   3   @ --- --- ---   @
                        1.1             2.2             3.3             4.4
    --+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+-
    176   0   0  64 211  51  51  64 246 102 102  65  12 204 205  65  30 102 102  66
    --- --- ---   @ ---   3   3   @ ---   f   f   A --- --- ---   A ---   f   f   B
            5.5             6.6             7.7             8.8             9.9
    --+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+-
    202   0   0  67  74   0   0  67 151 128   0 123 123
    --- --- ---   C   J --- ---   C --- --- ---   {   {
          101.0           202.0           303.0

(with some of the options turned on, dtype=">f4" and offset=3). The three rows that are always present are the --+---+---+---+--- separators, the decimal-valued bytes, and the interpretation as printable characters. With a given dtype, the debugging output will also show you the values interpreted as a numeric type, but you have to get the offset correct for this to be useful. Since not all of the data belong to a given dtype, it's usually easier to put data in the file that correspond to easy-to-read bytes. For example, big-endian (ROOT is big-endian) int32 values look like this as bytes:

>>> np.array([1, 2, 3, 4, 5], dtype=">i4").view("u1")
array([0, 0, 0, 1, 0, 0, 0, 2, 0, 0, 0, 3, 0, 0, 0, 4, 0, 0, 0, 5],
      dtype=uint8)

That's very readable, even if it's embedded among some headers and strings. (Strings are easy to identify from the character interpretation lines.) Keep in mind that you want the numbers you're using as anchors to be distinguishable from the surrounding headers, which often have a lot of zeros, so pick numbers that are not zero (or one). 123 is a great one to use; it's easy to pick out by eye and it's small enough to fit in one byte.

After having said all of that, I highly suspect that the byte-serialization of std::list will be just like that of std::vector. I highly suspect that there will be a 6 byte header that you can ignore, but it will start with a decimal 64 (that's a high-bit flag in a 4-byte integer part of the 6 byte header), followed by a 4 byte "number of items in the std::list," followed by that many data values.

For example:

--+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+-
 64   0   0  22   0   3   0   0   0   5   0   0   0   1   0   0   0   2   0   0
  @ --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- ---
--+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+-
  0   3   0   0   0   4   0   0   0   5
--- --- --- --- --- --- --- --- --- ---

where the total number of bytes in the object (which you don't need) is 22, the std::list serialization version (that I just made up) is 3, there are 5 elements in the list, followed by the values for 1, 2, 3, 4, and 5.

That's a guess, but the reason I guessed that is because it's how std::vector is serialized, how std::set is serialized, how std::map would be serialized except that it's memberwise and the data come in key-value pairs, and it's how ROOT's RVec is serialized. I'd be surprised if they break pattern for std::list. (How these STL objects are implemented in memory in C++ doesn't matter for how they are serialized to disk.)

Good luck, and I'm available for help if you have any questions!

@ioanaif ioanaif self-assigned this Feb 9, 2024
@ioanaif ioanaif linked a pull request Mar 22, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New feature or request
Projects
Status: Done!
Development

Successfully merging a pull request may close this issue.

3 participants