Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: implement TGraph-writing #1144

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

jpivarski
Copy link
Member

This is a draft. It needs TGraphErrors, TGraphAsymmErrors, tests of writing to actual files and reading them back with ROOT, and maybe some high-level interface so that we can

output_file["name"] = XYZ

for a Python object XYZ that would be interpreted as a TGraph (or TGraphErrors, or TGraphAsymmErrors). What should that XYZ be? Not raw NumPy arrays, since a tuple of NumPy arrays is already interpreted as a histogram. Does it need to be some kind of uproot.make_TGraph function? It would be nice to have something more Pythonic.

Should any Pandas DataFrame with columns named x and y be interpreted as a TGraph? A Pandas DataFrame would supply a title.

Thoughts, @grzanka?

@grzanka
Copy link

grzanka commented Feb 24, 2024

Should any Pandas DataFrame with columns named x and y be interpreted as a TGraph? A Pandas DataFrame would supply a title.

Thoughts, @grzanka?

Another candidate may be a python dataclass with x and y fields containing arrays of numbers ? Or numpy recarray ?

@grzanka
Copy link

grzanka commented Feb 27, 2024

As discussed here: #1142 (comment) we may create confusion by considering an object that can be converted both to TTree and TGraph.

@jpivarski jpivarski added the inactive A pull request that hasn't been touched in a long time label Mar 20, 2024
@Pepesob
Copy link
Contributor

Pepesob commented Jul 23, 2024

Hello @jpivarski! I am a student of @grzanka, and I would like to join to this issue. I have begun to understand how different objects are transformed so they can be saved to a root file, and I believe I got good grasp of it. I have started to implement prototype of a function to_TGraph(), which, when given two arrays of values (and some other arguments), returns a writable TGraph object. It works, but more things still need to be done (especially TGraphErrors and TGraphAsymErrors support)

def to_TGraph(
    fName,
    fTitle,

    fNpoints,
    fX,
    fY,
    fFunctions,  # not used
    fHistogram,  # not used
    fMinimum, 
    fMaximum,

    fLineColor=602,
    fLineStyle=1,
    fLineWidth=1,
    fFillColor=0,
    fFillStyle=1001,
    fMarkerColor=1,
    fMarkerStyle=1,
    fMarkerSize=1.0,):
    
    tobject = uproot.models.TObject.Model_TObject.empty()

    tnamed = uproot.models.TNamed.Model_TNamed.empty()
    tnamed._deeply_writable = True
    tnamed._bases.append(tobject)
    tnamed._members["fName"] = fName
    tnamed._members["fTitle"] = fTitle

    tattline = uproot.models.TAtt.Model_TAttLine_v2.empty()
    tattline._deeply_writable = True
    tattline._members["fLineColor"] = fLineColor
    tattline._members["fLineStyle"] = fLineStyle
    tattline._members["fLineWidth"] = fLineWidth

    tattfill = uproot.models.TAtt.Model_TAttFill_v2.empty()
    tattfill._deeply_writable = True
    tattfill._members["fFillColor"] = fFillColor
    tattfill._members["fFillStyle"] = fFillStyle

    tattmarker = uproot.models.TAtt.Model_TAttMarker_v2.empty()
    tattmarker._deeply_writable = True
    tattmarker._members["fMarkerColor"] = fMarkerColor
    tattmarker._members["fMarkerStyle"] = fMarkerStyle
    tattmarker._members["fMarkerSize"] = fMarkerSize


    tGraph = uproot.models.TGraph.Model_TGraph_v4.empty()

    tGraph._bases.append(tnamed)
    tGraph._bases.append(tattline)
    tGraph._bases.append(tattfill)
    tGraph._bases.append(tattmarker)
    

    tGraph._members["fNpoints"] = fNpoints
    tGraph._members["fX"] = fX
    tGraph._members["fY"] = fY
    # fFuncitons - do i need it? probably not because it is array of zero items
    # fHistogram - do i need it? probably not because it is a nullptr
    tGraph._members["fMinimum"] = fMinimum
    tGraph._members["fMaximum"] = fMaximum

    return tGraph

Now, the issue is deciding what we want to interpret as a TGraph. After some time of thinking, I came up with a few solutions:
(when not explicitly said, all rules that are mentioned for TGraph also apply for TGraphErrors and TGraphAsymErrors)

  1. make function that saves to TGraph:

    Pros:

    • explicit method
    • will work with popular datastructure libraries
    • can't be confused with anything
    • can add some arguments to customize TGraph (like title)

    Cons:

    • not pythonic
    • not within UpRoot convention by asigning using [] (setitem)
  2. crate class that will be always interpreted as TGraph (something like TGraphConvertable)

    Pros:

    • within UpRoot convention by asigning using [] (setitem)
    • will work with popular datastructure libraries
    • can add some arguments to customize TGraph (like title)
    • can't be confused with anything

    Cons:

    • user has to create specific object
    • only this class can be interpreted as TGraph
  3. use metadata hiden in some most popular datastructure libraries (eg. numpy.dtype.metadata), when certain flag is set (eg. {"as_TGraph": True}) then object will be interpreted as TGraph

    Pros:

    • saving to TGraph will work with many popular datastructure libraries
    • within UpRoot convention by asigning using [] (setitem)

    Cons:

    • interfering with internal informations about datastructures
    • some of popular datastructure libraries don't have metadata, don't give access to it or access is difficult
    • every datastructure librariary have different access to metadata so implementing support for all of them will be tedious
    • will require additional function to set this metadata
  4. interpret dict/NamedTuple with keys: {'x','y','errors_x_high','errors_x_low','errors_y_high','errors_y_low', ...} as TGraphAsymmErrors, when 'errors_*_low' are set to "None" interpret it as TGraphErrors, when 'errors_*_*' are set to "None" interpret is as TGraph

    Pros:

    • within UpRoot convention by asigning using [] (setitem)
    • it will work with all popular datastructure libraries
    • require little effort from user
    • can add aditional keys to customize TGraph (eg. {title,...]})

    Cons:

    • it overlaps with existing TTree/histogram convention
  5. point 4. solution with flag "implicit_TGraph_convertion", if set to True point 4. works normally, if set to False no implicit TGraph convertion (require other method, maybe combined with point 1. or/and 2.)) (in my opinion default set to True)

    Pros:

    • within UpRoot convention by asigning using [] (setitem)
    • it will work with all popular datastructure libraries
    • require little effort from user
    • user can chose between many saving methods
    • reduce unintentional overlapping

    Cons:

    • added hiden complexity
    • does not remove overlapping entirely

In my opinion solution number 5. is the best one. It gives the most flexibility with little risk of unintentional conversion.

What do you think? Which solution will suit the best? Or maybe you have your own vision that would be better? I would like to know your opinion!

@jpivarski
Copy link
Member Author

It can't be a dict because all dicts are already interpreted as TTree (as a set of TBranch names to arrays).

Adding attributes, as in point 5, is somewhat cumbersome—it would have to be extra lines of Python, hard to do interactively—and hidden. You would just need to know the name, and even if you know that there is a special name, you might need to look it up to get the right spelling.

How about a DataFrame? It can be recognized if it has column names like x, y, errors_x, errors_y, errors_x_high, errors_x_low, errors_y_high, errors_y_low, with rules for how errors_x_high and errors_x_low overrule errors_x (or raise an error if both are present), and rules for whether to make a TGraph, TGraphErrors, or TGraphAsymmErrors.

DataFrames (just Pandas or all of the new ones, Polars, CuDF, Modin? Can they be recognized in bulk?) are the basic way of describing scatter-point data in Python, and the TGraph* classes are the basic way of describing scatter-point data in ROOT.

@Pepesob
Copy link
Contributor

Pepesob commented Jul 23, 2024

Isn't there the same problem with DataFrame as it is with Dictionary? Right now pandas DataFrame is transformed into dictionary and then interpreted as TTree.

if uproot._util.from_module(obj, "pandas"):
import pandas
if isinstance(
obj, pandas.DataFrame
) and uproot._util.pandas_has_attr_is_numeric(pandas)(obj.index):
obj = uproot.writing._cascadetree.dataframe_to_dict(obj)

If I understand correctly you suggest to add logic, so whenever there are columns in DataFrame with given names {x, y, errors_x, etc... }, then recognize DataFrame as TGraph (or TGraphErrors, TGraphAsymmErrors)?

@jpivarski
Copy link
Member Author

We could catch the DataFrame type before asking if it's a Mapping. This would take away some of the types that are currently being identified as TTrees, but accidentally. There needs to be simple, easy-to-remember criteria for what will become a TTree and what will become a TGraph*, and DataFrame vs other Mapping could be that criteria. Presence of a special attribute is more subtle.

@Pepesob
Copy link
Contributor

Pepesob commented Jul 23, 2024

I think that always converting DataFrame into TGraph is a good idea, because usually DataFrame do not have entries of variable size in column (at least in pandas), so they could be rightly interpreted as [x,y] (+errors) coordinates. However wouldn't that change break people existing code that transforms pandas DataFrame into TTree? I would still propose the flag implicit_DataFrame_TGraph_convertion to give choice to people.

@jpivarski
Copy link
Member Author

I had forgotten that DataFrame was a major, recommended way to make TTrees (not just an accident of DataFrames being mappings).

Okay.

But still, setting a flag like implicit_DataFrame_TGraph_convertion on the DataFrame (can't do it to a Python dict) is really obscure. It's hard to memorize; if I type implicitly_DataFrame_TGraph_convertion, it won't work.

What about this?

f = uproot.recreate("/tmp/stuff.root")
df = pd.DataFrame({"x": [1, 2, 3], "y": [4, 5, 6]})
f["ttree"] = df
f["tgraph"] = uproot.as_TGraph(df)

Then the user only needs to remember a function named uproot.as_TGraph, and it might be discoverable by tab-complete. ("What was it? uproot.<tab> (lots of things), uproot.as_<tab>, ah, there it is!")

This function could just add the flag, and a DataFrame/namedtuple with that flag is recognized as a TGraph. So it's possible to do both.

@Pepesob
Copy link
Contributor

Pepesob commented Jul 23, 2024

Okay there is slight misconception with the flag. What i mean was something like this:

implicit_DataFrame_TGraph_convertion = True

def set_implicit_DataFrame_TGraph_convertion(value: bool):
    if not isinstance(value, bool):
        raise ValueError("Flag must be boolean")
    
    global implicit_DataFrame_TGraph_convertion
    implicit_DataFrame_TGraph_convertion = value

and then in the add_to_directory function

if implicit_DataFrame_TGraph_convertion and uproot._util.is_dataframe(obj):
    if uproot._util.from_module(obj, "pandas"):
        import pandas
        [...]

So it is up to the user what behaviour he wants, implicit DataFrame conversion to TGraph or not. But let's move on.

Now I will refer to my 2. proposition point.

f = uproot.recreate("/tmp/stuff.root")
df = pd.DataFrame({"x": [1, 2, 3], "y": [4, 5, 6]})
f["ttree"] = df
f["tgraph"] = uproot.as_TGraph(df)

uproot.as_TGraph(df) would return something like TGraphConvertable and then this object would be written to file.

On a second thought, there is something like this uproot.models.TGraph.Model_TGraph_v4

class Model_TGraph_v4(uproot.behaviors.TGraph.TGraph, uproot.model.VersionedModel):
"""
A :doc:`uproot.model.VersionedModel` for ``TGraph`` version 4.
"""
behaviors = (uproot.behaviors.TGraph.TGraph,)
def read_members(self, chunk, cursor, context, file):
if uproot._awkwardforth.get_forth_obj(context) is not None:
raise uproot.interpretation.objects.CannotBeForth()
if self.is_memberwise:
raise NotImplementedError(
f"memberwise serialization of {type(self).__name__}\nin file {self.file.file_path}"
)
self._bases.append(
file.class_named("TNamed", 1).read(
chunk,
cursor,
context,
file,
self._file,
self._parent,
concrete=self.concrete,
)
)
self._bases.append(
file.class_named("TAttLine", 2).read(
chunk,
cursor,
context,
file,
self._file,
self._parent,
concrete=self.concrete,
)
)
self._bases.append(
file.class_named("TAttFill", 2).read(
chunk,
cursor,
context,
file,
self._file,
self._parent,
concrete=self.concrete,
)
)
self._bases.append(
file.class_named("TAttMarker", 2).read(
chunk,
cursor,
context,
file,
self._file,
self._parent,
concrete=self.concrete,
)
)
self._members["fNpoints"] = cursor.field(chunk, self._format0, context)
tmp = self._dtype0
if context.get("speedbump", True):
cursor.skip(1)
self._members["fX"] = cursor.array(chunk, self.member("fNpoints"), tmp, context)
tmp = self._dtype1
if context.get("speedbump", True):
cursor.skip(1)
self._members["fY"] = cursor.array(chunk, self.member("fNpoints"), tmp, context)
self._members["fFunctions"] = uproot.deserialization.read_object_any(
chunk, cursor, context, file, self._file, self
)
self._members["fHistogram"] = uproot.deserialization.read_object_any(
chunk, cursor, context, file, self._file, self
)
self._members["fMinimum"], self._members["fMaximum"] = cursor.fields(
chunk, self._format1, context
)
def read_member_n(self, chunk, cursor, context, file, member_index):
if member_index == 0:
self._bases.append(
file.class_named("TNamed", 1).read(
chunk,
cursor,
context,
file,
self._file,
self._parent,
concrete=self.concrete,
)
)
if member_index == 1:
self._bases.append(
file.class_named("TAttLine", 2).read(
chunk,
cursor,
context,
file,
self._file,
self._parent,
concrete=self.concrete,
)
)
if member_index == 2:
self._bases.append(
file.class_named("TAttFill", 2).read(
chunk,
cursor,
context,
file,
self._file,
self._parent,
concrete=self.concrete,
)
)
if member_index == 3:
self._bases.append(
file.class_named("TAttMarker", 2).read(
chunk,
cursor,
context,
file,
self._file,
self._parent,
concrete=self.concrete,
)
)
if member_index == 4:
self._members["fNpoints"] = cursor.field(
chunk, self._format_memberwise0, context
)
if member_index == 5:
tmp = self._dtype0
if context.get("speedbump", True):
cursor.skip(1)
self._members["fX"] = cursor.array(
chunk, self.member("fNpoints"), tmp, context
)
if member_index == 6:
tmp = self._dtype1
if context.get("speedbump", True):
cursor.skip(1)
self._members["fY"] = cursor.array(
chunk, self.member("fNpoints"), tmp, context
)
if member_index == 7:
self._members["fFunctions"] = uproot.deserialization.read_object_any(
chunk, cursor, context, file, self._file, self
)
if member_index == 8:
self._members["fHistogram"] = uproot.deserialization.read_object_any(
chunk, cursor, context, file, self._file, self
)
if member_index == 9:
self._members["fMinimum"] = cursor.field(
chunk, self._format_memberwise1, context
)
if member_index == 10:
self._members["fMaximum"] = cursor.field(
chunk, self._format_memberwise2, context
)
@classmethod
def strided_interpretation(
cls, file, header=False, tobject_header=True, breadcrumbs=(), original=None
):
if cls in breadcrumbs:
raise uproot.interpretation.objects.CannotBeStrided(
"classes that can contain members of the same type cannot be strided because the depth of instances is unbounded"
)
breadcrumbs = (*breadcrumbs, cls)
members = [(None, None)]
if header:
members.append(("@num_bytes", numpy.dtype(">u4")))
members.append(("@instance_version", numpy.dtype(">u2")))
members.extend(
file.class_named("TNamed", 1)
.strided_interpretation(file, header, tobject_header, breadcrumbs)
.members
)
members.extend(
file.class_named("TAttLine", 2)
.strided_interpretation(file, header, tobject_header, breadcrumbs)
.members
)
members.extend(
file.class_named("TAttFill", 2)
.strided_interpretation(file, header, tobject_header, breadcrumbs)
.members
)
members.extend(
file.class_named("TAttMarker", 2)
.strided_interpretation(file, header, tobject_header, breadcrumbs)
.members
)
members.append(("fNpoints", numpy.dtype(">u4")))
raise uproot.interpretation.objects.CannotBeStrided(
"class members defined by Model_TStreamerBasicPointer of type double* in member fX of class TGraph"
)
raise uproot.interpretation.objects.CannotBeStrided(
"class members defined by Model_TStreamerBasicPointer of type double* in member fY of class TGraph"
)
raise uproot.interpretation.objects.CannotBeStrided(
"class members defined by Model_TStreamerObjectPointer of type TList* in member fFunctions of class TGraph"
)
raise uproot.interpretation.objects.CannotBeStrided(
"class members defined by Model_TStreamerObjectPointer of type TH1F* in member fHistogram of class TGraph"
)
members.append(("fMinimum", numpy.dtype(">f8")))
members.append(("fMaximum", numpy.dtype(">f8")))
return uproot.interpretation.objects.AsStridedObjects(
cls, members, original=original
)
@classmethod
def awkward_form(cls, file, context):
from awkward.forms import ListOffsetForm, RecordForm
if cls in context["breadcrumbs"]:
raise uproot.interpretation.objects.CannotBeAwkward(
"classes that can contain members of the same type cannot be Awkward Arrays because the depth of instances is unbounded"
)
context = context.copy()
context["breadcrumbs"] = context["breadcrumbs"] + (cls,)
contents = {}
if context["header"]:
contents["@num_bytes"] = uproot._util.awkward_form(
numpy.dtype("u4"), file, context
)
contents["@instance_version"] = uproot._util.awkward_form(
numpy.dtype("u2"), file, context
)
tmp_awkward_form = file.class_named("TNamed", 1).awkward_form(file, context)
contents.update(zip(tmp_awkward_form.fields, tmp_awkward_form.contents))
tmp_awkward_form = file.class_named("TAttLine", 2).awkward_form(file, context)
contents.update(zip(tmp_awkward_form.fields, tmp_awkward_form.contents))
tmp_awkward_form = file.class_named("TAttFill", 2).awkward_form(file, context)
contents.update(zip(tmp_awkward_form.fields, tmp_awkward_form.contents))
tmp_awkward_form = file.class_named("TAttMarker", 2).awkward_form(file, context)
contents.update(zip(tmp_awkward_form.fields, tmp_awkward_form.contents))
contents["fNpoints"] = uproot._util.awkward_form(
numpy.dtype(">u4"), file, context
)
contents["fX"] = ListOffsetForm(
context["index_format"],
uproot._util.awkward_form(cls._dtype0, file, context),
)
contents["fY"] = ListOffsetForm(
context["index_format"],
uproot._util.awkward_form(cls._dtype1, file, context),
)
contents["fMinimum"] = uproot._util.awkward_form(
numpy.dtype(">f8"), file, context
)
contents["fMaximum"] = uproot._util.awkward_form(
numpy.dtype(">f8"), file, context
)
return RecordForm(
list(contents.values()),
list(contents.keys()),
parameters={"__record__": "TGraph"},
)
_format0 = struct.Struct(">I")
_format1 = struct.Struct(">dd")
_format_memberwise0 = struct.Struct(">I")
_format_memberwise1 = struct.Struct(">d")
_format_memberwise2 = struct.Struct(">d")
_dtype0 = numpy.dtype(">f8")
_dtype1 = numpy.dtype(">f8")
base_names_versions = [
("TNamed", 1),
("TAttLine", 2),
("TAttFill", 2),
("TAttMarker", 2),
]
member_names = [
"fNpoints",
"fX",
"fY",
"fFunctions",
"fHistogram",
"fMinimum",
"fMaximum",
]
class_flags = {"has_read_object_any": True}
class_rawstreamers = (
uproot.models.TH._rawstreamer_THashList_v0,
uproot.models.TH._rawstreamer_TAttAxis_v4,
uproot.models.TH._rawstreamer_TAxis_v10,
uproot.models.TH._rawstreamer_TH1_v8,
uproot.models.TH._rawstreamer_TH1F_v3,
uproot.models.TH._rawstreamer_TCollection_v3,
uproot.models.TH._rawstreamer_TSeqCollection_v0,
uproot.models.TH._rawstreamer_TList_v5,
uproot.models.TH._rawstreamer_TAttMarker_v2,
uproot.models.TH._rawstreamer_TAttFill_v2,
uproot.models.TH._rawstreamer_TAttLine_v2,
uproot.models.TH._rawstreamer_TString_v2,
uproot.models.TH._rawstreamer_TObject_v1,
uproot.models.TH._rawstreamer_TNamed_v1,
_rawstreamer_TGraph_v4,
)
writable = True
def _serialize(self, out, header, name, tobject_flags):
where = len(out)
for x in self._bases:
x._serialize(out, True, name, tobject_flags)
raise NotImplementedError("FIXME")
if header:
num_bytes = sum(len(x) for x in out[where:])
version = 4
out.insert(where, uproot.serialization.numbytes_version(num_bytes, version))

So uproot.as_TGraph(df) would just return uproot.models.TGraph.Model_TGraph_v4 and this object would be written to file. Moreover this solution won't require any changes in identify.py because uproot.models.TGraph.Model_TGraph_v4 is Model object, and for those there is already logic in identify.py . At least i think so. I need to doulbe check it but i think this would work.

@jpivarski
Copy link
Member Author

I didn't realize that you were suggesting a global flag, but we shouldn't do that for multiple reasons. (What if the user wants to write TTrees and TGraphs? What if the user is not using Uproot directly, but through a library and that library changes this state? What if the user is using a library of a library, and one level does it and another doesn't know that it has happened?)

But having uproot.as_TGraph create a uproot.models.TGraph.Model_TGraph_v4 is a direct way to do it, if you're not interested in the per-instance flag.

@Pepesob
Copy link
Contributor

Pepesob commented Jul 24, 2024

Hey @jpivarski! I implemented to_TGraph() function. I also wrote test for this feature. What should be my next step? Should I create pull request to this branch (jpivarski/start-implementation-of-TGraph-writing) or directly to main?

@jpivarski
Copy link
Member Author

To simplify the git topology, please open a PR for inclusion into main.

Is your branch derived from this branch? If so, then I can close this PR in favor of yours and @grzanka's commits should still be in it, so that the final PR will count as being co-authored by the two of you.

@Pepesob
Copy link
Contributor

Pepesob commented Jul 24, 2024

Yes I made branch derived from this one

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
inactive A pull request that hasn't been touched in a long time
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants