Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore: update docs #83

Merged
merged 20 commits into from
Apr 7, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# hepconvert
<img src="https://github.com/scikit-hep/hepconvert/blob/f461872ec41473b14fdb7ebd76e68798ef8bb394/docs/docs-img/hepconvert_logo.svg" width="400px">

[![Actions Status][actions-badge]][actions-link]
[![Documentation Status][rtd-badge]][rtd-link]
Expand All @@ -24,7 +24,7 @@
[rtd-badge]: https://readthedocs.org/projects/hepconvert/badge/?version=latest
[rtd-link]: https://hepconvert.readthedocs.io/en/latest/

The hepconvert library is a bridge between columnar file formats, currently **ROOT, and Parquet** and soon eventually include **Feather, and HDF5.** It aims to simplify file conversions in Python, replacing what is usually a multi-step process with one line of code, with builtin features for managing large datasets and choosing compression levels.
The hepconvert library is a bridge between columnar file formats, currently **ROOT, and Parquet** and soon will include **Feather, and HDF5.** It aims to simplify file conversions in Python, replacing what is usually a multi-step process with one line of code, with builtin features for managing large datasets and choosing compression levels.

# Installation

Expand Down
43 changes: 43 additions & 0 deletions docs/source/add.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
CLI Guide for add_histograms (add)
==================================

Instructions for function `add_histograms <https://hepconvert.readthedocs.io/en/latest/hepconvert.histogram_adding.add_histograms.html>`__.

Command:
--------

.. code-block:: bash

hepconvert add [options] [OUT_FILE] [IN_FILES]


Examples:
---------

.. code-block:: bash

hepconvert add -f --progress-bar --union summed_hists.root hist1.root hist2.root hist3.root

Or, if files are in a directory:

.. code-block:: bash

hepconvert add -f --append --same_names summed_hists.root path/directory/


Options:
--------

``--force``, ``-f`` Use flag to overwrite a file if it already exists.

``--progress-bar`` Will show a basic progress bar to show how many histograms have summed, and how many files have been read.

``--append``, ``-a`` Will append histograms to an existing file.

``--compression``, ``-c`` Compression type. Options are "lzma", "zlib", "lz4", and "zstd". Default is "zlib".

``--compression-level`` Level of compression set by an integer. Default is 1.

``--union`` Use flag to add together histograms that have the same name and append all others to the new file.

``--same-names`` Use flag to only add histograms together if they have the same name.
9 changes: 9 additions & 0 deletions docs/source/cli.toctree
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
.. toctree::
:caption: Command Line Interface Instructions
:hidden:

parquet-to-root <parquet_to_root>
root-to-parquet <root_to_parquet>
copy-root <copy_root>
merge-root <merge_root>
add (add_histograms) <add>
2 changes: 1 addition & 1 deletion docs/source/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -36,4 +36,4 @@
# Additional stuff
master_doc = "index"

# exec(open("prepare_docstrings.py").read(), dict(globals()))
exec(open("prepare_docstrings.py").read(), dict(globals()))
57 changes: 57 additions & 0 deletions docs/source/copy_root.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
Command Line Interface Guide: copy_root
=======================================

Instructions for function `hepconvert.copy_root <https://hepconvert.readthedocs.io/en/latest/hepconvert.copy_root.copy_root.html>`__

Command:
--------

.. code-block:: bash

hepconvert copy-root [options] [OUT_FILE] [IN_FILE]


Examples:
---------

.. code-block:: bash

hepconvert copy-root -f --progress-bar --keep-branches 'Jet_*' out_file.root in_file.root


Branch skimming using ``cut``:

.. code-block:: bash

hepconvert copy-root -f --keep-branches 'Jet_*' --cut 'Jet_Px > 5' out_file.root in_file.root

Options:
--------

``--drop-branches``, ``-db`` and ``--keep-branches``, ``-kb`` list, str or dict. Specify branch names to remove from the ROOT file. Either a str, list of str (for multiple branches), or a dict with form {'tree': 'branches'} to remove branches from certain ttrees. Wildcarding accepted.

``--drop-trees``, ``-dt`` and ``--keep-trees``, ``-kt`` list of str, or str. Specify tree names to remove/keep TTrees in the ROOT files. Wildcarding accepted.

``--cut`` For branch skimming, passed to `uproot.iterate <https://uproot.readthedocs.io/en/latest/uproot.behaviors.TBranch.iterate.html>`__. str, if not None, this expression filters all of the expressions.

``--expressions`` For branch skimming, passed to `uproot.iterate <https://uproot.readthedocs.io/en/latest/uproot.behaviors.TBranch.iterate.html>`__. Names of TBranches or aliases to convert to ararys or mathematical expressions of them. If None, all TBranches selected by the filters are included.

``--force``, ``-f`` Use flag to overwrite a file if it already exists.

``--progress-bar`` Will show a basic progress bar to show how many TTrees have merged and written.

``--append``, ``-a`` Will append new TTree to an existing file.

``--compression``, ``-c`` Compression type. Options are "lzma", "zlib", "lz4", and "zstd". Default is "zlib".

``--compression-level`` Level of compression set by an integer. Default is 1.

``--name`` Give a name to the new TTree. Default is "tree".

``--title`` Give a title to the new TTree.

``--initial-basket-capacity`` (int) Number of TBaskets that can be written to the TTree without rewriting the TTree metadata to make room. Default is 10.

``--resize-factor`` (float) When the TTree metadata needs to be rewritten, this specifies how many more TBasket slots to allocate as a multiplicative factor. Default is 10.0.

``--step-size`` Size of batches of data to read and write. If an integer, the maximum number of entries to include in each iteration step; if a string, the maximum memory size to include. The string must be a number followed by a memory unit, such as “100 MB”. Default is "100 MB"
229 changes: 229 additions & 0 deletions docs/source/general_guide.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,229 @@
General Guide and Examples:
===========================
Is something missing from this guide? Please post your questions on the `discussions page <https://github.com/scikit-hep/hepconvert/discussions>`__!

Features of all (or most) functions:
----------------------------------------

**Automatic handling of Uproot duplicate counter issue:**
If you are using a hepconvert function that goes ROOT -> ROOT (both the input and output files are ROOT)
and working with data in jagged arrays, if branches have the same "fLeafCount", hepconvert
will group branches automatically so that Uproot will not create a `counter branch for each branch <https://github.com/scikit-hep/uproot5/discussions/903>`__.

**Quick Modifications of ROOT files and TTrees:**

Functions ``copy_root``, ``merge_root``, and ``root_to_parquet`` have a few options for applying quick
modifications to ROOT files and TTree data.

**Branch slimming:**
Parameters ``keep_branches`` or ``drop_branches`` (list or dict) control branch slimming.
Examples:

.. code:: python

>>> hepconvert.root_to_parquet("out_file.root", "in_file.root", keep_branches="x*", progress_bar=True, force=True)

# Before:

# name | typename | interpretation
# ---------------------+--------------------------+-------------------------------
# x1 | int64_t | AsDtype('>i8')
# x2 | int64_t | AsDtype('>i8')
# y1 | int64_t | AsDtype('>i8')
# y2 | int64_t | AsDtype('>i8')

# After:

# name | typename | interpretation
# ---------------------+--------------------------+-------------------------------
# x1 | int64_t | AsDtype('>i8')
# x2 | int64_t | AsDtype('>i8')

.. code:: python

>>> hepconvert.root_to_parquet("out_file.root", "in_file.root", keep_branches={"tree1": ["branch2", "branch3"], "tree2": ["branch2"]}, progress_bar=True, force=True)

# Before:

# Tree1:
# name | typename | interpretation
# ---------------------+--------------------------+-------------------------------
# branch1 | int64_t | AsDtype('>i8')
# branch2 | int64_t | AsDtype('>i8')
# branch3 | int64_t | AsDtype('>i8')

# Tree2:
# name | typename | interpretation
# ---------------------+--------------------------+-------------------------------
# branch1 | int64_t | AsDtype('>i8')
# branch2 | int64_t | AsDtype('>i8')
# branch3 | int64_t | AsDtype('>i8')

# After:

# Tree1:
# name | typename | interpretation
# ---------------------+--------------------------+-------------------------------
# branch2 | int64_t | AsDtype('>i8')
# branch3 | int64_t | AsDtype('>i8')

# Tree2:
# name | typename | interpretation
# ---------------------+--------------------------+-------------------------------
# branch2 | int64_t | AsDtype('>i8')


**Branch skimming:**
Parameters ``cut`` and ``expressions`` control branch skimming. Both of these parameters go to Uproot's `iterate
<https://uproot.readthedocs.io/en/latest/uproot.behaviors.TBranch.iterate.html>`__
function. See Uproot's documentation for more details.

Basic example:

.. code:: python

hepconvert.copy_root("skimmed_HZZ.root", "HZZ.root", keep_branches="Jet_",
force=True, expressions="Jet_Px", cut="Jet_Px >= 10",)


**Remove TTrees:**
Use parameters ``keep_ttrees`` or ``drop_ttrees`` to remove TTrees.

.. code:: python

# Creating example data:
with uproot.recreate("two_trees.root") as file:
file["tree"] = {"x": np.array([1, 2, 3])}
file["tree1"] = {"x": np.array([1, 2, 3])}

hepconvert.copy_root("one_tree.root", "two_trees.root", keep_trees=tree,
force=True, expressions="Jet_Px", cut="Jet_Px >= 10",)


**How hepconvert works with ROOT**

hepconvert uses Uproot for reading and writing ROOT files; it also has the same limitations.
It currently only works with flat TTrees (nanoAOD-like data), and cannot yet read or write RNTuples.

As described in Uproot's documentation:

.. note::

A small but growing list of data types can be written to files:

* strings: TObjString
* histograms: TH1*, TH2*, TH3*
* profile plots: TProfile, TProfile2D, TProfile3D
* NumPy histograms created with `np.histogram <https://numpy.org/doc/stable/reference/generated/numpy.histogram.html>`__, `np.histogram2d <https://numpy.org/doc/stable/reference/generated/numpy.histogram2d.html>`__, and `np.histogramdd <https://numpy.org/doc/stable/reference/generated/numpy.histogramdd.html>`__ with 3 dimensions or fewer
* histograms that satisfy the `Universal Histogram Interface <https://uhi.readthedocs.io/>`__ (UHI) with 3 dimensions or fewer; this includes `boost-histogram <https://boost-histogram.readthedocs.io/>`__ and `hist <https://hist.readthedocs.io/>`__
* PyROOT objects

**Memory Management**

Each hepconvert function has automatic and customizable memory management for working with large files.

Functions reading **ROOT** files will read in batches controlled by the parameter ``step_size``.
Set ``step_size`` to either an `int` to set the batch size to a number of entries, or a `string` in
form of "100 MB".


**Progress Bars**
hepconvert uses the package tqdm for progress bars, if you do not have the package installed an error message will provide installation instructions.
They are controlled with the ``progress_bar`` argument.
For example, to use a default progress bar with copy_root, set progress_bar to True:

.. code:: python

hepconvert.copy_root("out_file.root", "in_file.root", progress_bar=True)


Some functions can handle a customized tqdm progress bar.
To use a customized tqdm progress bar, make a progress bar object and pass it to the hepconvert function like so,

.. code:: python

>>> import tqdm

>>> bar_obj = tqdm.tqdm(colour="GREEN", desc="Description")
>>> hepconvert.add_histograms("out_file.root", "path/in_files/", progress_bar=bar_obj)

.. image:: https://raw.githubusercontent.com/scikit-hep/hepconvert/main/docs/docs-img/progress_bar.png
:width: 450px
:alt: hepconvert
:target: https://github.com/scikit-hep/hepconvert


Some types of tqdm progress bar objects may not work in this way.


**Command Line Interface**

All functions are able to be run in the command line. See the "Command Line Interface Instructions" tab on the left to see CLI
instructions on individual functions.

Adding Histograms
-----------------
``hepconvert.add_histograms`` adds the values of many histograms
and writes the summed histograms to an output file (like ROOT's hadd, but limited
to histograms).


**Parameters of note:**

``union`` If True, adds the histograms that have the same name and appends all others
to the new file.

``append`` If True, appends histograms to an existing file. Force and append
cannot both be True.

``same_names`` If True, only adds together histograms which have the same name (key). If False,
histograms are added together based on TTree structure (bins must be equal).

Memory:
``add_histograms`` has no memory customization available currently. To maintain
performance it stores the summed histograms in memory until all files have
been read, then the summed histograms are written to the output file. Only
one input ROOT file is read and kept in memory at a time.


Merging TTrees
--------------
``hepconvert.merge_root`` merges TTrees in multiple ROOT files together. The end result is a single file containing data from all input files (again like ROOT's hadd, but can handle flat TTrees and histograms).

.. warning::
At the moment, hepconvert.merge can only merge TTrees that have the same
number of branches, with the same names and datatypes.
We are working on adding backfill capabilities for mismatched TTrees.

**Features:**
merge_root has parameters ``cut``, ``expressions``, ``drop_branches``, ``keep_branches``, ``drop_trees`` and ``keep_trees``.


Copying TTrees
--------------
``hepconvert.copy_root`` copies TTrees in multiple ROOT files together.

.. warning::
At the moment, hepconvert.merge can only merge TTrees that have the same
number of branches, with the same names and datatypes.
We are working on adding backfill capabilities for mismatched TTrees.

**Features:**
merge_root has parameters ``cut``, ``expressions``, ``drop_branches``, ``keep_branches``, ``drop_trees`` and ``keep_trees``.


Parquet to ROOT
---------------

Writes the data from a single Parquet file to one TTree in a ROOT file.
This function creates a new TTree (name the new tree with parameter ``tree``).


ROOT to Parquet
---------------

Writes the data from one TTree in a ROOT file to a single Parquet file.
If there are multiple TTrees in the file, specify one TTree to write to the Parquet file using the ``tree`` parameter.

**Features:**
root_to_parquet has parameters ``cut``, ``expressions``, ``drop_branches``, ``keep_branches``.
5 changes: 5 additions & 0 deletions docs/source/guide.toctree
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
.. toctree::
:caption: Guide with Examples
:hidden:

general_guide
6 changes: 6 additions & 0 deletions docs/source/hepconvert.add_histograms.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
hepconvert.add_histograms
=========================

Defined in `hepconvert.histogram_adding <https://github.com/zbilodea/hepconvert/blob/52e6cbfbbf81c669ca31b8a538d8f3e8984b35a5/src/hepconvert/histogram_adding.py>`__ on `line 345 <https://github.com/zbilodea/hepconvert/blob/52e6cbfbbf81c669ca31b8a538d8f3e8984b35a5/src/hepconvert/histogram_adding.py#L345>`__.

.. autofunction:: hepconvert.add_histograms
6 changes: 0 additions & 6 deletions docs/source/hepconvert.copy_root.copy_root.rst

This file was deleted.

6 changes: 6 additions & 0 deletions docs/source/hepconvert.copy_root.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
hepconvert.copy_root
====================

Defined in `hepconvert.copy_root <https://github.com/zbilodea/hepconvert/blob/52e6cbfbbf81c669ca31b8a538d8f3e8984b35a5/src/hepconvert/copy_root.py>`__ on `line 15 <https://github.com/zbilodea/hepconvert/blob/52e6cbfbbf81c669ca31b8a538d8f3e8984b35a5/src/hepconvert/copy_root.py#L15>`__.

.. autofunction:: hepconvert.copy_root
3 changes: 1 addition & 2 deletions docs/source/hepconvert.copy_root.toctree
Original file line number Diff line number Diff line change
Expand Up @@ -2,5 +2,4 @@
:caption: copy_root
:hidden:

hepconvert.copy_root (module) <hepconvert.copy_root>
hepconvert.copy_root.copy_root <hepconvert.copy_root.copy_root>
hepconvert.copy_root <hepconvert.copy_root>
Loading
Loading