Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stats calc tool #2628

Open
wants to merge 17 commits into
base: main
Choose a base branch
from
Open

Stats calc tool #2628

wants to merge 17 commits into from

Conversation

TjarkMiener
Copy link
Member

@TjarkMiener TjarkMiener commented Oct 28, 2024

This PR adds a generic stats-calculation tool utilizing the PixelStatisticsCalculator.

Related #2542

This comment has been minimized.

2 similar comments

This comment has been minimized.

This comment has been minimized.

pyproject.toml Outdated Show resolved Hide resolved
src/ctapipe/resources/stats_calc_config.yaml Outdated Show resolved Hide resolved
),
).tag(config=True)

dl1a_column_name = CaselessStrEnum(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is DL1a/b is "official"? Also, I'd perhaps use generic input_column_name similar to the output one.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, in ctapipe we use DL1_IMAGES and DL1_PARAMETERS to distinguish between things that are per-pixel vs. single quantities per event.

https://ctapipe.readthedocs.io/en/latest/api/ctapipe.io.DataLevel.html

Copy link
Member

@maxnoe maxnoe Oct 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd also not make this an enum. In the generic tool, users should be able to chose any column that has compatible shape. Just provide a clear error when the column is not found in the input file.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could also be a list of columns, to compute on multiple at the same time.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I changed the column name and also polish the references to DL1a data by using pixel-wise image data which is more descriptive. ToolConfigurationError is raised once the column is not found. Having list of columns seems a little bit of an overkill here, which would just make the code more complex. Maybe the aggregation config could be shared between the columns, but especially the outlier detection will be different between the columns.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ToolConfigurationError is raised once the column is not found. Having list of columns seems a little bit of an overkill here, which would just make the code more complex. Maybe the aggregation config could be shared between the columns, but especially the outlier detection will be different between the columns.

I think the case where you only want to know about a single column is quite rare, you are usually interested in multiple. So having to read all data again to compute metrics on a new column seems very limiting and a loop over columns shouldn't make the code much more complex.

src/ctapipe/resources/stats_calc_config.yaml Outdated Show resolved Hide resolved
src/ctapipe/tools/stats_calculation.py Outdated Show resolved Hide resolved

This comment has been minimized.

This comment has been minimized.

mexanick
mexanick previously approved these changes Oct 28, 2024
Copy link
Contributor

@kosack kosack left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently fails to read prod3 files (which have no EFFECTIVE focal length information). The tool current fails with a focal_length_choice exception, however it seems there is no way to set the focal length choioce since the TableLoader is not set up to be configrable.

src/ctapipe/tools/calculate_pixel_stats.py Outdated Show resolved Hide resolved
src/ctapipe/tools/calculate_pixel_stats.py Outdated Show resolved Hide resolved
src/ctapipe/tools/calculate_pixel_stats.py Outdated Show resolved Hide resolved
@maxnoe
Copy link
Member

maxnoe commented Nov 6, 2024

Currently fails to read prod5 files (which have no EFFECTIVE focal length information).

prod 5 files should have effective focal length

# Iterate over the telescope ids and calculate the statistics
for tel_id in self.tel_ids:
# Read the whole dl1 images for one particular telescope
dl1_table = self.input_data.read_telescope_events_by_id(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get a crash here:

% ctapipe-calculate-pixel-statistics -i events.dl1.h5 -o  stats.h5

    dl1_table = self.input_data.read_telescope_events_by_id(
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/kkosack/Projects/CTA/Working/ctapipe/src/ctapipe/io/tableloader.py", line 1089, in read_telescope_events_by_id
    tel_ids = self.subarray.get_tel_ids(telescopes)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/kkosack/Projects/CTA/Working/ctapipe/src/ctapipe/instrument/subarray.py", line 549, in get_tel_ids
    for telescope in telescopes:
TypeError: 'numpy.int16' object is not iterable

Seems to be due to passing an integer instead of a list, which is what is required by read_telescope_events_by_id

Suggested change
dl1_table = self.input_data.read_telescope_events_by_id(
dl1_table = self.input_data.read_telescope_events_by_id(
telescopes = [tel_id,]

How does this work in the tests?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! I do not know why the test pass here. I will include the change.

@kosack
Copy link
Contributor

kosack commented Nov 6, 2024

image

In the output, how can I tell what column was aggregated? It is always named "statistics" and there is no metadata in the group or tables that contain that information. Wouldn't it be better to name the group like monitoring/statistics/{input_column_name}? (i.e. maybe set the default of output_column_name to be the value of input_column_name? And also add the input_column namein the output table's metadata (table.meta['input_column_name']=input_column_name)

Copy link
Contributor

@kosack kosack left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A more general comment: with very minor changes, this could be turned into ctapipe-calculate-stats, i.e. the ability to compute stats for any column, not just pixel-wise ones.

  • Expose TableLoader as a configurable component (needed anyhow, see above)
  • minor modifications to drop assumption on data shape in calculator.py.

I would expect e.g. to be able to do:

ctapipe-calculate-pixel-statistics -i events-prod5.DL1.h5  
    --StatisticsAggregator.chunk_size=100 
    --StatisticsCalculatorTool.input_column_name hillas_length 
    -o length.h5

and get the stats on the length parameter. This is perhaps outside the scope of this PR, but should be kept in mind. It also relates to @maxnoe's comment that we could change the API to accept a mapping of columns to Aggragators.

kosack
kosack previously requested changes Nov 6, 2024
Copy link
Contributor

@kosack kosack left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A common error is to have too small a chunk size, but this now results in a very ugly error and a full trace-back and exception, along with an UnclosedFileWarning (bug?)

  • The former (Unexpected exception) should e caught and raises as a ToolConfigurationError, so the user gets a nice message. And please explain in the message what parameters controls this, i.e. say Change --StatisticsAggregator.chunk_size to decrease this.
  • The latter (unclosed file) seems to be a bug to fix.
2024-11-06 15:14:43,361 ERROR [ctapipe.StatisticsCalculatorTool] (tool.run): Caught unexpected exception: The length of the provided table (853) is insufficient to meet the required statistics for a single chunk of size (2500).
2024-11-06 15:14:43,361 ERROR [ctapipe.StatisticsCalculatorTool] (tool.run): Caught unexpected exception: The length of the provided table (853) is insufficient to meet the required statistics for a single chunk of size (2500).
Traceback (most recent call last):
  File "/Users/kkosack/Projects/CTA/Working/ctapipe/src/ctapipe/core/tool.py", line 431, in run
    self.start()
  File "/Users/kkosack/Projects/CTA/Working/ctapipe/src/ctapipe/tools/calculate_pixel_stats.py", line 134, in start
    aggregated_stats = self.stats_calculator.first_pass(
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/kkosack/Projects/CTA/Working/ctapipe/src/ctapipe/monitoring/calculator.py", line 169, in first_pass
    aggregated_stats = aggregator(
                       ^^^^^^^^^^^
  File "/Users/kkosack/Projects/CTA/Working/ctapipe/src/ctapipe/monitoring/aggregator.py", line 86, in __call__
    raise ValueError(
ValueError: The length of the provided table (853) is insufficient to meet the required statistics for a single chunk of size (2500).
2024-11-06 15:14:43,377 INFO [ctapipe.StatisticsCalculatorTool] (tool.write_provenance): Output:
/Users/kkosack/miniconda3/envs/ctapipe-0.21/lib/python3.12/site-packages/tables/file.py:113: UnclosedFileWarning: Closing remaining open file: /Users/kkosack/Projects/CTA/PipeWork/v0.21.3/events-prod5.DL1.h5
  warnings.warn(UnclosedFileWarning(msg))

@@ -100,3 +100,18 @@ def test_tool_config_error(tmp_path, dl1_image_file):
cwd=tmp_path,
raises=True,
)
# Check if ToolConfigurationError is raised
# when the chunk size is larger than the number of events in the input file
with pytest.raises(ToolConfigurationError):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the tool will raise ToolConfigurationError in a few situations. Please test all of them and use regexp matching to check whether a correct error message is displayed, e.g.

Suggested change
with pytest.raises(ToolConfigurationError):
with pytest.raises(ToolConfigurationError, match="Change --StatisticsAggregator.chunk_size")):
...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Actually, I found that the first raise pytest wasn't raised on the 'column' check but on the 'chunk_size' check which comes before in the tool.

@mexanick mexanick added this to the 0.23.0 milestone Nov 14, 2024
@maxnoe maxnoe modified the milestones: 0.23.0, 0.24.0 Nov 18, 2024
)
# Add metadata to the aggregated statistics
aggregated_stats.meta["input_url"] = self.input_data.input_url
aggregated_stats.meta["input_column_name"] = self.input_column_name
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd not add this as input_column_name but as prefix for the columns. Using prefixes you can also aggregate multiple columns easily.

See e.g. how we write tables in ctapipe-apply-models.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@maxnoe , we discussed this with @TjarkMiener offline and we don't see much use cases for this. Also, having the metadata as a prefix for the column name can make queries awkward and schemas inflexible (i.e. set of aggregators will be fixed for each input source). Also, in this case, something like pandas.MultiIndex will be more appropriate in my opinion, but again, I don't see much use. I'd leave this as is for the moment, and modify if a corresponding use case appears.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The usecase is consistency with what we doe elsewhere and loading such data with e.g.TableLoader.

I.e. my use case would be to load these data and look add aggregated statistics for multiple columns.

This has to match waht e.g. is planned for quality pipe, right? Aggregating high-level information for many of the data columns and other things and loading them at the same time, producing control plots etc.

But fine to do in a follow up PR.

I really don't like the limitation to a single column though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me what you want sounds like a pandas.MultiIndex, which is not supported in the current way of astropy, and I'm not sure whether it will ever be supported like that... Also, aggregating (flattening) and joining can be done elsewhere (e.g. when QualPipe reads it).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant that independent of the columns names / data organization I don't like that this tool can only aggregate a single column, and that using it for multiple requires separate runs of the tool with separate output files, performing the expensive data loading again and again.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tool is not only aggregating stats, but also detecting outliers and treating regions of trouble properly. Each column requires a distinct set of configuration parameters. Therefore, I think it is inevitable to perform separate runs of the tool, which are independent on each other.

tel_id,
)
# Add metadata to the aggregated statistics
aggregated_stats.meta["input_url"] = self.input_data.input_url
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add source event type to the metadata, so we know whether this is calculated from FF, pedestal, or some other type of events.

Since we should also support the processing of MCs, we might want to run the stats calc tool over multiple tels.
rename the tool and file name

only keep dl1 table of the particular telescope into RAM

added tests for tool config errors

rename input col name

adopt yaml syntax in example config for stats calculation
renamed output_column_name to output_table_name

This comment has been minimized.

Tjark Miener and others added 2 commits December 18, 2024 14:27
remove the input url since it is irrelevant
Copy link

Passed

Analysis Details

0 Issues

  • Bug 0 Bugs
  • Vulnerability 0 Vulnerabilities
  • Code Smell 0 Code Smells

Coverage and Duplications

  • Coverage 89.50% Coverage (94.00% Estimated after merge)
  • Duplications 0.00% Duplicated Code (0.70% Estimated after merge)

Project ID: cta-observatory_ctapipe_AY52EYhuvuGcMFidNyUs

View in SonarQube

@mexanick mexanick requested review from maxnoe and mexanick December 19, 2024 10:46
@mexanick mexanick dismissed kosack’s stale review December 19, 2024 10:48

Code updated, please take a second look.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants