Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Waveform attributes #745

Closed
wants to merge 6 commits into from
Closed
Show file tree
Hide file tree
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
49 changes: 49 additions & 0 deletions strax/processing/peak_properties.py
Original file line number Diff line number Diff line change
Expand Up @@ -100,3 +100,52 @@ def compute_widths(peaks, select_peaks_indices=None):
i = len(desired_fr) // 2
peaks['width'][select_peaks_indices] = fr_times[:, i:] - fr_times[:, ::-1][:, i:]
peaks['area_decile_from_midpoint'][select_peaks_indices] = fr_times[:, ::2] - fr_times[:, i].reshape(-1,1)

@export
#@numba.njit(nopython=True, cache=True)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this work or why is it commented out? In addition, is not this the same as this function is already doing:

def compute_index_of_fraction(peak, fractions_desired, result):

Copy link
Contributor Author

@ahiguera-mx ahiguera-mx Aug 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Daniel
Thanks for checking at this, the functions calculate different things
compute_index_of_fraction()return the index at which certain area has been achieved if the fraction is 1 it will return peak['length'] so this is in units of samples.

compute_wf_attributes() for a given set of quantiles (fractions) it return the the amount of time elapsed for a given fraction of the total waveform area to be observed

So in other words, compute_index_of_fraction() is in units of samples and compute_wf_attributes() is in units of ns

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also compute_wf_attributes() can handle an array of peaks whereas compute_index_of_fraction() can do only peak by peak
Do you think it would better to merge into one function?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at these codes, compute_index_of_fraction can also return floats. I think it is better to utilize compute_index_of_fraction or modify it.

result[current_fraction_index] = i + area_needed / x

def compute_wf_attributes(data, sample_length, n_samples: int, downsample_wf=False):
"""
Compute waveform attribures
Quantiles: represent the amount of time elapsed for
a given fraction of the total waveform area to be observed in n_samples
i.e. n_samples = 10, then quantiles are equivalent deciles
Waveforms: downsampled waveform to n_samples
:param data: waveform e.g. peaks or peaklets
:param n_samples: compute quantiles for a given number of samples
:return: waveforms and quantiles of size n_samples
"""
assert data.shape[0] == len(sample_length), "ararys must have same size"

num_samples = data.shape[1]

waveforms = np.zeros((len(data), n_samples), dtype=np.float64)
quantiles = np.zeros((len(data), n_samples), dtype=np.float64)

# Cannot compute with with more samples than actual waveform sample
assert num_samples > n_samples, "cannot compute with more samples than the actual waveform"


step_size = int(num_samples / n_samples)
steps = np.arange(0, num_samples + 1, step_size)
inter_points = np.linspace(0., 1. - (1. / n_samples), n_samples)
cumsum_steps = np.zeros(n_samples + 1, dtype=np.float64)
frac_of_cumsum = np.zeros(num_samples + 1)
sample_number_div_dt = np.arange(0, num_samples + 1, 1)
for i, (samples, dt) in enumerate(zip(data, sample_length)):
if np.sum(samples) == 0:
continue
# reset buffers
frac_of_cumsum[:] = 0
cumsum_steps[:] = 0
frac_of_cumsum[1:] = np.cumsum(samples)
frac_of_cumsum[1:] = frac_of_cumsum[1:] / frac_of_cumsum[-1]
cumsum_steps[:-1] = np.interp(inter_points, frac_of_cumsum, sample_number_div_dt * dt)
cumsum_steps[-1] = sample_number_div_dt[-1] * dt
quantiles[i] = cumsum_steps[1:] - cumsum_steps[:-1]

if downsample_wf:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this whole downsampling makes no sense to me.

  • If i pass downsample_wf=False to this function it will return an array of zeros as waveforms. in what case would that be useful?
  • Why wouldnt you infer downsampling==True if num_samples > n_samples?
  • What if (num_samples / n_samples) is not an interger? wouldnt the last bin sum over a different number of samples? so why is waveforms[i] /= (step_size * dt) correct?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is because the bayes pluggin needs the waveform but the SOM plugin does not. So if we dont need it we can save computational time by not calculating this, but if you know of a better way to achive this goal I am more than happy to change this.

As to your last point I do not really understand what you are saying, can you elaborate a bit more?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets take a simple example.

num_samples = 10
n_samples = 6
step_size = int(num_samples / n_samples)
steps = np.arange(0, num_samples + 1, step_size)

for j in range(n_samples):
    print(steps[j], ":", steps[j + 1])

0 : 1
1 : 2
2 : 3
3 : 4
4 : 5
5 : 6

Does this look like correct downsampling to you?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I think I know what you are saying now. So this is dealing with waveforms so num_samples will always be 200, but there are no safeguards against non integer divisions of the data. We can make an assert statement to make sure num_samples is divisible by n_samples. I will make that change.

The waveforms[i] /= (step_size * dt) I am now a bit confused about, the objective is to get rid of the dt dependance so we can compare all wavefroms to each other, but I did some quick dimensional analysis and I think we should be multiplying by dt instead of dividing since waveform amplitude is in units of PE/sample = PE/(dt) so we should be multiplying the numerator by dt. @ahiguera-mx since you worked on this more with the Bayes classification is there something we are missing?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Like I explained when you originally asked me where to place this function. Strax is an open source framework built to be used by experiments other than XENON. Function in strax cannot assume the data is exactly as it is in xenon.
  • The required bar in terms of code quality to have a PR merged is also higher.
  • Allocating a massive array of zeros just to be used sometimes is not a good approach. You should downsample ina different function. There are many standard techniques for downsampling, such as windowed reductions. I would stick to one of those.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok we dont need the downsampling for now so we can just remove that funcitonality, and make an assert statement to make sure data quantiles make sense, but the big question still is, given the difference between the functions, do we still want to try to merge it into one or should I leave them as separate things?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would try to either:

  • use the existing function in the plugin or adapt it to be able to produce what you need
  • Provide some actual concrete example that demonstrates the need for the extra function. I find the hand wavy explanations very difficult to follow.


for j in range(n_samples):
waveforms[i][j] = np.sum(samples[steps[j]:steps[j + 1]])
waveforms[i] /= (step_size * dt)
return quantiles, waveforms
28 changes: 28 additions & 0 deletions tests/test_peak_properties.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@
from hypothesis import given, strategies, example, settings
import tempfile
from math import ceil, floor
import pytest

def get_filled_peaks(peak_length, data_length, n_widths):
dtype = [(('Start time since unix epoch [ns]', 'time'), np.int64),
Expand Down Expand Up @@ -79,3 +80,30 @@ def test_compute_widths(peak_length, data_length, n_widths):
mess = ("Highly unlikely that from at least 10 positive area peaks "
"none were able to compute the width")
assert np.any(peaks['width'] != pre_peaks['width']), mess

@settings(max_examples=100, deadline=None)
@given(
# number of peaks
strategies.integers(min_value=0, max_value=20),
# length of the data field in the peaks
strategies.integers(min_value=2, max_value=20),
# Number of widths to compute
strategies.integers(min_value=2, max_value=10),
)
def test_compute_wf_attributes(peak_length, data_length, n_widths):
"""
Test strax.compute_wf_attribute
"""
peaks = get_filled_peaks(peak_length, data_length, n_widths)
wf = np.zeros((len(peaks), 10), dtype=np.float64)
q = np.zeros((len(peaks), 10), dtype=np.float64)

try:
q, wf = strax.compute_wf_attributes(peaks['data'], peaks['dt'], 10, True)
except AssertionError as e:
if "zero waveform" in str(e):
print("cannot compute with a zero waveform")
elif "more samples than the actual waveform" in str(e):
print("cannot compute with more samples than the actual waveform")

assert np.all(~np.isnan(q)) and np.all(~np.isnan(wf)), "attributes contains NaN values"