Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/calculate fft energy blocks #1024

Open
wants to merge 14 commits into
base: main
Choose a base branch
from

Conversation

anates
Copy link

@anates anates commented May 30, 2023

This feature allows the computation of the frequency energy content of the signal, distributed over n frequency bins. This simplifies the detection of dominant frequencies for different process stages.

This pull request is meant as initial step, to receive feedback (both wrt. code, implementation and intention of the feature) and to improve the code, before it can be merged.

@codecov-commenter
Copy link

Codecov Report

Patch coverage: 87.50% and project coverage change: -0.03 ⚠️

Comparison is base (e91e2a3) 93.42% compared to head (55696eb) 93.39%.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1024      +/-   ##
==========================================
- Coverage   93.42%   93.39%   -0.03%     
==========================================
  Files          20       20              
  Lines        1886     1909      +23     
  Branches      371      375       +4     
==========================================
+ Hits         1762     1783      +21     
  Misses         85       85              
- Partials       39       41       +2     
Impacted Files Coverage Δ
tsfresh/feature_extraction/settings.py 100.00% <ø> (ø)
tsfresh/feature_extraction/feature_calculators.py 92.43% <87.50%> (-0.04%) ⬇️

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

Copy link
Collaborator

@nils-braun nils-braun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much for the contribution! We love seeing more feature calculators!
I have added a few comments on a high level. There is one conceptional problem I see with the current implementation - but this can be fixed (I think), so nothing which prevents the feature in general!

Thanks again for your work :)
Once you finished implementing the comments, I will have another look.

@set_property("input", "pd.Series")
@set_property("index_type", pd.DatetimeIndex)
def energy_content_frequency_brackets(
signal: pd.Series, params: dict = {"number_of_bins": 10, "with_normalization": True}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please do not set the default parameter to a dictionary (https://stackoverflow.com/questions/26320899/why-is-the-empty-dictionary-a-dangerous-default-value-in-python). This is not only related to tsfresh but for general python: this can lead to very unexpected behaviour (see the linked stackoverflow if you want to know more).
(you even disabled the "dangerous default parameter" pylint check for this ;) )

The good thing is, you do not even need a default argument. tsfresh will always fill this argument, so you can just remove the default.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding type information. But params is actually a list of dictionaries.
As you chose a combiner calculator (https://tsfresh.readthedocs.io/en/latest/text/how_to_add_custom_feature.html#step-1-decide-which-type-of-feature-you-want-to-implement) you will get all parameters for this calculator as a list of dictionaries.

This will also need some changes to your code.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From that point of view, I'm a bit unsure if I should leave the calculator to be a combiner or convert it to a "simple" calculator. Converting it into a "simple" calculator would allow me to simplify the code, and speed up calculation. For the time being I therefore decided to convert it, and would appreciate feedback if that decision is appropriate.
Thanks!

Args:
signal (pd.Series): The distribution for which the frequency energy buckets \
should be calculated
params (_type_, optional): Parameters used for calculation of the frequency \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above, the parameter is not optional (and the type information is not __type__)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed, thanks!

Comment on lines 1260 to 1269
number_of_bins: int = 10
if "number_of_bins" in params.keys():
number_of_bins: int = (
params["number_of_bins"]
if (
params["number_of_bins"] <= len(time_vec)
and params["number_of_bins"] >= 0
)
else len(time_vec) - 1
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You do not need to use a default value for number_of_bins here. You can always assume that the key will be set. The reason is that we define the settings ourself :)
This allows you to simplify the code to (basically)

number_of_bins = min(params["number_of_bins"], len(time_vec) - 1)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if a user sets number_of_bins to 0 or below this should be a bug and you can just raise an exception here.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I chose to raise a warning to inform the user, and to continue with a default value of 1.

)
else len(time_vec) - 1
)
with_normalization: bool = True
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same - you can assume the value is set in the param config

split_freq_vec: list[np.ndarray] = np.array_split(freq_vec, number_of_bins)
frequency_energy_vec: list[float] = []
frequency_range_vec: list[tuple[float, float]] = []
for _, freq_entry in enumerate(split_freq_vec):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You do not need to use enumerate if you do not need the index (and it seems you do not need it as you use _)

Comment on lines 1280 to 1281
lower_frequency_index: int = np.where(freq_vec == freq_entry[0])[0][0]
upper_frequency_index: int = np.where(freq_vec == freq_entry[-1])[0][0]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are those two lines doing?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea behind those two lines was to find the index of the frequency in freq_vec which corresponds to the frequency limits given for each entry in split_freq_vec. Anyway, as this code was updated on my side, I will re-write that part.

tsfresh/feature_extraction/feature_calculators.py Outdated Show resolved Hide resolved
tsfresh/feature_extraction/feature_calculators.py Outdated Show resolved Hide resolved

return_val: list[tuple[str, float]] = [
(
f"bin_{entry_num}_max_{number_of_bins}_low_{int(frequency_range_vec[entry_num][0])}_high_{int(frequency_range_vec[entry_num][1])}",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, we have a conceptional problem.
Each feature calculator in tsfresh is allowed to only return a single number per parameter input. This is important because our feature selection only works for numbers, not lists of numbers.
Additionally, the feature extractor names (the string you return) need to be always the same (independent on the data) and should only depend on the parameters.

How can we solve this? We can use the same idea that we e.g. used in the fft_aggregated calculator. Just make the entry_num and additional parameter :) With this, you can return a single result per parameter and just have a lot more parameters.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hei,
I'm not sure how to implement that, or if I fully understood your comment.

I.e. if I use one parameter as input (f.eks. setting params = {"number_of_bins": 10}), then I can return only a single tuple consisting of (str, float), but not multiple tuples (as I would expect, when setting number_of_bins to 10). The solution for that would be (as stated above) to rather set params to [{"number_of_bins": 1}] * 10, which would increase my number of inputs to 10, and thereby allow me to return 10 tuples, instead of one. Internally, of course, I would have to replace number_of_bins with entry_num to define the number of bins for the calculation.

Is that interpretation correct? This would increase complexity for the user, though.

Thanks!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me explain a bit why we need this to give you some context.
Some users use tsfresh just for feature extraction - in which case your approach to return a dynamic number of features would be perfectly well. However, tsfresh is also used for feature selection. If the feature names and number however depends on the data - how can I use the selected features on my training dataset on my validation dataset? Therefore it is crucial that the generated features only depend on the input parameter configuration. This means, if a user sets paramA = 5, it should always give the exact same features.

The way we have implemented in other features, is similar like you described. To make a simple example, assuming our feature calculator wants to generate 10 numbers [0, 1, 2, ....]. As this is not possible, we give our feature calculator not the number of bins, but the actual bin index. So if the feature calculator is called with the parameter `bin_index=0' it would return the 0, etc. In the default configuration, you would then have a default set of bin_indices. This might not be as many bins as needed for your data (but it can be configured by the user if they want). For example, we always return the first 100 FFT coefficients (no matter how long the data is).

Another approach would be, instead of returning the raw bin numbers, returning statistics of them. Like create one feature out of the max value of all your bins, one out of the min etc. We do something similar to the fft_aggregated feature calculator.

anates and others added 2 commits June 9, 2023 14:28
…ed additional option for user-defined bins instead of automatically calculated bins. Updated test accordingly
.gitignore Outdated
@@ -48,6 +48,7 @@ docs/text/_generated/
# Virtual environment
venv*
activate
tsfresh-env
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you remove this? It seems to be very special to your own environment (and not a general name)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's the name of my virtual environment (not venv), but I'll remove it.

.gitignore Outdated
@@ -61,3 +62,4 @@ tsfresh/notebooks/data/
# dask
dask-worker-space
dask-worker-space/
*.pyc.*
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder why this is needed? Which processes produces these files?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those are bytecode files, generated when I run the code for the first time. As I don't want to upload them in git, I ignored them.

@nils-braun
Copy link
Collaborator

@anates - the combiner/simple calculator and your bin index issue makes it probably quite complex, so let me try to simplify the problem a bit:
as a start, can you implement a function energy_content_frequency_brackets which gets all parmeters as keyword arguments and just produces a single return value (float)? Your function would look like

def energy_content_frequency_brackets(signal, with_normalization, bin_index):

Would that be possible? I think the function would look something like


def energy_content_frequency_brackets(signal, with_normalization, bin_index):
    freq_vec = np.fft.rfftfreq(n=len(signal), d=(time_vec[1] - time_vec[0]))
    fft_vec = np.abs(np.fft.rfft(signal.to_numpy()))
    split_freq_vec = np.array_split(freq_vec, number_of_bins)

    freq_entry = split_freq_vec[bin_index]

    lower_frequency_index = find_nearest(array=freq_vec, value=freq_entry[0])[0]
    upper_frequency_index: int = find_nearest(array=freq_vec, value=freq_entry[-1])[0]
    fft_sum: float = np.sum(fft_vec[lower_frequency_index:upper_frequency_index])
    if with_normalization:
        fft_sum /= len(fft_vec)

    return fft_sum

When I call this function for different bin_index (0, ...) I should basically get the list of values you originally had.

If this works, we can go one step further and make the implementation efficient: instead of doing the FFT for every bin independently we do it for all bins once. This is how a "combiner" works (it is just for making the calculation more efficient). But first, lets make sure the implementation with a "simple" type works!

Hope this helps :) Sorry for all the delayed-ping-pong!

Note: in your current implementation, you also return the "low" and "high" in the feature name. As described above, this is not possible. However we can think of adding yet another feature calculator which just returns the high and one which just returns the low. But this can be a separate PR

@anates
Copy link
Author

anates commented Jun 15, 2023

Hei,
I addressed the other points, but have some comments regarding the main implementation (dvs. your last comment):

  • The "low" and "high"-part of my return string intend to inform the user where the boundaries of that bin are. At least for my understanding those values are quite important, to understand my signal better. Are there other ways of returning those boundaries to the user, or is that on their side to identify them (outside the feature-extractor)?
  • I agree with you that having a variable feature name might not be the best approach, but the number of bins will always be the same, independently of the data length, as it is based on a user-defined input. This corresponds to (for example) the coefficient-approach which always returns the first 100 coefficients, independently of the data itself. Would that still be ok?

Thanks!

@nils-braun
Copy link
Collaborator

Hi!
To your first point:
Would it work to add them as a second feature calculator? One that returns the values and one that returns the high (and one the low)?

To your second point:
Yes indeed, that is the only option in the current way tsfresh is implemented. Except returning a statistics over the values as described above. Not saying this is the best way, but it is just the way tsfresh is currently working...

@anates
Copy link
Author

anates commented Jun 22, 2023

Hei,
thanks for the feedback! Based on that, would it be a solution to add an additional input parameter to this extraction function, to either return the bin value (default) for bin n, labelled with _bin_n, or return the high-/low-values for bin n? For typical operation (i.e. feature selection) it would not make a difference, but if the user is interested in the exact boundaries, they can still call this function separately with the same inputs as before, and obtain the bin limits.

This would also reduce the amount of code duplication, else I would have to re-use the same code in two different extractors.

Concerning the second point: Assumed I understood your point correctly, the current implementation of returning n bins is already working as intended? For it being a combiner I then would just have to re-write handling of the input params, to allow it being a list of dicts.

Thanks!

@nils-braun
Copy link
Collaborator

nils-braun commented Jun 25, 2023

Based on that, would it be a solution to add an additional input parameter to this extraction function
Good idea!

Very close to what you described is possible (and preferred as you said!). It would be ok to write a function which gets an additional parameter, but if it is a simple function it should only return a single number (the feature name will be set automatically). Something like

def energy_content_frequency_brackets(signal, with_normalization, bin_index, return_mode) -> float:
   if return_mode == "value":
      return fft_sum # single float!
   elif return_mode == "high":
      return 42 # replace with actual number
   elif return_mode == "low":
      return 47 # replace with actual number

Feel free to use different names, this was just an example!

Assumed I understood your point correctly, the current implementation of returning n bins is already working as intended?

Maybe I missed your point, but the current solution as it is implemented right now in the code does not work as expected for tsfresh (again: it is not wrong, it just does not work with the way tsfresh expects feature calculators to work). Let me rephrase what I tried to explain in #1024 (comment).
If you implement a simple calculator, which I would strongly encourage as the first implementation (as a combiner is harder to implement and only is for speeding up the implementation which can be the second step), your calculator is only allowed to return a single number. Not a list and not a feature name - but really just a single number.
In the example I gave in the thread above, I only return a single of the bins, while the bin_index is an additional parameter.

def energy_content_frequency_brackets(signal, with_normalization, bin_index) -> float:

(plus the return mode from the question above maybe)

Let me try to give another (maybe simpler) example: assuming you write a simple feature calculator which always returns the three largest number of a time series. You can not write the feature calculator like this

def my_calculator(data) -> List[float]:
   data = list(reversed(sorted(data)))
   return [data[0], data[1], data[2]]

and also not like this

def my_calculator(data) -> List[Tuple[str, float]]:
   data = list(reversed(sorted(data)))
   return [("largest_1", data[0]), ...]

but you need to write it like this

def my_calculator(data, which_largest) -> float:
   data = list(reversed(sorted(data)))
   return data[which_largest]

@anates
Copy link
Author

anates commented Jul 7, 2023

Work on this feature is on hold for the next weeks, but I'll take it up again after my break, if that's ok?
Thanks!

…nput parameter, for initial test/proof of concept
@anates
Copy link
Author

anates commented Sep 5, 2023

Hei,
I reverted the feature calculator to being a "simple" calculator as an initial proof of concept. I have not adjusted the testing-functions yet, but will do that as soon as the implementation gets closer to being final.
However, is the current implementation closer to being useful for tsfresh?
Thanks!

Copy link
Collaborator

@nils-braun nils-braun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, great! The changes you did are very good - now it is in a shape to fit with the rest of the feature extractors.
Thanks for taking the time and sorry again for all the delays. I really appreciate your contribution.

warnings.warn(warnings_message, RuntimeWarning)
return np.NaN

number_of_bins: int = 10
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You also have this as an input parameter. Maybe better use that!

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hei,
yes, I'll implement it, but will work on it step-by-step, to ensure that every step still fits within the framework. Thanks!

freq_vec: np.ndarray = np.fft.rfftfreq(n=len(signal), d=(time_vec[1] - time_vec[0]))
fft_vec: np.ndarray = np.abs(np.fft.rfft(signal.to_numpy()))

if (bin_ranges[0] < 0 and bin_ranges[1]) < 0 or (bin_ranges[0] >= bin_ranges[1]):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess you want the bracket after the < 0?

Suggested change
if (bin_ranges[0] < 0 and bin_ranges[1]) < 0 or (bin_ranges[0] >= bin_ranges[1]):
if (bin_ranges[0] < 0 and bin_ranges[1] < 0) or (bin_ranges[0] >= bin_ranges[1]):

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's a typo, thanks for noticing it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants