Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MathFeatures seems much slower than pandas.sum() #576

Open
solegalli opened this issue Dec 12, 2022 · 22 comments
Open

MathFeatures seems much slower than pandas.sum() #576

solegalli opened this issue Dec 12, 2022 · 22 comments
Labels
priority need to be looked at next urgent urgent attention needed

Comments

@solegalli
Copy link
Collaborator

Need to check what is going on and fix

@solegalli solegalli added urgent urgent attention needed priority need to be looked at next labels Jan 14, 2023
@Morgan-Sell
Copy link
Collaborator

hi @solegalli,

Were you able to pinpoint the root cause of this issue?

@solegalli
Copy link
Collaborator Author

No. Didn't have time to check.

@olikra
Copy link
Contributor

olikra commented Jun 5, 2024

"seems much slower" Do we know how much slower MathFeatures is? If desired I can do some checks and report back...

@solegalli
Copy link
Collaborator Author

I don't know exactly how much slower it is. It would be great if you could do the checks @olikra !

@olikra
Copy link
Contributor

olikra commented Jun 6, 2024

@solegalli
Welcome - Happy to help

I suggest following approach concentrating on the 4 basic arithmetics (add,subtract,divide,multiply):

  • Use of latest versions of numpy/pandas/feature-engine
  • Generating a CSV-File with two columns (float-numbers) and 100.000 rows for having same baseline for each module
  • Load the data upfront
  • repeating each arithmetic (result in a third column) 10 times on numpy,pandas/feature-engine and measure via timeit isolated calculation time and average it
  • build some graphics and discuss the ground truth

In a second step we could do further calculation on the two columns itself with (log, sin,...)

@solegalli
Copy link
Collaborator Author

Sounds good @olikra

@olikra
Copy link
Contributor

olikra commented Jun 8, 2024

@solegalli Did some investigation:

1. Performance Measurement Approach
For reproducable results I created a dataframe with 2 features (floating numbers) and 1.000.000 samples:

                 A            B
0       4083.179436  1029.857437
1       7451.928885  5945.188317
2       3669.416751  4133.482993
3       2288.967806  2333.608841
4       8108.977405  6750.933386
...             ...          ...
999995   623.329330  8458.191033
999996  1048.666685  1992.428440
999997   677.435021  3372.229829
999998  1934.224681  9202.690288
999999  4376.775617  5361.415726
[1000000 rows x 2 columns]>

A script reads in the testdata (looping from 100.000 to 1.000.000 in 10.000 increments) and doing the math 50 times to get better averages.

For pandas i used this line of code:
df['C'] = df['A'] + df['B']

For numpy:
df = np.column_stack((df, (df[:, 0] + df[:, 1])[:, None]))

for feature-engine math:

transformer = MathFeatures(variables=["A", "B"], func='sum', )
df = transformer.fit_transform(df)

Probably, there are more performant solutions for pandas and numpy. But it's just about the comparison to feature-engine math!

2. Results

For the 500.000 record block the runtime measured via time.time_ns() directly before and after the 3 line of code:

runtime_ms_500000

Runtime over all record-blocks:
runtime_allrecords

Luckily the runtime is linear to number of records!

I stepped into the feature-engine-math and measured the time for each codeline with transform and fit + transform_fit:
Bildschirmfoto 2024-06-08 um 13 51 28

I assume the issue is in the transform part :-)

I'm not familiar with profiling the code, but used cProfile for a first look inside:

profiler.enable()
mf.transform(df)
profiler.disable()
ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        3    0.056    0.019    0.124    0.041 ../python3.11/site-packages/pandas/core/frame.py:11435(_reduce)
       12    0.023    0.002    0.023    0.002 ../python3.11/site-packages/pandas/core/array_algos/take.py:120(_take_nd_ndarray)
        7    0.023    0.003    0.023    0.003 ../python3.11/site-packages/pandas/core/internals/managers.py:180(blknos)
        1    0.017    0.017    0.023    0.023 .../python3.11/site-packages/pandas/core/nanops.py:1499(_maybe_null_out)
       25    0.012    0.000    0.012    0.000 {method 'reduce' of 'numpy.ufunc' objects}
       10    0.012    0.001    0.012    0.001 {method 'take' of 'numpy.ndarray' objects}
        4    0.010    0.002    0.010    0.002 {method 'copy' of 'numpy.ndarray' objects}
        2    0.003    0.001    0.003    0.001 ../python3.11/site-packages/pandas/core/dtypes/missing.py:261(_isna_array)

Not sure if above output is an indicator for the issue.

Regards

Addition:
Following versions were used:
pandas: 2.2.2
numpy: 1.26.4
feature-engine: 1.8.0

Performance measurement was done on a Debian system exclusive - no other tasks like mariadb,nginx,... influenced the run

@olikra
Copy link
Contributor

olikra commented Jun 9, 2024

It crossed my mind: Of course I have to test the pandas.agg() function similarly. Step into it as soon as posible...

@olikra
Copy link
Contributor

olikra commented Jun 10, 2024

Here we go:

  • Added two additional functions to the performance measurement

pandas.agg

variables = ['A', 'B']
df['C'] = df[variables].agg('sum', axis=1)

pandas.sum
df['C'] = df.loc[:, ['A', 'B']].sum(axis=1)

Performance for the 500.000 records run:
runtime_ms_500000

Performance over all records (100.000 records to 1.000.000 records
runtime_allrecords

My conclusion: There is no real performance issue in the feature-engine.agg function. It is influenced by the underlaying pandas.agg function. The difference is just the additional wrapping in the feature-engine implementation of pandas.agg

@solegalli
Copy link
Collaborator Author

Thank you @olikra !! This is really useful.

So, by the looks of it, we'd have to replace the transform logic, stepping away from pandas agg, and probably replacing with numpy.

Yes, the problem is during transform, where the feature creation/combination takes place.

It would be great to have numpy sum in the comparison, to see if this makes a better solution: https://numpy.org/doc/stable/reference/generated/numpy.sum.html

Would you be able to add that one too in your comparison? and then, i guess we can modify the logic to replace the functions with those of numpy

@olikra
Copy link
Contributor

olikra commented Jun 10, 2024

@solegalli added numpy.sum to the stack:
numpy.sum

starttime = time.time_ns()
df = np.append(df, np.sum(df, axis=1, keepdims=True,), axis=1)
enddtime = time.time_ns() - starttime

Performance for the 500.000 record float based run:
runtime_ms_500000_FLOAT

Note: To check for correctness of all computations I calculated a checksum (sum of floats in all features/samples) after each run. For the 500.000 record-runs it looks so:
Bildschirmfoto 2024-06-10 um 15 20 22

The checksum for the pandas-based calculations is the same- but the numpy-based ones are a little bit different. I assume the cause is the different handling of floats in numpy and pandas. Not sure, if this can be relevant for existing models when the calculation changes from pandas to numpy.

Cause the measurement is already setup I did the same with integers:
Performance for the 500.000 record run
runtime_ms_500000_INTEGER

So no big differences from Float to Integer.

@solegalli
Copy link
Collaborator Author

Thank you so much @olikra ! This is great! We need to move from pandas to numpy then. Would you be also up for trying to change the logic of this transformer?

@olikra
Copy link
Contributor

olikra commented Jun 11, 2024

@solegalli I can give it a try. Need to get more familiar with feature-engine and the coding-style, test-style. Until now it was a blackbox-action for me.

So the plan is to replace the generic pandas.agg functionality by native numpy functionality, where a native numpy functionality is available at this code-segment:

MathFeatures.transform(self, X)

 if len(new_variable_names) == 1:
         X[new_variable_names[0]] = X[self.variables].agg(self.func, axis=1)
     else:
         X[new_variable_names] = X[self.variables].agg(self.func, axis=1)

@solegalli
Copy link
Collaborator Author

Yes @olikra that is the bit to be replaced. Not sure there is a numpy equivalent of pandas.agg, if there is, that would be the simplest, if not, we need to break it down function by function :/

Hopefully, you wouldn't have to change the tests much.

@olikra
Copy link
Contributor

olikra commented Jun 12, 2024

I did some analytics how pandas implemented the pandas.agg function to avoid trapping in pitfalls implementing the numpy-functionality.

I used a simple dataframe for the pandas.agg sum function:

ref = pd.DataFrame.from_dict(
        {
            "A": [20, 21, 19, 18],
            "B": [0.9, 0.8, 0.7, 0.6],
            "C": [5.9, 0.8, 4.7, 0.6],
        }
    ) 

In the deeper code-sections of pandas, it seems they iterate over each row/sample and doing the math:
Bildschirmfoto 2024-06-12 um 13 33 24
Bildschirmfoto 2024-06-12 um 13 33 55

@olikra
Copy link
Contributor

olikra commented Jun 13, 2024

@solegalli Let me present a first idea how the migration from pandas.agg to numpy in .transform could be done. We have to keep in mind, that already existing implementations of MathFeatures (in your courses and in the rest of the world) have to work in the same way as before migration!

sidenote: The "func function, str, list or dict" in pandas.agg with the endless combinations does make things not easier... :-)

Overall solution
I assume the most used way to call the transform with functions, is to submit the native keywords like 'sum', 'mean', 'min', 'max', ... .

We have to iterate over the func-list, cause I found no numpy equivalent of pandas.agg. So the func-list is handled like a stack and we use a dictionary to do the calculation for each element in the stack.

In this example func will get following values:

func = ['mean', 'sum', 'np.log', 'convert_kelvin_to_celsius']

'np.log' is a call to numpy, because we haven't provide it in the following dictionary.
'convert_kelvin_to_celsius' is a domain-specific complex calculation.

functiondictionary = {
        "mean": getattr(np, "nanmean"),
        "sum": getattr(np, "nansum"),
        "min": getattr(np, "nanmin"),
        "max": getattr(np, "nanmax"),
        "prod": getattr(np, "nanprod"),
        "median": getattr(np, "median"),
        "std": getattr(np, "nanstd"),
        "var": getattr(np, "nanvar")}

Inside the iterator the calculation is done by this (surrounded code not shown):

df_nan = df_nan[stack_element] = functiondictionary[stack_element](df_nan[variables], axis=1, )
remove stack_element from stack

df_nan is populated with the 'mean' and 'sum' values and the stack looks this way:

func = ['np.log', 'convert_kelvin_to_celsius']

Unfortunatley we have to call these two functions with the pandas.agg. If we could find a way to identify a custom-function (e.g. by an decorator) we could call the custom-function by:

df_nan['convert_kelvin_to_celsius'] = np.apply_along_axis(convert_kelvin_to_celsius, axis=1, arr=df_nan[variables])

Specific issues with 'std' and 'var'

Pandas uses different default-values e.g. for ddof. Using following code:

df_nan = pd.DataFrame.from_dict(
    {
        "A": [20, 21, 23, 12],
        "B": [10, 10, 10, 10],
    }
)
variables = ['A', 'B',]

df_nan['std_pandas_np'] = df_nan[variables].agg(np.nanstd, axis=1)
df_nan['std_numpy_ddof1'] = np.nanstd(df_nan[variables], axis=1, ddof=1 )
df_nan['std_numpy_ddof0'] = np.nanstd(df_nan[variables], axis=1, )

results in this:
Bildschirmfoto 2024-06-13 um 16 10 51

Pandas uses ddof = 1 as default and numpy uses ddof = 0. Same for 'var'

Specific issues with Nan

Having some Nan's in the array:

df_nan = pd.DataFrame.from_dict(
    {
        "A": [20, 21, 23, 12],
        "B": ['N/A', 12, 'N/A', 'N/A',],
        "C": [10, 10, 10, 10],
    }
)
df_nan.replace('N/A', np.NaN, inplace=True)
df_nan['var_pandas_np'] = df_nan[variables].agg(np.nanvar, axis=1)
df_nan['var_numpyddof1'] = np.nanvar(df_nan[variables], axis=1,ddof=1)
df_nan['var_numpyddof0'] = np.nanvar(df_nan[variables], axis=1,)

results in this:

Bildschirmfoto 2024-06-13 um 16 10 51

Open question is, how we handle different default values against the background that MathFeatures is already in use.
What do you think about the general direction/approach?

@solegalli
Copy link
Collaborator Author

Hey @olikra thank you so much! This is an incredible amount of work.

I think you are on the right track. The main thing is not to break backwards compatibility in terms of the functionality. That means, that the user can still instantiate the transformer as usual to get the results they expect. So in principle, we should try and implement as many of the pandas supported methods as possible.

Then, I think that using ddof 1 or 0 is a minor detail, but if we want to be absolutely strict with backward compatibility we can enforce numpy to use ddof 1 as well.

Is the idea to use numpy whenever possible and pandas agg for pandas functions like the convert_kelvin_to_celsius? How many functions are there? did you find a list? If this is the approach, it will accelerate the transformer quite a bit, because I presume that most users would use the standard functions like sum, mean etc and those are supported by numpy. Then, depending on how many extra functions are supported by agg, we can choose to reimplement them in numpy (if they are just a few) or the easiest would be to default to pandas. The latter has the advantage that if/when pandas releases another function, it will be available for MathFeatures by default, whereas if we re-code them ourselves, every time pandas makes an update, we need to update it from our side.

@olikra
Copy link
Contributor

olikra commented Jun 15, 2024

@solegalli
1. Pandas List of functions
I digged into the pandas code but found no overview, what functions are supported by pandas.agg(). It’s quite splattered in the code. In the core component ist a list, but this is used by the plot-function:

_cython_table = {
builtins.sum: "sum",
builtins.max: "max",
builtins.min: "min",
np.all: "all",
np.any: "any",
np.sum: "sum",
np.nansum: "sum",
np.mean: "mean",
np.nanmean: "mean",
np.prod: "prod",
np.nanprod: "prod",
np.std: "std",
np.nanstd: "std",
np.var: "var",
np.nanvar: "var",
np.median: "median",
np.nanmedian: "median",
np.max: "max",
np.nanmax: "max",
np.min: "min",
np.nanmin: "min",
np.cumprod: "cumprod",
np.nancumprod: "cumprod",
np.cumsum: "cumsum",
np.nancumsum: "cumsum",
}

2. Backwards compatibility
Sure this is the most important thing.

On my Jupiter-Notebook I created/installed a first version of a boosted-transform method. I implemented the direct usage of numpy.nan* for following functions-calls:

calculations = ('mean', 'sum', 'min', 'max', 'prod', 'median', 'std', 'var')

During exercising your training-courses I will check/adjust it and do some fine-tuning.

3. Performance
Having the transform (with pandas.agg() ) and boosted-transform ( with numpy.nan*) by hand, i did a performance test with 100.000 records and two columns with floats.

transform (with pandas.agg() ) - Numbers in Seconds
runtime_ms_100000_transform_float

boosted-transform ( with numpy.nan_) numbers in Millisecond
runtime_ms_100000_boostedtransform_float

4. Test of transform
From my perspective, actually there is no test in place for calling the custom functions like convert_kelvin_to_celsius(). We have to test this anyway, when we call a custom-function via the boosted-transform. Should I implement one?

@olikra
Copy link
Contributor

olikra commented Jun 16, 2024

@solegalli
Update
Switched on local side for these standard-calculations:

('mean','sum', 'min', 'max', 'prod', 'median', 'std', 'var')

from pandas.agg to numpy.

32 Tests of 33 already passed...

Bildschirmfoto 2024-06-16 um 13 16 56

Will look into it tomorrow evening...

@olikra
Copy link
Contributor

olikra commented Jun 21, 2024

@solegalli Math-Feature-conversion is done so far. I assume some parts need discussion. Do you prefer discussion in this issue or in the pull request?

@olikra
Copy link
Contributor

olikra commented Jun 25, 2024

I have uploaded a pr (#774)

@solegalli
Copy link
Collaborator Author

We discuss the code changes in the PR. I added a few comments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority need to be looked at next urgent urgent attention needed
Projects
None yet
Development

No branches or pull requests

3 participants