Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add profiling functionality to feature-engine #592

Open
solegalli opened this issue Jan 14, 2023 · 16 comments
Open

add profiling functionality to feature-engine #592

solegalli opened this issue Jan 14, 2023 · 16 comments

Comments

@solegalli
Copy link
Collaborator

Using cProfile, we need to be able to test the performance of the entire library.

@Okroshiashvili
Copy link
Contributor

Hi @solegalli

  1. What do you think what has to be done here?
  2. Will this profiling functionality be only for devs or for end users as well?
  3. Where profiling functionality be run? I mean, from where we (devs) will get profile output?
  4. Are we only interested in timings and not memory footprint?
  5. Are we okay with third party library for profiling?

@solegalli
Copy link
Collaborator Author

Hey @Okroshiashvili

  1. we need to understand the performance of each one of our classes. For example, drop duplicates was super slow. I'd like to be able to pick that up before making the class public by using profiler. Because we already have a lot of public classes, this would help us understand if they are performant, and if not, modify them
  2. devs
  3. I don't understand the question. I normally run pytest for tests, pytest with coverage for coverage, and here I guess we would run the profiler for the profile. Should we also make this part of the standard circleci checks? Not sure about this.
  4. both speed and memory (we have a few issues regarding using a lot of memory by pd.copy()
  5. yes. I thought of eprofiler

I am new to this, so I asked for help to @ChristopherGS

I copy here the notes that I got from him:

def profile(output_file=None, sort_by="cumulative", strip_dirs=False):

      """A time profiler decorator.
      Inspired by and modified the profile decorator of Giampaolo Rodola:

      http://code.activestate.com/recipes/577817-profile-decorator/

      Args:
      output_file: str or None. Default is None
      Path of the output file. If only name of the file is given, it's
      saved in the current directory.
      If it's None, the name of the decorated function is used.

      sort_by: str or SortKey enum or tuple/list of str/SortKey enum
      Sorting criteria for the Stats object.
      For a list of valid string and SortKey refer to:
      https://docs.python.org/3/library/profile.html#pstats.Stats.sort_stats

      strip_dirs: bool
      Whether to remove the leading path info from file names.
      This is also useful in reducing the size of the printout
      Returns:
      Profile of the decorated function
      """

    # nosemgrep
    def inner(func):
    # nosemgrep
    @wraps(func)
    # nosemgrep
    def wrapper(*args, **kwargs): # nosemgrep
    if not ENABLE_PROFILING:
    return func(*args, **kwargs)
    _output_file = output_file or UPDATE_ME / "sim_profile.prof"
    pr = cProfile.Profile()
    pr.enable()
    retval = func(*args, **kwargs)
    pr.disable()
    
    with open(_output_file, "w") as f:
    ps = pstats.Stats(pr, stream=f)
    if strip_dirs:
    ps.strip_dirs()
    if isinstance(sort_by, (tuple, list)):
    ps.sort_stats(*sort_by)
    else:
    ps.sort_stats(sort_by)
    ps.dump_stats(_output_file)
    # nosemgrep
    return retval
    
    # nosemgrep
    return wrapper # nosemgrep
    
    return inner # nosemgrep

Now add this decorator to the top most level function you call when doing something you want to profile (you might have to write a script, or you could use an e2e test)

For easy viz, checkout: https://jiffyclub.github.io/snakeviz/

@Okroshiashvili
Copy link
Contributor

Thanks for your input @solegalli

I pick this one and will update you whenever I will have some result

@Okroshiashvili
Copy link
Contributor

Hi @solegalli

So, I found great library for profiling the timings of anything in Python. It is Pyinstrument.

Briefly, it is statistical profiler and imposes less burden to the code. Meaning that, it's much faster then conventional profilers. Moreover, its output is really easy to digest. You can quickly look at the documentation and check its benefits.

I created small function which we use as a decorator to profile any method inside a class. Note that decorating and profiling class will not give us correct result. If we profile class with this decorator then we will get the timings of class constructor, that is __init__ method only.

import functools
from pyinstrument.profiler import Profiler


def profile_function(output_file="profile.html"):
    """
    Decorator to profile a function.

    Args:
        output_file: File to write profile output. Defaults to "profile.html".
    """
    def decorator(function):
        @functools.wraps(function)
        def wrapper(*args, **kwargs):
            profiler = Profiler()
            profiler.start()
            result = function(*args, **kwargs)
            profiler.stop()
            output = profiler.output_html()
            with open(output_file, "w") as f:
                f.write(output)
            return result
        return wrapper
    return decorator

Put this function somewhere and import it anywhere you want. Then decorate class method with this function in the following way:

@profile_function(output_file="path_to_the_output_file/the_output_file_i_want_to_have.html")
def slow_function(*args, **kwargs):
    sleep(10)
    return "Anything I need"

This will produce the_output_file_i_want_to_have.html inside path_to_the_output_file directory. You can do not provide argument to the decorator and by default it will create profile.html in root directory. You can use any browser to open that HTML file and investigate it to see the timings of each method. Actually, that HTML file is a tree and you can examine each method as well as any child method.

Note that, is HTML file is empty that's fantastic. That means that the code/method/class took less than 1 millisecond to execute.

What's your thoughts?

@solegalli
Copy link
Collaborator Author

The package has half a million downloads per month. Sounds good 🍾

Thank you @Okroshiashvili

Happy for you to move forward with this 👍🏽

@Okroshiashvili
Copy link
Contributor

Okroshiashvili commented Mar 1, 2023

Thanks @solegalli ☺️

What's the resolution of this issue? Can we consider it as done and close?

@solegalli
Copy link
Collaborator Author

Why close it? we need to add the functionality first.

@Okroshiashvili
Copy link
Contributor

Where we need to add this functionality?

@Okroshiashvili
Copy link
Contributor

What about to create utility module and put this function there?

Whenever we need to profile class method we import and decorate that method.

Is that okay?

@solegalli
Copy link
Collaborator Author

Yes, that'll work. Let's create a "profiling" module. So everybody knows what the module is about and what to find there.

@Okroshiashvili
Copy link
Contributor

One more question, please.

Where we want to have that module? In root directory or somewhere else?

@solegalli
Copy link
Collaborator Author

In the root directory sounds OK to me.

@Okroshiashvili
Copy link
Contributor

@solegalli

Regarding the memory profiling, I've researched possible libraries and most of them seems either too outdated or not so outdated. Then I found this profiler from bloomberg.

What I tested so far it seems quite good and actively supported. What do you think? Should we bring it into the feature_engine project?

Since memory profiling will be for only devs then it should not be an issue 🤔

@solegalli
Copy link
Collaborator Author

This might be a totally ignorant question, but why do we need alternative profilers? does cprofile not provide what we need?

@Okroshiashvili
Copy link
Contributor

cprofile is not able to do memory profiling. It only can do speed profiling.

Python has built in tracemalloc for memory profiling. However, its API is a bit inconvenient.

What I suggested to use for memory profiling, namely memray is much better in terms of API and easy to use and is actively maintained.

As I remember you asked me to have both speed and memory profiler. It's just matter of decision either we need to have these profilers or not. I just tried to have the best possible profilers.

I will update docs to let and other devs know how to use these profilers.

@solegalli
Copy link
Collaborator Author

Thank you for the explanation!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants