Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve logging logic to improve/fix GPU performance #2252

Open
1 task
strickvl opened this issue Jan 10, 2024 · 5 comments
Open
1 task

Improve logging logic to improve/fix GPU performance #2252

strickvl opened this issue Jan 10, 2024 · 5 comments
Assignees
Labels
bug Something isn't working good first issue Good for newcomers

Comments

@strickvl
Copy link
Contributor

Open Source Contributors Welcomed!

Please comment below if you would like to work on this issue!

Contact Details [Optional]

[email protected]

What happened?

Users have reported a significant drop in GPU utilization (from 95% to 2%) after upgrading ZenML from version 0.32.1 to 0.44.2. This issue was observed while deploying pipelines on GCP VertexAI. Investigations suggest that the performance bottleneck is due to the logging mechanism, especially when using progress bars like tqdm. It appears that logging, particularly frequent updates from progress bars, is substantially slowing down the processing speed.

Task Description

Investigate and optimize the logging logic in ZenML, particularly for scenarios involving high GPU usage. The goal is to ensure that the logging process, including progress bars, does not adversely affect the GPU performance and overall speed of pipeline execution.

Expected Outcome

  • ZenML should maintain high GPU utilization without being impacted by the logging process.
  • Users should be able to use progress bars and other logging tools without experiencing a significant slowdown in processing.
  • Modifications should be made to allow users to control the frequency and verbosity of logs to balance between logging needs and performance.

Steps to Implement

  • Analyze the current logging mechanism and identify how it interacts with GPU-intensive processes.
  • Develop solution(s) to optimize logging, particularly when progress bars are used, to reduce their impact on GPU and overall performance.
  • Implement configurable settings for users to control the logging behavior, such as limiting log frequency or verbosity.
  • Thoroughly test the changes in scenarios with high GPU usage to ensure that the logging optimizations are effective.
  • Update documentation to guide users on how to configure logging settings for optimal performance.

Note that part of the solution might be to expose these global variables / constants better in settings via environment variables.

Additional Context

This issue is critical for users leveraging ZenML for GPU-intensive tasks, as efficient GPU utilization is key to performance in these scenarios. The solution should provide a balance between informative logging and optimal resource utilization.

Code of Conduct

  • I agree to follow this project's Code of Conduct
@strickvl strickvl added bug Something isn't working good first issue Good for newcomers labels Jan 10, 2024
@nida-imran173
Copy link

@strickvl I'm interested in working on this issue. Can I take it up?

@strickvl
Copy link
Contributor Author

Sure thing, @nida-imran173! I'll assign it to you and let us know if you have any questions. Most basic things should be answered in our CONTRIBUTING.md document.

@nida-imran173
Copy link

Hi @strickvl,

After analyzing the code in 'logging', I've identified a few potential areas that could be causing the reported drop in GPU utilization. Here are the key points:

  1. The code performs file I/O operations (fileio.open, fileio.makedirs, fileio.remove) to read, write, and create directories. These operations can be resource-intensive, especially if there are frequent reads or writes to the file system.
  2. Depending on the logging frequency and the size of the buffer, it could impact performance. If logging occurs too frequently, it may lead to increased file I/O operations, potentially affecting performance.
  3. The remove_ansi_escape_codes function uses a regular expression (re.compile) to remove ANSI escape codes. If this function is called frequently or processes large amounts of data, it might impact performance.

I would greatly appreciate your guidance and any specific insights you might have on tackling this issue. If there are additional aspects I should consider or if you have any preferences regarding the approach, please let me know.

@strickvl
Copy link
Contributor Author

So first thing I'd say would be to reproduce the issue. I.e. run a step when logging is turned on (i.e. by default). Then either toggle / update STEP_LOGS_STORAGE_INTERVAL_SECONDS env variable, or perhaps by disabling step logs.

When someone is running on a GPU-enabled environment, we could potentially have different behaviour. Also it isn't yet clear to me why logs within a GPU-enabled environment are slower beyond maybe that the task itself generates a certain frequency of logs. So in short, we'll need to dive a bit deeper into the problem I think.

@htahir1
Copy link
Contributor

htahir1 commented Jan 23, 2024

@strickvl @nida-imran173 I would just add to this discussion that I think the primary reason for GPU performance degredation is exactly as Nida already said:

The code performs file I/O operations (fileio.open, fileio.makedirs, fileio.remove) to read, write, and create directories. These operations can be resource-intensive, especially if there are frequent reads or writes to the file system. Depending on the logging frequency and the size of the buffer, it could impact performance. If logging occurs too frequently, it may lead to increased file I/O operations, potentially affecting performance.

I would try to tackle this issue first. basically, id run some tests to see how this can effect performance. A very simple test could be to run a pipeline which trains a model using pytorch or tensorflow. These libraries produce progress bars that are then logged and cause a slow down . Once we've verified this, we can work on a fix all together by brainstorming strategies.

But first things first, as @strickvl said, we need a test in place where we can measure things

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

3 participants