Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix/analytics connection hang #141

Merged
merged 14 commits into from
Feb 15, 2024

Conversation

marwan37
Copy link
Contributor

@marwan37 marwan37 commented Feb 8, 2024

Describe changes

I implemented a workaround that disables retries for sending analytics data, in response to issue #130. This ensures the CLI remains responsive even when the Segment analytics API is unreachable, due to network failures or blocking mechanisms like PiHole.

  • Updated src/mlstacks/analytics/client.py to set max_retries to 0, which can be adjusted as needed.
  • Wrapped the analytics_context.track call in a try/except block within the track_event function. This ensures the return False statement is reachable for error handling and logging exceptions.

Reference to Documentation

The Segment Python library documentation did not explicitly mention configurable options for max_retries, but the source code revealed a max_retries property in segment/analytics/client.py.

Testing

To simulate an unreachable analytics API domain at api.segment.io, I redirected all requests to this domain to 127.0.0.1 by adding the following entry in the /etc/hosts file on my Mac:

127.0.0.1 api.segment.io

Following this setup, I tested the deploy, breakdown, output, and destroy CLI commands, and confirmed it exits immediately, without hanging, after attempting to reach the analytics API domain once.

Pre-requisites

Please ensure you have done the following:

  • I have read the CONTRIBUTING.md document.
  • If my change requires a change to docs, I have updated the documentation
    accordingly.
  • I have added tests to cover my changes.
  • I have based my new branch on develop and the open PR is targeting
    develop. If your branch wasn't based on develop read
    Contribution guide on rebasing branch to develop.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to
    change)
  • Other (add details above)

Summary by CodeRabbit

  • New Features
    • Introduced a new configuration option max_retries for enhanced control over analytics tracking.
  • Refactor
    • Improved the reliability of event tracking in the analytics module with better error handling.

Copy link

coderabbitai bot commented Feb 8, 2024

Important

Auto Review Skipped

Auto reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository.

To trigger a single review, invoke the @coderabbitai review command.

Walkthrough

This update introduces a new configuration parameter max_retries to the analytics module, initially set to 0. It also enhances the track_event function with error handling to manage exceptions during analytics tracking, aiming to improve reliability and user experience by preventing the application from hanging due to unreachable analytics services.

Changes

File Summary
.../analytics/client.py Added max_retries config parameter; updated track_event with try-except for exceptions

Related issues

  • Failed connections to analytics API block CLI indefinitely #130: This PR potentially addresses the issue by introducing error handling in analytics tracking, which could prevent the CLI from hanging when the analytics service is unreachable. The addition of max_retries may also relate to the objectives of improving reliability and user experience by ensuring core operations are not affected by analytics failures.

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

Share

Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>.
    • Generate unit-tests for this file.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit tests for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai generate interesting stats about this repository from git and render them as a table.
    • @coderabbitai show all the console.log statements in this repository.
    • @coderabbitai read src/utils.ts and generate unit tests.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (invoked as PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger a review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai help to get help.

Additionally, you can add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.

CodeRabbit Configration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • The JSON schema for the configuration file is available here.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/coderabbit-overrides.v2.json

CodeRabbit Discord Community

Join our Discord Community to get help, request features, and share feedback.

@strickvl strickvl added the bug Something isn't working label Feb 8, 2024
@strickvl
Copy link
Contributor

strickvl commented Feb 8, 2024

Thank you for this PR! One thing you'll probably need to do is rebase on develop branch. If your branch is branched off develop already, then you can just click the 'edit' button at the right / top of the header and then update it so that it knows it'll be merged onto develop and not to main (which it's currently set to do).

CleanShot 2024-02-08 at 20 17 25@2x

Thanks!

@strickvl
Copy link
Contributor

strickvl commented Feb 8, 2024

You can also see the CI is showing a linting error, so please do run the formatting and linting scripts locally before updating with the fix.

@marwan37 marwan37 changed the base branch from main to develop February 8, 2024 19:49
@marwan37
Copy link
Contributor Author

marwan37 commented Feb 8, 2024

Just fixed any issues mentioned in the CI and ran the format and lint scripts again. This was indeed branched off develop already so the edit button at the top allowed me to update it. Thanks!

@strickvl strickvl requested review from bcdurak and removed request for safoinme February 8, 2024 19:55
@strickvl
Copy link
Contributor

strickvl commented Feb 8, 2024

@coderabbitai review

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Status

Actionable comments generated: 2

Configuration used: .coderabbit.yaml

Commits Files that changed from the base of the PR and between 992fff1 and 256be0d.
Files selected for processing (1)
  • src/mlstacks/analytics/client.py (2 hunks)

src/mlstacks/analytics/client.py Outdated Show resolved Hide resolved
src/mlstacks/analytics/client.py Outdated Show resolved Hide resolved
Copy link
Contributor

@strickvl strickvl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks mostly all good to me. Only thing I'd suggest is making it a debug log as there's no need to disrupt the user experience over this.

src/mlstacks/analytics/client.py Outdated Show resolved Hide resolved
Co-authored-by: Alex Strick van Linschoten <[email protected]>
@marwan37
Copy link
Contributor Author

marwan37 commented Feb 8, 2024

Great suggestion, @strickvl. I've committed the change. Thanks for your input. Regarding the other CodeRabbit suggestion, I'm assuming it's safe to ignore as the app uses a single instance of the analytics client?

@strickvl
Copy link
Contributor

strickvl commented Feb 8, 2024

@marwan37 yep sometimes the rabbit has good suggestions or catches small things. Today, not so much :) You can ignore them.

Copy link
Contributor

@strickvl strickvl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work! This looks good to go from my side. I'll let my colleague @bcdurak give it a review as he originally worked on some of this analytics code.

@strickvl
Copy link
Contributor

strickvl commented Feb 8, 2024

@marwan37 sorry the line I suggested seems to have been too long. I'd break up that logger statement and then I'll rerun the CI.

@marwan37
Copy link
Contributor Author

marwan37 commented Feb 8, 2024

@strickvl, I've adjusted the logger statement. Thanks for pointing that out.

@strickvl strickvl self-requested a review February 9, 2024 12:10
Copy link
Contributor

@bcdurak bcdurak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some comments on the changes. Feel free to ask any questions if anything is unclear.

@@ -34,6 +34,7 @@
logger = getLogger(__name__)

analytics.write_key = "tU9BJvF05TgC29xgiXuKF7CuYP0zhgnx"
analytics.max_retries = 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lowering this number makes a lot of sense. Considering the timeout is set to 15 seconds by default, in case of a timeout, it should never take that long. Perhaps 3 retries or 5 might be more suitable here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @bcdurak, I appreciate the review and the insights. It's worth noting that the CLI hung for ~10 minutes before exiting with the default settings, likely due to exponential backoff.

I just tested it with max_tries set to lower values, and here's what I got:
3: 5 seconds
5: 20 seconds
6: 40 seconds

Setting it to 5 sounds like a balanced approach, as you suggested. I'll make the change.

"""
if metadata is None:
metadata = {}

metadata.setdefault("event_success", True)

with MLStacksAnalyticsContext() as analytics_context:
return bool(analytics_context.track(event=event, properties=metadata))
try:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is where it gets tricky a bit.

The MLStacksAnalyticsContext is implemented as a context manager. Any Exception that would happen during the execution of things within the scope of this context manager will be handled by its __exit__ method. This way, we ensure that actual execution can not fail, even if something goes wrong in the analytics. So this try-except is already covered by it.

You may have seen some error messages already though. The reason is, by default, the segment analytics python package is using something called a Consumer, which creates separate threads to upload and send out the events. Due to its nature, any calls happening within this thread are out of the scope of our context manager. However, if something goes wrong, they handle it the same way with a try-catch and give out an error log right here, that you may have seen already.

I see two solutions if you would like to get rid of the error logs:

  1. You can disable the usage of this consumer paradigm by setting the analytics.sync_mode to True. This way, the events will be sent out by the main thread, and the MLStacksAnalyticsContext will do all the error handling. However, in the case of an unresponsive server, this will block the main thread for analytics.max_retries+1 * analytics.timeout seconds, so it is not very ideal.
  2. You can try to disable the logger for the segment analytics package and implement a custom analytics.on_error method to handle the same error message as a debug message.

Personally, I would recommend the second solution.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the thorough explanation about the MLStacksAnalyticsContext and Segment's consumer threads behavior. These issues were noted in the original GitHub ticket, which led me to initially consider a threading-based solution to address the main problem. However, as you mentioned, they are outside the scope of mlstacks' analytics client.

Given the nuances you've outlined, I agree that the second solution seems more appropriate. I'll work on implementing the analytics.on_error method, with non-disruptive error handling.

Copy link
Contributor Author

@marwan37 marwan37 Feb 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've implemented the custom on_error handler as recommended, including corresponding unit tests to verify its functionality. If there are any further adjustments or tests you think should be included, please let me know.

Edit: To resolve the test suite failures during the CI, I've revised the test to generate a unique write_key per session, and reverted to the original key and configurations post-tests.

Edit 2: Circling back, it turns out the initial fix using a unique write_key didn't quite tackle the problem. I simplified the test to directly examine log outputs and removed the use of mocking. Thanks for bearing with me.

@strickvl strickvl requested a review from bcdurak February 9, 2024 20:47
@strickvl
Copy link
Contributor

@marwan37 I cloned your fork + checked out your branch but when running this I still get the following output when testing locally:

CleanShot 2024-02-13 at 16 21 52@2x

i.e. it doesn't seem to pick up your custom error at all. Looking at their codebase, it's not clear to me whether the on_error is doing what we think it's doing actually... @bcdurak any thoughts? This e.g. is the place where the log I'm getting is generated.

@strickvl strickvl self-requested a review February 13, 2024 15:45
@marwan37
Copy link
Contributor Author

@strickvl, yes I had those errors too. My apologies, I should have documented that behavior earlier. These errors seem to be handled internally by Segment and won't trigger the custom on_error. Currently, it should be assumed that the on_error handler is designed primarily for handling errors that occur after the retry logic has concluded. Without directly modifying the Segment library, I couldn't find a way to gain more control over those logs, but I'd be happy to explore further if needed.

Copy link
Contributor

@strickvl strickvl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a couple changes that will fix this so it works the way we want. We can disable the segment logger itself (using the suggested code) and then we don't see the logged output at all (the stuff we can't control). We tested on our end that the on_error is indeed triggering / outputting what we want when logging verbosity is set to DEBUG.

src/mlstacks/analytics/client.py Outdated Show resolved Hide resolved
src/mlstacks/analytics/client.py Outdated Show resolved Hide resolved
@strickvl
Copy link
Contributor

@marwan37 can you run the format / linting script on the files on your end and push any changes that get made?

@marwan37 marwan37 force-pushed the fix/analytics-connection-hang branch 2 times, most recently from d37cda0 to 4766b7f Compare February 14, 2024 14:07
Copy link
Contributor

@bcdurak bcdurak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@marwan37 The changes look good. Thank you so much for your contribution 😃

@strickvl strickvl merged commit b64ccd1 into zenml-io:develop Feb 15, 2024
34 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants