Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Large slowdowns in Python cold start times #195

Open
huonw opened this issue Jan 9, 2024 · 4 comments
Open

Large slowdowns in Python cold start times #195

huonw opened this issue Jan 9, 2024 · 4 comments
Labels
bug Something isn't working

Comments

@huonw
Copy link
Contributor

huonw commented Jan 9, 2024

Description

We recently upgraded our New Relic layers, and this appears to have resulted in a massive slow-down in our cold start times, with the time from AWS's START log line to "our code executing" going from 100-200ms to 5-10s, and this is entirely with New Relic.

For instance, we don't see this slowdown if we set the lambda to call our handler directly (not going via newrelic_lambda_wrapper.handler + NEW_RELIC_LAMBDA_HANDLER).

We're in ap-southeast-2 (Sydney), and are using layers like arn:aws:lambda:ap-southeast-2:451483290750:layer:NewRelicPython310:17. Comparing versions (i.e. changing 17) with a lambda with 256MB of memory, we find the interesting start times are:

version start time
17 5.5-5.6
16 5.5-5.6
15 6.8-7.3
14 7.2-7.5
13 10.0-10.1
... (similar)
8 ~10
7 ~1.4
... (similar)
2 ~1.5
1 0.1-0.2

In summary:

  • 1 -> 2: slow-down
  • 7 -> 8: slow-down
  • 13 -> 14: speed-up
  • 15 -> 16: speed-up

The final result is still a massive slowdown from where it used to be. This causes issues for us. For instance, lambda hooks for cognito have a strict 5 second time limit... and the initialisation time alone exceeds that, even without running any real code (we do use provisioned concurrency, but we want to have functional Lambdas, even if we get a spike in load that exceeds the provisioning).

When Amazon Cognito calls your Lambda function, it must respond within 5 seconds. If it doesn't and if the call can be retried, Amazon Cognito retries the call. After three unsuccessful attempts, the function times out. You can't change this five-second timeout value — https://docs.aws.amazon.com/cognito/latest/developerguide/cognito-user-identity-pools-working-with-aws-lambda-triggers.html#important-lambda-considerationsl

Steps to Reproduce

  1. Create a large lambda (sorry, quite a lot of effort to narrow down the specifics, our lambdas include boto3 and sqlalchemy) with a new relic layer, that prints something as soon as the "real" code starts executing
  2. Invoke it
  3. BUG: observe there's a long time between AWS's START and the printing of the real code

Expected Behaviour

Start-up times like they used to be: a few milliseconds, not a few seconds.

Relevant Logs / Console output

Using one of the layers between version 8 and 13, with a lambda that has print("Starting function...") at the top of the real handler, the CloudWatch logs include something like:

2024-01-09T12:28:56.259+11:00	START RequestId: ad0b4e84-98db-46e3-96c6-f9f28b20e2b2 Version: $LATEST
2024-01-09T12:28:57.272+11:00	[NR_EXT] Sent 1/1 New Relic function log batches successfully with certainty in 517.487ms (516ms to transmit 0.3kB).
2024-01-09T12:29:06.353+11:00	Starting function...

Note how the start line is at 28:56.259 while the print happens at 29:06.353, more than 10s later.

Your Environment

AWS Lambda, Python 3.10, layer versions above.

Additional context

I imagine this may be caused by changes to the underlying agents, but I can't work out how to find the version of the agent that's included in a particular layer, e.g. it doesn't appear to be logged by the init routine.

https://forum.newrelic.com/s/hubtopic/aAXPh0000000cJpOAI/new-relic-python-lambda-layers-performance-has-regressed-massively

@huonw huonw added the bug Something isn't working label Jan 9, 2024
@huonw
Copy link
Contributor Author

huonw commented Jan 9, 2024

As a workaround, we might have to downgrade to arn:aws:lambda:ap-southeast-2:451483290750:layer:NewRelicPython310:1, but that's seems like it would be quite old. Are there change logs for each layer version so we can work out the risks?

@huonw
Copy link
Contributor Author

huonw commented Jan 16, 2024

For a really simple lambda:

print("imported")
def handler(*args):
  print("started")

The numbers are a fair bit smaller, and somewhat different (regression from v1 -> v2, and v15 -> 16), but still bad:

version time from AWS START to print("started")
17 1.5-1.6
16 1.5-1.6
15 0.5-0.6
... (similar)
2 0.5-0.6
1 0.2

There's also ~0.6s of init time overhead on a cold start, from INIT_START to print("imported"), which seems to be constant across all versions.

That is, a cold start of a lambda with New Relic instrumentation seems to spend a fair chunk of time (>2s) on the New Relic overhead.

@pavelm10
Copy link

pavelm10 commented Jan 17, 2024

Observing the same slowdown (from few hundreds of milliseconds to 8-15 seconds) for arn:aws:lambda:eu-central-1:451483290750:layer:NewRelicPython311:11

@huonw
Copy link
Contributor Author

huonw commented Feb 5, 2024

I've been communicating with New Relic support and they suggested setting
NEW_RELIC_PACKAGE_REPORTING_ENABLED: false
https://docs.newrelic.com/docs/apm/agents/python-agent/configuration/python-agent-configuration/#package-reporting-enabled

To bypass this (new) code:
https://github.com/newrelic/newrelic-python-agent/blame/2b14392a19517a20012d281fbaaedfc2497f4fc3/newrelic/core/environment.py#L207-L245

This makes a big difference to us. The start time is now more 0.1s, rather than 1.5s (for the reduced lambda) or 5s (for our real ones).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants