Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replica Function support (Lambda@Edge) #364

Open
iDVB opened this issue Oct 26, 2018 · 13 comments
Open

Replica Function support (Lambda@Edge) #364

iDVB opened this issue Oct 26, 2018 · 13 comments

Comments

@iDVB
Copy link
Contributor

iDVB commented Oct 26, 2018

I'm still in the process of testing out what exactly the issue might be but I have a good hunch that the service does not know how to deal with edge function logs.

Likely this is because they're located in the region that the end user was closest to. (edge location)
This means that when an end user makes a request near Oregon.... then the logs for that lambda will end up in the Oregon region.

Additionally, the log prefix is different as well in that its:
/aws/lambda/us-east-1.demo-stack-proxy-dev-originRequest
vs the traditional...
/aws/lambda/demo-stack-proxy-dev-originRequest

I thought I might be able to shim the issue by running around and putting MONITORED=0 on all the replica functions... but as it appears (from the console anyway) you can't tag replica functions.

I'm not sure how you can even programmatically detect if they are replica functions.... looking into that now.

UPDATE:
You can't seem to see the function when you make a aws lambda list-functions but you can see it when you aws lambda get-function directly. However, there does not seem to be any replica identifying traits other then all replica function names start with us-east-1. regardless of what region a developer usually uses. Edge function masters always get created in us-east-1.

I'm going to start looking into your source to see if I can just add a check for us-east-1. as a prefix in the function name and exclude it for now. Later, maybe there is a way to even have these logging by adding subscription filters to the region related logs.

@simlu
Copy link
Collaborator

simlu commented Oct 26, 2018

So I take it deploying this monitor to all regions that could have edge function executions doesn't work (yet)? Targeting region functions log groups shouldn't be hard.

Why do you want to disable monitoring for those functions? Is it more broken than simply not working?

What issues are you seeing? As I said this service works great for us, but we are not currently using edge lambda function.

@iDVB
Copy link
Contributor Author

iDVB commented Oct 26, 2018

No, as is, it seems to be fully broken as I think the Dummy Custom Resources that initially calls those subscribe functions are failing and causing the deploy to fail and roll back. This is maybe to do with it trying to locate a replica's associated log group and failing?

Deploying to all regions might work but is a bit overkill in my mind, especially since the regions that are “likely to have edge functions” is all of them. That’s how they work, they get replicated to all edge locations and sit there ready anytime an end user visits from that area. So it would be impossible to predict.

I'm going to dive into the code a bit more and see if I can see exactly what the issue might be. But let me know if you have any ideas. I'm very happy to test them.

@simlu
Copy link
Collaborator

simlu commented Oct 26, 2018

I'd love to see what the cloudformation template / resource lambda function logs look like (you need to be quick to find them before everything is torn down).

Also I'm sure @MacMcIrish has some input here in what would be interesting to look at!

@iDVB
Copy link
Contributor Author

iDVB commented Oct 26, 2018

AFK, for 1hr then will look for ya.
Also, I think I can alter these refs to logs and have it check for replica log groups across regions.

logGroupNamePrefix: `/aws/lambda/${f.FunctionName}`

@MacMcIrish
Copy link
Contributor

@iDVB Failing on the post deploy hook via a dummy resource is a frustrating issue that I plan to make much easier to understand :) up until now, I was the only one suffering from this, so I'm sorry to put you through my pain. Fortunately this is only an issue on the very first deploy of the stack, as when the initial deploy fails the teardown of the stack also deletes the log group.

To examine the log group before the teardown happens:

  1. Open the Cloudwatch Management Console in the targeted region
  2. Deploy the stack
  3. As the stack is deploying, refresh the page until /aws/lambda/lambda-monitor-${YOUR_STAGE_HERE}-subscribe and /aws/lambda/lambda-monitor-${YOUR_STAGE_HERE}-set-retention appear. Open the log groups immediately.
  4. When the logs are attached to this log group, you can find the error output on why the custom resource failed to deploy.

Once you have this info, it'll be a lot easier to track down the reason that this is failing to stabilise on deploy. Looking forward to hearing back from you!

@iDVB
Copy link
Contributor Author

iDVB commented Oct 26, 2018

Thanks! Actually, it seems to successfully send the error to Rollbar... so I CAN see that its due to

ResourceNotFoundException: The specified log group does not exist.

which led me to hypothesize about it not being able to fine the replica log groups.

Testing your steps to catch the logs though...
OUTPUT:

START RequestId: NOTTELLING Version: $LATEST
2018-10-26T17:36:58.132Z	NOTTELLING
    "Status": "FAILED",
    "Reason": "See the details in CloudWatch Log Stream: 2018/10/26/[$LATEST]NOTTELLING",
    "PhysicalResourceId": "2018/10/26/[$LATEST]NOTTELLING",
    "StackId": "arn:aws:cloudformation:us-east-1:NOTTELLING:stack/lambda-monitor-dev/NOTTELLING",
    "RequestId": "NOTTELLING",
    "LogicalResourceId": "TriggerLambdaSubscribeResource",
    "Data": {}
}

2018-10-26T17:36:58.370Z	NOTTELLING	Status code: 200
2018-10-26T17:36:58.370Z	NOTTELLING	Status message: OK
2018-10-26T17:36:58.607Z	NOTTELLING
{
    "errorMessage": "The specified log group does not exist.",
    "errorType": "ResourceNotFoundException",
    "stackTrace": [
        "Request.extractError (/var/task/node_modules/aws-sdk/lib/protocol/json.js:48:27)",
        "Request.callListeners (/var/task/node_modules/aws-sdk/lib/sequential_executor.js:106:20)",
        "Request.emit (/var/task/node_modules/aws-sdk/lib/sequential_executor.js:78:10)",
        "Request.emit (/var/task/node_modules/aws-sdk/lib/request.js:683:14)",
        "Request.transition (/var/task/node_modules/aws-sdk/lib/request.js:22:10)",
        "AcceptorStateMachine.runTo (/var/task/node_modules/aws-sdk/lib/state_machine.js:14:12)",
        "/var/task/node_modules/aws-sdk/lib/state_machine.js:26:10",
        "Request.<anonymous> (/var/task/node_modules/aws-sdk/lib/request.js:38:9)",
        "Request.<anonymous> (/var/task/node_modules/aws-sdk/lib/request.js:685:12)",
        "Request.callListeners (/var/task/node_modules/aws-sdk/lib/sequential_executor.js:116:18)",
        "Request.emit (/var/task/node_modules/aws-sdk/lib/sequential_executor.js:78:10)",
        "Request.emit (/var/task/node_modules/aws-sdk/lib/request.js:683:14)",
        "Request.transition (/var/task/node_modules/aws-sdk/lib/request.js:22:10)",
        "AcceptorStateMachine.runTo (/var/task/node_modules/aws-sdk/lib/state_machine.js:14:12)",
        "/var/task/node_modules/aws-sdk/lib/state_machine.js:26:10",
        "Request.<anonymous> (/var/task/node_modules/aws-sdk/lib/request.js:38:9)",
        "Request.<anonymous> (/var/task/node_modules/aws-sdk/lib/request.js:685:12)",
        "Request.callListeners (/var/task/node_modules/aws-sdk/lib/sequential_executor.js:116:18)",
        "callNextListener (/var/task/node_modules/aws-sdk/lib/sequential_executor.js:96:12)",
        "IncomingMessage.onEnd (/var/task/node_modules/aws-sdk/lib/event_listeners.js:294:13)",
        "emitNone (events.js:111:20)",
        "IncomingMessage.emit (events.js:208:7)"
    ]
}

END RequestId: NOTTELLING
REPORT RequestId: NOTTELLING	Duration: 2351.20 ms	Billed Duration: 2400 ms Memory Size: 256 MB	Max Memory Used: 57 MB	
START RequestId: NOTTELLING Version: $LATEST
2018-10-26T17:37:13.071Z	NOTTELLING	Response body:
{
    "Status": "FAILED",
    "Reason": "See the details in CloudWatch Log Stream: 2018/10/26/[$LATEST]NOTTELLING",
    "PhysicalResourceId": "2018/10/26/[$LATEST]NOTTELLING",
    "StackId": "arn:aws:cloudformation:us-east-1:NOTTELLING:stack/lambda-monitor-dev/NOTTELLING",
    "RequestId": "NOTTELLING",
    "LogicalResourceId": "TriggerLambdaSubscribeResource",
    "Data": {}
}

2018-10-26T17:37:13.529Z	NOTTELLING	Status code: 200
2018-10-26T17:37:13.529Z	NOTTELLING	Status message: OK
2018-10-26T17:37:13.711Z	NOTTELLING
{
    "errorMessage": "The specified log group does not exist.",
    "errorType": "ResourceNotFoundException",
    "stackTrace": [
        "Request.extractError (/var/task/node_modules/aws-sdk/lib/protocol/json.js:48:27)",
        "Request.callListeners (/var/task/node_modules/aws-sdk/lib/sequential_executor.js:106:20)",
        "Request.emit (/var/task/node_modules/aws-sdk/lib/sequential_executor.js:78:10)",
        "Request.emit (/var/task/node_modules/aws-sdk/lib/request.js:683:14)",
        "Request.transition (/var/task/node_modules/aws-sdk/lib/request.js:22:10)",
        "AcceptorStateMachine.runTo (/var/task/node_modules/aws-sdk/lib/state_machine.js:14:12)",
        "/var/task/node_modules/aws-sdk/lib/state_machine.js:26:10",
        "Request.<anonymous> (/var/task/node_modules/aws-sdk/lib/request.js:38:9)",
        "Request.<anonymous> (/var/task/node_modules/aws-sdk/lib/request.js:685:12)",
        "Request.callListeners (/var/task/node_modules/aws-sdk/lib/sequential_executor.js:116:18)",
        "Request.emit (/var/task/node_modules/aws-sdk/lib/sequential_executor.js:78:10)",
        "Request.emit (/var/task/node_modules/aws-sdk/lib/request.js:683:14)",
        "Request.transition (/var/task/node_modules/aws-sdk/lib/request.js:22:10)",
        "AcceptorStateMachine.runTo (/var/task/node_modules/aws-sdk/lib/state_machine.js:14:12)",
        "/var/task/node_modules/aws-sdk/lib/state_machine.js:26:10",
        "Request.<anonymous> (/var/task/node_modules/aws-sdk/lib/request.js:38:9)",
        "Request.<anonymous> (/var/task/node_modules/aws-sdk/lib/request.js:685:12)",
        "Request.callListeners (/var/task/node_modules/aws-sdk/lib/sequential_executor.js:116:18)",
        "callNextListener (/var/task/node_modules/aws-sdk/lib/sequential_executor.js:96:12)",
        "IncomingMessage.onEnd (/var/task/node_modules/aws-sdk/lib/event_listeners.js:294:13)",
        "emitNone (events.js:111:20)",
        "IncomingMessage.emit (events.js:208:7)"
    ]
}

@iDVB
Copy link
Contributor Author

iDVB commented Oct 26, 2018

Looks like I can see pretty much the same thing in the Rollbar logs

@simlu
Copy link
Collaborator

simlu commented Oct 26, 2018

@iDVB Looks like this is exactly the problem that you anticipated. There are two things to do here:
(1) Fix the deploy, so if the log group doesn't exist, it shouldn't subscribe. It should also look for the us-east-1.* log group and subscribe to that one if it exists.
(2) Make it so that cloudtrail triggered subscriptions happens for log groups that are prefixed with region.

Note: At that point (2) it might be desirable to also prevent subscription if already subscribed (that's #361), but that could also happen later.

@iDVB
Copy link
Contributor Author

iDVB commented Oct 26, 2018

@MacMcIrish @simlu Any reason lambda.js isn't using AWS.REQUEST.promise().then() rather then creating your own promises? https://aws.amazon.com/blogs/developer/support-for-promises-in-the-sdk/

@simlu
Copy link
Collaborator

simlu commented Oct 26, 2018

@iDVB Not really. That should probably be changed if the tests just continue working. I've already done that in other places last night.

@iDVB
Copy link
Contributor Author

iDVB commented Oct 29, 2018

Working on a PR for this now...
#371

@simlu
Copy link
Collaborator

simlu commented Nov 5, 2018

Task outline:

(1)
CloudTrail needs to be getting events from all regions
This is necessary so we can do actions based on edge lambda function log group creation.

(2)
The subscribe and set-retention need to be updated to handle "CloudTrail log group events" and "non CloudTrail log groups events" differently. For CloudTrail events only the log group that triggered the invocation should be handled. Otherwise we need to do an "expensive" check all call.
This is necessary so we do not overload the system with requests, otherwise this could get expensive and we will run into throttling.

(3)
Handle edge log group subscription correctly (when an edge log group gets created we need to check the master function for tags etc). This should be relatively easy, we just need to check if this is an edge log group and check the master function for tags.


I would start by figuring out what the lambda event for the CloudTrail log group creation event looks like (i.e. how much meta information is available).

@iDVB
Copy link
Contributor Author

iDVB commented Nov 5, 2018

Going to test to see what we can see in the CT event by letting it trigger a function with
console.log(JSON.stringify(event, null, 2)) in it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

3 participants