Skip to content

Commit

Permalink
feat: Add Alarms and end to end X-Ray
Browse files Browse the repository at this point in the history
  • Loading branch information
charles-marion committed Sep 24, 2024
1 parent 03744fc commit cf9ca30
Show file tree
Hide file tree
Showing 27 changed files with 1,210 additions and 403 deletions.
2 changes: 1 addition & 1 deletion cli/magic-config.ts
Original file line number Diff line number Diff line change
Expand Up @@ -831,7 +831,7 @@ async function processCreateOptions(options: any): Promise<void> {
{
type: "confirm",
name: "advancedMonitoring",
message: "Do you want to enable custom metrics and advanced monitoring?",
message: "Do you want to enable custom metrics, alarms and XRay?",
initial: options.advancedMonitoring || false,
},
{
Expand Down
16 changes: 8 additions & 8 deletions docs/.vitepress/config.mts
Original file line number Diff line number Diff line change
Expand Up @@ -51,24 +51,24 @@ export default defineConfig({
{
text: 'Documentation',
items: [
{ text: 'Custom Public Domain', link: '/documentation/custom-public-domain' },
{ text: 'Private Chatbot', link: '/documentation/private-chatbot' },
{ text: 'AppSync', link: '/documentation/appsync' },
{ text: 'CloudFront Geo Restriction', link: '/documentation/cf-geo-restriction' },
{
text: 'Cognito Federation', items: [
{ text: 'Cognito Overview', link: '/documentation/cognito/overview' },
{ text: 'Keycloak SAML example', link: '/documentation/cognito/keycloak-saml' },
{ text: 'Keycloak OIDC example', link: '/documentation/cognito/keycloak-oidc' },
]
},
{ text: 'Model Requirements', link: '/documentation/model-requirements' },
{ text: 'Self-hosted models', link: '/documentation/self-hosted-models' },
{ text: 'Inference Script', link: '/documentation/inference-script' },
{ text: 'Custom Public Domain', link: '/documentation/custom-public-domain' },
{ text: 'Document Retrieval', link: '/documentation/retriever' },
{ text: 'AppSync', link: '/documentation/appsync' },
{ text: 'Inference Script', link: '/documentation/inference-script' },
{ text: 'Model Requirements', link: '/documentation/model-requirements' },
{ text: 'Precautions', link: '/documentation/precautions' },
{ text: 'Private Chatbot', link: '/documentation/private-chatbot' },
{ text: 'SageMaker Schedule', link: '/documentation/sagemaker-schedule' },
{ text: 'CloudFront Geo Restriction', link: '/documentation/cf-geo-restriction' },
{ text: 'Security', link: '/documentation/vulnerability-scanning' },
{ text: 'Precautions', link: '/documentation/precautions' }
{ text: 'Self-hosted models', link: '/documentation/self-hosted-models' },
]
}
],
Expand Down
27 changes: 27 additions & 0 deletions docs/documentation/monitoring.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
# Monitoring
By default, the project will create a [Amazon CloudWatch Dashboard](https://console.aws.amazon.com/cloudwatch). This Dashboard is created using the library [cdk-monitoring-constructs](https://github.com/cdklabs/cdk-monitoring-constructs) and it is recommended to update the metrics you tracks based on your project needs.

The dashboard is created in `lib/monitoring/index.ts`

During the configuration set, the advanced settings allows you to enable advance monitoring which will do the following:
* [Enable AWS X-Ray](https://docs.aws.amazon.com/xray/latest/devguide/aws-xray.html) which will collect traces availbale by opening the [Trace Map](https://docs.aws.amazon.com/xray/latest/devguide/xray-console-servicemap.html) from the CloudWatch console.
* Generate a custom metric per LLM model used using Amazon Bedrock allowing you to track token usage. This metrics are available in the dashboard. These metrics are created using [Cloudwatch filters](https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/MonitoringLogData.html).
* Create sample CloudWatch Alarms.

***Cost***: Be mindful of the costs associated with AWS resources, enabling advance motoring is [adding custom metrics, alarms](https://aws.amazon.com/cloudwatch/pricing/) and [AWS X-Ray traces](https://aws.amazon.com/xray/pricing/).

## Recommended changes (Advanced monitoring)

### Recevie alerts
The default setup is monitoring key resources such as the error rates of the APIs or the dead letter queues (if not empty, the processing of LLM requests failed). All these alarms can be viewed from the Amazon CloudWatch console.

The alarms state is monitoring by a [composite alarm](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/Create_Composite_Alarm.html) which will send an event to an SNS Topic if any alarm is active.

To receive notifications, add a [subscription](https://docs.aws.amazon.com/sns/latest/dg/sns-create-subscribe-endpoint-to-topic.html) (manually or in `lib/monitoring/index.ts`) to the topic listed in the Cloudformation output `CompositeAlarmTopicOutput`.

### Update alarms and their thresholds
The alarms listed in `lib/monitoring/index.ts` are example and they should be updated to match your project needs. Please refer to the following [project describing](https://github.com/cdklabs/cdk-monitoring-constructs) how to add/update the alarms.

### Review AWS X-Ray sampling
Consider updating the default [AWS X-Ray sampling rules](https://docs.aws.amazon.com/xray/latest/devguide/xray-console-sampling.html) to define the amount of data recorded

6 changes: 5 additions & 1 deletion docs/guide/deploy.md
Original file line number Diff line number Diff line change
Expand Up @@ -83,6 +83,8 @@ You have:

## Deployment

Before you start, please read the [precautions](../documentation/precautions.md) and [security](../documentation/vulnerability-scanning.md) pages.

**Step 1.** Clone the repository.

```bash
Expand Down Expand Up @@ -178,7 +180,9 @@ REACT_APP_URL=https://dxxxxxxxxxxxxx.cloudfront.net pytest integtests/user_inter

## Monitoring

Once the deployment is complete, a [CloudWatch Dashboard](https://console.aws.amazon.com/cloudwatch) will be available in the selected region to monitor the usage of the resources.
Once the deployment is complete, a [Amazon CloudWatch Dashboard](https://console.aws.amazon.com/cloudwatch) will be available in the selected region to monitor the usage of the resources.

For more information, please refer to [the monitoring page](../documentation/monitoring.md)


## Run user interface locally
Expand Down
1 change: 0 additions & 1 deletion integtests/chatbot-api/embedding_test.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,4 +13,3 @@ def test_calculate(client: AppSyncClient, default_embed_model, default_provider)

assert len(result) == 1
assert len(result[0].get("vector")) == 1536
assert result[0].get("vector")[0] == 0.03729608149230709
22 changes: 20 additions & 2 deletions lib/aws-genai-llm-chatbot-stack.ts
Original file line number Diff line number Diff line change
Expand Up @@ -220,7 +220,7 @@ export class AwsGenAILLMChatbotStack extends cdk.Stack {
}

const monitoringStack = new cdk.NestedStack(this, "MonitoringStack");
new Monitoring(monitoringStack, "Monitoring", {
const monitoringConstruct = new Monitoring(monitoringStack, "Monitoring", {
prefix: props.config.prefix,
advancedMonitoring: props.config.advancedMonitoring === true,
appsycnApi: chatBotApi.graphqlApi,
Expand All @@ -243,6 +243,7 @@ export class AwsGenAILLMChatbotStack extends cdk.Stack {
"/aws/lambda/" + (r as lambda.Function).functionName
);
}),
cloudFrontDistribution: userInterface.cloudFrontDistribution,
cognito: {
userPoolId: authentication.userPool.userPoolId,
clientId: authentication.userPoolClient.userPoolClientId,
Expand All @@ -265,18 +266,34 @@ export class AwsGenAILLMChatbotStack extends cdk.Stack {
ragFunctionProcessing: [
...(ragEngines ? [ragEngines.dataImport.rssIngestorFunction] : []),
],
ragStateMachineProcessing: [
ragImportStateMachineProcessing: [
...(ragEngines
? [
ragEngines.dataImport.fileImportWorkflow,
ragEngines.dataImport.websiteCrawlingWorkflow,
]
: []),
],
ragEngineStateMachineProcessing: [
...(ragEngines
? [
ragEngines.auroraPgVector?.createAuroraWorkspaceWorkflow,
ragEngines.openSearchVector?.createOpenSearchWorkspaceWorkflow,
ragEngines.kendraRetrieval?.createKendraWorkspaceWorkflow,
ragEngines.deleteDocumentWorkflow,
ragEngines.deleteWorkspaceWorkflow,
]
: []),
],
});

if (monitoringConstruct.compositeAlarmTopic) {
new cdk.CfnOutput(this, "CompositeAlarmTopicOutput", {
key: "CompositeAlarmTopicOutput",
value: monitoringConstruct.compositeAlarmTopic.topicName,
});
}

/**
* CDK NAG suppression
*/
Expand Down Expand Up @@ -306,6 +323,7 @@ export class AwsGenAILLMChatbotStack extends cdk.Stack {
`/${this.stackName}/ChatBotApi/RestApi/GraphQLApiHandler/ServiceRole/Resource`,
`/${this.stackName}/ChatBotApi/RestApi/GraphQLApiHandler/ServiceRole/DefaultPolicy/Resource`,
`/${this.stackName}/ChatBotApi/Realtime/Resolvers/lambda-resolver/ServiceRole/Resource`,
`/${this.stackName}/ChatBotApi/Realtime/Resolvers/lambda-resolver/ServiceRole/DefaultPolicy/Resource`,
`/${this.stackName}/ChatBotApi/Realtime/Resolvers/outgoing-message-handler/ServiceRole/Resource`,
`/${this.stackName}/ChatBotApi/Realtime/Resolvers/outgoing-message-handler/ServiceRole/DefaultPolicy/Resource`,
`/${this.stackName}/IdeficsInterface/MultiModalInterfaceRequestHandler/ServiceRole/DefaultPolicy/Resource`,
Expand Down
11 changes: 11 additions & 0 deletions lib/chatbot-api/appsync-ws.ts
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@ import {
Function as LambdaFunction,
LayerVersion,
LoggingFormat,
Tracing,
Runtime,
} from "aws-cdk-lib/aws-lambda";
import { SqsEventSource } from "aws-cdk-lib/aws-lambda-event-sources";
Expand All @@ -23,6 +24,7 @@ interface RealtimeResolversProps {
readonly shared: Shared;
readonly api: appsync.GraphqlApi;
readonly logRetention?: number;
readonly advancedMonitoring?: boolean;
}

export class RealtimeResolvers extends Construct {
Expand All @@ -47,7 +49,9 @@ export class RealtimeResolvers extends Construct {
handler: "index.handler",
description: "Appsync resolver handling LLM Queries",
runtime: Runtime.PYTHON_3_11,
tracing: props.advancedMonitoring ? Tracing.ACTIVE : Tracing.DISABLED,
environment: {
...props.shared.defaultEnvironmentVariables,
SNS_TOPIC_ARN: props.topic.topicArn,
},
logRetention: props.logRetention,
Expand All @@ -64,11 +68,18 @@ export class RealtimeResolvers extends Construct {
__dirname,
"functions/outgoing-message-appsync/index.ts"
),
bundling: {
externalModules: ["aws-xray-sdk-core", "@aws-sdk"],
},
layers: [powertoolsLayerJS],
handler: "index.handler",
description: "Sends LLM Responses to Appsync",
runtime: Runtime.NODEJS_18_X,
loggingFormat: LoggingFormat.JSON,
tracing: props.advancedMonitoring ? Tracing.ACTIVE : Tracing.DISABLED,
logRetention: props.logRetention,
environment: {
...props.shared.defaultEnvironmentVariables,
GRAPHQL_ENDPOINT: props.api.graphqlUrl,
},
vpc: props.shared.vpc,
Expand Down
9 changes: 9 additions & 0 deletions lib/chatbot-api/functions/outgoing-message-appsync/index.ts
Original file line number Diff line number Diff line change
Expand Up @@ -11,11 +11,16 @@ import type {
SQSBatchResponse,
} from "aws-lambda";
import { graphQlQuery } from "./graphql";
import * as AWSXRay from "aws-xray-sdk-core";

// Configure the context missing strategy to do nothing
AWSXRay.setContextMissingStrategy(() => {});

const processor = new BatchProcessor(EventType.SQS);
const logger = new Logger();

const recordHandler = async (record: SQSRecord): Promise<void> => {
const segment = AWSXRay.getSegment(); //returns the facade segment
const payload = record.body;
if (payload) {
const item = JSON.parse(payload);
Expand Down Expand Up @@ -44,7 +49,11 @@ const recordHandler = async (record: SQSRecord): Promise<void> => {
}
`;
//logger.info(query);
const subsegment = segment?.addNewSubsegment("AppSync - Publish Response");
subsegment?.addMetadata("sessionId", req.data.sessionId);
await graphQlQuery(query);
subsegment?.close();

//logger.info(resp);
}
};
Expand Down
Original file line number Diff line number Diff line change
@@ -1 +1,2 @@
pydantic==2.4.0
aws_xray_sdk==2.14.0
7 changes: 4 additions & 3 deletions lib/chatbot-api/index.ts
Original file line number Diff line number Diff line change
Expand Up @@ -79,11 +79,11 @@ export class ChatBotApi extends Construct {
],
},
logConfig: {
fieldLogLevel: appsync.FieldLogLevel.ALL,
retention: RetentionDays.ONE_WEEK,
fieldLogLevel: appsync.FieldLogLevel.INFO,
retention: props.config.logRetention ?? RetentionDays.ONE_WEEK,
role: loggingRole,
},
xrayEnabled: true,
xrayEnabled: props.config.advancedMonitoring === true,
visibility: props.config.privateWebsite
? appsync.Visibility.PRIVATE
: appsync.Visibility.GLOBAL,
Expand All @@ -104,6 +104,7 @@ export class ChatBotApi extends Construct {
...props,
api,
logRetention: props.config.logRetention,
advancedMonitoring: props.config.advancedMonitoring,
});

this.resolvers.push(realtimeBackend.resolvers.sendQueryHandler);
Expand Down
4 changes: 3 additions & 1 deletion lib/chatbot-api/rest-api.ts
Original file line number Diff line number Diff line change
Expand Up @@ -52,7 +52,9 @@ export class ApiResolvers extends Construct {
architecture: props.shared.lambdaArchitecture,
timeout: cdk.Duration.minutes(10),
memorySize: 512,
tracing: lambda.Tracing.ACTIVE,
tracing: props.config.advancedMonitoring
? lambda.Tracing.ACTIVE
: lambda.Tracing.DISABLED,
logRetention: props.config.logRetention ?? logs.RetentionDays.ONE_WEEK,
loggingFormat: lambda.LoggingFormat.JSON,
layers: [props.shared.powerToolsLayer, props.shared.commonLayer],
Expand Down
41 changes: 40 additions & 1 deletion lib/chatbot-api/websocket-api.ts
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@ import * as iam from "aws-cdk-lib/aws-iam";
import * as sns from "aws-cdk-lib/aws-sns";
import * as subscriptions from "aws-cdk-lib/aws-sns-subscriptions";
import * as sqs from "aws-cdk-lib/aws-sqs";
import * as xray from "aws-cdk-lib/aws-xray";
import { Construct } from "constructs";

import { Shared } from "../shared";
Expand All @@ -18,6 +19,7 @@ interface RealtimeGraphqlApiBackendProps {
readonly userPool: UserPool;
readonly api: appsync.GraphqlApi;
readonly logRetention?: number;
readonly advancedMonitoring?: boolean;
}

export class RealtimeGraphqlApiBackend extends Construct {
Expand All @@ -32,7 +34,43 @@ export class RealtimeGraphqlApiBackend extends Construct {
) {
super(scope, id);
// Create the main Message Topic acting as a message bus
const messagesTopic = new sns.Topic(this, "MessagesTopic");
const messagesTopic = new sns.Topic(this, "MessagesTopic", {
tracingConfig: props.advancedMonitoring
? sns.TracingConfig.ACTIVE
: sns.TracingConfig.PASS_THROUGH,
});

if (props.advancedMonitoring) {
// https://docs.aws.amazon.com/xray/latest/devguide/xray-services-sns.html#xray-services-sns-configuration
const stack = cdk.Stack.of(scope);
new xray.CfnResourcePolicy(this, "SNSResourcePolicy", {
policyName: "SNSResourcePolicy",
policyDocument: JSON.stringify(
new iam.PolicyDocument({
statements: [
new iam.PolicyStatement({
effect: iam.Effect.ALLOW,
principals: [new iam.ServicePrincipal("sns.amazonaws.com")],
actions: [
"xray:PutTraceSegments",
"xray:GetSamplingRules",
"xray:GetSamplingTargets",
],
resources: ["*"],
conditions: {
StringEquals: {
"aws:SourceAccount": stack.account,
},
StringLike: {
"aws:SourceArn": `arn:${stack.partition}:sns:${stack.region}:${stack.account}:*`,
},
},
}),
],
})
),
});
}

const deadLetterQueue = new sqs.Queue(this, "OutgoingMessagesDLQ", {
enforceSSL: true,
Expand Down Expand Up @@ -66,6 +104,7 @@ export class RealtimeGraphqlApiBackend extends Construct {
shared: props.shared,
api: props.api,
logRetention: props.logRetention,
advancedMonitoring: props.advancedMonitoring,
});

// Route all outgoing messages to the websocket interface queue
Expand Down
4 changes: 3 additions & 1 deletion lib/model-interfaces/idefics/index.ts
Original file line number Diff line number Diff line change
Expand Up @@ -58,7 +58,9 @@ export class IdeficsInterface extends Construct {
handler: "index.handler",
layers: [props.shared.powerToolsLayer, props.shared.commonLayer],
architecture: props.shared.lambdaArchitecture,
tracing: lambda.Tracing.ACTIVE,
tracing: props.config.advancedMonitoring
? lambda.Tracing.ACTIVE
: lambda.Tracing.DISABLED,
timeout: cdk.Duration.minutes(lambdaDurationInMinutes),
memorySize: 1024,
logRetention: props.config.logRetention ?? logs.RetentionDays.ONE_WEEK,
Expand Down
4 changes: 3 additions & 1 deletion lib/model-interfaces/langchain/index.ts
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,9 @@ export class LangChainInterface extends Construct {
description: "Langchain request handler",
runtime: props.shared.pythonRuntime,
architecture: props.shared.lambdaArchitecture,
tracing: lambda.Tracing.ACTIVE,
tracing: props.config.advancedMonitoring
? lambda.Tracing.ACTIVE
: lambda.Tracing.DISABLED,
timeout: cdk.Duration.minutes(15),
memorySize: 1024,
logRetention: props.config.logRetention ?? logs.RetentionDays.ONE_WEEK,
Expand Down
Loading

0 comments on commit cf9ca30

Please sign in to comment.