- Infrastructure overview
- Using the logs
git
hooks- Testing
- Deploying
- Database management
- One-off tasks
- Test Loading Commands
- How messages are queued and sent
- Writing public APIs
- API Usage
- Queues and tasks
- Notify.gov
- Pull Requests
- Code Reviews
- Run Book
- Data Storage Policies & Procedures
- Troubleshooting
A diagram of the system is available in our compliance repo.
Notify is a Flask application running on cloud.gov, which also brokers access to a PostgreSQL database and Redis store.
In addition to the Flask app, Notify uses Celery to manage the task queue. Celery stores tasks in Redis.
Application, infrastructure, and compliance work is spread across several repositories:
- notifications-api for the API app
- notifications-admin for the Admin UI app
- notifications-utils for common library functions
In addition to terraform directories in the api and admin apps above:
- usnotify-ssb A supplemental service broker that provisions SES and SNS for us
- ttsnotify-brokerpak-sms The brokerpak defining SNS (SMS sending)
- datagov-brokerpak-smtp The brokerpak defining SES
- cg-egress-proxy The caddy proxy that allows external API calls
- us-notify-compliance for OSCAL control documentation and diagrams
We use Terraform to manage our infrastructure, providing consistent setups across the environments.
Our Terraform configurations manage components via cloud.gov. This means that the configurations should work out of the box if you are using a Cloud Foundry platform, but will not work for setups based on raw AWS.
There are several remote services required for local development:
- S3
- SES
- SNS
Credentials for these services are created by running:
cd terraform/development
./run.sh
in both the api repository as well as the admin repository.
This will append credentials to your .env
file. You will need to manually clean up any prior runs from that file if you run that command again.
You can remove your development infrastructure by running ./run.sh -d
./reset.sh
can be used to import your development infrastructure information in case of a new computer or new working tree and the old terraform state file was not transferred.
./reset.sh -u USER_TO_OFFBOARD
can be used to import another user's development resources in order to clean them up. Steps for use:
- Move your existing terraform state file aside temporarily, so it is not overwritten.
./reset.sh -u USER_TO_OFFBOARD
- Answer no to the prompt about creating missing resources.
- Run
./run.sh -u USER_TO_OFFBOARD -d
to fully remove the rest of that user's resources.
The cloud.gov environment is configured with Terraform. See the terraform
folder to learn about that.
In addition to services provisioned through cloud.gov, we have several services provisioned via supplemental service brokers in AWS. Our AWS services are currently located in several regions using Studio-controlled AWS accounts.
To send messages, we use Amazon Web Services SNS and SES. In addition, we use AWS Pinpoint to provision and manage phone numbers, short codes, and long codes for sending SMS.
In SNS, we have 3 topics for SMS receipts. These are not currently functional, so senders won't know the status of messages.
Through Pinpoint, the API needs at least one number so that the application itself can send SMS for authentication codes.
The API also has access to AWS S3 buckets for storing CSVs of messages and contact lists. It does not access a third S3 bucket that stores agency logos.
We are using New Relic for application monitoring and error reporting. When requesting access to New Relic, ask to be added to the Benefits-Studio subaccount.
- Join the GSA GitHub org
- Get permissions for the repos via GitHub teams
- Get access to the cloud.gov org && spaces
- Set up a user account on the staging site
- Get access to AWS, if necessary
- Get access to New Relic, if necessary
- Create the local
.env
file by copyingsample.env
and running./run.sh
within theterraform/development
folder (see these docs) - Run through the local setup process
- Review the system diagram
- Do stuff!
Upon completion, an admin should update 🔒the permissions and access tracker.
These steps are required for new cloud.gov environments. Local development borrows SES & SNS infrastructure from the notify-staging
cloud.gov space, so these steps are not required for new developers.
Steps for deploying production from scratch. These can be updated for a new cloud.gov environment by subbing out prod
or production
for your desired environment within the steps.
- Deploy API app
- Update
terraform-production.yml
anddeploy-prod.yml
to point to the correct space and git branch. - Ensure that the
domain
module is commented out interraform/production/main.tf
- Run CI/CD pipeline on the
production
branch by opening a PR frommain
toproduction
- Create any necessary DNS records (check
notify-api-ses-production
service credentials for instructions) within https://github.com/18f/dns - Follow the
Steps to prepare SES
below - (Optional) if using a public API route, uncomment the
domain
module and re-trigger a deploy
- Update
- Deploy Admin app
- Update
terraform-production.yml
anddeploy-prod.yml
to point to the correct space and git branch. - Ensure that the
api_network_route
anddomain
modules are commented out interraform/production/main.tf
- Run CI/CD pipeline on the
production
branch by opening a PR frommain
toproduction
- Create DNS records for
domain
module within https://github.com/18f/dns - Uncomment the
api_network_route
anddomain
modules and re-trigger a deploy
- Update
- After the first deploy of the application with the SSB-brokered SES service completes:
- Log into the SES console and navigate to the SNS subscription page.
- Select "Request confirmation" for any subscriptions still in "Pending Confirmation" state
- Find and replace instances in the repo of "testsender", "testreceiver" and "dispostable.com", with your origin and destination email addresses, which you verified in step 1 above.
TODO: create env vars for these origin and destination email addresses for the root service, and create new migrations to update postgres seed fixtures
This should be complete for all regions Notify.gov has been deployed to or is currently planned to be deployed to.
- Visit the SNS console for the region you will be sending from. Notes:
- SNS settings are per-region, so each environment must have its own region
- Pinpoint and SNS have confusing regional availability, so ensure both are available before submitting any requests.
- Choose
Text messaging (SMS)
from the sidebar - Click the
Exit SMS Sandbox
button and submit the support request. This request should take at most a day to complete. Be sure to request a higher sending limit at the same time.
- Go to Pinpoint console for the same region you are using SNS in.
- In the lefthand sidebar, go the
SMS and Voice
(bottom) and choosePhone Numbers
- Under
Number Settings
chooseRequest Phone Number
- Choose Toll-free number, tick SMS, untick Voice, choose
transactional
, hit next and thenrequest
- Select
Toll-free registrations
andCreate registration
- Select the number you just created and then
Register existing toll-free number
- Complete and submit the form. Approval usually takes about 2 weeks.
- See the run book for information on how to set those numbers.
Example answers for toll-free registration form
If you're using the cf
CLI, you can run cf logs notify-api-ENV
and/or cf logs notify-admin-ENV
to stream logs in real time. Add --recent
to get the last few logs, though logs often move pretty quickly.
For general log searching, the cloud.gov Kibana instance is powerful, though quite complex to get started. For shortcuts to errors, some team members have New Relic access.
The links below will open a filtered view with logs from both applications, which can then be filtered further. However, for the links to work, you need to paste them into the URL bar while already logged into and viewing the Kibana page. If not, you'll just be redirected to the generic dashboard.
Production: https://logs.fr.cloud.gov/app/discover#/view/218a6790-596d-11ee-a43a-090d426b9a38 Demo: https://logs.fr.cloud.gov/app/discover#/view/891392a0-596e-11ee-921a-1b6b2f4d89ed Staging: https://logs.fr.cloud.gov/app/discover#/view/73d7c820-596e-11ee-a43a-090d426b9a38
Once in the view, you'll likely want to adjust the time range in the upper right of the page.
We're using pre-commit
to manage hooks in order to automate common tasks or easily-missed cleanup. It's installed as part of make bootstrap
and is limited to this project's virtualenv.
To run the hooks in advance of a git
operation, use poetry run pre-commit run
. For running across the whole codebase (useful after adding a new hook), use poetry run pre-commit run --all-files
.
The configuration is stored in .pre-commit-config.yaml
. In that config, there are links to the repos from which the hooks are pulled, so hop through there if you want a detailed description of what each one is doing.
We do not maintain any hooks in this repository.
# install dependencies, etc.
make bootstrap
# Create test database
createdb test_notification_api
make test
This will run:
- flake8 for code styling
- isort for import styling
- pytest for the test suite
On GitHub, in addition to these tests, we run:
- bandit for code security
- pip-audit for dependency vulnerabilities
- OWASP for dynamic scanning
We're using GitHub Actions. See /.github for the configuration.
In addition to commit-triggered scans, the daily_checks.yml
workflow runs the relevant dependency audits, static scan, and/or dynamic scans at 10am UTC each day. Developers will be notified of failures in daily scans by GitHub notifications.
Within GitHub Actions, several scans take place every day to ensure security and compliance.
daily-checks.yml
runs pip-audit
, bandit
, and owasp
scans to ensure that any newly found vulnerabilities do not impact notify. Failures should be addressed quickly as they will also block the next attempted deploy.
drift.yml
checks the deployed infrastructure against the expected configuration. A failure here is a flag to check audit logs for unexpected access and/or behavior and potentially destroy and re-deploy the application. Destruction and redeployment of all underlying infrastructure is an extreme remediation, and should only be attempted after ensuring that a good database backup is in hand.
If you're checking out the system locally, you may want to create a user quickly.
poetry run flask command create-test-user
This will run an interactive prompt to create a user, and then mark that user as active. Use a real mobile number if you want to log in, as the SMS auth code will be sent here.
- Run
make run-flask
from within the dev container. - On your host machine run:
docker run -v $(pwd):/zap/wrk/:rw --network="notify-network" -t owasp/zap2docker-weekly zap-api-scan.py -t http://dev:6011/docs/openapi.yml -f openapi -c zap.conf
The equivalent command if you are running the API locally:
docker run -v $(pwd):/zap/wrk/:rw -t owasp/zap2docker-weekly zap-api-scan.py -t http://host.docker.internal:6011/docs/openapi.yml -f openapi -c zap.conf -r report.html
In order to run end-to-end (E2E) tests, which are managed and handled in the admin project, a bit of extra configuration needs to be accounted for here on the API side as well. These instructions are in the README as they are necessary for project setup, and they're copied here for reference.
In the .env
file, you should see this section:
#############################################################
# E2E Testing
[email protected]
NOTIFY_E2E_TEST_PASSWORD="don't write secrets to the sample file"
You can leave the email address alone or change it to something else to your liking.
You should absolutely change the NOTIFY_E2E_TEST_PASSWORD
environment
variable to something else, preferably a lengthy passphrase.
With those two environment variable set, the database migrations will run properly and an E2E test user will be ready to go for use in the admin project.
Note: Whatever you set these two environment variables to, you'll need to match their values on the admin side. Please see the admin README and documentation for more details.
Feature flagging is now implemented in the Admin application to allow conditional enabling of features. The current setup uses environment variables, which can be configured via the command line with Cloud Foundry (CF). These settings should be defined in each relevant .yml file and committed to source control.
To adjust a feature flag, update the corresponding environment variable and redeploy as needed. This setup provides flexibility for enabling or disabling features without modifying the core application code.
Specifics on the commands can be found in the Admin Feature Flagging readme.
The API has 3 deployment environments, all of which deploy to cloud.gov:
- Staging, which deploys from
main
- Demo, which deploys from
production
- Production, which deploys from
production
Configurations for these are located in the deploy-config
folder. This setup is duplicated for the front end.
To trigger a new deploy, create a pull request from main
to production
in GitHub. This PR typically has release notes highlighting major and minor changes in the deployment. For help preparing this, sorting closed pull requests by "recently updated" will show all PRs merged since the last production deploy.
Deployment to staging runs via the base deployment action on GitHub, which pulls credentials from GitHub's secrets store in the staging environment.
Deployment to demo runs via the demo deployment action on GitHub, which pulls credentials from GitHub's secrets store in the demo environment.
Deployment to production runs via the production deployment action on GitHub, which pulls credentials from GitHub's secrets store in the production environment.
The action that we use deploys using a rolling strategy, so all deployments should have zero downtime.
In the event that a deployment includes a Terraform change, that change will run before any code is deployed to the environment. Each environment has its own Terraform GitHub Action to handle that change.
Failures in any of these GitHub workflows will be surfaced in the Pull Request related to the code change, and in the case of checks.yml
actively prevent the PR from being merged. Failure in the Terraform workflow will not actively prevent the PR from being merged, but reviewers should not approve a PR with a failing terraform plan.
The API app runs in a restricted egress space. This allows direct communication to cloud.gov-brokered services, but not to other APIs that we require.
As part of the deploy, we create an egress proxy application that allows traffic out of our application to a select list of allowed domains.
Update the allowed domains by updating deploy-config/egress_proxy/notify-api-<env>.allow.acl
and deploying an updated version of the application throught he normal deploy process.
For an environment variable to make its way into the cloud.gov environment, it must end up in the manifest.yml
file. Based on the deployment approach described above, there are 2 ways for this to happen.
Because secrets are pulled from GitHub, they must be passed from our action to the deploy action and then placed into manifest.yml
. This means that they should be in a 4 places:
- The GitHub secrets store
- The deploy action in the
env
section using the format{secrets.SECRET_NAME}
- The deploy action in the
push_arguments
section using the format--var SECRET_NAME="$SECRET_NAME"
- The manifest using the format
SECRET_NAME: ((SECRET_NAME))
Public env vars make up the configuration in deploy-config
. These are pulled in together by the --vars-file
line in the deploy action. To add or update one, it should be in 2 places:
- The relevant YAML file in
deploy-config
using the formatvar_name: value
- The manifest using the format
((var_name))
In addition to the environment variable management, there may be some additional application initialization that needs to be accounted for. This can include the following:
- Setting other environment variables that require host environment information directly that the application will run in as opposed to being managed by the
manifest.yml
file or or a user-provided service. - Running app initializing scripts that require host environment information directly prior to starting the application itself.
These initialization steps are taken care of in the .profile
file, which we use to set a couple of host environment-specific environment variables.
There is a sandbox space, complete with terraform and deploy-config/sandbox.yml
file available
for experimenting with infrastructure changes without going through the full CI/CD cycle each time.
Rules for use:
- Ensure that no other developer is using the environment, as there is nothing stopping changes from overwriting each other.
- Clean up when you are done:
terraform destroy
from within theterraform/sandbox
directory will take care of the provisioned services- Delete the apps and routes shown in
cf apps
by runningcf delete APP_NAME -r
- Delete the space deployer you created by following the instructions within
terraform/sandbox/secrets.auto.tfvars
If this is the first time you have used Terraform in this repository, you will first have to hook your copy of Terraform up to our remote state. Follow Retrieving existing bucket credentials.
⚓ The Admin app depends upon the API app, so set up the API first.
- Set up services:
Check Terraform troubleshooting if you encounter problems.
$ cd terraform/sandbox $ ../create_service_account.sh -s notify-sandbox -u <your-name>-terraform -m > secrets.auto.tfvars $ terraform init $ terraform plan $ terraform apply
Note that you'll have to do this for both the API and the Admin. Once this is complete we shouldn't have to do it again (unless we're setting up a new sandbox environment).
To deploy either the API or the Admin apps to the sandbox, the process is largely the same, but the Admin requires a bit of additional work.
- Make sure you are in the API project's root directory.
- Authenticate with cloud.gov in the command line:
cf login -a api.fr.cloud.gov --sso
- Run
./scripts/deploy_to_sandbox.sh
from the project root directory.
At this point your target org and space will change with cloud.gov to be the notify-sandbox
environment and the application will be pushed for deployment.
The script does a few things to make sure the deployment flows smoothly with miniminal work on your part:
- Sets the target org and space in cloud.gov for you.
- Creates a
requirements.txt
file for the Python dependencies so that the deployment picks up on the dependencies properly. - Pushes the application with the correct environment variables set based on what is supplied by the
deploy-config/sandbox.yml
file.
- Start a poetry shell as a shortcut to load
.env
file variables by runningpoetry shell
. (You'll have to restart this any time you change the file.) - Output requirements.txt file:
poetry export --without-hashes --format=requirements.txt > requirements.txt
- Ensure you are using the correct CloudFoundry target
cf target -o gsa-tts-benefits-studio -s notify-sandbox
- Deploy the application:
cf push --vars-file deploy-config/sandbox.yml --var NEW_RELIC_LICENSE_KEY=$NEW_RELIC_LICENSE_KEY
The real push
command has more var arguments than the single one above. Get their values from a Notify team member.
- Visit the URL(s) of the app you just deployed
In Notify, several aspects of the system are loaded into the database via migration. This means that application setup requires loading and overwriting historical data in order to arrive at the current configuration.
Here are notes about what is loaded into which tables, and some plans for how we might manage that in the future.
Flask does not seem to have a great way to squash migrations, but rather wants you to recreate them from the DB structure. This means it's easy to recreate the tables, but hard to recreate the initial data.
A diagram of Notify's data model is available in our compliance repo.
Create a migration:
flask db migrate
Trim any auto-generated stuff down to what you want, and manually rename it to be in numerical order. We should only have one migration branch.
Running migrations locally:
flask db upgrade
This should happen automatically on cloud.gov, but if you need to run a one-off migration for some reason:
cf run-task notifications-api-staging --commmand "flask db upgrade" --name db-upgrade
There is a Flask command to wipe user-created data (users, services, etc.).
The command should stop itself if it's run in a production environment, but, you know, please don't run it in a production environment.
Running locally:
flask command purge_functional_test_data -u <functional tests user name prefix>
Running on cloud.gov:
cf run-task notify-api --command "flask command purge_functional_test_data -u <functional tests user name prefix>"
For these, we're using Flask commands, which live in /app/commands.py
.
This includes things that might be one-time operations! If we're running it on production, it should be a Flask
command Using a command allows the operation to be tested, both with pytest
and with trial runs in staging.
To see information about available commands, you can get a list with:
poetry run flask command
Appending --help
to any command will give you more information about parameters.
To run a command on cloud.gov, use this format:
cf run-task CLOUD-GOV-APP --commmand "YOUR COMMAND HERE" --name YOUR-COMMAND-NAME
NOTE: Do not include poetry run
in the command you provide for cf run-task
! cloud.gov is already aware
of the Python virtual environment and Python dependencies; it's all handled through the Python brokerpak we use
to deploy the application.
For example, if you want to update the templates in one of the remote environments after a change to the JSON file, you would run this:
cf run-task CLOUD-GOV-APP --command "flask command update-templates" --name YOUR-COMMAND-NAME
Here's more documentation about Cloud Foundry tasks.
(Note: to obtain the CLOUD_GOV_APP name, run cf apps
and find the name of the app for the tier you are targeting)
To promote a user to platform admin: cf run-task <CLOUD_GOV_APP from cf apps see above> --command "flask command promote-user-to-platform-admin --user-email-address="
To update templates: cf run-task <CLOUD_GOV_APP from cf apps see above> --command "flask command update-templates"
All commands use the -g
or --generate
to determine how many instances to load to the db. The -g
or --generate
option is required and will always defult to 1. An example: flask command add-test-uses-to-db -g 6
will generate 6 random users and insert them into the db.
add-test-organizations-to-db
add-test-services-to-db
add-test-jobs-to-db
add-test-notifications-to-db
add-test-users-to-db
(extra options include-s
or--state
and-d
or--admin
)
Services used during message-send flow:
- AWS S3
- AWS SNS
- AWS Cloudwatch
- Redis
- PostgreSQL
There are several ways for notifications to come into the API.
- Messages sent through the API enter through
app/notifications/post_notifications.py
- One-off messages and CSV uploads both enter from the UI through
app/job/rest.py:create_job
API messages come in one at a time, and end up at persist_notification
, which writes to the database, and provider_tasks.deliver_sms
,
which enqueues the sending.
One-off messages and batch messages both upload a CSV, which are then first stored in S3 and queued as a Job
. When the job runs, it iterates
through the rows from tasks.py:process_row
, running tasks.py:save_sms
(email notifications branch off through tasks.py:save_email
) to write to the db with persist_notification
and begin the process of delivering the notification to the provider
through provider_tasks.deliver_sms
. The exit point to the provider is in send_to_providers.py:send_sms
.
Most of the API endpoints in this repo are for internal use. These are all defined within top-level folders under app/
and tend to have the structure app/<feature>/rest.py
.
Public APIs are intended for use by services and are all located under app/v2/
to distinguish them from internal endpoints. Originally we did have a "v1" public API, where we tried to reuse / expose existing internal endpoints. The needs for public APIs are sufficiently different that we decided to separate them out. Any "v1" endpoints that remain are now purely internal and no longer exposed to services.
New and existing APIs should be documented within openapi.yml. Tools to help with editing this file:
Here are some pointers for how we write public API endpoints.
Example: app/v2/inbound_sms/get_inbound_sms.py
This helps keep the file size manageable but does mean a bit more work to register each endpoint if we have many that are related. Note that internal endpoints are grouped differently: in large rest.py
files.
Example:
from flask import Blueprint
from app.v2.errors import register_errors
v2_notification_blueprint = Blueprint("v2_notifications", __name__, url_prefix='/v2/notifications')
register_errors(v2_notification_blueprint)
Note that the error handling setup by register_errors
(defined in app/v2/errors.py
) for public API endpoints is different to that for internal endpoints (defined in app/errors.py
).
Example: Ruby Client adapter to get template by ID.
All our clients should fully support all of our public APIs.
Each adapter should be documented in each client (example). We should also document each public API endpoint in our generic API docs (example). Note that internal endpoints are not documented anywhere.
This is done as part of registering the blueprint in app/__init__.py
e.g.
post_letter.before_request(requires_auth)
application.register_blueprint(post_letter)
To make life easier, the UK API client libraries are compatible with Notify and the UK API Documentation is applicable.
For a usage example, see our Python demo.
An API key can be created at https://HOSTNAME/services/YOUR_SERVICE_ID/api/keys. This is the same API key that is referenced as USER_API_TOKEN
below.
Internal-only documentation for exploring the API using Postman
An OpenAPI specification file can be found at https://notify-staging.app.cloud.gov/docs/openapi.yml.
See writing-public-apis.md for links to tools to make it easier to use the OpenAPI spec within VSCode.
On a mac, run:
The admin UI token is required for any of the internal-api
tagged methods. To create one and copy it to your pasteboard, run:
flask command create-admin-jwt | tail -n 1 | pbcopy
A user token is required for any of the external-api
tagged methods. To create one and copy it to your pasteboard, run:
flask command create-user-jwt --token=<USER_API_TOKEN> | tail -n 1 | pbcopy
Because jwt tokens expire so quickly, the development server can be set to allow tokens older than 30 seconds:
env ALLOW_EXPIRED_API_TOKEN=1 make run-flask
The API puts tasks into Celery queues for dispatch.
There are a bunch of queues:
- priority tasks
- database tasks
- send sms tasks
- send email tasks
- research mode tasks
- reporting tasks
- job tasks
- retry tasks
- notify internal tasks
- service callbacks
- service callbacks retry
- letter tasks
- sms callbacks
- antivirus tasks
- save api email tasks
- save api sms tasks
And these tasks:
- check for missing rows in completed jobs
- check for services with high failure rates or sending to tv numbers
- check if letters still in created
- check if letters still pending virus check
- check job status
- create fake letter response file
- create nightly billing
- create nightly billing for day
- create nightly notification status
- create nightly notification status for service and day
- delete email notifications
- delete inbound sms
- delete invitations
- delete letter notifications
- delete notifications for service and type
- delete notifications older than retention
- delete sms notifications
- delete verify codes
- deliver email
- deliver sms
- process incomplete jobs
- process job
- process returned letters list
- process ses result
- process virus scan error
- process virus scan failed
- raise alert if letter notifications still sending
- raise alert if no letter ack file
- record daily sorted counts
- remove letter jobs
- remove sms email jobs
- replay created notifications
- run scheduled jobs
- save api email
- save api sms
- save daily notification processing time
- save email
- save letter
- save sms
- send complaint
- send delivery status
- send inbound sms
- switch current sms provider on slow delivery
- tend providers back to middle
- timeout sending notifications
- update billable units for letter
- update letter notifications statuses
- update letter notifications to error
- update letter notifications to sent
- update validation failed for templated letter
For tasks that should happen before other stuff, there's a priority queue. Platform admins can set templates to use this queue.
Currently, this queue doesn't do anything special. If the normal queue is very busy, it's possible that this queue will be faster merely because it's shorter. By the same logic, a busy priority queue is likely to be slower than the normal queue
After scheduling some tasks, run celery beat to get them moving:
make run-celery-beat
Notify.gov is a service being developed by the TTS Public Benefits Studio to increase the availability of SMS and email notifications to Federal, State, and Local Benefits agencies.
Agencies that sign up will be able to create and use personalized message templates for sending notifications to members of the public regarding their benefits. These could include reminders about upcoming enrollment deadlines and tasks, or information about upcoming appointments, events, or services.
The templates are sent by the agency using one of two methods:
- using the Notify.gov API to send a message to a given recipient with given personalization values
- using the Notify.gov website to upload a CSV file of recipients and their personalization values, one row per message
Notify.gov is comprised of two applications both running on cloud.gov:
- Admin, a Flask website running on the python_buildpack which hosts agency user-facing UI
- API, a Flask application running on the python_buildpack hosting the Notify.gov API
Notify.gov utilizes several cloud.gov-provided services:
- S3 buckets for temporary file storage
- Elasticache (redis) for cacheing data and enqueueing background tasks
- RDS (PostgreSQL) for system data storage
Notify.gov also provisions and uses two AWS services via a supplemental service broker:
For further details of the system and how it connects to supporting services, see the application boundary diagram
Changes are made to our applications via pull requests, which show a diff (the before and after state of all proposed changes in the code) of of the work done for that particular branch. We use pull requests as the basis for working on Notify.gov and modifying the application over time for improvements, bug fixes, new features, and more.
There are several things that make for a good and complete pull request:
- An appropriate and descriptive title
- A detailed description of what's being changed, including any outstanding work (TODOs)
- A list of security considerations, which contains information about anything we need to be mindful of from a security compliance perspective
- The proper labels, assignee, code reviewer, and other project metadata set
When you first open a pull request, start off by making sure the metadata for it is in place:
- Provide an appropriate and descriptive title for the pull request
- Link the pull request to its corresponding issue (must be done after creating the pull request itself)
- Assign yourself as the author
- Attach the appropriate labels to it
- Set it to be on the Notify.gov project board
- Select one or more reviewers from the team or mark the pull request as a draft
depending on its current state
- If the pull request is a draft, please be sure to add reviewers once it is ready for review and mark it ready for review
Please enter a clear description about your proposed changes and what the expected outcome(s) is/are from there. If there are complex implementation details within the changes, this is a great place to explain those details using plain language.
This should include:
- Links to issues that this PR addresses (especially if more than one)
- Screenshots or screen captures of any visible changes, especially for UI work
- Dependency changes
If there are any caveats, known issues, follow-up items, etc., make a quick note of them here as well, though more details are probably warranted in the issue itself in this case.
If you're opening a draft PR, it might be helpful to list any outstanding work, especially if you're asking folks to take a look before it's ready for full review. In this case, create a small checklist with the outstanding items:
- TODO item 1
- TODO item 2
- TODO item ...
Please think about the security compliance aspect of your changes and what the potential impacts might be.
NOTE: Please be mindful of sharing sensitive information here! If you're not sure of what to write, please ask the team first before writing anything here.
Relevant details could include (and are not limited to) the following:
- Handling secrets/credential management (or specifically calling out that there is nothing to handle)
- Any adjustments to the flow of data in and out the system, or even within it
- Connecting or disconnecting any external services to the application
- Handling of any sensitive information, such as PII
- Handling of information within log statements or other application monitoring services/hooks
- The inclusion of a new external dependency or the removal of an existing one
- ... (anything else relevant from a security compliance perspective)
There are some cases where there are no security considerations to be had, e.g., updating our documentation with publicly available information. In those cases it is fine to simply put something like this:
- None; this is a documentation update with publicly available information.
This way it shows that we still gave this section consideration and that nothing happens to apply in this scenario.
When conducting a code review there are several things to keep in mind to ensure a quality and valuable review. Remember, we're trying to improve Notify.gov as best we can; it does us no good if we do not double check that our work meets our standards, especially before going out the door!
It also does us no good if we do not treat each other without mutual respect or consideration either; if there are mistakes or oversights found in a pull request, or even just suggestions for alternative ways of approaching something, these become learning opportunities for all parties involved in addition to modeling positive behavior and practices for the public and broader open source community.
Given this basis of approaching code reviews, here are some general guidelines and suggestions for how to approach a code review from the perspectives of both the reviewer and the author.
When performing a code review, please be curious and critical while also being respectful and appreciative of the work submitted. Code reviews are a chance to check that things meet our standards and provide learning opportunities. They are not places for belittling or disparaging someone's work or approach to a task, and absolutely not the person(s) themselves.
That said, any responses to the code review should also be respectful and considerate. Remember, this is a chance to not only improve our work and the state of Notify.gov, it's also a chance to learn something new!
Note: If a response is condescending, derogatory, disrespectful, etc., please do not hesitate to either speak with the author(s) directly about this or reach out to a team lead/supervisor for additional help to rectify the issue. Such behavior and lack of professionalism is not acceptable or tolerated.
When performing a code review, it is helpful to keep the following guidelines in mind:
- Be on the lookout for any sensitive information and/or leaked credentials, secrets, PII, etc.
- Ask and call out things that aren't clear to you; it never hurts to double check your understanding of something!
- Check that things are named descriptively and appropriately and call out anything that is not.
- Check that comments are present for complex areas when needed.
- Make sure the pull request itself is properly prepared - it has a clear description, calls out security concerns, and has the necessary labels, flags, issue link, etc., set on it.
- Do not be shy about using the suggested changes feature in GitHub pull request comments; this can help save a lot of time!
- Do not be shy about marking a review with the
Request Changes
status - yes, it looks big and red when it shows up, but this is completely fine and not to be taken as a personal mark against the author(s) of the pull request!
Additionally, if you find yourself making a lot of comments and/or end up having several concerns about the overall approach, it will likely be helpful to schedule time to speak with the author(s) directly and talk through everything. This can save folks a lot of misunderstanding and back-and-forth!
When receiving a code review, please remember that someone took the time to look over all of your work with a critical eye to make sure our standards are being met and that we're producing the best quality work possible. It's completely fine if there are specific changes requested and/or other parts are sent back for additional work!
That said, the review should also be respectful, helpful, and a learning opportunity where possible. Remember, this is a chance to not only improve your work and the state of Notify.gov, it's also a chance to learn something new!
Note: If a review is condescending, derogatory, disrespectful, etc., please do not hesitate to either speak with the reviewer(s) directly about this or reach out to a team lead/supervisor for additional help to rectify the issue. Such behavior and lack of professionalism is not acceptable or tolerated.
When going over a review, it may be helpful to keep these perspectives in mind:
- Approach the review with an open mind, curiosity, and appreciation.
- If anything the reviewer(s) mentions is unclear to you, please ask for clarification and engage them in further dialogue!
- If you disagree with a suggestion or request, please say so and engage in an open and respecful dialogue to come to a mutual understanding of what the appropriate next step(S) should be - accept the change, reject the change, take a different path entirely, etc.
- If there are no issues with any suggested edits or requested changes, make the necessary adjustments and let the reviewer(s) know when the work is ready for review again.
Additionally, if you find yourself responding to a lot of things and questioning the feedback received throughout much of the code review, it will likely be helpful to schedule time to speak with the reviewer(s) directly and talk through everything. This can save folks a lot of misunderstanding and back-and-forth!
Policies and Procedures needed before and during Notify.gov Operations. Many of these policies are taken from the Notify.gov System Security & Privacy Plan (SSPP).
Any changes to policies and procedures defined both here and in the SSPP must be kept in sync, and should be done collaboratively with the System ISSO and ISSM to ensure that the security of the system is maintained.
- Alerts, Notifications, Monitoring
- Restaging Apps
- Deploying to Production
- Smoke-testing the App
- Simulated bulk send testing
- Configuration Management
- DNS Changes
- Known Gotchas
- User Account Management
- SMS Phone Number Management
Operational alerts are posted to the #pb-notify-alerts Slack channel. Please join this channel and enable push notifications for all messages whenever you are on call.
NewRelic is being used for monitoring the application. NewRelic Dashboard can be filtered by environment and API, Admin, or Both.
Cloud.gov Logging is used to view and search application and platform logs.
In addition to the application logs, there are several tables in the application that store useful information for audit logging purposes:
events
- the various
*_history
tables
Our apps must be restaged whenever cloud.gov releases updates to buildpacks. Cloud.gov will send email notifications whenever buildpack updates affect a deployed app.
Restaging the apps rebuilds them with the new buildpack, enabling us to take advantage of whatever bugfixes or security updates are present in the new buildpack.
There are two GitHub Actions that automate this process. Each are run manually and must be run once for each environment to enable testing any changes in staging before running within demo and production environments.
When notify-api-<env>
, notify-admin-<env>
, egress-proxy-notify-api-<env>
, and/or egress-proxy-notify-admin-<env>
need to be restaged:
- Navigate to the Restage apps GitHub Action
- Click the
Run workflow
button to open a popup - Leave
Use workflow from
on it's default ofBranch: main
- Select the environment you need to restage from the dropdown
- Click
Run workflow
within the popup - Repeat for other environments
When ssb-sms
, and/or ssb-smtp
need to be restaged:
- Navigate to the SSB Restage apps GitHub Action
- Click the
Run workflow
button to open a popup - Leave
Use workflow from
on it's default ofBranch: main
- Select the environment (either
staging
orproduction
) you need to restage from the dropdown - Click
Run workflow
within the popup - Repeat for other environments
When ssb-devel-sms
and/or ssb-devel-smtp
need to be restaged:
- Navigate to the SSB Restage apps GitHub Action
- Click the
Run workflow
button to open a popup - Leave
Use workflow from
on it's default ofBranch: main
- Select the
development
environment from the dropdown - Click
Run workflow
within the popup
Deploying to production involves 3 steps that must be done in order, and can be done for just the API, just the Admin, or both at the same time:
- Create a new pull request in GitHub that merges the
main
branch into theproduction
branch; be sure to provide details about what is in the release! - Create a new release tag and generate release notes; publish it with the
Pre-release
at first, then update it to latest after a deploy is finished and successful. - Review and approve the pull request(s) for the production deployment.
Additionally, you may have to monitor the GitHub Actions as they take place to troubleshoot and/or re-run failed jobs.
This is done entirely in GitHub. First, go to the pull requests section of the API and/or Admin repository, then click on the New pull request
button.
In the screen that appears, change the base: main
target branch on the left side of the arrow to base: production
instead. You want to merge all of the latest changes in main
to the production
branch. After you've made the switch, click on the Create pull request
button.
When the pull request details page appears, you'll need to set a few things:
Title: <current> Production Deploy
, e.g., 9/9/2024 Production Deploy
Description: feel free to copy from a previous production deploy PR; note that you'll have to change the links to the release notes if applicable!
Labels: Engineering
Author: set to yourself
Reviewers: assign folks or the @notify-contributors team
Please link it to the project board as well, then click on the Create pull request
button to finalize it all.
On the main page of the repository, click on the small heading that says Releases
on the right to get to the release listing page. Once there, click on the Draft a new release
button.
You'll first have to choose a tag or create a new one: use the current date as the tag name, e.g., 9/9/2024
. Keep the target set to main
and then click on the Generate release notes button
.
Add a title in the format of <current date>
Production Deploy, e.g., 9/9/2024 Production Deploy
.
Lastly, uncheck the Set as the latest release
checkbox and check the Set as a pre-release
checkbox instead.
Once everything is complete, cick on the Publish release
button and then link to the new release notes in the corresponding production deploy pull request.
When everything is good to go, two people will need to approve the pull request for merging into the production
branch. Once they do, then merge the pull request.
At this point everything is mostly automatic. The deploy will update both the demo
and production
environments. Once the deploys are done and successful, go back into the pre-release release notes and switch the checkboxes to turn it into the latest release and save the change.
Sometimes a deploy will fail and you will have to look at the GitHub Action deployment logs to see what the cause is. In many cases it will be an out of memory error because of the two environments going out at the same time. Whenever the successful deploy is finished, re-run the failed jobs in the other deployment action again.
Once the deploys are finished it's also a good idea to just poke around the site to make sure things are working fine and as expected!
To ensure that notifications are passing through the application properly, the following steps can be taken to ensure all parts are operating correctly:
- Send yourself a password reset email. This will verify SES integration. The email can be deleted once received if you don't wish to change your password.
- Log into the app. This will verify SNS integration for a one-off message.
- Upload a CSV and schedule send for the soonest time after "Now". This will verify S3 connections as well as scheduler and worker processes are running properly.
Assuming that you have followed all steps to set up localstack successfully (see docs/localstack.md), do the following:
- Create an sms template that requires no inputs from the user (i.e. the csv file will only have phone numbers)
- Uncomment the test 'test_generate_csv_for_bulk_testing' in app/test_utils.py
- Run
make test
on this project. This will generate the csv file for the bulk test. - If you are not a platform admin for your service when you run locally, do the following:
-
psql -d notification_api
- update users set platform_admin='t';
- \q
- sign out
- sign in.
- Go to settings and set the organization for your service to 'Broadcast services' (scroll down to platform admin)
- Go to settings and set your service to 'live' (scroll down to platform admin)
-
- Run your app 'locally'. I.e. run
make run-procfile
on this project andmake run-flask
on the admin project - Sign in. Verify you are running with localstack. I.e., you do NOT receive a text message on sign in. Instead, you see your authentication code in green in the api logs
- Go to send messages and upload your csv file and send your 100000 messages
Also known as: How to move code from my machine to production
- All changes must be made in a feature branch and opened as a PR targetting the
main
branch. - All PRs must be approved by another developer
- PRs to
main
andproduction
branches must be merged by a someone with theAdministrator
role. - PR documentation includes a Security Impact Analysis
- PRs that will impact the Security Posture must be approved by the Notify.gov ISSO.
- Any PRs waiting for approval should be talked about during daily Standup meetings.
- Changes are deployed to the
staging
environment after a successfulchecks.yml
run onmain
branch. Branch Protections prevent pushing directly tomain
- Changes are deployed to the
demo
andproduction
environments after mergingmain
intoproduction
. Branch Protections prevent pushing directly toproduction
- Changes are deployed to
staging
andproduction
environments after merging to themain
branch. Thestaging
deployment must be successful beforeproduction
is attempted. Branch Protections prevent pushing directly tomain
- A new release is created by pushing a tag to the repository on the
main
branch. - To include the new version in released SSB code, create a PR in the
usnotify-ssb
repo updating the version in use inapp-setup-sms.sh
- To include new verisons of the SMTP brokerpak in released SSB code, create a PR in the
usnotify-ssb
repo updating the version in use inapp-setup-smtp.sh
US_Notify Administrators are responsible for ensuring that remediations for vulnerabilities are implemented. Response times vary based on the level of vulnerability as follows:
- Critical (Very High) - 15 days
- High - 30 days
- Medium - 90 days
- Low - 180 days
- Informational - 365 days (depending on the analysis of the issue)
Notify.gov DNS records are maintained within the 18f/dns repository. To create new DNS records for notify.gov or any subdomains:
- Update the
notify.gov.tf
terraform to update oƒr create the new records within Route53 and push the branch to the 18f/dns repository. - Open a PR.
- Verify that the plan output within circleci creates the records that you expect.
- Request a PR review from the 18F/tts-tech-portfolio team
- Once the PR is approved and merged, verify that the apply step happened correctly within CircleCI
- Head to https://github.com/GSA/notifications-api/actions/workflows/daily_checks.yml
- Open the most recent scan (it should be today's)
- Scroll down to "Artifacts", click to download the .zip of OWASP ZAP results
- Rename to
api_zap_scan_DATE.zip
and add it to 🔒 https://drive.google.com/drive/folders/1CFO-hFf9UjzU2JsZxdZeGRfw-a47u7e1 - Click any of the jobs to open the logs
- In top right of logs, click the gear icon
- Select "Download log archive" to download a .zip of the test output for all jobs
- Rename to
api_static_scan_DATE.zip
and add it to 🔒 https://drive.google.com/drive/folders/1dSe9H7Ag_hLfi5hmQDB2ktWaDwWSf4_R - Repeat for https://github.com/GSA/notifications-admin/actions/workflows/daily_checks.yml
- Start API locally
make run-procfile
- In a separate terminal tab, navigate to the API project and run
poetry run flask command generate-salt
- A random secret will appear in the tab
- Go to github->settings->secrets and variables->actions in the admin project and find the DANGEROUS_SALT secret for the admin project for staging. Open it and paste the result of #3 into the secret and save. Repeat for the API project, for staging.
- Repeat #3 and #4 but do it for demo
- Repeat #3 and #4 but do it for production
The important thing is to use the same secret for Admin and API on each tier--i.e. you only generate three secrets.
- Problem:
- Creating or deleting service keys is failing. SSB Logs reference failing to verify certificate/certificate valid for
GUID A
but not forGUID B
- Solution:
- Restage SSB apps using the restage apps action
- Problem:
- When deploying a new environment, a race condition prevents SNS topic subscriptions from being successfully verified on the AWS side
- Solution:
- Manually re-request subscription confirmation from the AWS Console.
Important policies:
- Infrastructure Accounts and Application Platform Administrators must be approved by the System Owner (Amy) before creation, but people with
Administrator
role can actually do the creation and role assignments. - At least one agency partner must act as the
User Manager
for their service, with permissions to manage their team according to their agency's policies and procedures. - All users must utilize
.gov
email addresses. - Users who leave the team or otherwise have role changes must have their accounts updated to reflect the new roles required (or disabled) within 14 days.
- SpaceDeployer credentials must be rotated within 14 days of anyone with SpaceDeveloper cloud.gov access leaving the team.
- A user report must be created annually (See AC-2(j)).
make cloudgov-user-report
can be used to create a full report of all cloud.gov users.
Role Name | System | Permissions | Who | Responsibilities |
---|---|---|---|---|
Administrator | GitHub | Admin | PBS Fed | Approve & Merge PRs into main and production |
Administrator | AWS | NotifyAdministrators IAM UserGroup |
PBS Fed | Read audit logs, verify & fix any AWS service issues within Production AWS account |
Administrator | Cloud.gov | OrgManager |
PBS Fed | Manage cloud.gov roles and permissions. Access to production spaces |
DevOps Engineer | Cloud.gov | SpaceManager |
PBS Fed or Contractor | Access to non-production spaces |
DevOps Engineer | AWS | NotifyAdministrators IAM UserGroup |
PBS Fed or Contractor | Access to non-production AWS accounts to verify & fix any AWS issues in the lower environments |
Engineer | GitHub | Write | PBS Fed or Contractor | Write code & issues, submit PRs |
Role Name | Permissions | Who | Responsibilities |
---|---|---|---|
Platform Administrator | platform_admin |
PBS Fed | Administer system settings within Notify.gov across Services |
User Manager | MANAGE_USERS |
Agency Partner | Manage service team members |
User | any except MANAGE_USERS |
Agency Partner | Use Notify.gov |
Role Name | System | Permissions | Notes |
---|---|---|---|
Cloud.gov Service Account | Cloud.gov | OrgManager and SpaceDeveloper |
Creds stored in GitHub Environment secrets within api and admin app repos |
SSB Deployment Account | AWS | IAMFullAccess |
Creds stored in GitHub Environment secrets within usnotify-ssb repo |
SSB Cloud.gov Service Account | Cloud.gov | SpaceDeveloper |
Creds stored in GitHub Environment secrets within usnotify-ssb repo |
SSB AWS Accounts | AWS | sms_broker or smtp_broker IAM role |
Creds created and maintained by usnotify-ssb terraform |
See Infrastructure Overview for information about SMS phone numbers in AWS.
Once you have a number, it must be set in the app in one of two ways:
- For the default phone number, to be used by Notify itself for OTP codes and the default from number for services, set the phone number as the
AWS_US_TOLL_FREE_NUMBER
ENV variable in the environment you are creating - For service-specific phone numbers, set the phone number in the Service's
Text message senders
in the settings tab.
- +18447952263 - in use as default number. Notify's OTP messages and trial service messages are sent from this number (Also the number for the live service: Federal Test Service)
- +18447891134 - Montgomery County / Ride On
- +18888402596 - Norfolk / DHS
- +18555317292 - Washington State / DHS
- +18889046435 - State Department / Consular Affairs
- +18447342791
- +18447525067
- +18336917230
- +18335951552
- +18333792033
- +18338010522
For a full list of phone numbers in trial and production, team members can access a tracking list here.
- name
- email_address
- mobile_number
- email_address
- email_address
No db data is PII, but each job has a csv file in s3 containing phone numbers and personalization data.
- to
- normalized_to
- _personalization2
- phone_prefix3
- phone_prefix3
- content2
- user_number
- data (contains user IP addresses)1
Users and invited users are Federal, State, or Local government employees or contractors. Members of the general public are not users of the system
Field-level encryption is used on these fields.
Details on encryption schemes and algorithms can be found in SC-28(1)
Probably not PII, this is the country code of the phone.
Seven (7) days by default. Each service can be set with a custom policy via ServiceDataRetention
by a Platform Admin. The ServiceDataRetention
setting applies per-service and per-message type and controls both entries in the notifications
table as well as csv
contact files uploaded to s3
Data cleanup is controlled by several tasks in the nightly_tasks.py
file, kicked off by Celery Beat.
Ask the user to provide the csv file name. Either the csv file they uploaded, or the one that is autogenerated when they do a one-off send and is visible in the UI
Starting with the admin logs, search for this file name. When you find it, the log line should have the file name linked to the job_id and the csv file location. Save both of these.
In the api logs, search by job_id. Either you will see evidence of the job failing and retrying over and over (in which case search for a stack trace using timestamp), or you will ultimately get to a log line that links the job_id to a message_id. In this case, now search by message_id. You should be able to find the actual result from AWS, either success or failure, with hopefully some helpful info.
If you need to view the questionable csv file on production, run the following command:
cf run-task notify-api-production --command "flask command download-csv-file-by-name -f <file location found in admin logs>"
locally, just do:
poetry run flask command download-csv-file-by-name <file location in admin logs>
- Either send a message and capture the csv file name, or get a csv file name from a user
- Using the log tool at logs.fr.cloud.gov, use filters to limit what you're searching on (cf.app is 'notify-admin-production' for example) and then search with the csv file name in double quotes over the relevant time period (last 5 minutes if you just sent a message, or else whatever time the user sent at)
- When you find the log line, you should also find the job_id and the s3 file location. Save these somewhere.
- To get the csv file contents, you can run the command above. This command currently prints to the notify-api log, so after you run the command, you need to search in notify-api-production for the last 5 minutes with the logs sorted by timestamp. The contents of the csv file unfortunately appear on separate lines so it's very important to sort by time.
- If you want to see where the message actually failed, search with cf.app is notify-api-production using the job_id that you saved in step #3. If you get far enough, you might see one of the log lines has a message_id. If you see it, you can switch and search on that, which should tell you what happened in AWS (success or failure).
During cf push
you may see
For application 'notify-api-sandbox': Routes cannot be mapped to destinations in different spaces
👻 This indicates a ghost route squatting on a route you need to create. In the cloud.gov web interface, check for incomplete deployments. They might be holding on to a route. Delete them. Also, check the list of routes (from the CloudFoundry icon in the left sidebar) for routes without an associated app. If they look like a route your app would need to create, delete them.
After pushing the Admin app, you might see this in the logs
{"name": "app", "levelname": "ERROR", "message": "API unknown failed with status 503 message Request failed", "pathname": "/home/vcap/app/app/__init__.py", ...
And you would also see this in the Admin web UI
Sorry, we can't deliver what you asked for right now.
This indicates that the Admin and API apps are unable to talk to each other because of either a missing route or a missing network policy. The apps require container-to-container networking to communicate. List cf network-policies
; you should see one connecting API and Admin on port 61443. If not, you can create one manually:
cf add-network-policy notify-admin-sandbox notify-api-sandbox --protocol tcp --port 61443
This error encounted after cf push
indicates you may be using the wrong CloudFoundry target
For application 'notify-api-sandbox': Service instance 'notify-api-rds-sandbox' not found
Run cf target -o gsa-tts-benefits-studio -s notify-sandbox
before pushing to the Sandbox
Note: better to search on space 'notify-production' rather than specifically for admin or api
#notify-admin-1200 (job cache regeneration) #notify-admin-1505 (general login issues) #notify-admin-1701 (wrong sender phone number) #notify-admin-1859 (job is created with created_at being the wrong time)