-
Notifications
You must be signed in to change notification settings - Fork 309
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
split log-cache from doppler, use syslog ingress #949
Conversation
Hello friend, it looks like your pull request has failed one or more of our checks. Please take a look! 👀 |
6f920dc
to
71a0d19
Compare
Hello friend, it looks like your pull request has failed one or more of our checks. Please take a look! 👀 |
71a0d19
to
381b2ca
Compare
Making this change for a few reasons: - The scaling needs of dopplers and log-cache are often different, so grouping them together can be problematic. Dopplers are limited to ~40 instances and some high traffic foundations need larger log-cache instance groups. - Syslog ingress eliminates the load on dopplers and traffic controllers to get envelopes to log-cache. This increases the load slightly on diego cells, and eliminates significant load on dopplers/tc's. It's recommended after deploying this change to evaluate the memory allocated to doppler nodes and switch them to compute heavy instances and deploy log-cache to high memory instances.
They didn't seem to be used and would need to be updated to work with the separate log cache instance group.
fae959e
to
0158943
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @mkocher & @rroberts2222. This looks good overall. I just had one change to request regarding the deprecated experimental ops-files.
operations/experimental/use-logcache-syslog-ingress-windows2019.yml
Outdated
Show resolved
Hide resolved
We had made these ops files no-ops in an earlier commit, here we are removing them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Noticed that cc defaults to using doppler as the stats server which is breaking stats. We'll push a fix. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some other operators may not want to move to the syslog ingress model, and to accommodate them it is probably a good idea to include some ops files that would restore the previous log-cache-nozzle / RLP ingress.
Hello friend, it looks like your pull request has failed one or more of our checks. Please take a look! 👀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to update the ops files tests
…pler As part of the latest cf-deployment, [log-cache is no longer nested under doppler](cloudfoundry/cf-deployment#949) Pipeline failure caused by the change: ``` operation [0] in ops-files/log-cache-reduce-memory.yml failed': Expected to find exactly one matching array item for path '/instance_groups/name=doppler/jobs/name=log-cache' but found 0 ```
- In #949 Log Cache was split out from the doppler instance group to its own log-cache instance group - Log Cache was also configured to use syslog ingress by default, rather than the previous behaviour which was to use the Reverse Log Proxy - Operators who had previously used the experimental ops-file to opt into syslog ingress (operations/experimental/use-logcache-syslog-ingress.yml) would already have had the `log_cache_syslog_tls` credential in their CredHub - When these operators attempted to upgrade to v18.0.0 the certificate was not re-generated by default, leading to a mismatch between the new service name and the existing certificate - Specify `update_mode: converge` so that the certificate is re-generated and the syslog agent will be able to send logs to the log cache syslog server Fixes: ``` failed to write to log-cache.service.cf.internal:6067, retrying in 8.192s, err: x509: certificate is valid for q-s3.doppler.default.cf.bosh, doppler.service.cf.internal, not log-cache.service.cf.internal ```
This PR can cause significant log downtime if operators are not prepared for it when they first upgrade to a release its shipped in (v18.0.0+). There has been a lot of conversation in slack about this so I wanted to repost some here for posterity. In order to minimize log downtime operators should deploy twice, once to make the new log cache service alias available to every VM, and again to actually deploy the change. The deployment order should be something along the lines of:
To be safe, it's recommended to initially scale your new Log Cache up to the same instance count of Doppler VMs that you previously had, and then to look at metrics after the deploy in order to scale down both your Log Cache and your Doppler footprint. |
Note that with this change the syslog agents running on the diego cells and other VMs need to be able to talk to the Log Cache Syslog Server on port 6067. Operators that are running diego cells within isolation segments may have to adjust their firewall rules. |
to SendSpikeMetrics with EmitTimer instead of EmitGauge This is to mitigate an issue that started to happen when we started using syslog-ingress Context: [cloudfoundry/cf-deployment#949](cloudfoundry/cf-deployment#949)
to SendSpikeMetrics with EmitTimer instead of EmitGauge This is to mitigate an issue that started to happen when we started using syslog-ingress Context: [cloudfoundry/cf-deployment#949](cloudfoundry/cf-deployment#949)
to SendSpikeMetrics with EmitTimer instead of EmitGauge This is to mitigate an issue that started to happen when we started using syslog-ingress Context: [cloudfoundry/cf-deployment#949](cloudfoundry/cf-deployment#949)
Making this change for a few reasons:
grouping them together can be problematic. Dopplers are limited to ~40
instances and some high traffic foundations need larger log-cache
instance groups.
to get envelopes to log-cache. This increases the load slightly on
diego cells, and eliminates significant load on dopplers/tc's.
It's recommended after deploying this change to evaluate the memory
allocated doppler nodes and switch them to compute heavy instances and
deploy log-cache to high memory instances.
Please take a moment to review the questions before submitting the PR
Has a cf-deployment including this change passed cf-acceptance-tests?
Does this PR introduce a breaking change? Please take a moment to read through the examples before answering the question.
How should this change be described in cf-deployment release notes?
Log Cache is now deployed separately from Doppler on its own instance group. Operators should consider scaling the memory on Doppler nodes down and using high memory Log Cache nodes. Operators should notice reduced CPU usage on Doppler & Traffic Controller and slight increase CPU usage by Syslog Forwarder.
Does this PR introduce a new BOSH release into the base cf-deployment.yml manifest or any ops-files?
Does this PR make a change to an experimental or GA'd feature/component?
Please provide Acceptance Criteria for this change?
bosh vms
should show a log-cache instance group.bosh ssh log-cache
should be a vm running log-cache and assorted processesbosh ssh doppler
should show a vm not running log cache when executingsudo monit summary
What is the level of urgency for publishing this change?
Tag your pair, your PM, and/or team!
@rroberts2222