Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Periodically create partitions for alert executions #398

Merged
merged 36 commits into from
Jun 20, 2020

Conversation

cwbriones
Copy link
Contributor

@cwbriones cwbriones commented Jun 8, 2020

Introduce a DailyPartition manager actor that runs within the DatabaseAlertExecutionRepository, periodically attempting to create partitions for the alert executions table that was added in #380. Notably I've introduced another EbeanServer which uses the DDL role (metrics_dba), since that Role is necessary to create tables.

I've included unit tests for the actor while the DB testing is covered by the existing integration tests.

@BrandonArp
Copy link
Member

Have we setup the Akka pools for these actors? You're updating connection pool sizes, but if the Akka actor's thread pools aren't big enough (they usually assume non-blocking), then you're gonna block before the connection open.

@cwbriones
Copy link
Contributor Author

Is the concern because I block on actor start whereas we ordinarily would let that happen asynchronously? I'm not sure I entirely understand the issue.

@cwbriones cwbriones marked this pull request as ready for review June 12, 2020 01:21
Copy link
Contributor Author

@cwbriones cwbriones left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some additional context for reviewers.

@@ -1165,7 +1165,6 @@
<groupId>org.flywaydb</groupId>
<artifactId>flyway-play_${scala.package.version}</artifactId>
<version>${flyway.play.version}</version>
<scope>runtime</scope>
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Scope changed because I need to block on the flyway-play PlayInitializer from within MainModule to guarantee that migrations have run.

@@ -27,7 +27,7 @@ ALTER ROLE metrics_app WITH NOSUPERUSER INHERIT NOCREATEDB NOCREATEROLE NOREPLIC

CREATE ROLE metrics_dba LOGIN;
ALTER ROLE metrics_dba WITH PASSWORD 'metrics_dba_password';
ALTER ROLE metrics_dba WITH NOSUPERUSER INHERIT NOCREATEDB NOCREATEROLE NOREPLICATION CONNECTION LIMIT 6;
ALTER ROLE metrics_dba WITH NOSUPERUSER INHERIT NOCREATEDB NOCREATEROLE NOREPLICATION CONNECTION LIMIT 15;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why 15:

I bumped the connection pool to be the same size as what we have for metrics_dml, and set the size to be the same in tests. That is 5 + 5 = 10 connections. There's also an additional connection flyway creates with this role. That's a total of 11. I then just bumped it to 15.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Flyway creates (for some reason) 2 connections.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any change required here then? that would be 12 which is still below the 15 limit.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, just trying to spread the knowledge.

-- Table - Text - The name of the parent table.
-- Start - Date - The beginning date of the time range, inclusive.
-- End - Date - The end date of the time range, exclusive.
CREATE OR REPLACE FUNCTION create_daily_partition(schema TEXT, tablename TEXT, start_date DATE, end_date DATE)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Content is exactly the same as the previous function, just formatted in a nicer way and with an added schema parameter

@cwbriones cwbriones changed the title Create partitions Periodically create partitions for alert executions Jun 12, 2020
@BrandonArp
Copy link
Member

The concern regarding thread pools is because you seem to need to adjust the connection pools. If we're now seeing enough database traffic to warrant tuning the connection pools, we'll also need to tweak the thread pools (requests are served on Akka actors). It hadn't been a problem up until now so we've just kinda ignored it. But it seemed like you were hitting some limitation. And I just wanted to call out that just bumping the connection pool size would likely be insufficient.

@vjkoskela
Copy link
Member

I believe the issue here is that the code is now using the DDL connection pool to manage the partition tables. Previously, this connection pool was only used by Flyway and tuned to prevent accidental usage (by limiting its size). Now that we are intentionally using this pool for dynamically extending the partitions, the default pool size needs to be increased.

@cwbriones
Copy link
Contributor Author

Now that we are intentionally using this pool for dynamically extending the partitions, the default pool size needs to be increased.

Yes, this was my reasoning for increasing the connection pool size. It is likely overkill to have it the same size as the DML pool since the only additional traffic will be from the single partition-creating actor. The request pool was unchanged because the additional traffic is from an internal actor rather than generated via external requests.

@cwbriones
Copy link
Contributor Author

Some additional context that I provided in slack:

The context is that when we execute jobs within metrics portal, we store the results in some execution table. For alerts this was added in #380 . Since we expect alerts to evaluate twice a minute, thats ~3000 executions per alert per day. Because of daily volume and the fact that we really only care about alert executions in the current day, the table is partitioned by day - but Postgres does not automatically create these tables when the day rolls over.

So this PR adds an actor that periodically wakes up and creates the tables if they don’t exist at some scheduled time.

.setParameter(3, startDate)
.setParameter(4, endDate);

sql.findOneOrEmpty().orElseThrow(() -> new IllegalStateException("Expected a single empty result."));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a top level message handler. Throwing the exception here will just result in an uncaught message handler exception in Akka. Seems like we could do better handling, logging and instrumenting this failure. Thoughts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not the top-level handler - the actual handler is the execute overload with no parameters. This has the actual DB logic but the other code adds logging/metrics.

It is the case though that the wrapper code will not handle this illegal state exception which would still cause the problem you're mentioning albeit for different reasons so I could probably handle that.

@@ -27,7 +27,7 @@ ALTER ROLE metrics_app WITH NOSUPERUSER INHERIT NOCREATEDB NOCREATEROLE NOREPLIC

CREATE ROLE metrics_dba LOGIN;
ALTER ROLE metrics_dba WITH PASSWORD 'metrics_dba_password';
ALTER ROLE metrics_dba WITH NOSUPERUSER INHERIT NOCREATEDB NOCREATEROLE NOREPLICATION CONNECTION LIMIT 6;
ALTER ROLE metrics_dba WITH NOSUPERUSER INHERIT NOCREATEDB NOCREATEROLE NOREPLICATION CONNECTION LIMIT 15;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Flyway creates (for some reason) 2 connections.

@@ -441,6 +473,23 @@ public EbeanServer get() {
}
}

private static final class MetricsPortalDDLEbeanServerProvider implements Provider<EbeanServer> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for my edification: you wrote this Provider<...> that you bind() up above, but you also wrote a @Provides private method. Why? (If it was a purely stylistic difference, I'd have expected you to use only one.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it's purely stylistic in this case, the behavior would be identical. I wrote a SIC for this to keep it consistent with the other EBeanServer provider.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally though I do prefer @Provides methods because there's less boilerplate.

.toCompletableFuture()
.get();
public static void stop(final ActorRef ref, final Duration timeout) throws ExecutionException, InterruptedException {
Patterns.gracefulStop(ref, timeout).toCompletableFuture().get();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this worth the method? Especially, since it actually has no logic specific to this actor. Moreover, you could use PatternCS (assuming we're on a new enough version of Akka in MP) which should simplify the call to stop (I believe).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can inline this; I had a method to be consistent with the start call but since that was removed there's no need for the consistency.

As far as the APIs used, PatternsCS and Patterns were merged in Akka 2.5.19 (we're running 2.5.20).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just looked and MAD is running Akka 2.5.16 so it makes sense that using Patterns works here but not over there.

* interrupted for other reasons.
*/
public static void stop(final ActorRef ref, final Duration timeout) throws ExecutionException, InterruptedException {
Patterns.gracefulStop(ref, timeout).toCompletableFuture().get();
public static void ensurePartitionExistsForDate(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not entirely without merit, and not as concerned, but at least may be simpler with PatternsCS. I believe we also doubled up on the timeouts with this pattern in MAD and CAGG; e.g.

ArpNetworking/metrics-aggregator-daemon@158884e#diff-fed66f6be12dfd97b6fb62212dd0d6ebR52

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned above PatternsCS and Patterns are essentially the same API for this version of Akka - but I'll add the timeouts. I'm pretty sure I'm the one that asked you to double up on that code to begin with 😄

.matchEquals(EXECUTE, msg -> execute())
.match(CreateForRange.class, msg -> {
final Status.Status resp = execute(msg.getStart(), msg.getEnd());
if (!getSender().equals(getSelf())) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hrm. Is this what you were looking for?

https://doc.akka.io/docs/akka/current/typed/interaction-patterns.html

See: "Ignoring replies"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is exactly what I was looking for, but that's Akka 2.6.XX using typed actors so it's not available here 😢

private void tick() {
recordCounter("tick", 1);

final Instant now = _clock.instant();
if (_schedule.nextRun(_lastRun).map(run -> run.isBefore(now)).orElse(true)) {
getSelf().tell(EXECUTE, getSelf());
final LocalDate startDate = ZonedDateTime.ofInstant(now, _clock.getZone()).toLocalDate();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need a local timezone here? What happens if two boxes have different timezones? Would this work equally well just fixed to UTC?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a local time zone, it is UTC regardless of host. I initialized the clock to Clock.systemUTC(). I can be explicit here and just specify UTC a second time.


private void updateCache(final LocalDate start, final LocalDate end) {
LocalDate date = start;
while (!date.equals(end)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As these are local dates, I believe they won't be equal if they have different locales (zones) even if adjusted it's the same point in time. Is this what we want for comparison? (related: see comment about zones above)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand the concern, it's not applicable here because LocalDate instances do not have timezone information - it's essentially just day/month/year

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I'm treating all time in here as UTC which is why the zone offset information was dropped)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants