Periodically create partitions for alert executions #398

cwbriones · 2020-06-08T22:36:16Z

Introduce a DailyPartition manager actor that runs within the DatabaseAlertExecutionRepository, periodically attempting to create partitions for the alert executions table that was added in #380. Notably I've introduced another EbeanServer which uses the DDL role (metrics_dba), since that Role is necessary to create tables.

I've included unit tests for the actor while the DB testing is covered by the existing integration tests.

Admin user at runtime = 5 Admin user in tests = 5 Migration connection = 1 Total possible connections is 11 so let's do 15

BrandonArp · 2020-06-11T23:21:14Z

Have we setup the Akka pools for these actors? You're updating connection pool sizes, but if the Akka actor's thread pools aren't big enough (they usually assume non-blocking), then you're gonna block before the connection open.

This reverts commit e398c9e.

cwbriones · 2020-06-12T01:20:49Z

Is the concern because I block on actor start whereas we ordinarily would let that happen asynchronously? I'm not sure I entirely understand the issue.

cwbriones

Some additional context for reviewers.

cwbriones · 2020-06-12T01:23:40Z

pom.xml

@@ -1165,7 +1165,6 @@
      <groupId>org.flywaydb</groupId>
      <artifactId>flyway-play_${scala.package.version}</artifactId>
      <version>${flyway.play.version}</version>
-      <scope>runtime</scope>


Scope changed because I need to block on the flyway-play PlayInitializer from within MainModule to guarantee that migrations have run.

cwbriones · 2020-06-12T01:25:05Z

main/postgres/initdb.d/init.sql.1

@@ -27,7 +27,7 @@ ALTER ROLE metrics_app WITH NOSUPERUSER INHERIT NOCREATEDB NOCREATEROLE NOREPLIC

 CREATE ROLE metrics_dba LOGIN;
 ALTER ROLE metrics_dba WITH PASSWORD 'metrics_dba_password';
-ALTER ROLE metrics_dba WITH NOSUPERUSER INHERIT NOCREATEDB NOCREATEROLE NOREPLICATION CONNECTION LIMIT 6;
+ALTER ROLE metrics_dba WITH NOSUPERUSER INHERIT NOCREATEDB NOCREATEROLE NOREPLICATION CONNECTION LIMIT 15;


Why 15:

I bumped the connection pool to be the same size as what we have for metrics_dml, and set the size to be the same in tests. That is 5 + 5 = 10 connections. There's also an additional connection flyway creates with this role. That's a total of 11. I then just bumped it to 15.

nit: Flyway creates (for some reason) 2 connections.

Any change required here then? that would be 12 which is still below the 15 limit.

Nope, just trying to spread the knowledge.

cwbriones · 2020-06-12T01:25:55Z

conf/db/migration/metrics_portal_ddl/V22__replace_alert_execution_function.sql

+--    Table - Text - The name of the parent table.
+--    Start - Date - The beginning date of the time range, inclusive.
+--    End   - Date - The end date of the time range, exclusive.
+CREATE OR REPLACE FUNCTION create_daily_partition(schema TEXT, tablename TEXT, start_date DATE, end_date DATE)


Content is exactly the same as the previous function, just formatted in a nicer way and with an added schema parameter

BrandonArp · 2020-06-12T15:52:08Z

The concern regarding thread pools is because you seem to need to adjust the connection pools. If we're now seeing enough database traffic to warrant tuning the connection pools, we'll also need to tweak the thread pools (requests are served on Akka actors). It hadn't been a problem up until now so we've just kinda ignored it. But it seemed like you were hitting some limitation. And I just wanted to call out that just bumping the connection pool size would likely be insufficient.

vjkoskela · 2020-06-13T17:45:23Z

I believe the issue here is that the code is now using the DDL connection pool to manage the partition tables. Previously, this connection pool was only used by Flyway and tuned to prevent accidental usage (by limiting its size). Now that we are intentionally using this pool for dynamically extending the partitions, the default pool size needs to be increased.

cwbriones · 2020-06-16T21:38:37Z

Now that we are intentionally using this pool for dynamically extending the partitions, the default pool size needs to be increased.

Yes, this was my reasoning for increasing the connection pool size. It is likely overkill to have it the same size as the DML pool since the only additional traffic will be from the single partition-creating actor. The request pool was unchanged because the additional traffic is from an internal actor rather than generated via external requests.

cwbriones · 2020-06-17T01:40:13Z

Some additional context that I provided in slack:

The context is that when we execute jobs within metrics portal, we store the results in some execution table. For alerts this was added in #380 . Since we expect alerts to evaluate twice a minute, thats ~3000 executions per alert per day. Because of daily volume and the fact that we really only care about alert executions in the current day, the table is partitioned by day - but Postgres does not automatically create these tables when the day rolls over.

So this PR adds an actor that periodically wakes up and creates the tables if they don’t exist at some scheduled time.

app/com/arpnetworking/metrics/portal/alerts/impl/DailyPartitionCreator.java

vjkoskela · 2020-06-17T23:45:06Z

app/com/arpnetworking/metrics/portal/alerts/impl/DailyPartitionCreator.java

+                .setParameter(3, startDate)
+                .setParameter(4, endDate);
+
+        sql.findOneOrEmpty().orElseThrow(() -> new IllegalStateException("Expected a single empty result."));


This is a top level message handler. Throwing the exception here will just result in an uncaught message handler exception in Akka. Seems like we could do better handling, logging and instrumenting this failure. Thoughts?

This is not the top-level handler - the actual handler is the execute overload with no parameters. This has the actual DB logic but the other code adds logging/metrics.

It is the case though that the wrapper code will not handle this illegal state exception which would still cause the problem you're mentioning albeit for different reasons so I could probably handle that.

app/com/arpnetworking/metrics/portal/alerts/impl/DatabaseAlertExecutionRepository.java

vjkoskela · 2020-06-17T23:47:52Z

main/postgres/initdb.d/init.sql.1

@@ -27,7 +27,7 @@ ALTER ROLE metrics_app WITH NOSUPERUSER INHERIT NOCREATEDB NOCREATEROLE NOREPLIC

 CREATE ROLE metrics_dba LOGIN;
 ALTER ROLE metrics_dba WITH PASSWORD 'metrics_dba_password';
-ALTER ROLE metrics_dba WITH NOSUPERUSER INHERIT NOCREATEDB NOCREATEROLE NOREPLICATION CONNECTION LIMIT 6;
+ALTER ROLE metrics_dba WITH NOSUPERUSER INHERIT NOCREATEDB NOCREATEROLE NOREPLICATION CONNECTION LIMIT 15;


nit: Flyway creates (for some reason) 2 connections.

speezepearson · 2020-06-17T23:36:04Z

app/global/MainModule.java

@@ -441,6 +473,23 @@ public EbeanServer get() {
        }
    }

+    private static final class MetricsPortalDDLEbeanServerProvider implements Provider<EbeanServer> {


Just for my edification: you wrote this Provider<...> that you bind() up above, but you also wrote a @Provides private method. Why? (If it was a purely stylistic difference, I'd have expected you to use only one.)

Yes it's purely stylistic in this case, the behavior would be identical. I wrote a SIC for this to keep it consistent with the other EBeanServer provider.

Generally though I do prefer @Provides methods because there's less boilerplate.

app/com/arpnetworking/metrics/portal/alerts/impl/DailyPartitionCreator.java

Ensure partitions exist on-demand when they are needed, with a cache in front. Separate ticking from execution.

app/com/arpnetworking/metrics/portal/alerts/impl/DailyPartitionCreator.java

vjkoskela · 2020-06-19T23:21:46Z

app/com/arpnetworking/metrics/portal/alerts/impl/DailyPartitionCreator.java

-                .toCompletableFuture()
-                .get();
+    public static void stop(final ActorRef ref, final Duration timeout) throws ExecutionException, InterruptedException {
+        Patterns.gracefulStop(ref, timeout).toCompletableFuture().get();


Is this worth the method? Especially, since it actually has no logic specific to this actor. Moreover, you could use PatternCS (assuming we're on a new enough version of Akka in MP) which should simplify the call to stop (I believe).

I can inline this; I had a method to be consistent with the start call but since that was removed there's no need for the consistency.

As far as the APIs used, PatternsCS and Patterns were merged in Akka 2.5.19 (we're running 2.5.20).

I just looked and MAD is running Akka 2.5.16 so it makes sense that using Patterns works here but not over there.

vjkoskela · 2020-06-19T23:23:45Z

app/com/arpnetworking/metrics/portal/alerts/impl/DailyPartitionCreator.java

     * interrupted for other reasons.
     */
-    public static void stop(final ActorRef ref, final Duration timeout) throws ExecutionException, InterruptedException {
-        Patterns.gracefulStop(ref, timeout).toCompletableFuture().get();
+    public static void ensurePartitionExistsForDate(


Not entirely without merit, and not as concerned, but at least may be simpler with PatternsCS. I believe we also doubled up on the timeouts with this pattern in MAD and CAGG; e.g.

ArpNetworking/metrics-aggregator-daemon@158884e#diff-fed66f6be12dfd97b6fb62212dd0d6ebR52

As mentioned above PatternsCS and Patterns are essentially the same API for this version of Akka - but I'll add the timeouts. I'm pretty sure I'm the one that asked you to double up on that code to begin with 😄

app/com/arpnetworking/metrics/portal/alerts/impl/DailyPartitionCreator.java

vjkoskela · 2020-06-19T23:27:13Z

app/com/arpnetworking/metrics/portal/alerts/impl/DailyPartitionCreator.java

-                .matchEquals(EXECUTE, msg -> execute())
+                .match(CreateForRange.class, msg -> {
+                    final Status.Status resp = execute(msg.getStart(), msg.getEnd());
+                    if (!getSender().equals(getSelf())) {


Hrm. Is this what you were looking for?

https://doc.akka.io/docs/akka/current/typed/interaction-patterns.html

See: "Ignoring replies"

This is exactly what I was looking for, but that's Akka 2.6.XX using typed actors so it's not available here 😢

app/com/arpnetworking/metrics/portal/alerts/impl/DailyPartitionCreator.java

vjkoskela · 2020-06-19T23:54:11Z

app/com/arpnetworking/metrics/portal/alerts/impl/DailyPartitionCreator.java

    private void tick() {
        recordCounter("tick", 1);

        final Instant now = _clock.instant();
        if (_schedule.nextRun(_lastRun).map(run -> run.isBefore(now)).orElse(true)) {
-            getSelf().tell(EXECUTE, getSelf());
+            final LocalDate startDate = ZonedDateTime.ofInstant(now, _clock.getZone()).toLocalDate();


Do we really need a local timezone here? What happens if two boxes have different timezones? Would this work equally well just fixed to UTC?

This is not a local time zone, it is UTC regardless of host. I initialized the clock to Clock.systemUTC(). I can be explicit here and just specify UTC a second time.

vjkoskela · 2020-06-19T23:55:02Z

app/com/arpnetworking/metrics/portal/alerts/impl/DailyPartitionCreator.java

+
+    private void updateCache(final LocalDate start, final LocalDate end) {
+        LocalDate date = start;
+        while (!date.equals(end)) {


As these are local dates, I believe they won't be equal if they have different locales (zones) even if adjusted it's the same point in time. Is this what we want for comparison? (related: see comment about zones above)

If I understand the concern, it's not applicable here because LocalDate instances do not have timezone information - it's essentially just day/month/year

(I'm treating all time in here as UTC which is why the zone offset information was dropped)

cwbriones added 17 commits May 27, 2020 19:07

WIP

abbbe67

change function

d2ce1d3

rewrite query so that it's more readable

e2d8ea5

create and bind actor

9275fa9

tests pass

f118544

propagate the exception from within the actor

b1fc6db

missing Named annotation

f0ab80a

Merge branch 'master' into create_partitions

dd711da

rework guice bindings

929aa31

stop creating on 'open'

2b14dff

moved impl into repository, use ddl user

2b097fe

fix startup issue by specifying a ddl ebean datasource

f015f2d

tests finally pass 😭

6f180c5

require migrations in ebean server provider

fd09528

bump the connection pool, add unit tests

85ebe05

suppress fb warning

ac21546

tweak pool sizes again

ed68e90

Admin user at runtime = 5 Admin user in tests = 5 Migration connection = 1 Total possible connections is 11 so let's do 15

cwbriones added 4 commits June 11, 2020 16:35

additional checks in tests

e09c1d5

revert pool size change

e398c9e

Revert "revert pool size change"

cd9b68f

This reverts commit e398c9e.

unused import

5c837de

cwbriones marked this pull request as ready for review June 12, 2020 01:21

cwbriones commented Jun 12, 2020

View reviewed changes

cwbriones changed the title ~~Create partitions~~ Periodically create partitions for alert executions Jun 12, 2020

Merge branch 'master' into create_partitions

ed117e2

Merge branch 'master' into create_partitions

07e8921

Merge branch 'master' into create_partitions

9ab2d01

Merge branch 'master' into create_partitions

1919870

vjkoskela requested changes Jun 17, 2020

View reviewed changes

address comments

7167b29

speezepearson reviewed Jun 18, 2020

View reviewed changes

cwbriones added 8 commits June 18, 2020 12:06

use akka.actor.Status instead of mananging the future ourselves

2cb6e0b

feedback

d6a5afb

Ensure partitions exist on-demand when they are needed, with a cache in front. Separate ticking from execution.

test fixes

5ee82ea

checkstyle

395bc09

method should be private

a7ff3f6

Merge branch 'master' into create_partitions

4ca92f2

Merge branch 'master' into create_partitions

b18deac

Merge branch 'master' into create_partitions

466bc04

vjkoskela approved these changes Jun 19, 2020

View reviewed changes

cwbriones added 2 commits June 19, 2020 17:14

Merge remote-tracking branch 'upstream/master' into create_partitions

aa6d0b4

feedback part 2

96be3c9

cwbriones merged commit e8cf917 into ArpNetworking:master Jun 20, 2020

cwbriones mentioned this pull request Jun 20, 2020

EBeanServer should treat dates it receives as UTC #419

Open

speezepearson mentioned this pull request Jun 22, 2020

reduce Ebean admin pool size #421

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Periodically create partitions for alert executions #398

Periodically create partitions for alert executions #398

cwbriones commented Jun 8, 2020 •

edited

Loading

BrandonArp commented Jun 11, 2020

cwbriones commented Jun 12, 2020

cwbriones left a comment

cwbriones Jun 12, 2020

cwbriones Jun 12, 2020

vjkoskela Jun 17, 2020

cwbriones Jun 18, 2020

vjkoskela Jun 18, 2020

cwbriones Jun 12, 2020

BrandonArp commented Jun 12, 2020

vjkoskela commented Jun 13, 2020

cwbriones commented Jun 16, 2020

cwbriones commented Jun 17, 2020

vjkoskela Jun 17, 2020

cwbriones Jun 18, 2020

vjkoskela Jun 17, 2020

speezepearson Jun 17, 2020

cwbriones Jun 18, 2020

cwbriones Jun 18, 2020

vjkoskela Jun 19, 2020

cwbriones Jun 20, 2020

cwbriones Jun 20, 2020

vjkoskela Jun 19, 2020

cwbriones Jun 20, 2020

vjkoskela Jun 19, 2020

cwbriones Jun 20, 2020

vjkoskela Jun 19, 2020

cwbriones Jun 20, 2020

vjkoskela Jun 19, 2020

cwbriones Jun 20, 2020

cwbriones Jun 20, 2020

Periodically create partitions for alert executions #398

Periodically create partitions for alert executions #398

Conversation

cwbriones commented Jun 8, 2020 • edited Loading

BrandonArp commented Jun 11, 2020

cwbriones commented Jun 12, 2020

cwbriones left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BrandonArp commented Jun 12, 2020

vjkoskela commented Jun 13, 2020

cwbriones commented Jun 16, 2020

cwbriones commented Jun 17, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cwbriones commented Jun 8, 2020 •

edited

Loading