-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add batch creation logic for the reminder service #3413
Conversation
Changes unknown |
19f4f0a
to
8fd38d5
Compare
ALTER TABLE repositories ADD COLUMN reminder_last_sent TIMESTAMP; | ||
|
||
CREATE EXTENSION IF NOT EXISTS tsm_system_rows; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if enabling extensions should be part of migration files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a reasonable way to do it, but I'd prefer to have less postgres magic rather than more. It's not clear that we need to fetch a random valid repository at startup -- we can simply start with a random UUID and march forward on valid repository IDs from there.
If we did need to get a random repository, I'd be inclined to simply use:
SELECT * FROM repositories
WHERE id > gen_random_uuid()
LIMIT 1;
(plus a little bit to fetch the lowest-numbered repository if that query returns zero rows.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion: I think we can get rid of query GetRandomRepository
by calculating a random UUID on the application side and passing it straight to ListEligibleRepositoriesAfterID
. Added benefits would be
- we have one less query
- one less step in the process
- the whole process becomes deterministic (and easier to test)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we want to start at a random point, fetching a valid random repo is required. If we can start at any point, then sequential iteration would do the job.
If we don't fetch a valid repo / generate a random uuid, then we can potentially have a case where the generated uuid is out of range, and we have to either generate a uuid again or start from the beginning (which defeats the purpose of random generation)
It all comes down to whether we want to start from a valid random point or not.
sqlc.yaml
Outdated
- db_type: 'pg_catalog.interval' | ||
go_type: 'github.com/jackc/pgtype.Interval' | ||
- db_type: 'pg_catalog.interval' | ||
go_type: 'github.com/jackc/pgtype.NullInterval' | ||
nullable: true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another option would be to use make_interval(secs => {{ duration.Seconds() }})
Another option would simply be to pass a string through here, rather than a time.Duration
. Given that this is a configuration constant, I'd prefer to pull in fewer dependencies to support it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I missed if it was discussed but should we also have a way to disable reminder for certain repos by filtering repos out of the batch creation logic?
User configurable option? Why would someone like to do that? To preserve their rate limit? I don't think reminder should be exposed to end users; it's just background reconciliation to keep the system up to date. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Small suggestion as I keep looking at the code.
database/query/repositories.sql
Outdated
-- name: GetRandomRepository :one | ||
SELECT * FROM repositories | ||
TABLESAMPLE SYSTEM_ROWS(1); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @Vyom-Yadav, thank you for your great work! 🙇
Albeit I don't think that adding tsm_system_rows
would be a problem, I'm not convinced it's the right tool in this scenario.
If I got the idea right, here we're trying to get a row at random to use its ID as cursor, and all subsequent queries would be based on WHERE r.id > $1
, which is OK and is most likely guaranteed to go by index.
I was wondering, why not getting rid of the additional dependency on tsm_system_rows
and simply generate one "first" random uuid on the application side and then use that use that to start? If I got the procedure right, we would be also able to remove this statement, as we control randomness from the outside.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"first" random uuid on the application side and then use that use that to start?
There is no way that I know that generates a random uuid in some range (I'd rather not play with that). If we generate on the application side, then we can end up with an out-of-range uuid, which leads to simple sequential iteration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe we're using v4 UUIDs, which are random, except for about 7 bits in the middle of the string. (It looks like entity_event.go uses v1 UUIDs...). In particular, the default for the id
field on repositories is gen_random_uuid()
, which is a v4 UUID.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good to see you back! I'm going to encourage you to be a little less precise here in favor of ensuring there's an upper bound on the amount of work that we put into the system at any given time, which I think is the more critical piece for background operation.
ALTER TABLE repositories ADD COLUMN reminder_last_sent TIMESTAMP; | ||
|
||
CREATE EXTENSION IF NOT EXISTS tsm_system_rows; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a reasonable way to do it, but I'd prefer to have less postgres magic rather than more. It's not clear that we need to fetch a random valid repository at startup -- we can simply start with a random UUID and march forward on valid repository IDs from there.
If we did need to get a random repository, I'd be inclined to simply use:
SELECT * FROM repositories
WHERE id > gen_random_uuid()
LIMIT 1;
(plus a little bit to fetch the lowest-numbered repository if that query returns zero rows.)
database/query/repositories.sql
Outdated
SELECT r.* FROM repositories r | ||
INNER JOIN rule_evaluations re ON re.repository_id = r.id | ||
INNER JOIN rule_details_eval rde ON rde.rule_eval_id = re.id | ||
WHERE r.id > $1 | ||
GROUP BY r.id | ||
HAVING MIN(rde.last_updated) + sqlc.arg('min_elapsed')::interval < NOW() | ||
ORDER BY r.id | ||
LIMIT sqlc.narg('limit')::bigint; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can see where this is coming from -- I'm also worried about the cost of this query on the database side. With a fairly simple amount of data, the query ends up being a lot more expensive than a full table scan on the database side (e.g. full scans of the tables are about 40-ish cost in this example).
Digging in a bit, I ran explain on a small database:
EXPLAIN SELECT r.* FROM repositories r
INNER JOIN rule_evaluations re ON re.repository_id = r.id
INNER JOIN rule_details_eval rde ON rde.rule_eval_id = re.id
WHERE r.id > '8e4d7b85-2022-4d95-8d1c-96d2097d73d2'::uuid
GROUP BY r.id
HAVING MIN(rde.last_updated) + interval '1 hour' < NOW()
ORDER BY r.id
LIMIT 10;
QUERY PLAN
------------------------------------------------------------------------------------------------------------
Limit (cost=57.74..57.76 rows=10 width=338)
-> Sort (cost=57.74..57.80 rows=24 width=338)
Sort Key: r.id
-> HashAggregate (cost=55.94..57.22 rows=24 width=338)
Group Key: r.id
Filter: ((min(rde.last_updated) + '01:00:00'::interval) < now())
-> Hash Join (cost=32.24..54.65 rows=258 width=346)
Hash Cond: (rde.rule_eval_id = re.id)
-> Seq Scan on rule_details_eval rde (cost=0.00..17.80 rows=780 width=24)
-> Hash (cost=30.12..30.12 rows=169 width=354)
-> Hash Join (cost=13.66..30.12 rows=169 width=354)
Hash Cond: (re.repository_id = r.id)
-> Seq Scan on rule_evaluations re (cost=0.00..15.10 rows=510 width=32)
-> Hash (cost=12.75..12.75 rows=73 width=338)
-> Seq Scan on repositories r (cost=0.00..12.75 rows=73 width=338)
Filter: (id > '8e4d7b85-2022-4d95-8d1c-96d2097d73d2'::uuid)
The fundamental limit here seems to be that since we don't have an index on rde.last_updated
, we'll always incur a sequential scan on rule_details_eval
(one of our larger tables).
It's possible on a large database, we'd see different query planning, but I worry that the layers of indirection and aggregation will fundamentally make this query hard to plan efficiently. At the worst case, we might have a database with 100M rows, and at any given query, there are only 100 rule_details_eval rows that are older than our threshold. In this scenario, we'd end up needing to scan the 100M rows on each query for work, finding 10 repositories, and then repeating.
My gut feeling is that it's more important to have a steady amount of load on the database (many light queries) than to get exactly as much work as possible in any particular iteration. One way to do this would be to use a sub-select to limit the number of repositories considered in the query (e.g. SELECT ... FROM (SELECT * FROM repositories WHERE id > $1 ORDER BY id LIMIT 50) AS r ...
). Another option would simply to do the work outside of a single SQL query, e.g. SELECT * FROM repositories WHERE id > $1 LIMIT 30
, and then processing each repository after that.
Using the sub-select with limit on repositories with my sample query and a limit of 20, I see a 25-30% reduction in query time, but I suspect the behavior is stronger with larger databases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that we have reminder_last_sent
, it might also be reasonable to simply use that -- we'd have a one-time extra batch of revisits when this rolls out, and then the query could be:
SELECT * FROM (SELECT * FROM repositories WHERE id > $1 LIMIT 50)
WHERE reminder_last_sent < NOW() - sqlc.arg('min_elapsed')::interval
ORDER BY id;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed on the statement analysis. The current query (without the sub-select) guarantees that we get limit
rows (if they are there). With a sub-select, selected rows might not be eligible i.e. last_updated
was recent.
Agreed with the point that this will be a single time-consuming query rather than a steady load form of query. To keep things and mocking simpler on the application side, I'd go with a sub-query rather than two different queries, i.e. selection and filtration.
Just using reminder_last_sent
isn't a good parameter IMO. We can have a case where a reminder was sent 24h ago but the repo was recently updated (edge based), so this would result in extra load on the server (reconciling the repo).
internal/reminder/reminder.go
Outdated
} | ||
err := r.restoreCursorState(ctx) | ||
|
||
randomRepo, err := r.store.GetRandomRepository(ctx) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not simply r.repositoryCursor = uuid.Random()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may generate an out-of-range uuid, which will result in sequential iteration. There is nothing wrong with sequential iteration, but I coded it in a way in which we start from a random valid uuid (theoretically speaking, it is possible to get the first uuid as a random uuid from the db which will result in sequential iteration)
This point of generating a random valid uuid may not be that important, so I'm willing to gen random uuid on application side if that's better (in terms of simplicity)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what an out-of-range uuid is -- I'd meant uuid.NewRandom()
, which generates a valid (v4) UUID.
So, there is one edge case here: we may generate a UUID which is larger than any of the UUIDs in the database (at which point, we should start at the sequentially-first UUID). I believe that we're using v4 (random) UUIDs, so we should end up with a rougly-even distribution of UUIDs choosing at random. (If we were using time-sorted UUIDs, picking a random UUID would likely either start at the beginning or the end of the sequence.)
internal/reminder/reminder.go
Outdated
if err != nil { | ||
// Non-fatal error, if we can't restore the cursor state, we'll start from scratch. | ||
logger.Error().Err(err).Msg("error restoring cursor state") | ||
return nil, err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This means reminder will exit with an empty database (say, when setting up for the first time).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it should be:
if err != nil && !errors.Is(err, sql.ErrNoRows) {
return nil, err
}
internal/reminder/reminder.go
Outdated
"repoListCursor": r.repoListCursor, | ||
// Update the reminder_last_sent for each repository to export as metrics | ||
for _, repo := range repos { | ||
logger.Debug().Msgf("updating reminder_last_sent for repository: %s", repo.ID) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you include both the repo and the old value of reminder_last_sent
as structured fields in the logs? This could be an easy way to see what the actual delay on updates is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
internal/reminder/reminder.go
Outdated
Limit: sql.NullInt64{ | ||
Int64: int64(r.cfg.RecurrenceConfig.BatchSize), | ||
Valid: true, | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why make this nullable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bad copy paste, changed it.
internal/reminder/reminder.go
Outdated
// Only fetch additional repositories if we are under the limit | ||
if len(repos) < r.cfg.RecurrenceConfig.BatchSize { | ||
additionalRepos, err = r.getAdditionalRepos(ctx, repos) | ||
if err != nil { | ||
return nil, err | ||
} | ||
|
||
repos, intersectionPoint = r.mergeRepoBatch(repos, additionalRepos) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feels complicated -- it feels like we should call ListEligibleRepositoriesAfterID
and then use the last returned element of that query to set the cursor for the next run (checking RepositoryExistsAfterID
and setting the cursor to the zero UUID if no more repos exist).
internal/reminder/reminder.go
Outdated
// There may be an intersection between the two sets | ||
// If there is an intersection, then we need to update the cursor to the last fetched | ||
// non-common repository |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I understand how this happens, unless there are only e.g. 4 repos eligible in the whole database.
internal/reminder/reminder.go
Outdated
intersectionPoint := -1 | ||
var additionalRepos []db.Repository | ||
|
||
// Only fetch additional repositories if we are under the limit |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it's a requirement that we find exactly BatchSize
repos, just that we don't exceed BatchSize
repos in one pass. We're trying to upper-bound the amount of background work added to the system.
I think this will get a lot simpler if we just do one pass:
repos, err := r.store.ListEligibleRepositoriesAfterID(ctx, {...})
if err != nil {
return nil, err
}
// Don't actually ignore the error here...
if len(repos) == 0 || !RepositoryExistsAfterID(ctx, repos[len(repos)-1].ID) {
r.repositoryCursor = uuid.UUID{}
} else {
r.repositoryCursor = repos[len(repos)-1].ID
}
return repos, nil
internal/reminder/reminder.go
Outdated
// non-common repository | ||
intersectionPoint := -1 | ||
for i, repo := range additionalRepos { | ||
if reposSet.Has(repo) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is using repo
as a hashable object, but we've done two different queries, so the repos might differ in some minutiae like updated_at
and be included in the batch twice. Given that we know that the two lists are sorted, we could also march through additionalRepos
comparing less-than the first item in repos
and find the intersection point without needing to use the set.
But, as mentioned, I think we can avoid a lot of this code if we're willing to not exactly fill each reminder batch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh, I missed this, but now we fetch only once, so this isn't required.
e273a44
to
efed08c
Compare
@evankanderson, I updated the PR to fetch only once, and |
conn, err := grpc.DialContext(ctx, endpoint, opts...) | ||
conn, err := grpc.NewClient(endpoint, opts...) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Haven't verified the changes that came with protoc-gen-go v1.34.1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking pretty close -- just two concerns:
- It looks like
LATERAL
is a lot slower than just a standard inner join. /shrug - I think your logic for when to loop back the beginning isn't quite right.
database/query/repositories.sql
Outdated
-- name: GetRandomRepository :one | ||
SELECT * FROM repositories | ||
TABLESAMPLE SYSTEM_ROWS(1); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe we're using v4 UUIDs, which are random, except for about 7 bits in the middle of the string. (It looks like entity_event.go uses v1 UUIDs...). In particular, the default for the id
field on repositories is gen_random_uuid()
, which is a v4 UUID.
database/query/repositories.sql
Outdated
ORDER BY id | ||
LIMIT sqlc.arg('limit')::bigint | ||
) r | ||
JOIN LATERAL ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like LATERAL
converts this from a HashAggregate of a NestedLoop to a NestedLoop that runs an Aggregate inside:
EXPLAIN SELECT r.*
FROM (
SELECT *
FROM repositories
ORDER BY id
LIMIT 10
) r
JOIN LATERAL (
SELECT MIN(rde.last_updated) AS min_last_updated
FROM rule_evaluations re
INNER JOIN rule_details_eval rde ON rde.rule_eval_id = re.id
WHERE re.repository_id = r.id
) sub ON sub.min_last_updated + interval '1h' < NOW()
ORDER BY r.id;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------
Nested Loop (cost=31.75..318.77 rows=10 width=338)
-> Limit (cost=0.14..2.30 rows=10 width=338)
-> Index Scan using repositories_pkey on repositories (cost=0.14..47.45 rows=220 width=338)
-> Aggregate (cost=31.61..31.63 rows=1 width=8)
Filter: ((min(rde.last_updated) + '01:00:00'::interval) < now())
-> Nested Loop (cost=8.12..31.60 rows=5 width=8)
-> Bitmap Heap Scan on rule_evaluations re (cost=7.97..15.08 rows=3 width=16)
Recheck Cond: (repository_id = repositories.id)
-> Bitmap Index Scan on rule_evaluations_results_name_lower_idx (cost=0.00..7.97 rows=3 width=0)
Index Cond: (repository_id = repositories.id)
-> Index Scan using idx_rule_detail_eval_ids on rule_details_eval rde (cost=0.15..5.50 rows=1 width=24)
Index Cond: (rule_eval_id = re.id)
(12 rows)
vs
EXPLAIN SELECT r.id, r.provider, r.project_id, r.repo_owner, r.repo_name, r.repo_id, r.is_private, r.is_fork, r.webhook_id
FROM (
SELECT *
FROM repositories
ORDER BY id
LIMIT 10
) r
JOIN rule_evaluations AS re on re.repository_id = r.id
JOIN rule_details_eval rde ON rde.rule_eval_id = re.id
WHERE re.repository_id = r.id
GROUP BY r.id, r.provider, r.project_id, r.repo_owner, r.repo_name, r.repo_id, r.is_private, r.is_fork, r.webhook_id
HAVING MIN(rde.last_updated) + interval '1h' < NOW() ORDER BY r.id;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------
Sort (cost=28.33..28.36 rows=13 width=146)
Sort Key: r.id
-> HashAggregate (cost=27.39..28.09 rows=13 width=146)
Group Key: r.id, r.provider, r.project_id, r.repo_owner, r.repo_name, r.repo_id, r.is_private, r.is_fork, r.webhook_id
Filter: ((min(rde.last_updated) + '01:00:00'::interval) < now())
-> Nested Loop (cost=2.67..26.39 rows=40 width=154)
-> Hash Join (cost=2.52..19.79 rows=26 width=162)
Hash Cond: (re.repository_id = r.id)
-> Seq Scan on rule_evaluations re (cost=0.00..15.10 rows=510 width=32)
-> Hash (cost=2.40..2.40 rows=10 width=146)
-> Subquery Scan on r (cost=0.14..2.40 rows=10 width=146)
-> Limit (cost=0.14..2.30 rows=10 width=338)
-> Index Scan using repositories_pkey on repositories (cost=0.14..47.45 rows=220 width=338)
-> Index Scan using idx_rule_detail_eval_ids on rule_details_eval rde (cost=0.15..0.25 rows=1 width=24)
Index Cond: (rule_eval_id = re.id)
(15 rows)
While there are more steps in the non-lateral query plan, it seems to plan and execute substantially faster than the LATERAL version. In an environment with more rows, this is the difference between EXPLAIN ANALYZE
taking 30ms and 2ms of execution time.
internal/reminder/reminder.go
Outdated
|
||
// Update the reminder_last_sent for each repository to export as metrics | ||
for _, repo := range repos { | ||
logger.Debug().Msgf("updating reminder_last_sent for repository: %s", repo.ID) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Zerolog supports structured logging, which allows us to later easily query and extract (for example) log lines that reference a specific repository id:
logger.Debug().Msgf("updating reminder_last_sent for repository: %s", repo.ID) | |
logger.Debug().Str("repo", repo.ID.String()).Time("previously", repo.ReminderLastSent.Time). | |
Msg("updating reminder_last_sent") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
internal/reminder/reminder.go
Outdated
logger.Debug().Msgf("previous reminder_last_sent: %s", repo.ReminderLastSent.Time) | ||
err := r.store.UpdateReminderLastSentById(ctx, repo.ID) | ||
if err != nil { | ||
logger.Error().Err(err).Msgf("unable to update reminder_last_sent for repository: %s", repo.ID) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logger.Error().Err(err).Msgf("unable to update reminder_last_sent for repository: %s", repo.ID) | |
logger.Error().Err(err).Str("repo", repo.ID.String()).Msg("unable to update reminder_last_sent") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
if len(repos) == 0 { | ||
r.repositoryCursor = uuid.Nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we're filtering the repos we get back from the batch in the database, it's possible that we read e.g. 50 repos that all got filtered. I think you want to check RepositoryExistsAfterID
here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I missed this branch 😞 . I updated the logic. Hopefully I haven't missed any logical branches this time.
internal/reminder/reminder.go
Outdated
logger.Error().Err(err).Msgf("unable to check if repository exists after cursor: %s", r.repositoryCursor) | ||
logger.Info().Msg("resetting cursor to zero uuid") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd tend to put this as one message in the logs, rather than two separate messages. Especially when searching structured logs (say, for the last 48 hours across 8 or 10 servers), it can be hard to "scroll" forward or backward to the next log line from a particular server.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
ef39a5d
to
6c92820
Compare
database/query/profile_status.sql
Outdated
-- ListOldestRuleEvaluationsByRepositoryId has casts in select statement as sqlc generates incorrect types. | ||
-- Though repository_id doesn't have non null constraint, but it always has a value in the database. | ||
-- cast after MIN is required due to a known bug in sqlc: https://github.com/sqlc-dev/sqlc/issues/1965 | ||
|
||
-- name: ListOldestRuleEvaluationsByRepositoryId :many | ||
SELECT re.repository_id::uuid AS repository_id, MIN(rde.last_updated)::timestamp AS oldest_last_updated | ||
FROM rule_evaluations re | ||
INNER JOIN rule_details_eval rde ON re.id = rde.rule_eval_id | ||
WHERE re.repository_id = ANY (sqlc.arg('repository_ids')::uuid[]) | ||
GROUP BY re.repository_id; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can re.repository_id
be null? I was under the assumption that minder needs a repo to function on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also
EXPLAIN SELECT re.repository_id::uuid AS repository_id, MIN(rde.last_updated)::timestamp AS oldest_last_updated
FROM rule_evaluations re
INNER JOIN rule_details_eval rde ON re.id = rde.rule_eval_id
WHERE re.repository_id = ANY (array['de0b2ad2-bc90-4126-b0a2-63abc1cce808','81b2a2ce-85fb-4528-a43e-7eef02f596f2','7e0813ce-6201-48f0-b136-bf08be8efcb9']::uuid[])
GROUP BY re.repository_id;
GroupAggregate (cost=37.19..37.36 rows=8 width=24)
Group Key: re.repository_id
-> Sort (cost=37.19..37.22 rows=12 width=24)
Sort Key: re.repository_id
-> Hash Join (cost=17.11..36.97 rows=12 width=24)
Hash Cond: (rde.rule_eval_id = re.id)
-> Seq Scan on rule_details_eval rde (cost=0.00..17.80 rows=780 width=24)
-> Hash (cost=17.01..17.01 rows=8 width=32)
-> Seq Scan on rule_evaluations re (cost=0.00..17.01 rows=8 width=32)
" Filter: (repository_id = ANY ('{de0b2ad2-bc90-4126-b0a2-63abc1cce808,81b2a2ce-85fb-4528-a43e-7eef02f596f2,7e0813ce-6201-48f0-b136-bf08be8efcb9}'::uuid[]))"
It does a sequential scan on both tables. I can understand the sequential scan on rule_evaluations
table as there is no index on re.repository_id
. But after obtaining re.id
s associated with re.repository_id
, why does it need to perform a sequential scan on rule_details_eval rde
? There is an index on rde.rule_eval_id
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
repository_id
can be null for resources which aren't associated with a repository (for example, we're working on a DockerHub provider that wouldn't have git repos). I think it's fine for Reminder not to work with those yet.
It looks like we don't have a relevant index on repository_id
:
\d rule_evaluations
Table "public.rule_evaluations"
Column | Type | Collation | Nullable | Default
-----------------+----------+-----------+----------+-------------------
id | uuid | | not null | gen_random_uuid()
entity | entities | | not null |
profile_id | uuid | | not null |
rule_type_id | uuid | | not null |
repository_id | uuid | | |
artifact_id | uuid | | |
pull_request_id | uuid | | |
rule_name | text | | not null |
Indexes:
"rule_evaluations_pkey" PRIMARY KEY, btree (id)
"rule_evaluations_results_name_lower_idx" UNIQUE, btree (profile_id, lower(rule_name), repository_id, COALESCE(artifact_id, '00000000-0000-0000-0000-000000000000'::uuid), entity, rule_type_id, COALESCE(pull_request_id, '00000000-0000-0000-0000-000000000000'::uuid)) NULLS NOT DISTINCT
This somewhat surprises me, as I'd have expected the foreign key constraint on repository_id
to create an index. We should probably add indexes to support these -- feel free to only add the one for repository_id
right now. (This may actually have been the proximate cause of a lot of the other queries having screwy cost estimates -- adding the index changed the explain from a Seq Scan + Filter
to Bitmap Heap Scan(Recheck Cond) + Bitmap Index Scan(Index Cond)
.
rule_details_eval
seems to be fine:
Indexes:
"rule_details_eval_pkey" PRIMARY KEY, btree (id)
"idx_rule_detail_eval_ids" UNIQUE, btree (rule_eval_id)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added index on rule_evaluations(repository_id). The index will be created in concurrent mode as I thought blocking rule_evaluations
writes isn't a good option.
@evankanderson To address the problem of updating the cursor properly, I have split the fetching and filtering queries. Now, repositories would be fetched unconditionally (to update the cursor) and later filtered on the application side. |
6c92820
to
dc04b46
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is getting really close!
I'd actually missed the "query might return zero repos but we still need to advance the cursor" issue -- nice catch!
database/query/profile_status.sql
Outdated
-- ListOldestRuleEvaluationsByRepositoryId has casts in select statement as sqlc generates incorrect types. | ||
-- Though repository_id doesn't have non null constraint, but it always has a value in the database. | ||
-- cast after MIN is required due to a known bug in sqlc: https://github.com/sqlc-dev/sqlc/issues/1965 | ||
|
||
-- name: ListOldestRuleEvaluationsByRepositoryId :many | ||
SELECT re.repository_id::uuid AS repository_id, MIN(rde.last_updated)::timestamp AS oldest_last_updated | ||
FROM rule_evaluations re | ||
INNER JOIN rule_details_eval rde ON re.id = rde.rule_eval_id | ||
WHERE re.repository_id = ANY (sqlc.arg('repository_ids')::uuid[]) | ||
GROUP BY re.repository_id; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
repository_id
can be null for resources which aren't associated with a repository (for example, we're working on a DockerHub provider that wouldn't have git repos). I think it's fine for Reminder not to work with those yet.
It looks like we don't have a relevant index on repository_id
:
\d rule_evaluations
Table "public.rule_evaluations"
Column | Type | Collation | Nullable | Default
-----------------+----------+-----------+----------+-------------------
id | uuid | | not null | gen_random_uuid()
entity | entities | | not null |
profile_id | uuid | | not null |
rule_type_id | uuid | | not null |
repository_id | uuid | | |
artifact_id | uuid | | |
pull_request_id | uuid | | |
rule_name | text | | not null |
Indexes:
"rule_evaluations_pkey" PRIMARY KEY, btree (id)
"rule_evaluations_results_name_lower_idx" UNIQUE, btree (profile_id, lower(rule_name), repository_id, COALESCE(artifact_id, '00000000-0000-0000-0000-000000000000'::uuid), entity, rule_type_id, COALESCE(pull_request_id, '00000000-0000-0000-0000-000000000000'::uuid)) NULLS NOT DISTINCT
This somewhat surprises me, as I'd have expected the foreign key constraint on repository_id
to create an index. We should probably add indexes to support these -- feel free to only add the one for repository_id
right now. (This may actually have been the proximate cause of a lot of the other queries having screwy cost estimates -- adding the index changed the explain from a Seq Scan + Filter
to Bitmap Heap Scan(Recheck Cond) + Bitmap Index Scan(Index Cond)
.
rule_details_eval
seems to be fine:
Indexes:
"rule_details_eval_pkey" PRIMARY KEY, btree (id)
"idx_rule_detail_eval_ids" UNIQUE, btree (rule_eval_id)
err := r.store.UpdateReminderLastSentById(ctx, repo.ID) | ||
if err != nil { | ||
logger.Error().Err(err).Str("repo", repo.ID.String()).Msg("unable to update reminder_last_sent") | ||
return []error{err} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like this accumulates a list of errors by the signature, but this code does an early-return. Do you want to accumulate the errors, or just return err
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function will largely change in the next PR i.e. connecting minder and reminder. A slice of errors is returned as we can get errors while creating messages, sending them, etc. Again, we can discuss this in the next PR. (See sendReminders
function in the old PR)
idToLastUpdatedMap := make(map[uuid.UUID]time.Time) | ||
for _, oldestRuleEval := range oldestRuleEvals { | ||
idToLastUpdatedMap[oldestRuleEval.RepositoryID] = oldestRuleEval.OldestLastUpdated | ||
} | ||
|
||
for _, repo := range repos { | ||
if oldestRuleEval, ok := idToLastUpdatedMap[repo.ID]; ok && | ||
oldestRuleEval.Add(r.cfg.RecurrenceConfig.MinElapsed).Before(time.Now()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do this two-step loop here? It feels like either passing in the cutoff to the query (i.e. SELECT ... WHERE rde.last_updated < $1 AND re.repository_id = ANY(...) GROUP BY re.repository_id
or the equivalent HAVING
), or looping through the results once should be sufficient:
cutoff := time.Now().Sub(r.cfg.RecurrenceConfig.MinElapsed)
for _, evalTime := range oldestRuleEvals {
if evalTime.OldestLastUpdated.Before(cutoff) {
eligibleRepos = append(eligibleRepos, repo)
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ListOldestRuleEvaluationsByRepositoryId
only queries rule_evaluations
and rule_details_eval
, so it returns a slice of:
type ListOldestRuleEvaluationsByRepositoryIdRow struct {
RepositoryID uuid.UUID `json:"repository_id"`
OldestLastUpdated time.Time `json:"oldest_last_updated"`
}
We have to iterate over repos as the repo object isn't returned.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, that seems like it's worth a comment about how we're doing a bunch of transforms of types to fit into the sqlc-generated code (rather than for some other reason, like performance or thread-safety).
Also, a slight preference for:
cutoff := time.Now().Sub(r.cfg.RecurrenceConfig.MinElapsed)
for _, repo := range repos {
if t, ok := idToLastUpdate[repo.ID]; ok && t.Before(cutoff) {
....
}
}
This has two benefits:
- You can fit the condition on one line (shortening the time var that lives for a single line, removing
Map
from the name of the map, and doing the date math on a separate line). - You only do the date math once, and only fetch
time.Now()
once, rather than using a slightly different time for each check.
413869a
to
7afc72f
Compare
Signed-off-by: Vyom-Yadav <[email protected]>
7afc72f
to
2542103
Compare
idToLastUpdatedMap := make(map[uuid.UUID]time.Time) | ||
for _, oldestRuleEval := range oldestRuleEvals { | ||
idToLastUpdatedMap[oldestRuleEval.RepositoryID] = oldestRuleEval.OldestLastUpdated | ||
} | ||
|
||
for _, repo := range repos { | ||
if oldestRuleEval, ok := idToLastUpdatedMap[repo.ID]; ok && | ||
oldestRuleEval.Add(r.cfg.RecurrenceConfig.MinElapsed).Before(time.Now()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, that seems like it's worth a comment about how we're doing a bunch of transforms of types to fit into the sqlc-generated code (rather than for some other reason, like performance or thread-safety).
Also, a slight preference for:
cutoff := time.Now().Sub(r.cfg.RecurrenceConfig.MinElapsed)
for _, repo := range repos {
if t, ok := idToLastUpdate[repo.ID]; ok && t.Before(cutoff) {
....
}
}
This has two benefits:
- You can fit the condition on one line (shortening the time var that lives for a single line, removing
Map
from the name of the map, and doing the date math on a separate line). - You only do the date math once, and only fetch
time.Now()
once, rather than using a slightly different time for each check.
I'm going to merge and then send a PR for the cleanup in |
Summary
Provide a brief overview of the changes and the issue being addressed.
Explain the rationale and any background necessary for understanding the changes.
List dependencies required by this change, if any.
Part - 2 #2262
Algorithm:
uuid
as starting point.Only the starting point is random; batch creation becomes sequential after one end of the table is reached. If after reaching the end of the table, the iteration point is selected randomly again, the coverage of all repos won't be guaranteed.
Change Type
Mark the type of change your PR introduces:
Testing
Outline how the changes were tested, including steps to reproduce and any relevant configurations.
Attach screenshots if helpful.
Review Checklist: