Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add batch creation logic for the reminder service #3413

Merged
merged 1 commit into from
Jun 11, 2024

Conversation

Vyom-Yadav
Copy link
Collaborator

@Vyom-Yadav Vyom-Yadav commented May 23, 2024

Summary

Provide a brief overview of the changes and the issue being addressed.
Explain the rationale and any background necessary for understanding the changes.
List dependencies required by this change, if any.

Part - 2 #2262

Algorithm:

  1. Generate a random v4 uuid as starting point.
  2. Iterate and pick repos until the end of the table is reached.
  3. After end of table is reached, start iterating from the first row in the table.

Only the starting point is random; batch creation becomes sequential after one end of the table is reached. If after reaching the end of the table, the iteration point is selected randomly again, the coverage of all repos won't be guaranteed.

Change Type

Mark the type of change your PR introduces:

  • Bug fix (resolves an issue without affecting existing features)
  • Feature (adds new functionality without breaking changes)
  • Breaking change (may impact existing functionalities or require documentation updates)
  • Documentation (updates or additions to documentation)
  • Refactoring or test improvements (no bug fixes or new functionality)

Testing

Outline how the changes were tested, including steps to reproduce and any relevant configurations.
Attach screenshots if helpful.

Review Checklist:

  • Reviewed my own code for quality and clarity.
  • Added comments to complex or tricky code sections.
  • Updated any affected documentation.
  • Included tests that validate the fix or feature.
  • Checked that related changes are merged.

@Vyom-Yadav Vyom-Yadav requested a review from a team as a code owner May 23, 2024 16:00
@coveralls
Copy link

coveralls commented May 23, 2024

Coverage Status

Changes unknown
when pulling dc04b46 on Vyom-Yadav:addReminderLogic
into ** on stacklok:main**.

@Vyom-Yadav Vyom-Yadav marked this pull request as draft May 23, 2024 18:16
@Vyom-Yadav Vyom-Yadav force-pushed the addReminderLogic branch 4 times, most recently from 19f4f0a to 8fd38d5 Compare May 25, 2024 09:52
Comment on lines 15 to 17
ALTER TABLE repositories ADD COLUMN reminder_last_sent TIMESTAMP;

CREATE EXTENSION IF NOT EXISTS tsm_system_rows;
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if enabling extensions should be part of migration files.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a reasonable way to do it, but I'd prefer to have less postgres magic rather than more. It's not clear that we need to fetch a random valid repository at startup -- we can simply start with a random UUID and march forward on valid repository IDs from there.

If we did need to get a random repository, I'd be inclined to simply use:

SELECT * FROM repositories
WHERE id > gen_random_uuid()
LIMIT 1;

(plus a little bit to fetch the lowest-numbered repository if that query returns zero rows.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: I think we can get rid of query GetRandomRepository by calculating a random UUID on the application side and passing it straight to ListEligibleRepositoriesAfterID. Added benefits would be

  • we have one less query
  • one less step in the process
  • the whole process becomes deterministic (and easier to test)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want to start at a random point, fetching a valid random repo is required. If we can start at any point, then sequential iteration would do the job.

If we don't fetch a valid repo / generate a random uuid, then we can potentially have a case where the generated uuid is out of range, and we have to either generate a uuid again or start from the beginning (which defeats the purpose of random generation)

It all comes down to whether we want to start from a valid random point or not.

sqlc.yaml Outdated
Comment on lines 28 to 33
- db_type: 'pg_catalog.interval'
go_type: 'github.com/jackc/pgtype.Interval'
- db_type: 'pg_catalog.interval'
go_type: 'github.com/jackc/pgtype.NullInterval'
nullable: true
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another option would be to use make_interval(secs => {{ duration.Seconds() }})

Another option would simply be to pass a string through here, rather than a time.Duration. Given that this is a configuration constant, I'd prefer to pull in fewer dependencies to support it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@Vyom-Yadav Vyom-Yadav marked this pull request as ready for review May 25, 2024 09:55
Copy link
Member

@rdimitrov rdimitrov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I missed if it was discussed but should we also have a way to disable reminder for certain repos by filtering repos out of the batch creation logic?

@Vyom-Yadav
Copy link
Collaborator Author

I missed if it was discussed but should we also have a way to disable reminder for certain repos by filtering repos out of the batch creation logic?

User configurable option?

Why would someone like to do that? To preserve their rate limit? I don't think reminder should be exposed to end users; it's just background reconciliation to keep the system up to date.

Copy link
Contributor

@blkt blkt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small suggestion as I keep looking at the code.

Comment on lines 19 to 22
-- name: GetRandomRepository :one
SELECT * FROM repositories
TABLESAMPLE SYSTEM_ROWS(1);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @Vyom-Yadav, thank you for your great work! 🙇

Albeit I don't think that adding tsm_system_rows would be a problem, I'm not convinced it's the right tool in this scenario.

If I got the idea right, here we're trying to get a row at random to use its ID as cursor, and all subsequent queries would be based on WHERE r.id > $1, which is OK and is most likely guaranteed to go by index.

I was wondering, why not getting rid of the additional dependency on tsm_system_rows and simply generate one "first" random uuid on the application side and then use that use that to start? If I got the procedure right, we would be also able to remove this statement, as we control randomness from the outside.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"first" random uuid on the application side and then use that use that to start?

There is no way that I know that generates a random uuid in some range (I'd rather not play with that). If we generate on the application side, then we can end up with an out-of-range uuid, which leads to simple sequential iteration.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe we're using v4 UUIDs, which are random, except for about 7 bits in the middle of the string. (It looks like entity_event.go uses v1 UUIDs...). In particular, the default for the id field on repositories is gen_random_uuid(), which is a v4 UUID.

Copy link
Member

@evankanderson evankanderson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good to see you back! I'm going to encourage you to be a little less precise here in favor of ensuring there's an upper bound on the amount of work that we put into the system at any given time, which I think is the more critical piece for background operation.

Comment on lines 15 to 17
ALTER TABLE repositories ADD COLUMN reminder_last_sent TIMESTAMP;

CREATE EXTENSION IF NOT EXISTS tsm_system_rows;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a reasonable way to do it, but I'd prefer to have less postgres magic rather than more. It's not clear that we need to fetch a random valid repository at startup -- we can simply start with a random UUID and march forward on valid repository IDs from there.

If we did need to get a random repository, I'd be inclined to simply use:

SELECT * FROM repositories
WHERE id > gen_random_uuid()
LIMIT 1;

(plus a little bit to fetch the lowest-numbered repository if that query returns zero rows.)

Comment on lines 53 to 60
SELECT r.* FROM repositories r
INNER JOIN rule_evaluations re ON re.repository_id = r.id
INNER JOIN rule_details_eval rde ON rde.rule_eval_id = re.id
WHERE r.id > $1
GROUP BY r.id
HAVING MIN(rde.last_updated) + sqlc.arg('min_elapsed')::interval < NOW()
ORDER BY r.id
LIMIT sqlc.narg('limit')::bigint;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see where this is coming from -- I'm also worried about the cost of this query on the database side. With a fairly simple amount of data, the query ends up being a lot more expensive than a full table scan on the database side (e.g. full scans of the tables are about 40-ish cost in this example).

Digging in a bit, I ran explain on a small database:

EXPLAIN SELECT r.* FROM repositories r
  INNER JOIN rule_evaluations re ON re.repository_id = r.id
  INNER JOIN rule_details_eval rde ON rde.rule_eval_id = re.id
WHERE r.id > '8e4d7b85-2022-4d95-8d1c-96d2097d73d2'::uuid
GROUP BY r.id
HAVING MIN(rde.last_updated) + interval '1 hour' < NOW()
ORDER BY r.id
LIMIT 10;
                                                 QUERY PLAN                                                 
------------------------------------------------------------------------------------------------------------
 Limit  (cost=57.74..57.76 rows=10 width=338)
   ->  Sort  (cost=57.74..57.80 rows=24 width=338)
         Sort Key: r.id
         ->  HashAggregate  (cost=55.94..57.22 rows=24 width=338)
               Group Key: r.id
               Filter: ((min(rde.last_updated) + '01:00:00'::interval) < now())
               ->  Hash Join  (cost=32.24..54.65 rows=258 width=346)
                     Hash Cond: (rde.rule_eval_id = re.id)
                     ->  Seq Scan on rule_details_eval rde  (cost=0.00..17.80 rows=780 width=24)
                     ->  Hash  (cost=30.12..30.12 rows=169 width=354)
                           ->  Hash Join  (cost=13.66..30.12 rows=169 width=354)
                                 Hash Cond: (re.repository_id = r.id)
                                 ->  Seq Scan on rule_evaluations re  (cost=0.00..15.10 rows=510 width=32)
                                 ->  Hash  (cost=12.75..12.75 rows=73 width=338)
                                       ->  Seq Scan on repositories r  (cost=0.00..12.75 rows=73 width=338)
                                             Filter: (id > '8e4d7b85-2022-4d95-8d1c-96d2097d73d2'::uuid)

The fundamental limit here seems to be that since we don't have an index on rde.last_updated, we'll always incur a sequential scan on rule_details_eval (one of our larger tables).

It's possible on a large database, we'd see different query planning, but I worry that the layers of indirection and aggregation will fundamentally make this query hard to plan efficiently. At the worst case, we might have a database with 100M rows, and at any given query, there are only 100 rule_details_eval rows that are older than our threshold. In this scenario, we'd end up needing to scan the 100M rows on each query for work, finding 10 repositories, and then repeating.

My gut feeling is that it's more important to have a steady amount of load on the database (many light queries) than to get exactly as much work as possible in any particular iteration. One way to do this would be to use a sub-select to limit the number of repositories considered in the query (e.g. SELECT ... FROM (SELECT * FROM repositories WHERE id > $1 ORDER BY id LIMIT 50) AS r ...). Another option would simply to do the work outside of a single SQL query, e.g. SELECT * FROM repositories WHERE id > $1 LIMIT 30, and then processing each repository after that.

Using the sub-select with limit on repositories with my sample query and a limit of 20, I see a 25-30% reduction in query time, but I suspect the behavior is stronger with larger databases.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that we have reminder_last_sent, it might also be reasonable to simply use that -- we'd have a one-time extra batch of revisits when this rolls out, and then the query could be:

SELECT * FROM (SELECT * FROM repositories WHERE id > $1 LIMIT 50)
WHERE reminder_last_sent < NOW() - sqlc.arg('min_elapsed')::interval
ORDER BY id;

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed on the statement analysis. The current query (without the sub-select) guarantees that we get limit rows (if they are there). With a sub-select, selected rows might not be eligible i.e. last_updated was recent.

Agreed with the point that this will be a single time-consuming query rather than a steady load form of query. To keep things and mocking simpler on the application side, I'd go with a sub-query rather than two different queries, i.e. selection and filtration.

Just using reminder_last_sent isn't a good parameter IMO. We can have a case where a reminder was sent 24h ago but the repo was recently updated (edge based), so this would result in extra load on the server (reconciling the repo).

}
err := r.restoreCursorState(ctx)

randomRepo, err := r.store.GetRandomRepository(ctx)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not simply r.repositoryCursor = uuid.Random()?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may generate an out-of-range uuid, which will result in sequential iteration. There is nothing wrong with sequential iteration, but I coded it in a way in which we start from a random valid uuid (theoretically speaking, it is possible to get the first uuid as a random uuid from the db which will result in sequential iteration)

This point of generating a random valid uuid may not be that important, so I'm willing to gen random uuid on application side if that's better (in terms of simplicity)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what an out-of-range uuid is -- I'd meant uuid.NewRandom(), which generates a valid (v4) UUID.

So, there is one edge case here: we may generate a UUID which is larger than any of the UUIDs in the database (at which point, we should start at the sequentially-first UUID). I believe that we're using v4 (random) UUIDs, so we should end up with a rougly-even distribution of UUIDs choosing at random. (If we were using time-sorted UUIDs, picking a random UUID would likely either start at the beginning or the end of the sequence.)

if err != nil {
// Non-fatal error, if we can't restore the cursor state, we'll start from scratch.
logger.Error().Err(err).Msg("error restoring cursor state")
return nil, err
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This means reminder will exit with an empty database (say, when setting up for the first time).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it should be:

if err != nil && !errors.Is(err, sql.ErrNoRows) {
  return nil, err
}

"repoListCursor": r.repoListCursor,
// Update the reminder_last_sent for each repository to export as metrics
for _, repo := range repos {
logger.Debug().Msgf("updating reminder_last_sent for repository: %s", repo.ID)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you include both the repo and the old value of reminder_last_sent as structured fields in the logs? This could be an easy way to see what the actual delay on updates is.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment on lines 180 to 183
Limit: sql.NullInt64{
Int64: int64(r.cfg.RecurrenceConfig.BatchSize),
Valid: true,
},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why make this nullable?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bad copy paste, changed it.

Comment on lines 194 to 177
// Only fetch additional repositories if we are under the limit
if len(repos) < r.cfg.RecurrenceConfig.BatchSize {
additionalRepos, err = r.getAdditionalRepos(ctx, repos)
if err != nil {
return nil, err
}

repos, intersectionPoint = r.mergeRepoBatch(repos, additionalRepos)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels complicated -- it feels like we should call ListEligibleRepositoriesAfterID and then use the last returned element of that query to set the cursor for the next run (checking RepositoryExistsAfterID and setting the cursor to the zero UUID if no more repos exist).

Comment on lines 254 to 256
// There may be an intersection between the two sets
// If there is an intersection, then we need to update the cursor to the last fetched
// non-common repository
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand how this happens, unless there are only e.g. 4 repos eligible in the whole database.

intersectionPoint := -1
var additionalRepos []db.Repository

// Only fetch additional repositories if we are under the limit
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's a requirement that we find exactly BatchSize repos, just that we don't exceed BatchSize repos in one pass. We're trying to upper-bound the amount of background work added to the system.

I think this will get a lot simpler if we just do one pass:

repos, err := r.store.ListEligibleRepositoriesAfterID(ctx, {...})
if err != nil {
  return nil, err
}

// Don't actually ignore the error here...
if len(repos) == 0 || !RepositoryExistsAfterID(ctx, repos[len(repos)-1].ID) {
  r.repositoryCursor = uuid.UUID{}
} else {
  r.repositoryCursor = repos[len(repos)-1].ID
}

return repos, nil

// non-common repository
intersectionPoint := -1
for i, repo := range additionalRepos {
if reposSet.Has(repo) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is using repo as a hashable object, but we've done two different queries, so the repos might differ in some minutiae like updated_at and be included in the batch twice. Given that we know that the two lists are sorted, we could also march through additionalRepos comparing less-than the first item in repos and find the intersection point without needing to use the set.

But, as mentioned, I think we can avoid a lot of this code if we're willing to not exactly fill each reminder batch.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh, I missed this, but now we fetch only once, so this isn't required.

@Vyom-Yadav Vyom-Yadav force-pushed the addReminderLogic branch 2 times, most recently from e273a44 to efed08c Compare June 3, 2024 06:44
@Vyom-Yadav
Copy link
Collaborator Author

@evankanderson, I updated the PR to fetch only once, and batch_size will only be used as the upper limit. The only thing remaining is to decide the starting point random uuid generation.

conn, err := grpc.DialContext(ctx, endpoint, opts...)
conn, err := grpc.NewClient(endpoint, opts...)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haven't verified the changes that came with protoc-gen-go v1.34.1

Copy link
Member

@evankanderson evankanderson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking pretty close -- just two concerns:

  1. It looks like LATERAL is a lot slower than just a standard inner join. /shrug
  2. I think your logic for when to loop back the beginning isn't quite right.

Comment on lines 19 to 22
-- name: GetRandomRepository :one
SELECT * FROM repositories
TABLESAMPLE SYSTEM_ROWS(1);

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe we're using v4 UUIDs, which are random, except for about 7 bits in the middle of the string. (It looks like entity_event.go uses v1 UUIDs...). In particular, the default for the id field on repositories is gen_random_uuid(), which is a v4 UUID.

ORDER BY id
LIMIT sqlc.arg('limit')::bigint
) r
JOIN LATERAL (
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like LATERAL converts this from a HashAggregate of a NestedLoop to a NestedLoop that runs an Aggregate inside:

EXPLAIN SELECT r.*
FROM (
  SELECT *
  FROM repositories
  ORDER BY id
  LIMIT 10
) r
JOIN LATERAL (
  SELECT MIN(rde.last_updated) AS min_last_updated
  FROM rule_evaluations re
  INNER JOIN rule_details_eval rde ON rde.rule_eval_id = re.id
  WHERE re.repository_id = r.id
) sub ON sub.min_last_updated + interval '1h' < NOW()
ORDER BY r.id;
                                                       QUERY PLAN                                                        
-------------------------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=31.75..318.77 rows=10 width=338)
   ->  Limit  (cost=0.14..2.30 rows=10 width=338)
         ->  Index Scan using repositories_pkey on repositories  (cost=0.14..47.45 rows=220 width=338)
   ->  Aggregate  (cost=31.61..31.63 rows=1 width=8)
         Filter: ((min(rde.last_updated) + '01:00:00'::interval) < now())
         ->  Nested Loop  (cost=8.12..31.60 rows=5 width=8)
               ->  Bitmap Heap Scan on rule_evaluations re  (cost=7.97..15.08 rows=3 width=16)
                     Recheck Cond: (repository_id = repositories.id)
                     ->  Bitmap Index Scan on rule_evaluations_results_name_lower_idx  (cost=0.00..7.97 rows=3 width=0)
                           Index Cond: (repository_id = repositories.id)
               ->  Index Scan using idx_rule_detail_eval_ids on rule_details_eval rde  (cost=0.15..5.50 rows=1 width=24)
                     Index Cond: (rule_eval_id = re.id)
(12 rows)

vs

EXPLAIN SELECT r.id, r.provider, r.project_id, r.repo_owner, r.repo_name, r.repo_id, r.is_private, r.is_fork, r.webhook_id
FROM (
  SELECT *
  FROM repositories
  ORDER BY id
  LIMIT 10
) r
JOIN rule_evaluations AS re on re.repository_id = r.id
JOIN rule_details_eval rde ON rde.rule_eval_id = re.id
  WHERE re.repository_id = r.id
GROUP BY r.id, r.provider, r.project_id, r.repo_owner, r.repo_name, r.repo_id, r.is_private, r.is_fork, r.webhook_id
HAVING MIN(rde.last_updated) + interval '1h' < NOW()                                                                                                                             ORDER BY r.id;
                                                             QUERY PLAN                                                              
-------------------------------------------------------------------------------------------------------------------------------------
 Sort  (cost=28.33..28.36 rows=13 width=146)
   Sort Key: r.id
   ->  HashAggregate  (cost=27.39..28.09 rows=13 width=146)
         Group Key: r.id, r.provider, r.project_id, r.repo_owner, r.repo_name, r.repo_id, r.is_private, r.is_fork, r.webhook_id
         Filter: ((min(rde.last_updated) + '01:00:00'::interval) < now())
         ->  Nested Loop  (cost=2.67..26.39 rows=40 width=154)
               ->  Hash Join  (cost=2.52..19.79 rows=26 width=162)
                     Hash Cond: (re.repository_id = r.id)
                     ->  Seq Scan on rule_evaluations re  (cost=0.00..15.10 rows=510 width=32)
                     ->  Hash  (cost=2.40..2.40 rows=10 width=146)
                           ->  Subquery Scan on r  (cost=0.14..2.40 rows=10 width=146)
                                 ->  Limit  (cost=0.14..2.30 rows=10 width=338)
                                       ->  Index Scan using repositories_pkey on repositories  (cost=0.14..47.45 rows=220 width=338)
               ->  Index Scan using idx_rule_detail_eval_ids on rule_details_eval rde  (cost=0.15..0.25 rows=1 width=24)
                     Index Cond: (rule_eval_id = re.id)
(15 rows)

While there are more steps in the non-lateral query plan, it seems to plan and execute substantially faster than the LATERAL version. In an environment with more rows, this is the difference between EXPLAIN ANALYZE taking 30ms and 2ms of execution time.


// Update the reminder_last_sent for each repository to export as metrics
for _, repo := range repos {
logger.Debug().Msgf("updating reminder_last_sent for repository: %s", repo.ID)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Zerolog supports structured logging, which allows us to later easily query and extract (for example) log lines that reference a specific repository id:

Suggested change
logger.Debug().Msgf("updating reminder_last_sent for repository: %s", repo.ID)
logger.Debug().Str("repo", repo.ID.String()).Time("previously", repo.ReminderLastSent.Time).
Msg("updating reminder_last_sent")

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

logger.Debug().Msgf("previous reminder_last_sent: %s", repo.ReminderLastSent.Time)
err := r.store.UpdateReminderLastSentById(ctx, repo.ID)
if err != nil {
logger.Error().Err(err).Msgf("unable to update reminder_last_sent for repository: %s", repo.ID)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
logger.Error().Err(err).Msgf("unable to update reminder_last_sent for repository: %s", repo.ID)
logger.Error().Err(err).Str("repo", repo.ID.String()).Msg("unable to update reminder_last_sent")

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines +188 to +217
if len(repos) == 0 {
r.repositoryCursor = uuid.Nil
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we're filtering the repos we get back from the batch in the database, it's possible that we read e.g. 50 repos that all got filtered. I think you want to check RepositoryExistsAfterID here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I missed this branch 😞 . I updated the logic. Hopefully I haven't missed any logical branches this time.

Comment on lines 203 to 204
logger.Error().Err(err).Msgf("unable to check if repository exists after cursor: %s", r.repositoryCursor)
logger.Info().Msg("resetting cursor to zero uuid")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd tend to put this as one message in the logs, rather than two separate messages. Especially when searching structured logs (say, for the last 48 hours across 8 or 10 servers), it can be hard to "scroll" forward or backward to the next log line from a particular server.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@Vyom-Yadav Vyom-Yadav force-pushed the addReminderLogic branch 3 times, most recently from ef39a5d to 6c92820 Compare June 5, 2024 18:02
Comment on lines 77 to 85
-- ListOldestRuleEvaluationsByRepositoryId has casts in select statement as sqlc generates incorrect types.
-- Though repository_id doesn't have non null constraint, but it always has a value in the database.
-- cast after MIN is required due to a known bug in sqlc: https://github.com/sqlc-dev/sqlc/issues/1965

-- name: ListOldestRuleEvaluationsByRepositoryId :many
SELECT re.repository_id::uuid AS repository_id, MIN(rde.last_updated)::timestamp AS oldest_last_updated
FROM rule_evaluations re
INNER JOIN rule_details_eval rde ON re.id = rde.rule_eval_id
WHERE re.repository_id = ANY (sqlc.arg('repository_ids')::uuid[])
GROUP BY re.repository_id;
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can re.repository_id be null? I was under the assumption that minder needs a repo to function on.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also

EXPLAIN SELECT re.repository_id::uuid AS repository_id, MIN(rde.last_updated)::timestamp AS oldest_last_updated
        FROM rule_evaluations re
               INNER JOIN rule_details_eval rde ON re.id = rde.rule_eval_id
        WHERE re.repository_id = ANY (array['de0b2ad2-bc90-4126-b0a2-63abc1cce808','81b2a2ce-85fb-4528-a43e-7eef02f596f2','7e0813ce-6201-48f0-b136-bf08be8efcb9']::uuid[])
        GROUP BY re.repository_id;

GroupAggregate  (cost=37.19..37.36 rows=8 width=24)
  Group Key: re.repository_id
  ->  Sort  (cost=37.19..37.22 rows=12 width=24)
        Sort Key: re.repository_id
        ->  Hash Join  (cost=17.11..36.97 rows=12 width=24)
              Hash Cond: (rde.rule_eval_id = re.id)
              ->  Seq Scan on rule_details_eval rde  (cost=0.00..17.80 rows=780 width=24)
              ->  Hash  (cost=17.01..17.01 rows=8 width=32)
                    ->  Seq Scan on rule_evaluations re  (cost=0.00..17.01 rows=8 width=32)
"                          Filter: (repository_id = ANY ('{de0b2ad2-bc90-4126-b0a2-63abc1cce808,81b2a2ce-85fb-4528-a43e-7eef02f596f2,7e0813ce-6201-48f0-b136-bf08be8efcb9}'::uuid[]))"

It does a sequential scan on both tables. I can understand the sequential scan on rule_evaluations table as there is no index on re.repository_id. But after obtaining re.ids associated with re.repository_id, why does it need to perform a sequential scan on rule_details_eval rde? There is an index on rde.rule_eval_id.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

repository_id can be null for resources which aren't associated with a repository (for example, we're working on a DockerHub provider that wouldn't have git repos). I think it's fine for Reminder not to work with those yet.

It looks like we don't have a relevant index on repository_id:

\d rule_evaluations
                    Table "public.rule_evaluations"
     Column      |   Type   | Collation | Nullable |      Default      
-----------------+----------+-----------+----------+-------------------
 id              | uuid     |           | not null | gen_random_uuid()
 entity          | entities |           | not null | 
 profile_id      | uuid     |           | not null | 
 rule_type_id    | uuid     |           | not null | 
 repository_id   | uuid     |           |          | 
 artifact_id     | uuid     |           |          | 
 pull_request_id | uuid     |           |          | 
 rule_name       | text     |           | not null | 
Indexes:
    "rule_evaluations_pkey" PRIMARY KEY, btree (id)
    "rule_evaluations_results_name_lower_idx" UNIQUE, btree (profile_id, lower(rule_name), repository_id, COALESCE(artifact_id, '00000000-0000-0000-0000-000000000000'::uuid), entity, rule_type_id, COALESCE(pull_request_id, '00000000-0000-0000-0000-000000000000'::uuid)) NULLS NOT DISTINCT

This somewhat surprises me, as I'd have expected the foreign key constraint on repository_id to create an index. We should probably add indexes to support these -- feel free to only add the one for repository_id right now. (This may actually have been the proximate cause of a lot of the other queries having screwy cost estimates -- adding the index changed the explain from a Seq Scan + Filter to Bitmap Heap Scan(Recheck Cond) + Bitmap Index Scan(Index Cond).

rule_details_eval seems to be fine:

Indexes:
    "rule_details_eval_pkey" PRIMARY KEY, btree (id)
    "idx_rule_detail_eval_ids" UNIQUE, btree (rule_eval_id)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added index on rule_evaluations(repository_id). The index will be created in concurrent mode as I thought blocking rule_evaluations writes isn't a good option.

@Vyom-Yadav
Copy link
Collaborator Author

@evankanderson To address the problem of updating the cursor properly, I have split the fetching and filtering queries. Now, repositories would be fetched unconditionally (to update the cursor) and later filtered on the application side.

Copy link
Member

@evankanderson evankanderson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is getting really close!

I'd actually missed the "query might return zero repos but we still need to advance the cursor" issue -- nice catch!

Comment on lines 77 to 85
-- ListOldestRuleEvaluationsByRepositoryId has casts in select statement as sqlc generates incorrect types.
-- Though repository_id doesn't have non null constraint, but it always has a value in the database.
-- cast after MIN is required due to a known bug in sqlc: https://github.com/sqlc-dev/sqlc/issues/1965

-- name: ListOldestRuleEvaluationsByRepositoryId :many
SELECT re.repository_id::uuid AS repository_id, MIN(rde.last_updated)::timestamp AS oldest_last_updated
FROM rule_evaluations re
INNER JOIN rule_details_eval rde ON re.id = rde.rule_eval_id
WHERE re.repository_id = ANY (sqlc.arg('repository_ids')::uuid[])
GROUP BY re.repository_id;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

repository_id can be null for resources which aren't associated with a repository (for example, we're working on a DockerHub provider that wouldn't have git repos). I think it's fine for Reminder not to work with those yet.

It looks like we don't have a relevant index on repository_id:

\d rule_evaluations
                    Table "public.rule_evaluations"
     Column      |   Type   | Collation | Nullable |      Default      
-----------------+----------+-----------+----------+-------------------
 id              | uuid     |           | not null | gen_random_uuid()
 entity          | entities |           | not null | 
 profile_id      | uuid     |           | not null | 
 rule_type_id    | uuid     |           | not null | 
 repository_id   | uuid     |           |          | 
 artifact_id     | uuid     |           |          | 
 pull_request_id | uuid     |           |          | 
 rule_name       | text     |           | not null | 
Indexes:
    "rule_evaluations_pkey" PRIMARY KEY, btree (id)
    "rule_evaluations_results_name_lower_idx" UNIQUE, btree (profile_id, lower(rule_name), repository_id, COALESCE(artifact_id, '00000000-0000-0000-0000-000000000000'::uuid), entity, rule_type_id, COALESCE(pull_request_id, '00000000-0000-0000-0000-000000000000'::uuid)) NULLS NOT DISTINCT

This somewhat surprises me, as I'd have expected the foreign key constraint on repository_id to create an index. We should probably add indexes to support these -- feel free to only add the one for repository_id right now. (This may actually have been the proximate cause of a lot of the other queries having screwy cost estimates -- adding the index changed the explain from a Seq Scan + Filter to Bitmap Heap Scan(Recheck Cond) + Bitmap Index Scan(Index Cond).

rule_details_eval seems to be fine:

Indexes:
    "rule_details_eval_pkey" PRIMARY KEY, btree (id)
    "idx_rule_detail_eval_ids" UNIQUE, btree (rule_eval_id)

err := r.store.UpdateReminderLastSentById(ctx, repo.ID)
if err != nil {
logger.Error().Err(err).Str("repo", repo.ID.String()).Msg("unable to update reminder_last_sent")
return []error{err}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this accumulates a list of errors by the signature, but this code does an early-return. Do you want to accumulate the errors, or just return err?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function will largely change in the next PR i.e. connecting minder and reminder. A slice of errors is returned as we can get errors while creating messages, sending them, etc. Again, we can discuss this in the next PR. (See sendReminders function in the old PR)

Comment on lines +198 to +205
idToLastUpdatedMap := make(map[uuid.UUID]time.Time)
for _, oldestRuleEval := range oldestRuleEvals {
idToLastUpdatedMap[oldestRuleEval.RepositoryID] = oldestRuleEval.OldestLastUpdated
}

for _, repo := range repos {
if oldestRuleEval, ok := idToLastUpdatedMap[repo.ID]; ok &&
oldestRuleEval.Add(r.cfg.RecurrenceConfig.MinElapsed).Before(time.Now()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do this two-step loop here? It feels like either passing in the cutoff to the query (i.e. SELECT ... WHERE rde.last_updated < $1 AND re.repository_id = ANY(...) GROUP BY re.repository_id or the equivalent HAVING), or looping through the results once should be sufficient:

cutoff := time.Now().Sub(r.cfg.RecurrenceConfig.MinElapsed)
for _, evalTime := range oldestRuleEvals {
  if evalTime.OldestLastUpdated.Before(cutoff) {
    eligibleRepos = append(eligibleRepos, repo)
  }
}

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ListOldestRuleEvaluationsByRepositoryId only queries rule_evaluations and rule_details_eval, so it returns a slice of:

type ListOldestRuleEvaluationsByRepositoryIdRow struct {
	RepositoryID      uuid.UUID `json:"repository_id"`
	OldestLastUpdated time.Time `json:"oldest_last_updated"`
}

We have to iterate over repos as the repo object isn't returned.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, that seems like it's worth a comment about how we're doing a bunch of transforms of types to fit into the sqlc-generated code (rather than for some other reason, like performance or thread-safety).

Also, a slight preference for:

cutoff := time.Now().Sub(r.cfg.RecurrenceConfig.MinElapsed)
for _, repo := range repos {
  if t, ok := idToLastUpdate[repo.ID]; ok && t.Before(cutoff) {
    ....
  }
}

This has two benefits:

  1. You can fit the condition on one line (shortening the time var that lives for a single line, removing Map from the name of the map, and doing the date math on a separate line).
  2. You only do the date math once, and only fetch time.Now() once, rather than using a slightly different time for each check.

@Vyom-Yadav Vyom-Yadav force-pushed the addReminderLogic branch 2 times, most recently from 413869a to 7afc72f Compare June 11, 2024 07:59
Comment on lines +198 to +205
idToLastUpdatedMap := make(map[uuid.UUID]time.Time)
for _, oldestRuleEval := range oldestRuleEvals {
idToLastUpdatedMap[oldestRuleEval.RepositoryID] = oldestRuleEval.OldestLastUpdated
}

for _, repo := range repos {
if oldestRuleEval, ok := idToLastUpdatedMap[repo.ID]; ok &&
oldestRuleEval.Add(r.cfg.RecurrenceConfig.MinElapsed).Before(time.Now()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, that seems like it's worth a comment about how we're doing a bunch of transforms of types to fit into the sqlc-generated code (rather than for some other reason, like performance or thread-safety).

Also, a slight preference for:

cutoff := time.Now().Sub(r.cfg.RecurrenceConfig.MinElapsed)
for _, repo := range repos {
  if t, ok := idToLastUpdate[repo.ID]; ok && t.Before(cutoff) {
    ....
  }
}

This has two benefits:

  1. You can fit the condition on one line (shortening the time var that lives for a single line, removing Map from the name of the map, and doing the date math on a separate line).
  2. You only do the date math once, and only fetch time.Now() once, rather than using a slightly different time for each check.

@evankanderson
Copy link
Member

I'm going to merge and then send a PR for the cleanup in getEligibleRepositories to avoid needing to do another go-round. 😁

@evankanderson evankanderson merged commit a2b09c7 into mindersec:main Jun 11, 2024
23 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants