Skip to content

Commit

Permalink
Unique latest grouped records (#3)
Browse files Browse the repository at this point in the history
* Unique latest grouped records

* fixes
  • Loading branch information
ka8725 authored Jul 13, 2023
1 parent 976b699 commit 797e749
Show file tree
Hide file tree
Showing 2 changed files with 179 additions and 0 deletions.
179 changes: 179 additions & 0 deletions _posts/2023-07-13-select-unique-latest-grouped-records-from-db.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,179 @@
---
layout: post
title: "Select unique latest grouped records from DB"
headline: "Select unique latest grouped records from DB"
modified: 2023-07-13 17:26:54 +0200
description: "Learn a technique that allows you to build a recent records block in a Ruby on Rails application."
tags: [sql, active_record]
featured_post: false
toc: true
image: unique-grouped.jpg
---

Nowadays, almost every Ruby on Rails application has a so-called recent records block.
This block usually shows statistics or a list of recent things within the project during the last few days searched by some criteria.
It can be something like "top 10 products", "the most popular projects", or "the most relevant apartments". Read this blog post the learn how to efficiently build data for these blocks using SQL and window functions in Ruby on Rails app.

## Recent records - the task overview

Assume that you've got an app that has `Project` model. It has many `ratings`. The `Rating` model has just a value from 1 to 5 assigned by users to some projects.

For the task understanding it's enough to have a look into the table definition:

```sql
db-# \d ratings
Table "public.ratings"
Column │ Type │ Collation │ Nullable │ Default
═════════════╪════════════════════════════════╪═══════════╪══════════╪═════════════════════════════════════
id │ bigint │ │ not null │ nextval('ratings_id_seq'::regclass)
reviewer_id │ bigint │ │ │
reviewee_id │ bigint │ │ │
rating │ integer │ │ │
review │ text │ │ │
project_id │ bigint │ │ │
created_at │ timestamp(6) without time zone │ │ not null
updated_at │ timestamp(6) without time zone │ │ not null
Indexes:
"ratings_pkey" PRIMARY KEY, btree (id)
```

This table has the following data:

```sql
db-# select id, reviewer_id, reviewee_id, rating, project_id, created_at from ratings;
id │ reviewer_id │ reviewee_id │ rating │ project_id │ created_at
════╪═════════════╪═════════════╪════════╪════════════╪════════════════════════════
89105272022-12-05 21:46:01.583185
12765262022-12-23 14:35:11.047002
13675262022-12-23 14:36:48.366411
189105392023-03-01 23:27:52.68548
191095392023-03-01 23:28:32.880234
209105862023-03-01 23:35:15.564763
(6 rows)
```

Our task is to return the latest reviews per project. So the resulting records should be these:

```sql
id │ reviewer_id │ reviewee_id │ rating │ project_id │ created_at
════╪═════════════╪═════════════╪════════╪════════════╪════════════════════════════
89105272022-12-05 21:46:01.583185
13675262022-12-23 14:36:48.366411
191095392023-03-01 23:28:32.880234
209105862023-03-01 23:35:15.564763
(4 rows)
```

Note the project_id is distinct compared to the all records. And the timestamps are the most recent for those duplicated projects (their id = 26, 39).

There is no way to solve this task efficiently using only ActiveRecord functionality and pure Ruby. But SQL can solve this with the [window function](https://www.postgresql.org/docs/current/tutorial-window.html){:ref="nofollow" target="_blank"} technique.

## How window function with row number partition works

The idea is the following - we rank all records inside the table from 1 no N for the duplicated records of our search criteria. The most recent record gets 1, older one gets higher rank. The distinct rows will have 1.

For example:

```sql
id │ project_id │ reviewee_id │ created_at │ row_number
════╪════════════╪═════════════╪════════════════════════════╪════════════
827102022-12-05 21:46:01.5831851
122662022-12-23 14:35:11.0470022
132672022-12-23 14:36:48.3664111
1839102023-03-01 23:27:52.685482
193992023-03-01 23:28:32.8802341
2086102023-03-01 23:35:15.5647631
(6 rows)
```

The ratings with id = 8, 20 receive row number 1 because these projects are distinct (27, 86). But projects with id = 26, 39 have several ratings that's why the rows with this project id have row_number 1 and 2. The most recent ratings per project receive 1, and the older ones receive row number 2.

## Use subselect to filter correct results

If we filter out those row numbers greater 1 we get the required result. If that would be a table we could use the SQL's `where` clause. For example, a view (virtual table) could be created for that. But we will keep it simple. We will use subselect: initially we prepare select to return the data as above and immediately use `select` statement to filter out the correct result.

But first, let's see how to write SQL statement to assign the row number using the already noticed **window function**:

```sql
db-# select
id,
project_id,
reviewee_id,
created_at,
row_number() over (partition by project_id order by created_at desc)
from ratings
order by created_at;

id │ project_id │ reviewee_id │ created_at │ row_number
════╪════════════╪═════════════╪════════════════════════════╪════════════
827102022-12-05 21:46:01.5831851
122662022-12-23 14:35:11.0470022
132672022-12-23 14:36:48.3664111
1839102023-03-01 23:27:52.685482
193992023-03-01 23:28:32.8802341
2086102023-03-01 23:35:15.5647631
(6 rows)
```

The `row_number() over (partition by project_id order by created_at desc)` is a window function that assigns row number from 1 to N for the records duplicated by some criteria. In this case the criteria is distinct project_id sorted by created_at desc.

Running this query inside DB console will produce the result above.

Wrap this `select` with another `select` and filter only rows with number = 1:

```sql
db-# select
id,
reviewer_id,
reviewee_id,
rating,
project_id,
created_at from (
select *, row_number() over (partition by project_id order by created_at desc)
from ratings
order by created_at
)
as ratings where row_number = 1;

id │ reviewer_id │ reviewee_id │ rating │ project_id │ created_at
════╪═════════════╪═════════════╪════════╪════════════╪════════════════════════════
89105272022-12-05 21:46:01.583185
13675262022-12-23 14:36:48.366411
191095392023-03-01 23:28:32.880234
209105862023-03-01 23:35:15.564763
(4 rows)
```

Voila, we've got what we want!

## Use ActiveRecord.from to return the results as Ruby objects

Since we've got the SQL query it's easy to port it into ActiveRecord and get eventually the list of Ruby objects. We will use the `ActiveRecord.from` to write the subselect:

```ruby
> Rating
.select("id, reviewer_id, reviewee_id, rating, project_id, created_at")
.from("(select *, row_number() over (partition by project_id order by created_at desc) from ratings group by project_id, reviewee_id, created_at, id order by created_at) as ratings")

Rating Load (41.9ms) SELECT id, reviewer_id, reviewee_id, rating, project_id, created_at FROM (select *, row_number() over (partition by project_id order by created_at desc) from ratings group by project_id, reviewee_id, created_at, id order by created_at) as ratings WHERE "ratings"."row_number" = $1 [["row_number", 1]]
=> [#<Rating:0x000000011190d410 id: 8, reviewer_id: 9, reviewee_id: 10, rating: 5, project_id: 27, created_at: Mon, 05 Dec 2022 21:46:01.583185000 UTC +00:00>,
#<Rating:0x000000011190d348 id: 13, reviewer_id: 6, reviewee_id: 7, rating: 5, project_id: 26, created_at: Fri, 23 Dec 2022 14:36:48.366411000 UTC +00:00>,
#<Rating:0x000000011190d280 id: 19, reviewer_id: 10, reviewee_id: 9, rating: 5, project_id: 39, created_at: Wed, 01 Mar 2023 23:28:32.880234000 UTC +00:00>,
#<Rating:0x000000011190d1b8 id: 20, reviewer_id: 9, reviewee_id: 10, rating: 5, project_id: 86, created_at: Wed, 01 Mar 2023 23:35:15.564763000 UTC +00:00>]
```

You can run this experiment yourself on this [demo app](https://github.com/widefix/demo-fast-sql){:ref="nofollow" target="_blank"}.

## Conclusion

Advanced SQL understanding allows you to write performant advanced functionality in a Ruby on Rails application efficiently.

If you like this article and would like to see more examples of how SQL can improve your software development life read these articles:

- [Make your Ruby on Rails app 80x faster with SQL](https://blog.widefix.com/importance-sql-for-rails-experts/){:ref="nofollow" target="_blank"}
- [Financial plan on PostgreSQL](https://blog.widefix.com/financial-plan-on-postgresql/){:ref="nofollow" target="_blank"}
- [Financial plan on Rails](https://blog.widefix.com/financial-plan-on-rails/){:ref="nofollow" target="_blank"}
- [From Single drop-down to Multiple check-boxes](https://blog.widefix.com/from-single-dd-to-multiple-checkboxes/){:ref="nofollow" target="_blank"}
- [Efficient algorithm to check dates overlap](https://blog.widefix.com/date-ranges-overlap/){:ref="nofollow" target="_blank"}

Have a good day ahead and happy coding!
Binary file added images/unique-grouped.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit 797e749

Please sign in to comment.