Greatest-Per-Group Query in PostgreSQL Using DISTINCT ON
This is part 1 of 4 in a series on how to use Greatest Per Group queries in Ruby on Rails to avoid the N + 1 query problem:
- When and how to use Greatest Per Group queries (with PostgreSQL)
- Rails implementation using custom query methods
- Rails implementation using a scoped
has_one
association - Rails implementation using a view backed model
Everything has been tested with Rails 4.2 and PostgreSQL. See demonstration app and it's source code
When to use Greatest Per Group queries
Something that I find myself needing to do frequently in my Ruby on Rails apps is to get the largest/smallest/newest/oldest/whateverest element in a model's has_many
association. So, if my app has a Author
model which has_many :posts
, I might want to show the title of an author's most recent post when I'm displaying other information about the author.
If I'm only showing a single author in my view, this is easy to do without encountering a performance problem. I would just do something like this:
@post = @author.posts.order(created_at: :desc).first
@post.title
That generates two queries: one to load the author and another to load the most recent post. Pretty simple.
But if I want to show a list of authors along with the title of each author's most recent post, then we have to do something different. If we try the same approach, we can generate a lot of queries: one to load the list authors and then another N queries as we loop over each author and load their most recent post. This is the dreaded N + 1 query problem.
We really only want to fire off two queries: one to load the authors and then one more to load just their most recent posts. I am calling this a Greatest Per Group query because it is a special case of a Greatest N Per Group query, where N = 1. And it doesn't have to be the greatest per group, it could be the largest, smallest, newest, oldest, or whatever-est per group. The point is that you are trying to rank the records on the many side of a one-to-many relationship and only return a single record.
Okay, now that we know what a Greatest Per Group query is and where it's useful, let's figure out how they work.
How Greatest Per Group queries work
The good news is that SQL's reason for existence is to enable exactly this kind of thing. What we want to do is JOIN
our authors
table to a table of posts
which only has one post for each author: the most recent post. So we know the overall structure will look like this:
SELECT authors.*, newest_posts.title AS newest_post_title
FROM authors
LEFT JOIN newest_posts ON newest_posts.author_id = authors.id
That's a fairly straightforward join. In this case we're using a LEFT JOIN
since we want to list all authors, even if they have never written a post. What we glossed over, though, was that newest_posts
table. The problem is that we don't have a newest_posts
table. So how are we going to join a table that doesn't exist?
One of the great things about SQL is that it allows you to use the results of a query as a temporary table that you can join to. So, we're going to create a temporary newest_posts
table and then join it to our authors
table. First, though, we need to figure out how to generate a table that only has the most recent post for each author.
Your first thought might be that this is a perfect place to use GROUP BY
. That was my first thought, too. But GROUP BY
won't work for our purposes here because – in Postgres, at least – it limits what columns you can select to the column(s) you're grouping by and columns created by aggregate functions like MIN
, MAX
, AVG
, SUM
, and COUNT
. So a query like this will return an error:
SELECT posts.*
FROM posts
GROUP BY author_id
ORDER BY created_at DESC
The best we can do with GROUP BY
is something like this:
SELECT author_id, MAX(created_at) AS newest_post
FROM posts
GROUP BY author_id
That will return a result that looks like this:
author_id | newest_post
----------+------------
1 | 2015-08-07
2 | 2015-11-15
3 | 2014-04-23
4 | 2015-01-18
This could be useful if all we want to do is show when the last post was created. But we want to show the title of the newest post's, so this isn't very useful to us. That means GROUP BY
is out as an option, at least in Postgres. So, what are our options?
Well, Postgres has this marvelous thing called DISTINCT ON
. You can go read the documentation for the full run down, but the short description is this: DISTINCT ON
will only return the first row for each group of rows that share a unique combination of your distinct column(s). So you can use ORDER BY
to sort your results and that will determine which row DISTINCT ON
returns. This allows you to get only the newest, oldest, largest, smallest, etc. records for an association. Plus, unlike GROUP BY
, it allows you to SELECT
any and all columns that you desire.
Here's how we get a list of each of the most recent posts per author using DISTINCT ON
:
SELECT DISTINCT ON (author_id) *
FROM posts
ORDER BY author_id, created_at DESC
That's it. Fairly simple when it's all said and done. There are a couple of things which tripped me up the first time I used DISTINCT ON
, though, so I'll list them here:
- There is no comma directly after the
DISTINCT ON (author_id)
part of theSELECT
statement. But there are commas after that if you're selecting more than one column. - The first (leftmost) column in the
ORDER BY
clause has to be the distinct column –author_id
in our case. This is kinda goofy, since I think it's an implementation detail leaking out, but it's a thing that Postgres will make you do. - The remaining items in the
ORDER BY
clause work as you normally expect and are what determine whether you get the newest, oldest, largest, smallest, or whateverest record, so pay attention to them.
Okay, let's put it all together now. So, we have our overall query:
SELECT authors.*, newest_posts.title AS newest_post_title
FROM authors
LEFT JOIN newest_posts ON newest_posts.author_id = authors.id
And we know that the newest_posts
table that we're joining to the authors
table is actually going to be a temporary table that we generate with this query:
SELECT DISTINCT ON (author_id) *
FROM posts
ORDER BY author_id, created_at DESC
But how do we actually do that? Well, we wrap the query in parentheses, which makes it a subquery, then we give it a name with AS newest_posts
, and then we join it. Here's what that looks like:
SELECT authors.*, newest_posts.title AS newest_post_title
FROM authors
LEFT JOIN (SELECT DISTINCT ON (author_id) *
FROM posts
ORDER BY author_id, created_at DESC)
AS newest_posts
ON newest_posts.author_id = authors.id
This will return a result that looks like this:
author_id | first_name | last_name | newest_post_title
----------+------------+-----------+------------------
1 | Jane | Austen | Persuasion
2 | Charles | Dickens | Oliver Twist
3 | J.D. | Salinger | Nine Stories
Okay, now we know what our SQL should look like. But how do we put that into our Rails app in a way that's useful? Well, there are three approaches that I will cover:
- Using custom query methods (quick and dirty)
- Using a scoped
has_one
association (clean and elegant) - Using a view backed model (when all else fails)