Greatest-Per-Group Query in PostgreSQL Using DISTINCT ON

This is part 1 of 4 in a series on how to use Greatest Per Group queries in Ruby on Rails to avoid the N + 1 query problem:

When and how to use Greatest Per Group queries (with PostgreSQL)
Rails implementation using custom query methods
Rails implementation using a scoped has_one association
Rails implementation using a view backed model

Everything has been tested with Rails 4.2 and PostgreSQL. See demonstration app and it's source code

When to use Greatest Per Group queries

Something that I find myself needing to do frequently in my Ruby on Rails apps is to get the largest/smallest/newest/oldest/whateverest element in a model's has_many association. So, if my app has a Author model which has_many :posts, I might want to show the title of an author's most recent post when I'm displaying other information about the author.

If I'm only showing a single author in my view, this is easy to do without encountering a performance problem. I would just do something like this:

@post = @author.posts.order(created_at: :desc).first
@post.title

That generates two queries: one to load the author and another to load the most recent post. Pretty simple.

But if I want to show a list of authors along with the title of each author's most recent post, then we have to do something different. If we try the same approach, we can generate a lot of queries: one to load the list authors and then another N queries as we loop over each author and load their most recent post. This is the dreaded N + 1 query problem.

We really only want to fire off two queries: one to load the authors and then one more to load just their most recent posts. I am calling this a Greatest Per Group query because it is a special case of a Greatest N Per Group query, where N = 1. And it doesn't have to be the greatest per group, it could be the largest, smallest, newest, oldest, or whatever-est per group. The point is that you are trying to rank the records on the many side of a one-to-many relationship and only return a single record.

Okay, now that we know what a Greatest Per Group query is and where it's useful, let's figure out how they work.

How Greatest Per Group queries work

The good news is that SQL's reason for existence is to enable exactly this kind of thing. What we want to do is JOIN our authors table to a table of posts which only has one post for each author: the most recent post. So we know the overall structure will look like this:

SELECT authors.*, newest_posts.title AS newest_post_title
FROM authors
LEFT JOIN newest_posts ON newest_posts.author_id = authors.id

That's a fairly straightforward join. In this case we're using a LEFT JOIN since we want to list all authors, even if they have never written a post. What we glossed over, though, was that newest_posts table. The problem is that we don't have a newest_posts table. So how are we going to join a table that doesn't exist?

One of the great things about SQL is that it allows you to use the results of a query as a temporary table that you can join to. So, we're going to create a temporary newest_posts table and then join it to our authors table. First, though, we need to figure out how to generate a table that only has the most recent post for each author.

Your first thought might be that this is a perfect place to use GROUP BY. That was my first thought, too. But GROUP BY won't work for our purposes here because – in Postgres, at least – it limits what columns you can select to the column(s) you're grouping by and columns created by aggregate functions like MIN, MAX, AVG, SUM, and COUNT. So a query like this will return an error:

SELECT posts.*
FROM posts
GROUP BY author_id
ORDER BY created_at DESC

The best we can do with GROUP BY is something like this:

SELECT author_id, MAX(created_at) AS newest_post
FROM posts
GROUP BY author_id

That will return a result that looks like this:

author_id | newest_post
----------+------------
        1 | 2015-08-07
        2 | 2015-11-15
        3 | 2014-04-23
        4 | 2015-01-18

This could be useful if all we want to do is show when the last post was created. But we want to show the title of the newest post's, so this isn't very useful to us. That means GROUP BY is out as an option, at least in Postgres. So, what are our options?

Well, Postgres has this marvelous thing called DISTINCT ON. You can go read the documentation for the full run down, but the short description is this: DISTINCT ON will only return the first row for each group of rows that share a unique combination of your distinct column(s). So you can use ORDER BY to sort your results and that will determine which row DISTINCT ON returns. This allows you to get only the newest, oldest, largest, smallest, etc. records for an association. Plus, unlike GROUP BY, it allows you to SELECT any and all columns that you desire.

Here's how we get a list of each of the most recent posts per author using DISTINCT ON:

SELECT DISTINCT ON (author_id) *
FROM posts
ORDER BY author_id, created_at DESC

That's it. Fairly simple when it's all said and done. There are a couple of things which tripped me up the first time I used DISTINCT ON, though, so I'll list them here:

There is no comma directly after the DISTINCT ON (author_id) part of the SELECT statement. But there are commas after that if you're selecting more than one column.
The first (leftmost) column in the ORDER BY clause has to be the distinct column – author_id in our case. This is kinda goofy, since I think it's an implementation detail leaking out, but it's a thing that Postgres will make you do.
The remaining items in the ORDER BY clause work as you normally expect and are what determine whether you get the newest, oldest, largest, smallest, or whateverest record, so pay attention to them.

Okay, let's put it all together now. So, we have our overall query:

SELECT authors.*, newest_posts.title AS newest_post_title
FROM authors
LEFT JOIN newest_posts ON newest_posts.author_id = authors.id

And we know that the newest_posts table that we're joining to the authors table is actually going to be a temporary table that we generate with this query:

SELECT DISTINCT ON (author_id) *
FROM posts
ORDER BY author_id, created_at DESC

But how do we actually do that? Well, we wrap the query in parentheses, which makes it a subquery, then we give it a name with AS newest_posts, and then we join it. Here's what that looks like:

SELECT authors.*, newest_posts.title AS newest_post_title
FROM authors
LEFT JOIN (SELECT DISTINCT ON (author_id) *
           FROM posts
           ORDER BY author_id, created_at DESC)
           AS newest_posts
ON newest_posts.author_id = authors.id

This will return a result that looks like this:

author_id | first_name | last_name | newest_post_title
----------+------------+-----------+------------------
        1 | Jane       | Austen    | Persuasion
        2 | Charles    | Dickens   | Oliver Twist
        3 | J.D.       | Salinger  | Nine Stories

Okay, now we know what our SQL should look like. But how do we put that into our Rails app in a way that's useful? Well, there are three approaches that I will cover:

Using custom query methods (quick and dirty)
Using a scoped has_one association (clean and elegant)
Using a view backed model (when all else fails)

Greatest-Per-Group Query in PostgreSQL Using DISTINCT ON

Richard Jones

Richard Jones

When to use Greatest Per Group queries

How Greatest Per Group queries work

Extending rails.vim Navigation Commands with projectionist.vim

Webcal Links in Rails with the icalendar Gem

Greatest-Per-Group Query in Rails Using a View Backed Model

Greatest-Per-Group Query in Rails Using Custom Query Methods

How I Record Programming Talks