Optimize review request queries and remove the distinct flag.

Review Request #13435 — Created Nov. 27, 2023 and submitted

Information

Review Board
release-5.0.x

Reviewers

This is the culmination of all the recent work going into query
refactoring. This change disables the distinct flag for review
requests by default, instead reworking queries to avoid the conditions
in which duplicate rows would be returned.

The cause of the duplicate rows was due to certain JOINs on the query.
In particular, on LEFT OUTER JOINs (though it's also possible with some
INNER JOINs). When multiple matches could appear on the right of the
join (e.g., a user is on more than one listed accessible review group,
or is also listed as a reviewer directly), multiple rows would be
returned covering each of those matches.

Using distinct was a bandaid over that, but it carried with it some
major performance issues. Notably, when trying to retrieve the total
count of results, Django would generate a COUNT(*) ... query with the
review request query as a subquery, and in the case of MySQL, this could
result in a temporary table with a full row scan to get the total count.

We now take a new approach. Instead of JOINs, we're using explicit
subqueries for these lookups, but as an EXISTS(...) condition. This
can be optimized fairly well, and in testing with significant numbers of
review requests that would result in many duplicates, performance is on
par with or better than the previous approach across all supported
databases.

By removing distinct, we also get to promote the COUNT(*) queries,
avoiding the subquery and helping the database better optimize these.

Another notable change is that the default sorting is now different. We
used to sort by Last Updated, then by username, then by summary. This
was from the very early days of Review Board, when we had a very plain
dashboard with no sorting options. It's not necessary anymore, and it
slows down most operations (we have to JOIN on auth_user for the
username). Plus, the Last Updated timestamp is likely going to be a
unique value anyway, especially with millisecond precision. So we now
simplify the default sort by only sorted by this timestamp.

This will still need additional testing in our own production servers
and with select customers experiencing issues before this is released,
but it should result in large performance improvements overall.

All unit tests pass.

Ran through extensive performance testing with the new queries,
measuring the average time for 10 queries on a database of 500,000
review requests with hundreds to thousands of accessible repositories
and review groups, across multiple versions of MySQL and Postgres.

In my tests, the new subqueries, at worst, had the same performance
profiles as the JOINs with DISTINCT. At best, they were significantly
faster.

Summary ID
Optimize review request queries and remove the distinct flag.
This is the culmination of all the recent work going into query refactoring. This change disables the `distinct` flag for review requests by default, instead reworking queries to avoid the conditions in which duplicate rows would be returned. The cause of the duplicate rows was due to certain JOINs on the query. In particular, on LEFT OUTER JOINs (though it's also possible with some INNER JOINs). When multiple matches could appear on the right of the join (e.g., a user is on more than one listed accessible review group, or is also listed as a reviewer directly), multiple rows would be returned covering each of those matches. Using `distinct` was a bandaid over that, but it carried with it some major performance issues. Notably, when trying to retrieve the total count of results, Django would generate a `COUNT(*) ...` query with the review request query as a subquery, and in the case of MySQL, this could result in a temporary table with a full row scan to get the total count. We now take a new approach. Instead of JOINs, we're using explicit subqueries for these lookups, but as an `EXISTS(...)` condition. This can be optimized fairly well, and in testing with significant numbers of review requests that would result in many duplicates, performance is on par with or better than the previous approach across all supported databases. By removing `distinct`, we also get to promote the `COUNT(*)` queries, avoiding the subquery and helping the database better optimize these. Another notable change is that the default sorting is now different. We used to sort by Last Updated, then by username, then by summary. This was from the very early days of Review Board, when we had a very plain dashboard with no sorting options. It's not necessary anymore, and it slows down most operations (we have to JOIN on `auth_user` for the username). Plus, the Last Updated timestamp is likely going to be a unique value anyway, especially with millisecond precision. So we now simplify the default sort by only sorted by this timestamp. This will still need additional testing in our own production servers and with select customers experiencing issues before this is released, but it should result in large performance improvements overall.
2449d42c989b0f4beed9653ebc45277889ed47c1
Description From Last Updated

Add docs for this arg.

maubinmaubin
david
  1. Ship It!
  2. 
      
maubin
  1. 
      
  2. Show all issues

    Add docs for this arg.

  3. 
      
chipx86
Review request changed
Status:
Completed
Change Summary:
Pushed to release-5.0.x (4ec2fd7)