Optimize review request queries and remove the distinct flag.
Review Request #13435 — Created Nov. 27, 2023 and submitted — Latest diff uploaded
This is the culmination of all the recent work going into query
refactoring. This change disables thedistinct
flag for review
requests by default, instead reworking queries to avoid the conditions
in which duplicate rows would be returned.The cause of the duplicate rows was due to certain JOINs on the query.
In particular, on LEFT OUTER JOINs (though it's also possible with some
INNER JOINs). When multiple matches could appear on the right of the
join (e.g., a user is on more than one listed accessible review group,
or is also listed as a reviewer directly), multiple rows would be
returned covering each of those matches.Using
distinct
was a bandaid over that, but it carried with it some
major performance issues. Notably, when trying to retrieve the total
count of results, Django would generate aCOUNT(*) ...
query with the
review request query as a subquery, and in the case of MySQL, this could
result in a temporary table with a full row scan to get the total count.We now take a new approach. Instead of JOINs, we're using explicit
subqueries for these lookups, but as anEXISTS(...)
condition. This
can be optimized fairly well, and in testing with significant numbers of
review requests that would result in many duplicates, performance is on
par with or better than the previous approach across all supported
databases.By removing
distinct
, we also get to promote theCOUNT(*)
queries,
avoiding the subquery and helping the database better optimize these.Another notable change is that the default sorting is now different. We
used to sort by Last Updated, then by username, then by summary. This
was from the very early days of Review Board, when we had a very plain
dashboard with no sorting options. It's not necessary anymore, and it
slows down most operations (we have to JOIN onauth_user
for the
username). Plus, the Last Updated timestamp is likely going to be a
unique value anyway, especially with millisecond precision. So we now
simplify the default sort by only sorted by this timestamp.This will still need additional testing in our own production servers
and with select customers experiencing issues before this is released,
but it should result in large performance improvements overall.
All unit tests pass.
Ran through extensive performance testing with the new queries,
measuring the average time for 10 queries on a database of 500,000
review requests with hundreds to thousands of accessible repositories
and review groups, across multiple versions of MySQL and Postgres.In my tests, the new subqueries, at worst, had the same performance
profiles as the JOINs with DISTINCT. At best, they were significantly
faster.