• 
      

    Optimize review request queries and remove the distinct flag.

    Review Request #13435 — Created Nov. 27, 2023 and submitted — Latest diff uploaded

    Information

    Review Board
    release-5.0.x

    Reviewers

    This is the culmination of all the recent work going into query
    refactoring. This change disables the distinct flag for review
    requests by default, instead reworking queries to avoid the conditions
    in which duplicate rows would be returned.

    The cause of the duplicate rows was due to certain JOINs on the query.
    In particular, on LEFT OUTER JOINs (though it's also possible with some
    INNER JOINs). When multiple matches could appear on the right of the
    join (e.g., a user is on more than one listed accessible review group,
    or is also listed as a reviewer directly), multiple rows would be
    returned covering each of those matches.

    Using distinct was a bandaid over that, but it carried with it some
    major performance issues. Notably, when trying to retrieve the total
    count of results, Django would generate a COUNT(*) ... query with the
    review request query as a subquery, and in the case of MySQL, this could
    result in a temporary table with a full row scan to get the total count.

    We now take a new approach. Instead of JOINs, we're using explicit
    subqueries for these lookups, but as an EXISTS(...) condition. This
    can be optimized fairly well, and in testing with significant numbers of
    review requests that would result in many duplicates, performance is on
    par with or better than the previous approach across all supported
    databases.

    By removing distinct, we also get to promote the COUNT(*) queries,
    avoiding the subquery and helping the database better optimize these.

    Another notable change is that the default sorting is now different. We
    used to sort by Last Updated, then by username, then by summary. This
    was from the very early days of Review Board, when we had a very plain
    dashboard with no sorting options. It's not necessary anymore, and it
    slows down most operations (we have to JOIN on auth_user for the
    username). Plus, the Last Updated timestamp is likely going to be a
    unique value anyway, especially with millisecond precision. So we now
    simplify the default sort by only sorted by this timestamp.

    This will still need additional testing in our own production servers
    and with select customers experiencing issues before this is released,
    but it should result in large performance improvements overall.

    All unit tests pass.

    Ran through extensive performance testing with the new queries,
    measuring the average time for 10 queries on a database of 500,000
    review requests with hundreds to thousands of accessible repositories
    and review groups, across multiple versions of MySQL and Postgres.

    In my tests, the new subqueries, at worst, had the same performance
    profiles as the JOINs with DISTINCT. At best, they were significantly
    faster.

    Commits

    Files