Replace Lucene search for Review Requests with Haystack and Whoosh.

Review Request #4935 — Created Nov. 11, 2013 and submitted

Information

Review Board
master

Reviewers

Replace Lucene search for Review Requests with Haystack and Whoosh.

Full-text search with Lucene is a hassle to set up. Haystack and Whoosh are
pure python so it's easy to include them with Review Board and have full-text
search available out of the box. Regular maintenance of the search index is
still required for the search function to return up-to-date results.

These Review Request fields are indexed : id, summary, description, commit_id,
testing done, bugs, author (submitter's full name), submitter (username),
filenames of the files in the diffs (diff content is not indexed).

Enabled and disabled search in General Settings and changing index file path and results per page,
followed by rebuilding the index.

Search using the following fields:
id, summary, description, testing done, bugs, author, submitter, diff filenames.

Built and inspected Sphinx docs.

Built and updated the index through rb-site.

Unit tests passed.


Description From Last Updated

Hrm. Does this render to a form with a bunch of input elements? I think we want a more google-like …

daviddavid

What is this file used for, and how is it referenced? I don't see this filename in any of your …

daviddavid

Haystack loads HAYSTACK_CONNECTIONS from settings.py in its init.py. Overwriting HAYSTACK_CONNECTIONS through load_site_config() does not work for search because the new …

ED edwlee

This is a little bit confusing to me, because when I see "depends" I think "dependency". How about "An up-to-date …

daviddavid

Can we call this get_diff_filenames?

daviddavid

I'm not sure where the best place to put it is (here, in the view, in SearchQuerySet, etc), but at …

daviddavid

Revert this removed line.

daviddavid

How about just making filenames a set to begin with?

daviddavid

This should probably render a template with the error message instead of giving a (mostly) blank page. This text should …

daviddavid

Do you need to include this method if it doesn't do anything?

daviddavid

We should probably limit sqs here to some maximum number of results (maybe a couple hundred?). This could be a …

daviddavid

Including this in here makes haystack a hard dependency. We should probably include it in setup.py and change the docs/UI …

daviddavid

This was poorly localized to begin with, but it's worse now. How about using blocktrans instead? {% blocktrans %} <b>{{hits}}</b> …

daviddavid

It's probably ok to have an empty div in the case that there are no page links. I don't think …

daviddavid

The text in here should be inside {% trans %} tags.

daviddavid

Can we just depend on these as part of Review Board? I think, given that these are pure-Python, that it's …

chipx86chipx86

Can we move the import up around the others?

chipx86chipx86

If we make these dependencies, this whole function and all conditionals involving it can just go away.

chipx86chipx86

Mind adding a comment just explaining what we're doing here?

chipx86chipx86

Not sure I love having this here, but I understand why.. Can we rename this to get_all_diff_filenames?

chipx86chipx86

You can reduce the number of queries to 1 by doing: for filediff in FileDiff.objects.filter(diffset_history=self.diffset_history_id): In fact, you can probably …

chipx86chipx86

This is a start, but it's not enough. We also need to limit this only to those review requests that …

chipx86chipx86

This template is old, but the indentation for template tags and HTML should be independent. HTML should be indented relative …

chipx86chipx86

No spaces around variable names. So, {{hits}}, not {{ hits }}.

chipx86chipx86

No spaces. What is this file for, btw?

chipx86chipx86

Should be one statement, and immediately after the {% extends %}.

chipx86chipx86

Should apply indentation rules here too.

chipx86chipx86

Can you put a blank line above this?

chipx86chipx86

Does assigning these to self.results_per_page/max_search_results not work? Might as well also move this below the search_enable check.

daviddavid

Can you prefix this comment with "XXX:" to make it searchable as something we should fix in the future? This …

daviddavid

Hmm. Does SearchQuerySet order by relevance? This probably loses the ordering. We might want to turn it into a normal …

daviddavid
ED
ED
david
  1. Can you add some screenshots of what this looks like?

    1. Screenshot attached.

  2. reviewboard/reviews/forms.py (Diff revision 2)
     
     
     
     
     
     
     
     
     
     
    Show all issues

    Hrm. Does this render to a form with a bunch of input elements?

    I think we want a more google-like experience with just the single search box that can search everything. Part of that might require the query parsing, though...

    1. Yes it does (see attached screenshots). I agree that a single search box is better. This implementation is mostly for testing and getting myself familiar with Haystack. I am still looking for a suitable query parser -- Haystack only supports an implicit AND or OR, and explicit NOT (by prepending the search term with "-").

  3. reviewboard/settings.py (Diff revision 2)
     
     

    Hmm. We'll have to figure out where the right place to put this is for non-dev environments.

  4. reviewboard/templates/search/indexes/reviews/reviewrequest_text.txt (Diff revision 2)
     
     
     
     
     
     
     
     
     
     
    Show all issues

    What is this file used for, and how is it referenced? I don't see this filename in any of your code.

    1. This is the full-text template (i.e. fields used to create the full-text index) for ReviewRequestIndex. It follows the file path and name as per Haystack convention. Haystack references this file automatically so there is no need to refernence it explicitly.

  5. 
      
ED
ED
ED
  1. 
      
  2. reviewboard/admin/siteconfig.py (Diff revision 3)
     
     
     
     
     
     
     
    Show all issues

    Haystack loads HAYSTACK_CONNECTIONS from settings.py in its init.py. Overwriting HAYSTACK_CONNECTIONS through load_site_config() does not work for search because the new values are not reinitialized. See https://github.com/toastdriven/django-haystack/blob/master/haystack/init.py#L39

    1. This is the proper link: https://github.com/toastdriven/django-haystack/blob/master/haystack/init.py#L39

    2. Perhaps you could patch around this?

      apply_setting(...)
      haystack.connections = haystack.loading.ConnectionHandler(
          settings.HAYSTACK_CONNECTIONS)
      

      (I'm not sure if that works)

    3. Christian mentioned something similar to this. It seems to be working so far.

  3. 
      
ED
ED
ED
ED
david
  1. 
      
  2. Show all issues

    This is a little bit confusing to me, because when I see "depends" I think "dependency".

    How about "An up-to-date index is required to provide useful search results. See :ref:search-indexing for more information"

  3. reviewboard/reviews/models.py (Diff revision 4)
     
     
    Show all issues

    Can we call this get_diff_filenames?

  4. Show all issues

    I'm not sure where the best place to put it is (here, in the view, in SearchQuerySet, etc), but at some point we should definitely filter based on result.is_accessible_by(request.user)

    1. Should we even show the ID? It's already in the URL.

  5. 
      
ED
ED
ED
ED
david
  1. 
      
  2. Show all issues

    Revert this removed line.

  3. reviewboard/reviews/models.py (Diff revision 7)
     
     
     
     
     
     
     
     
     
     
     
     
    Show all issues

    How about just making filenames a set to begin with?

  4. reviewboard/reviews/views.py (Diff revision 7)
     
     
     
    Show all issues

    This should probably render a template with the error message instead of giving a (mostly) blank page. This text should also me marked for localization.

  5. reviewboard/reviews/views.py (Diff revision 7)
     
     
     
     
     
     
    Show all issues

    Do you need to include this method if it doesn't do anything?

    1. This has been removed.

  6. reviewboard/settings.py (Diff revision 7)
     
     
    Show all issues

    Including this in here makes haystack a hard dependency. We should probably include it in setup.py and change the docs/UI to only talk about installing whoosh.

    1. I've included django-haystack and Whoosh in setup.py as hard dependencies, and taken them out of the installation docs.

  7. Show all issues

    This was poorly localized to begin with, but it's worse now. How about using blocktrans instead?

    {% blocktrans %}
    <b>{{hits}}</b> results for <b>{{query}}</b>
    {% endblocktrans %}
    
  8. Show all issues

    It's probably ok to have an empty div in the case that there are no page links. I don't think we need this check.

    1. We want to have this check because when the paginator is rendered, either 'Next' or 'Previous' (or both) should be a link, otherwise we shouldn't render it at all because there are no results found.

    2. Oh, I guess I misunderstood what's inside this. I'm not sure I'm thrilled with always showing "Previous" (or "Next") if there's no page--I'd rather hide them.

    3. I've updated the template to only show "Previous" or "Next" when there are pages associated them.

  9. reviewboard/templates/reviews/search.html (Diff revision 7)
     
     
     
     
    Show all issues

    The text in here should be inside {% trans %} tags.

  10. 
      
david
  1. Oh, one more thing.

  2. reviewboard/reviews/views.py (Diff revision 7)
     
     
     
    Show all issues

    We should probably limit sqs here to some maximum number of results (maybe a couple hundred?). This could be a big performance hit if someone searches for "a" or "the" and it comes back with 500000 results, and we fetch them all from the db to make this list.

    If we limit the list of results, we should include a small note on the results page telling the user that there are more potential hits but they need to make their query more specific.

    1. It seems that the number of database calls Django makes in this loop is in O(n), i.e. a call is made when is_accessible_by() is evaluated. Is it possible to reduce them to O(1)? I am having trouble figuring it out.

      I should mention that I'm encountering an apparent bug with either Whoosh or Haystack (likely Whoosh) where updating the search index with manage.py update_index causes duplicates to be returned in a SearchQuerySet. The number of duplicates is directly proportional to the number of times the index is updated. Others have reported this issue as well: https://bitbucket.org/mchaput/whoosh/issue/97/search-index-contains-a-lot-of-duplicatesThis.

  3. 
      
ED
ED
chipx86
  1. A handful of comments, but I also have another thought:

    We have some setups out there that are calling manage index, and I don't want these to break. As it is, upgrading to this release will cause errors. I think, ideally, we'd continue to have a management command called index that supported the old parameters, but called the new ones. We could mark this as deprecated for, say, 2.0, and then 2.1 or 3.0 or whatever could remove it, giving people time to migrate.

    1. I've restored the index command and modified it for Haystack.

  2. docs/manual/admin/installation/linux.txt (Diff revision 9)
     
     
     
    Show all issues

    Can we just depend on these as part of Review Board? I think, given that these are pure-Python, that it's worth just offering search out of the box.

  3. reviewboard/__init__.py (Diff revision 9)
     
     
     
    Show all issues

    Can we move the import up around the others?

  4. reviewboard/admin/checks.py (Diff revision 9)
     
     
     
     
     
     
     
     
     
     
     
     
     
     
     
    Show all issues

    If we make these dependencies, this whole function and all conditionals involving it can just go away.

  5. reviewboard/admin/siteconfig.py (Diff revision 9)
     
     
     
    Show all issues

    Mind adding a comment just explaining what we're doing here?

  6. reviewboard/reviews/models.py (Diff revision 9)
     
     
    Show all issues

    Not sure I love having this here, but I understand why..

    Can we rename this to get_all_diff_filenames?

  7. reviewboard/reviews/models.py (Diff revision 9)
     
     
     
     
     
     
     
     
    Show all issues

    You can reduce the number of queries to 1 by doing:

    for filediff in FileDiff.objects.filter(diffset_history=self.diffset_history_id):
    

    In fact, you can probably simplify this whole thing even more by doing something like:

    q = FileDiff.objects.filter(diffset_history=self.diffset_history_id)
    return set(q.values_list('source_file', 'dest_file', flat=True))
    
  8. reviewboard/reviews/search_indexes.py (Diff revision 9)
     
     
     
    Show all issues

    This is a start, but it's not enough.

    We also need to limit this only to those review requests that are not private to private repositories or invite-only groups.

    Fortunately, we have some nice query methods that take care of this for you. Just change this to:

    return self.get_model().objects.public(
        status=None,
        extra_query=Q(status='P') | Q(status='S'))
    

    That should do the job.

    1. We are filtering things on the results side by what's accessible to the current user. Is that not good enough?

  9. reviewboard/templates/reviews/search.html (Diff revision 9)
     
     
     
     
     
     
     
     
     
     
     
     
     
     
    Show all issues

    This template is old, but the indentation for template tags and HTML should be independent.

    HTML should be indented relative to their parent tags by one space.

    Template tags should be indented 1 relative to their parent template tag, but that indentation should occur within the tag body. So:

    {% block ... %}
    {%  if foo %}
    {%   for blah %}
    {%   endfor %}
    {%  endif %}
    {% endblock %}
    
  10. Show all issues

    No spaces around variable names. So, {{hits}}, not {{ hits }}.

  11. reviewboard/templates/search/indexes/reviews/reviewrequest_text.txt (Diff revision 9)
     
     
     
     
     
     
     
     
     
    Show all issues

    No spaces.

    What is this file for, btw?

    1. I explained the purpose of this file in a previous comment: https://reviews.reviewboard.org/r/4935/#comment12754. I'll add a comment to this file to explain what it's for.

  12. Show all issues

    Should be one statement, and immediately after the {% extends %}.

  13. reviewboard/templates/search/search_disabled.html (Diff revision 9)
     
     
     
     
     
     
     
    Show all issues

    Should apply indentation rules here too.

  14. 
      
ED
ED
ED
ED
ED
ED
david
  1. 
      
  2. reviewboard/reviews/views.py (Diff revision 15)
     
     
     
     
     
    Show all issues

    Does assigning these to self.results_per_page/max_search_results not work?

    Might as well also move this below the search_enable check.

    1. results_per_page is a class variable defined and referenced by Haystack's SearchView, which ReviewRequestSearchView sub-classes from, so it's simpler to keep it as a class variable. We could assign max_search_results as an instance variable, but I chose to assign it as a class variable to be consistent. Should I still make it self.max_search_results?

  3. reviewboard/reviews/views.py (Diff revision 15)
     
     
    Show all issues

    Can you prefix this comment with "XXX:" to make it searchable as something we should fix in the future?

    This comment should probably also move to be right above the result_ids = set(...) line.

  4. reviewboard/reviews/views.py (Diff revision 15)
     
     
    Show all issues

    Hmm. Does SearchQuerySet order by relevance? This probably loses the ordering.

    We might want to turn it into a normal list, and then call distinct() on the QuerySet to remove duplicates.

    1. SearchQuerySet's ordering isn't officially documented. From some web searches, it seems that it does not order by relevance, but we should preserve the ordering anyway.

  5. 
      
ED
ED
chipx86
  1. This looks really good.

    So, the duplicates are fixed. What else is left for this change? Tell me again what the issue is with the number of records?

    1. The only thing left is the potential performance issue when searching with common keywords (assuming they don't get filtered out by Whoosh) resulting in a large number of records being fetched from the database. SearchQuerySet does not provide a way to limit the number of results returned like QuerySet does with the slicing syntax before actually executing the query.

  2. reviewboard/reviews/search_indexes.py (Diff revisions 15 - 17)
     
     
     
    Show all issues

    Can you put a blank line above this?

  3. 
      
ED
ED
ED
ED
david
  1. Ship It!
  2. 
      
ED
Review request changed
Status:
Completed
Change Summary:
Pushed to master (6859662). Woohoo!