Blog | CoWhite Softwarehttp://django.cowhite.com/blog/2017-07-18T07:17:11+00:00BlogMastering "Search" in Django - Postgres2017-07-18T07:17:11+00:00srinath/blog/author/srinath/http://django.cowhite.com/blog/mastering-search-in-django-postgres/<h3>Introduction</h3>
<p>
Although a simple concept, searching is something that many of us (developers) try skimping on and hence reduce the overall quality of user experience, due to irrelevant search results and illogical search ranking.
</p>
<p>
Today we will look into the Full text search functionality that we can leverage when we use the Django-PostGres combo. <strong>Please note that <em>Full Text search</em> is only supported if we use Postgres database backend with Django 1.10 or newer</strong>.
</p>
<p><br />
<h3>Who is this tutorial for?</h3>
<p>
This tutorial is for developers with some background with PYTHON and Django. Also it is assumed that you are familiar with Querysets, aggregators, annotations etc. Although I'll try explaining it from scratch, I would suggest you <a href="https://docs.djangoproject.com/en/1.11/ref/models/querysets/">get acquainted with the Queryset API reference</a>.
</p><br />
<h3>Simple <strong>Search</strong> lookup</h3>
<p>
Let's first understand what's a lookup. Field lookups are how you specify the meat of an SQL <strong>WHERE</strong> clause. They’re specified as keyword arguments to the <strong>QuerySet</strong> methods <code>filter()</code>, <code>exclude()</code> and <code>get()</code>. For eg:
</p>
<p></p>
<div class="codehilite"><pre># Python Code
Student.objects.filter(name__contains='Smith') # --> SELECT ... WHERE name LIKE '%Smith%';
Student.objects.filter(age__gt=18) # --> SELECT ... WHERE age > 18;
</pre></div>
<p>Here <code>contains</code> and <code>gt</code> are lookups. Please visit the official <a href="https://docs.djangoproject.com/en/1.11/ref/models/querysets/#field-lookups">documentation</a> for the complete list of <a href="https://docs.djangoproject.com/en/1.11/ref/models/querysets/#field-lookups">Field Lookups</a>.
</p><br />
<h4>Search Lookup</h4>
<p>
For Full Text Search the lookup is <code>search</code>. It is the simplest method to search a single term against a single column in the database. For eg:
</p>
<p></p>
<div class="codehilite"><pre># Python Code
Product.objects.filter(name__search='Shiny')
# Output:
# [<Product: Shiny Shoes>, <Product: Very Shiny LED>]
</pre></div>
<p></p>
<p>
<strong>Note:</strong>To use the search lookup, <code>'django.contrib.postgres'</code> must be in your <strong>INSTALLED_APPS</strong>.
</p><br />
<h3>Search Vectors</h3>
In the above example we could only search against a single column at once. To query against multiple columns, we had to chain <code>filters()</code> or use <code>Q()</code> (<a href="https://docs.djangoproject.com/en/1.7/ref/models/queries/#q-objects">help!</a>). But a more elegant solution is to use <strong>SearchVectors</strong>. Let us assume an example where we have the models <strong>Book</strong> and an <strong>Author</strong> and search among book title and author name simultaneously; we could use a <code>SearchVector</code> like so:
</p>
<p></p>
<div class="codehilite"><pre><span class="c"># Python Code</span>
<span class="kn">from</span> <span class="nn">django.contrib.postgres.search</span> <span class="kn">import</span> <span class="n">SearchVector</span>
<span class="n">Book</span><span class="o">.</span><span class="n">objects</span><span class="o">.</span><span class="n">annotate</span><span class="p">(</span>
<span class="n">search</span><span class="o">=</span><span class="n">SearchVector</span><span class="p">(</span><span class="s">'title'</span><span class="p">,</span> <span class="s">'author__name'</span><span class="p">),</span>
<span class="p">)</span><span class="o">.</span><span class="n">filter</span><span class="p">(</span><span class="n">search</span><span class="o">=</span><span class="s">'Arthur'</span><span class="p">)</span>
<span class="c"># Output:</span>
<span class="c"># [<Book: The Adventures of Sherlock Holmes>, <Book: Arthur's Eyes>]</span>
</pre></div>
<p></p>
<p>
The arguments to <code>SearchVector</code> can be any <strong>Expression</strong> or the name of a field. Multiple arguments will be concatenated together using a space so that the search document includes them all.</p>
<p><code>SearchVector</code> objects can be combined together, allowing you to reuse them. For example:
</p>
<p></p>
<div class="codehilite"><pre><span class="c"># Python Code</span>
<span class="kn">from</span> <span class="nn">django.contrib.postgres.search</span> <span class="kn">import</span> <span class="n">SearchVector</span>
<span class="n">Book</span><span class="o">.</span><span class="n">objects</span><span class="o">.</span><span class="n">annotate</span><span class="p">(</span>
<span class="n">search</span><span class="o">=</span><span class="n">SearchVector</span><span class="p">(</span><span class="s">'title'</span><span class="p">)</span> <span class="o">+</span> <span class="n">SearchVector</span><span class="p">(</span><span class="s">'author__name'</span><span class="p">),</span>
<span class="p">)</span><span class="o">.</span><span class="n">filter</span><span class="p">(</span><span class="n">search</span><span class="o">=</span><span class="s">'Arthur'</span><span class="p">)</span>
<span class="c"># Output:</span>
<span class="c"># [<Book: The Adventures of Sherlock Holmes>, <Book: Arthur's Eyes>]</span>
</pre></div>
<p></p>
<p>
We could also assign weight's to each of the vectors, thereby effectively ranking them acording to relevance, which we will be discussing later in the blog.
</p><br />
<h3>Search Queries</h3>
<p>
SearchQuery translates the terms the user provides into a search query object that the database compares to a search vector.
</p>
<p>
The advantage of using a <code>SearchQuery</code> is that by default all the words provided are passed through a stemming alogorithm, before looking for matching terms. Thus giving a much relevant search result accounting for the liguistic usage of that word. Another advantage of using <code>SearchQuery</code> is we can easily combined logically using <code>&</code> (AND), <code>|</code> (OR), and <code>~</code> (NOT). So we could try applying SearchQuery in above example as follows:
</p>
<p></p>
<div class="codehilite"><pre><span class="c"># Python Code</span>
<span class="kn">from</span> <span class="nn">django.contrib.postgres.search</span> <span class="kn">import</span> <span class="n">SearchVector</span>
<span class="n">Book</span><span class="o">.</span><span class="n">objects</span><span class="o">.</span><span class="n">annotate</span><span class="p">(</span>
<span class="n">search</span><span class="o">=</span><span class="n">SearchVector</span><span class="p">(</span><span class="s">'title'</span><span class="p">)</span> <span class="o">+</span> <span class="n">SearchVector</span><span class="p">(</span><span class="s">'author__name'</span><span class="p">),</span>
<span class="p">)</span><span class="o">.</span><span class="n">filter</span><span class="p">(</span><span class="n">search</span> <span class="o">=</span> <span class="n">SearchQuery</span><span class="p">(</span><span class="s">'Arthur'</span><span class="p">)</span> <span class="p">)</span>
<span class="c"># Output:</span>
<span class="c"># [<Book: The Adventures of Sherlock Holmes>, <Book: Arthur's Eyes>]</span>
<span class="c">#</span>
<span class="c">####################################################################</span>
<span class="c">#</span>
<span class="c"># could be combined logically as follows</span>
<span class="kn">from</span> <span class="nn">django.contrib.postgres.search</span> <span class="kn">import</span> <span class="n">SearchVector</span><span class="p">,</span> <span class="n">SearchQuery</span>
<span class="n">Book</span><span class="o">.</span><span class="n">objects</span><span class="o">.</span><span class="n">annotate</span><span class="p">(</span>
<span class="n">search</span><span class="o">=</span><span class="n">SearchVector</span><span class="p">(</span><span class="s">'title'</span><span class="p">)</span> <span class="o">+</span> <span class="n">SearchVector</span><span class="p">(</span><span class="s">'author__name'</span><span class="p">),</span>
<span class="p">)</span><span class="o">.</span><span class="n">filter</span><span class="p">(</span><span class="n">search</span> <span class="o">=</span> <span class="n">SearchQuery</span><span class="p">(</span><span class="s">'Arthur'</span><span class="p">)</span> <span class="o">&</span> <span class="n">SearchQuery</span><span class="p">(</span><span class="s">'Sherlock'</span><span class="p">))</span>
<span class="c"># Output:</span>
<span class="c"># [<Book: The Adventures of Sherlock Holmes>,]</span>
</pre></div>
<p></p><br />
<h3>Search Ranking</h3>
<p>
So far we were beating around the bushes learning stuff to get to the meat of the matter. The SearchRank orders the results according to the its relevancy to the search term(s) provided by the user. PostgreSQL provides a ranking function which takes into account how often the query terms appear in the document, how close together the terms are in the document, and how important the part of the document is where they occur. The ranking is higher with better match. To order by search ranking, we do the following
</p>
<p></p>
<div class="codehilite"><pre><span class="c"># Python Code</span>
<span class="kn">from</span> <span class="nn">django.contrib.postgres.search</span> <span class="kn">import</span> <span class="n">SearchQuery</span><span class="p">,</span> <span class="n">SearchRank</span><span class="p">,</span> <span class="n">SearchVector</span>
<span class="n">vector</span> <span class="o">=</span> <span class="n">SearchVector</span><span class="p">(</span><span class="s">'name'</span><span class="p">)</span>
<span class="n">query</span> <span class="o">=</span> <span class="n">SearchQuery</span><span class="p">(</span><span class="s">'Stevens'</span><span class="p">)</span>
<span class="n">Author</span><span class="o">.</span><span class="n">objects</span><span class="o">.</span><span class="n">annotate</span><span class="p">(</span>
<span class="n">rank</span><span class="o">=</span><span class="n">SearchRank</span><span class="p">(</span><span class="n">vector</span><span class="p">,</span> <span class="n">query</span><span class="p">)</span>
<span class="p">)</span><span class="o">.</span><span class="n">order_by</span><span class="p">(</span><span class="s">'-rank'</span><span class="p">)</span>
<span class="c"># Output:</span>
<span class="c"># [<Author: Fisher Stevens>, <Author: W. Richard Stevens>, <Author: Robert Louis Stevenson>]</span>
</pre></div>
<p></p><br />
<h4>Assigning Weights to queries</h4>
<p>
Every field may not have the same relevance in a query, so you can set weights of various vectors before you combine them.
</p>
<p>
We can assign 4 levels of weights to each <code>SearchVector</code> [A, B ,C ,or D], with A being with the most weight, by convention. It is set by setting the argument <code>weight</code> in <code>SearchVector</code>.
</p>
<p>
In <code>SearchRank</code> on the other hand we can assign the actual weights for each of these above mentioned letters by passing the value of argument <code>weights</code> as a list of 4 float numbers. The order for assigning the values are D,C,B,A. The default weights(if you do not pass the <code>weights</code> argument in <code>SearchRank</code>) of these letters are D->0.1, C->0.2, B->0.4, A->1.0.
</p>
<p>
Let's look at an example to understand better:
</p>
<p></p>
<div class="codehilite"><pre><span class="c"># Python Code</span>
<span class="kn">from</span> <span class="nn">django.contrib.postgres.search</span> <span class="kn">import</span> <span class="n">SearchQuery</span><span class="p">,</span> <span class="n">SearchRank</span><span class="p">,</span> <span class="n">SearchVector</span>
<span class="n">vector</span> <span class="o">=</span> <span class="n">SearchVector</span><span class="p">(</span><span class="s">'title'</span><span class="p">,</span> <span class="n">weight</span><span class="o">=</span><span class="s">'A'</span><span class="p">)</span> <span class="o">+</span> <span class="n">SearchVector</span><span class="p">(</span><span class="s">'author__name'</span><span class="p">,</span> <span class="n">weight</span><span class="o">=</span><span class="s">'B'</span><span class="p">)</span>
<span class="n">query</span> <span class="o">=</span> <span class="n">SearchQuery</span><span class="p">(</span><span class="s">'Arthur'</span><span class="p">)</span>
<span class="n">Book</span><span class="o">.</span><span class="n">objects</span><span class="o">.</span><span class="n">annotate</span><span class="p">(</span>
<span class="n">rank</span><span class="o">=</span><span class="n">SearchRank</span><span class="p">(</span><span class="n">vector</span><span class="p">,</span> <span class="n">query</span><span class="p">)</span>
<span class="p">)</span><span class="nb">filter</span><span class="p">(</span><span class="n">rank__gte</span><span class="o">=</span><span class="mf">0.3</span><span class="p">)</span><span class="o">.</span><span class="n">order_by</span><span class="p">(</span><span class="s">'-rank'</span><span class="p">)</span>
<span class="c"># Output:</span>
<span class="c"># [<Book: Arthur's Eyes>, <Book: The Adventures of Sherlock Holmes>]</span>
</pre></div>
<p></p>
<p>
For assigning the weights differently, we do the following
</p>
<p></p>
<div class="codehilite"><pre>rank = SearchRank(vector, query, weights=[0.2, 0.4, 0.6, 0.8])
Book.objects.annotate(
rank=rank
)filter(rank__gte=0.3).order_by('-rank')
</pre></div>
<p></p><br />
<h3>Performance, Indexes and <code>SearchVectorField</code></h3>
<p>
No special database tweaking is needed for using any of the above function per say. But since full text search being a reaource hungry and heavy process, it may have hiccups when searching more than a few hundred records. The fact to be noted is that Databases aren't actually designed for full text searches.
</p>
<p>
To mitigate this issue to some extent you could <a href="https://www.postgresql.org/docs/current/static/textsearch-tables.html#TEXTSEARCH-TABLES-INDEX">create indexes for full text search</a>, which has been documented in the PostgreSQL documentation.
</p>
<p>
Other method which works even better is adding a <code>SearchVectorField</code> to your model. You’ll need to keep it populated with triggers, for example, as described in the <a href="https://www.postgresql.org/docs/current/static/textsearch-features.html#TEXTSEARCH-UPDATE-TRIGGERS">PostgreSQL documentation</a>. You can then query the field as if it were an annotated <code>SearchVector</code>:
</p>
<p></p>
<div class="codehilite"><pre># Python Code
Book.objects.update(search_vector=SearchVector('title'))
Book.objects.filter(search_vector='Arthur')
# Output:
# [<Book: Arthur's Eyes>]
</pre></div>
<p></p><br />
<h3>Trigram Similarity</h3>
<p>
A typo-tolerant method of searching is trigram similarity. It compares the number of trigrams, or three consecutive characters, shared between the search term(s) and the target text.
</p>
<p>
But unlike other features, we have to make sure that an extention called pg_trgm extention on PostgreSQL is activated first. You would find a <a href="https://stackoverflow.com/a/14294609">helping hand on the topic here</a>.
</p>
<p>
There are two complementary functions here and they are <code>TrigramSimilarity</code> and <code>TrigramDistance</code>. They both essentially give the same information but one returns the amount of similarity and the latter the amount of difference as you may have guessed. Let's see an example of each
</p>
<p></p>
<div class="codehilite"><pre><span class="c"># Python Code</span>
<span class="kn">from</span> <span class="nn">django.contrib.postgres.search</span> <span class="kn">import</span> <span class="n">TrigramSimilarity</span><span class="p">,</span> <span class="n">TrigramDistance</span>
<span class="n">Author</span><span class="o">.</span><span class="n">objects</span><span class="o">.</span><span class="n">create</span><span class="p">(</span><span class="n">name</span><span class="o">=</span><span class="s">'Katy Stevens'</span><span class="p">)</span>
<span class="n">Author</span><span class="o">.</span><span class="n">objects</span><span class="o">.</span><span class="n">create</span><span class="p">(</span><span class="n">name</span><span class="o">=</span><span class="s">'Stephen Keats'</span><span class="p">)</span>
<span class="n">test</span> <span class="o">=</span> <span class="s">'Katie Stephens'</span>
<span class="c">##############################################################################</span>
<span class="c"># Similarity</span>
<span class="n">Author</span><span class="o">.</span><span class="n">objects</span><span class="o">.</span><span class="n">annotate</span><span class="p">(</span>
<span class="n">similarity</span><span class="o">=</span><span class="n">TrigramSimilarity</span><span class="p">(</span><span class="s">'name'</span><span class="p">,</span> <span class="n">test</span><span class="p">),</span>
<span class="p">)</span><span class="o">.</span><span class="n">filter</span><span class="p">(</span><span class="n">similarity__gt</span><span class="o">=</span><span class="mf">0.3</span><span class="p">)</span><span class="o">.</span><span class="n">order_by</span><span class="p">(</span><span class="s">'-similarity'</span><span class="p">)</span>
<span class="c"># Output:</span>
<span class="c"># [<Author: Katy Stevens>, <Author: Stephen Keats>]</span>
<span class="c">##############################################################################</span>
<span class="c"># Distance</span>
<span class="n">Author</span><span class="o">.</span><span class="n">objects</span><span class="o">.</span><span class="n">annotate</span><span class="p">(</span>
<span class="n">distance</span><span class="o">=</span><span class="n">TrigramDistance</span><span class="p">(</span><span class="s">'name'</span><span class="p">,</span> <span class="n">test</span><span class="p">),</span>
<span class="p">)</span><span class="o">.</span><span class="n">filter</span><span class="p">(</span><span class="n">distance__lte</span><span class="o">=</span><span class="mf">0.3</span><span class="p">)</span><span class="o">.</span><span class="n">order_by</span><span class="p">(</span><span class="s">'distance'</span><span class="p">)</span>
<span class="c"># Output:</span>
<span class="c"># [<Author: Katy Stevens>, <Author: Stephen Keats>]</span>
</pre></div>
</p>