Mastering "Search" in Django - Postgres

(Comments)

Introduction

Although a simple concept, searching is something that many of us (developers) try skimping on and hence reduce the overall quality of user experience, due to irrelevant search results and illogical search ranking.

Today we will look into the Full text search functionality that we can leverage when we use the Django-PostGres combo. Please note that Full Text search is only supported if we use Postgres database backend with Django 1.10 or newer.


Who is this tutorial for?

This tutorial is for developers with some background with PYTHON and Django. Also it is assumed that you are familiar with Querysets, aggregators, annotations etc. Although I'll try explaining it from scratch, I would suggest you get acquainted with the Queryset API reference.


Simple Search lookup

Let's first understand what's a lookup. Field lookups are how you specify the meat of an SQL WHERE clause. They’re specified as keyword arguments to the QuerySet methods filter(), exclude() and get(). For eg:

# Python Code

Student.objects.filter(name__contains='Smith') # --> SELECT ... WHERE name LIKE '%Smith%';
Student.objects.filter(age__gt=18) # --> SELECT ... WHERE age > 18;

Here contains and gt are lookups. Please visit the official documentation for the complete list of Field Lookups.


Search Lookup

For Full Text Search the lookup is search. It is the simplest method to search a single term against a single column in the database. For eg:

# Python Code

Product.objects.filter(name__search='Shiny')

# Output:
# [<Product: Shiny Shoes>, <Product: Very Shiny LED>]

Note:To use the search lookup, 'django.contrib.postgres' must be in your INSTALLED_APPS.


Search Vectors

In the above example we could only search against a single column at once. To query against multiple columns, we had to chain filters() or use Q() (help!). But a more elegant solution is to use SearchVectors. Let us assume an example where we have the models Book and an Author and search among book title and author name simultaneously; we could use a SearchVector like so:

# Python Code

from django.contrib.postgres.search import SearchVector

Book.objects.annotate(
    search=SearchVector('title', 'author__name'),
).filter(search='Arthur')

# Output:
# [<Book: The Adventures of Sherlock Holmes>, <Book: Arthur's Eyes>]

The arguments to SearchVector can be any Expression or the name of a field. Multiple arguments will be concatenated together using a space so that the search document includes them all.

SearchVector objects can be combined together, allowing you to reuse them. For example:

# Python Code

from django.contrib.postgres.search import SearchVector

Book.objects.annotate(
    search=SearchVector('title') + SearchVector('author__name'),
).filter(search='Arthur')

# Output:
# [<Book: The Adventures of Sherlock Holmes>, <Book: Arthur's Eyes>]

We could also assign weight's to each of the vectors, thereby effectively ranking them acording to relevance, which we will be discussing later in the blog.


Search Queries

SearchQuery translates the terms the user provides into a search query object that the database compares to a search vector.

The advantage of using a SearchQuery is that by default all the words provided are passed through a stemming alogorithm, before looking for matching terms. Thus giving a much relevant search result accounting for the liguistic usage of that word. Another advantage of using SearchQuery is we can easily combined logically using & (AND), | (OR), and ~ (NOT). So we could try applying SearchQuery in above example as follows:

# Python Code

from django.contrib.postgres.search import SearchVector

Book.objects.annotate(
    search=SearchVector('title') + SearchVector('author__name'),
).filter(search = SearchQuery('Arthur') )

# Output:
# [<Book: The Adventures of Sherlock Holmes>, <Book: Arthur's Eyes>]

#
####################################################################
#

# could be combined logically as follows

from django.contrib.postgres.search import SearchVector, SearchQuery

Book.objects.annotate(
    search=SearchVector('title') + SearchVector('author__name'),
).filter(search = SearchQuery('Arthur') & SearchQuery('Sherlock'))

# Output:
# [<Book: The Adventures of Sherlock Holmes>,]


Search Ranking

So far we were beating around the bushes learning stuff to get to the meat of the matter. The SearchRank orders the results according to the its relevancy to the search term(s) provided by the user. PostgreSQL provides a ranking function which takes into account how often the query terms appear in the document, how close together the terms are in the document, and how important the part of the document is where they occur. The ranking is higher with better match. To order by search ranking, we do the following

# Python Code

from django.contrib.postgres.search import SearchQuery, SearchRank, SearchVector

vector = SearchVector('name')
query = SearchQuery('Stevens')

Author.objects.annotate(
    rank=SearchRank(vector, query)
).order_by('-rank')

# Output:
# [<Author: Fisher Stevens>, <Author: W. Richard Stevens>, <Author: Robert Louis Stevenson>]


Assigning Weights to queries

Every field may not have the same relevance in a query, so you can set weights of various vectors before you combine them.

We can assign 4 levels of weights to each SearchVector [A, B ,C ,or D], with A being with the most weight, by convention. It is set by setting the argument weight in SearchVector.

In SearchRank on the other hand we can assign the actual weights for each of these above mentioned letters by passing the value of argument weights as a list of 4 float numbers. The order for assigning the values are D,C,B,A. The default weights(if you do not pass the weights argument in SearchRank) of these letters are D->0.1, C->0.2, B->0.4, A->1.0.

Let's look at an example to understand better:

# Python Code

from django.contrib.postgres.search import SearchQuery, SearchRank, SearchVector

vector = SearchVector('title', weight='A') + SearchVector('author__name', weight='B')
query = SearchQuery('Arthur')

Book.objects.annotate(
    rank=SearchRank(vector, query)
)filter(rank__gte=0.3).order_by('-rank')

# Output:
# [<Book: Arthur's Eyes>, <Book: The Adventures of Sherlock Holmes>]

For assigning the weights differently, we do the following

rank = SearchRank(vector, query, weights=[0.2, 0.4, 0.6, 0.8])

Book.objects.annotate(
    rank=rank
)filter(rank__gte=0.3).order_by('-rank')


Performance, Indexes and SearchVectorField

No special database tweaking is needed for using any of the above function per say. But since full text search being a reaource hungry and heavy process, it may have hiccups when searching more than a few hundred records. The fact to be noted is that Databases aren't actually designed for full text searches.

To mitigate this issue to some extent you could create indexes for full text search, which has been documented in the PostgreSQL documentation.

Other method which works even better is adding a SearchVectorField to your model. You’ll need to keep it populated with triggers, for example, as described in the PostgreSQL documentation. You can then query the field as if it were an annotated SearchVector:

# Python Code

Book.objects.update(search_vector=SearchVector('title'))
Book.objects.filter(search_vector='Arthur')

# Output:
# [<Book: Arthur's Eyes>]


Trigram Similarity

A typo-tolerant method of searching is trigram similarity. It compares the number of trigrams, or three consecutive characters, shared between the search term(s) and the target text.

But unlike other features, we have to make sure that an extention called pg_trgm extention on PostgreSQL is activated first. You would find a helping hand on the topic here.

There are two complementary functions here and they are TrigramSimilarity and TrigramDistance. They both essentially give the same information but one returns the amount of similarity and the latter the amount of difference as you may have guessed. Let's see an example of each

# Python Code

from django.contrib.postgres.search import TrigramSimilarity, TrigramDistance

Author.objects.create(name='Katy Stevens')
Author.objects.create(name='Stephen Keats')

test = 'Katie Stephens'

##############################################################################

# Similarity
Author.objects.annotate(
    similarity=TrigramSimilarity('name', test),
).filter(similarity__gt=0.3).order_by('-similarity')

# Output:
# [<Author: Katy Stevens>, <Author: Stephen Keats>]


##############################################################################

# Distance
Author.objects.annotate(
    distance=TrigramDistance('name', test),
).filter(distance__lte=0.3).order_by('distance')

# Output:
# [<Author: Katy Stevens>, <Author: Stephen Keats>]

Comments

Recent Posts

Archive

2021
2020
2019
2018
2017
2016
2015
2014

Tags

Authors

Feeds

RSS / Atom