In this article, we’ll develop a basic search application using Dancer and Sphinx. Sphinx is an open source search engine that’s fairly easy to use, but powerful enough to be deployed in high-traffic sites, such as Craigslist and Dailymotion.
In keeping with this year’s Dancer Advent Calendar trend, the example app will be built on Dancer 2, but it should work just as well with Dancer 1.
Alright, let’s get to work.
The Data
Our web application will be used to search through documents stored in a MySQL database. We’ll use a simple table with the following structure:
CREATE TABLE documents (
id int NOT NULL AUTO_INCREMENT,
title varchar(200) NOT NULL,
contents_text text NOT NULL,
contents_html text NOT NULL,
PRIMARY KEY (id)
);
Each document has an unique ID, a title, and contents, stored as both plain text and as HTML. We need the two formats for different purposes — HTML will be used to display the document in the browser, while plain text will be fed to the indexing mechanism of the search engine (because we do not want to index the HTML tags, obviously).
We can populate the database with any kind of document data — for my test version, I used a simple script to fill the database with POD documentation extracted from Dancer distribution. The script is included at the end of this article, in case you’d like to use it yourself.
Installation and Configuration of Sphinx
Sphinx can be installed pretty easily, using one of the pre-compiled .rpm
or .deb
packages, or the source tarball. These are available at the download page at SphinxSearch.com — grab the one that suits you and follow the installation instructions.
When Sphinx is installed, it needs to be configured before we can play with it. Its main configuration file is usually located at /etc/sphinx/sphinx.conf
. For our purposes, a very basic setup will do — we’ll put the following in the sphinx.conf
file:
source documents
{
type     = mysql
sql_host   = localhost
sql_user   = user
sql_pass   = hunter1
sql_db    = docs
sql_query  = \
SELECT id, title, contents_text FROM documents
}
index test
{
source      = documents
charset_type   = utf-8
path       = /usr/local/sphinx/data/test
}
This defines one source, which is what Sphinx uses to gather data, and one index, which will be created by processing the collected data and will then be queried when we perform the searches. In our case, the source is the documents database that we just created. Thesql_query
directive defines the SELECT
query that Sphinx will use to pull the data, and it includes all the fields from the documents
table, except contents_html
— like we said, HTML is not supposed to be indexed.
That’s all that we need to start using Sphinx. After we make sure the searchd
daemon is running, we can proceed with indexing the data. We call indexer
with the name of the index:
$ indexer test
It should spit out some information about the indexing operation, and when it’s done, we can do our first search:
$ search "plugin"
index 'test': query 'plugin ': returned 8 matches of 8 total in 0.002 sec
displaying matches:
1. document=19, weight=2713
2. document=44, weight=2694
3. document=20, weight=1713
4. document=2, weight=1672
5. document=1, weight=1640
6. document=13, weight=1640
7. document=27, weight=1601
8. document=28, weight=1601
Apparently, there are 8 documents in the Dancer documentation with the word plugin, and the one with the ID of 19 is the highest ranking result. Let’s see which document that is:
mysql> SELECT title FROM documents WHERE id = 19;
+----------------------------------------------------+
| title                        |
+----------------------------------------------------+
| Dancer::Plugin - helper for writing Dancer plugins |
+----------------------------------------------------+
It’s the documentation for Dancer::Plugin, and it makes total sense that this is the first result for the word plugin. Sphinx setup is thus ready and we can get to the web application part of our little project.
(more…)