Search Engine Speedups
Finally, a breakthrough in building a faster search engine! For anyone else out there trying to do this stuff, save yourself some trouble and read on…
Background
I’m working with MySQL on a very modest setup. At the time of writing, my search engine has to deal with about half a million documents. My search queries were running very slow and putting strain on the server. It was easy to blame the hardware but I wasn’t convinced. Many search engines are dealing with billions of documents and queries, and run very quickly. They know something I don’t.
The Bottleneck
I eliminated anything that could complicate things and found the bottleneck was in simply selecting words from the page word index. This was very disappointing. On further inspection I found the actual query was quite fast, but what was really slow was loading the required index (like a pathway) to gain access to the correct records.
The Solution
The problem here is a bloated page word index. Time to sort it out…
- Get rid of stop words.
Any very common words (ie. a, an, the, this) are not really so important when it comes to search. Its better to filter them altogether. This will greatly reduce the amount of word being indexed. These are called “stop words”. There are many stop word lists available on the net. Just have a quick hunt around. - Get rid of anything else that’s not really needed.
My page word index was full of words which were really just gibberish. To identify these words I simply keep a count of how many pages a word appears on. Words with lower counts are probably rubbish words not worth indexing. - Break things down.
I found dividing my page word index into separate tables helped in a big way. Sure, I had more tables, but these tables had smaller indexes and the performance gain was instantly obvious. I split my page words by first letter or number. Perhaps on more popular letters or numbers I could break it down further. So I ended up with tables like: page_words_a, page_words_b, page_words_c and so on.
More Tips & Ideas
There are other ways of speeding things up.
- Keep commonly used data in smaller separate tables.
For example, if you have a table full of page data with all the page texts, and you intend to sort pages by date, put the dates in a separate table which refers back to the main page table. This keeps the record size small, which means better speed. - Accuracy vs speed.
Often your search doesn’t need to give perfect and accurate results. If a visitor is only looking at a first page of 10 results, there’s no point fetching and sorting thousands (or millions) of pages. Try to select page words which seem to be good candidates. Then work out those scores and sort. A bit of inaccuracy results in far faster search times and is much kinder on your server, which means you can handle more traffic. - Cache common search queries.
This one is pretty obvious. Any results from common searches can be periodically saved and used when required. No need to cache all results, maybe just the first few pages.
There’s probably plenty of other speedup tricks available. With a bit of creative thinking almost anything is possible.
An update on this. So far I have not implemented caching or sacrificed accuracy, and the speed is still really good with the above method.
I did something a bit unconventional. I opted to process some of the query in PHP. This has also given a good speed boost because I’m able to use a few tricks in joining data. The result is something which actually works with ridiculous improvements on what I had before.
bluebloomer
July 2, 2009 at 2:23 pm