Search Engine Speedups
Finally, a breakthrough in building a faster search engine! For anyone else out there trying to do this stuff, save yourself some trouble and read on…
Background
I’m working with MySQL on a very modest setup. At the time of writing, my search engine has to deal with about half a million documents. My search queries were running very slow and putting strain on the server. It was easy to blame the hardware but I wasn’t convinced. Many search engines are dealing with billions of documents and queries, and run very quickly. They know something I don’t.
The Bottleneck
I eliminated anything that could complicate things and found the bottleneck was in simply selecting words from the page word index. This was very disappointing. On further inspection I found the actual query was quite fast, but what was really slow was loading the required index (like a pathway) to gain access to the correct records.
The Solution
The problem here is a bloated page word index. Time to sort it out…
- Get rid of stop words.
Any very common words (ie. a, an, the, this) are not really so important when it comes to search. Its better to filter them altogether. This will greatly reduce the amount of word being indexed. These are called “stop words”. There are many stop word lists available on the net. Just have a quick hunt around. - Get rid of anything else that’s not really needed.
My page word index was full of words which were really just gibberish. To identify these words I simply keep a count of how many pages a word appears on. Words with lower counts are probably rubbish words not worth indexing. - Break things down.
I found dividing my page word index into separate tables helped in a big way. Sure, I had more tables, but these tables had smaller indexes and the performance gain was instantly obvious. I split my page words by first letter or number. Perhaps on more popular letters or numbers I could break it down further. So I ended up with tables like: page_words_a, page_words_b, page_words_c and so on.
More Tips & Ideas
There are other ways of speeding things up.
- Keep commonly used data in smaller separate tables.
For example, if you have a table full of page data with all the page texts, and you intend to sort pages by date, put the dates in a separate table which refers back to the main page table. This keeps the record size small, which means better speed. - Accuracy vs speed.
Often your search doesn’t need to give perfect and accurate results. If a visitor is only looking at a first page of 10 results, there’s no point fetching and sorting thousands (or millions) of pages. Try to select page words which seem to be good candidates. Then work out those scores and sort. A bit of inaccuracy results in far faster search times and is much kinder on your server, which means you can handle more traffic. - Cache common search queries.
This one is pretty obvious. Any results from common searches can be periodically saved and used when required. No need to cache all results, maybe just the first few pages.
There’s probably plenty of other speedup tricks available. With a bit of creative thinking almost anything is possible.
Quest For Speed
It’s been a good test getting lots of data into the system. However, making Bluebloomer run quickly has been tough…
Search
Search has been the toughest. I’m testing a variety of different approaches. Again it’s a trade of accuracy vs speed. Although at this stage even simple fetching of indexed single words is taking too long. Hardware and setup may be some of the problem here, though I’m not 100% sure.
Caching
I tested caching topic results. Caching all results was unwieldy and slow. What I’ve now done is limited the amount of results in the cache and also split them into different tables depending on filters and sorting. This new method I haven’t yet tested, but my hopes are high. I’m guessing much of Bluebloomer’s traffic will be visitors viewing topics. If the cache works for these, I should be able to save some strain on the server.
Background Server
I have added another server to Bluebloomer. This will help in a few ways. First it gives me a place to backup the database. But in addition to this it can also handle data from the crawlers, create topics, cache topics and a few other odd jobs. Once again, this takes strain off the main public server.
All of this is taking some time to get sorted, but in my opinion any speed improvement is worth the effort. A slow site is not so much fun to use. Will keep trying.
Topics Magically Appearing
Woo hoo! It’s been tough getting meaningful topics to automatically appear across the Bluebloomer website. Yesterday it was a right mess. Who would blog about the topic “00 Pm” or “2 3″? Today, a different a story. I made a minor alteration to the topic selecting formula and wow what a difference! Topics now make sense and are relevant to there parent topics.
I’m now experimenting with the sensitivity of the topic formula. I’m able to adjust how topical phrases need to be. At the moment I think there are too many topics coming up. It’s just trial and error really.
It’s not absolutely perfect, there will need to be some human intervention. For example, it gets confused between, “celebrities” and “celebrations”, or “Tiger Woods” turns up under the “Bush & Forests” section. But generally, its all working well. What a relief.
The next step is getting the hot topic selection formula right. Gonna take a bit of thought.

Topics and stories magically appear on the front page.
100,000 Blog Posts & Counting…
Well, no categories yet. But crawling is going well with about 100,000 blog entries indexed. Crawling is now consistent and reasonably up to date (within the day).
I didn’t go ahead with activating category creation and I’m glad about the decision. I realised I can shortcut things to save time and take much strain off the server. I don’t really want to talk specifics here as there are some parts of Bluebloomer I want to keep under wraps.
By the way WordPress bloggers, if you’ve been blogging recently, there’s a chance your blog may be on there.
I’ve not yet built a search by author tool, but if you type a few keywords from your posts in there, you may find something.
Spinning Plates
Had to ease up the amount of blog crawling. The crawler looks ahead and gathers links, but there’s only so many I can get through in a day. There are many tasks Bluebloomer has to deal with, ie. page finding, crawling and indexing text, crawling images, phrase finding, category creation, caching… All these tasks vary in time needed, it’s a balancing act getting everything to work in harmony.
I’d say its going to take a few weeks measuring and sorting out the best balance. Also as I add new crawlers I’ll need to readjust things to suit. But if the overall ratios are ok, generally things should be fine.
I’m hoping to activate category creation today. This means things will start appearing on the front page. We’ll see how we go. In the meantime gotta keep on crawling.
Faster Faster!
Yesterday I decided things were running a bit slow. When building search engines there are a few factors which really affect the speed.
- Word table size (all distinct words occurring in all pages)
- Page word table size (all words appearing on a page)
Bluebloomer’s tables were getting huge. So the last few days I’ve been optimizing.
One quick way of shrinking both tables is by using a more aggressive stop word filter. This filters out more commonly used words. Easy enough. Cool.
Bluebloomer had been storing word positions. The reason for this was to aid in phrase finding. This resulted in a massive page word table, not good. Yes, the phrase finding was perfectly accurate, but the cost was too high. So instead of being truely accurate, Bluebloomer now guesses which pages contain particular phrases and how important those phrases are to a page. The result is a greatly reduced page word table and faster search speeds.
So far that’s as far as I’ve got. I think the lesson here is that compromises must be made (unless you’re Google). Don’t let perfectionism get in the way of speed and usability.
Phrase Finding
Warning: this is a bit of techy post.
I’ve been working on finding phrases from thousands (potentially millions) of blogs. This is no easy task but I’ve tried two different approaches and the results are interesting to what performs better…
Method 1 – The rugged way
Crawler Machine:
When blog posts are crawled phrases are extracted from the document. These phrases are indexed as are single words. Indexing of phrases only applies to the blog post being crawled. This ends up with a large word index. A few tricks are used to help intelligently reduce the amount of phrases found.
Master Machine:
Phrases that start appearing across many blog posts are then indexed against all other pages. This means the phrases are now indexed globally. From time to time these phrases need to be reindexed, but it becomes quicker as pages already indexed need not be checked again.
Many phrases will never show any promise. These can be cleaned out if they become too old. This frees up some room on the database and keeps things running quickly.
Method 2 – The elegant way
Crawler Machine:
Blog posts are crawled but only single words are indexed. Crawling is quicker than method 1.
Master Machine:
Words which appear on many blogs are put through a phrase finding function. This takes some time, especially on popular words. It has to do some guessing to reduce time.
As before, phrases are indexed against all pages.
There is no need to clean the database out. The index is smaller and only includes promising phrases. Very tidy, very elegant.
So which method works in the real world?
It may come as a surprise but it appears that method 1 works better. More phrases are found, but this is done on the crawler machines which takes a load off the master machine. The master machine has no trouble putting lots of info into the database, but it has far more trouble phrase finding. The gain method 2 has in crawl speed is well offset by the time to phrase find. The worse bit is that method 2 gets bogged down on the master machine.
When it comes to indexing phrases method 1 tends to be quicker. The reason is that it can do a more accurate job of picking good phrases and simply ignoring the rest. Method 2 needs to use a bit of guessing to find good phrases, this can actually lead to more phrases being indexed.
Method 1 requires a clean up phase. But this only takes a few moments.
Regarding method 1, having a master machine doing less work frees it up to do other tasks ie. being a database server. This is so important for real world applications.
So it looks like Bluebloomer will stay on the rugged method 1 path for now.
Final thought
Imagine indexing all possible phrases. Never throwing away a thing. No indexing would be required on the master machine, but the database would become extremely bloated and slow. The great thing about method 1 is it strikes a balance between accuracy, memory and speed.
Big Repetitive Blog = A Grinding Halt
Well yesterday seemed all good. Bluebloomer was busy scanning blogs so I took a rest, came back and trouble. The database was locked, my memory usage was through the roof, I had emails from my web host. Headache!
So I tried to find out what was going wrong. Turns out I had been a bit naive about things. This particular blog entry http://anshumanpandey.wordpress.com/2009/05/27/top-ten-war-machines/ was evil. It had a high word count, but that was ok. The evil part was the entire blog entry was duplicated inside a hidden layer. Because Bluebloomer does phrase matching this caused all sorts of trouble.
Really it’s my own fault though. I should have been a bit more careful. Things are now sorted, a new filter is in place to stop rogue blogs getting through and there was a silver lining as I found and fixed a few other bugs along the way.
Hopefully things will be ok now and lots of new blogs will start appearing on the site.
Development Version
Bluebloomer is still officially in development. There is a website there, but things may change and updates will be inconsistent. Running a development version is great for testing though…
A few hours ago I removed everything from the site. Bluebloomer is now finding new blogs. My guess is it will add anywhere between 200,000 – 300,000 blog entries during this scan. As the day goes on these will begin to appear and subtopics will be automatically created (I hope).
There are more things which need to be sorted after all this and then a massive list of improvements to add at a later date.
I’m really itching to get a solid version up and running. Quite keen to get on with the marketing side of things.
…patience.

There's nothing there!

