Phrase Finding
Warning: this is a bit of techy post.
I’ve been working on finding phrases from thousands (potentially millions) of blogs. This is no easy task but I’ve tried two different approaches and the results are interesting to what performs better…
Method 1 – The rugged way
Crawler Machine:
When blog posts are crawled phrases are extracted from the document. These phrases are indexed as are single words. Indexing of phrases only applies to the blog post being crawled. This ends up with a large word index. A few tricks are used to help intelligently reduce the amount of phrases found.
Master Machine:
Phrases that start appearing across many blog posts are then indexed against all other pages. This means the phrases are now indexed globally. From time to time these phrases need to be reindexed, but it becomes quicker as pages already indexed need not be checked again.
Many phrases will never show any promise. These can be cleaned out if they become too old. This frees up some room on the database and keeps things running quickly.
Method 2 – The elegant way
Crawler Machine:
Blog posts are crawled but only single words are indexed. Crawling is quicker than method 1.
Master Machine:
Words which appear on many blogs are put through a phrase finding function. This takes some time, especially on popular words. It has to do some guessing to reduce time.
As before, phrases are indexed against all pages.
There is no need to clean the database out. The index is smaller and only includes promising phrases. Very tidy, very elegant.
So which method works in the real world?
It may come as a surprise but it appears that method 1 works better. More phrases are found, but this is done on the crawler machines which takes a load off the master machine. The master machine has no trouble putting lots of info into the database, but it has far more trouble phrase finding. The gain method 2 has in crawl speed is well offset by the time to phrase find. The worse bit is that method 2 gets bogged down on the master machine.
When it comes to indexing phrases method 1 tends to be quicker. The reason is that it can do a more accurate job of picking good phrases and simply ignoring the rest. Method 2 needs to use a bit of guessing to find good phrases, this can actually lead to more phrases being indexed.
Method 1 requires a clean up phase. But this only takes a few moments.
Regarding method 1, having a master machine doing less work frees it up to do other tasks ie. being a database server. This is so important for real world applications.
So it looks like Bluebloomer will stay on the rugged method 1 path for now.
Final thought
Imagine indexing all possible phrases. Never throwing away a thing. No indexing would be required on the master machine, but the database would become extremely bloated and slow. The great thing about method 1 is it strikes a balance between accuracy, memory and speed.