Sunday, August 27, 2006

Getting more results from your search engine

Most search engines limit the total number of results they return per search string. With Google it's 1000. With Yahoo!, if it hasn't changed lately, it's 5000.
I've been asked a few days ago whether I know of any workaround that makes it possible to get more results per search string. My first answer was - "sorry, no can do". I did find a way to circumvent the limitation imposed by the Google API that limits the number of queries that can be executed per day (which is accidentally also limited to 1000). This workaround is a side-effect of my Google Image Search API. Yet, this does not provide a means to get more than 1000 results per search string.
After giving it some thought, I could figure out at least one way to increase the number of results per query. It's not a very accurate solution, but it's better than nothing. The idea is to use the various search engines that perform query refinement (some call it clustering of results). A good example is Ask Jeeves. When you perform a search on these engines, they also give you a list of suggestions to narrow or expand your search. That is, if you search for "apple", the narrowing suggestions are things like "Apple the fruit", "Facts about Apples", "Apple Tree", "Macintosh", etc.
When you work with this kind of engines (or with a "simple" search engine and one with narrowing capabilities together), you can start out by running the original search (apple) and retrieve all the results available for that. Then you can iteratively retrieve the results for all narrowing queries as well (up to 1000 for each), and keep drilling down as much as you like. Of course, there will most probably be a substabtial amount of duplicates in the results, which you will have to handle. Also, the more you drill down, the farther you'll get from the original query (i.e. query drift). Another problem is that of ranking - say your original query was "apple", how do you define the ranks between the results for "apple tree" and "Macintosh". So this still raises quite a few questions. Yet, in the end, you can end up with a much larger number of results that are to some extent related to the original query.

You may ask - why would someone need more than 1000 results per search string? Besides, the further you go down the ranking, the less their relevance to the original search string. In most cases - you're right. Yet, for some research purposes, not only would you need more than 1000 results - you might even prefer getting these than the "good" results returned in the first few pages.

Can anyone come up with some other (better?) idea to work around this limitation??? If you have an idea - please drop me a line!

(Note: I'm using the term "search string" to indicate a complete search, regardless of the number of results pages you get. The term "query" refers to what retrieves one single results page, since the query also includes the result index at which the results page should start. In other words, all "search results" for a single "search string" are achieved by sending multiple "queries" - if you have 100 results in each results page you need to execute approximately 10 "queries" to retrieve all the results Google provides for that "search string")

No comments: