Topic-Sensitive PageRank pdf
Various link-based ranking strategies have been developed recently for improving Web-search query results. The HITS term as a bias set for influencing the PageRank computation, with the goal of returning terms for which a given page has a high reputation. An approach for enhancing search rankings by generating a PageRank vector for each possible query term was recently proposed in with favorable results. However, the approach requires considerable processing time and storage, and is not easily extended to make use of user and query context. Our approach to biasing the PageRank computation is novel in its use of a small number of representative basis topics, taken from the Open Directory, in conjunction with a unigram language model used to classify the query and query context.
In our work we consider two scenarios. In the first, we assume a user with a specific information need issues a query to our search engine in the conventional way, by entering a query into a search box. In this scenario, we determine the topics most closely associated with the query, and use the appropriate topic-sensitive PageRank vectors for ranking the documents satisfying the query. This ensures that the \importance" scores react a preference for the link structure of pages that have some bearing on the query.
As with ordinary PageRank, the topic-sensitive PageRank score can be used as part of a scoring function that takes into account other IR-based scores. In the second scenario, we assume the user is viewing a document (for instance, browsing the Web or reading email), and selects a term from the document for which he would like more information. This notion of search in context is discussed in. For instance, if a query for \architecture" is performed by highlighting a term in a document discussing famous building architects, we would like the result to be different than if the query \architecture" is performed by highlighting a term in a document on CPU design. By selecting the appropriate topic-sensitive PageRank vectors based on the context of the query, we hope to provide more accurate search results. Note that even when a query is issued in the conventional way, without highlighting a term, the history of queries issued constitutes a form of query context. Yet another source of context comes from the user who submitted the query. For instance, the user's bookmarks and browsing history could be used in selecting the appropriate topic-sensitive rank vectors.
A summary of our approach follows. During the offline processing of the Web crawl, we generate 16 topic-sensitive PageRank vectors; each biased using URLs from a top-level category from the Open Directory Project (ODP). At query time, we calculate the similarity of the query (and if available, the query or user context) to each of these topics. Then instead of using a single global ranking vector, we take the linear combination of the topic-sensitive vectors, weighted using the similarities of the query (and any available context) to the topics.
By using a set of rank vectors, we are able to determine more accurately which pages are truly the most important with respect to a particular query or query-context. Because the link-based computations are performed offline, during the preprocessing stage, the query-time costs are not much greater than that of the ordinary PageRank algorithm.


Comments
Post a Comment