com.hrstc.lucene.queryexpansion
Class QueryExpansion

java.lang.Object
  extended by com.hrstc.lucene.queryexpansion.QueryExpansion

public class QueryExpansion
extends java.lang.Object

Implements Rocchio's pseudo feedback QueryExpansion algorithm

Query Expansion - Adding search terms to a user's search. Query expansion is the process of a search engine adding search terms to a user's weighted search. The intent is to improve precision and/or recall. The additional terms may be taken from a thesaurus. For example a search for "car" may be expanded to: car cars auto autos automobile automobiles [foldoc.org]. To see options that could be configured through the properties file @see Constants Section

Created on February 23, 2005, 5:29 AM

TODO: Yahoo started providing API to query www; could be nice to add yahoo implementation as well

Author:
Neil O. Rouben

Field Summary
private  org.apache.lucene.analysis.Analyzer analyzer
           
static java.lang.String DECAY_FLD
          how much importance of document decays as doc rank gets higher.
static java.lang.String DOC_NUM_FLD
          Number of documents to use
static java.lang.String DOC_SOURCE_FLD
          Indicates FLD what source to use to obtain documents {google, local, null}
static java.lang.String DOC_SOURCE_GOOGLE
          get documents from google
static java.lang.String DOC_SOURCE_LOCAL
          get documents from local repository
private  java.util.Vector<org.apache.lucene.search.TermQuery> expandedTerms
           
private static java.util.logging.Logger logger
           
static java.lang.String METHOD_FLD
          Indicates which method to use for QE
private  java.util.Properties prop
           
static java.lang.String ROCCHIO_ALPHA_FLD
          Rocchio Params
static java.lang.String ROCCHIO_BETA_FLD
           
static java.lang.String ROCCHIO_METHOD
           
private  org.apache.lucene.search.Searcher searcher
           
private  org.apache.lucene.search.Similarity similarity
           
static java.lang.String TERM_NUM_FLD
          Number of terms to produce
 
Constructor Summary
QueryExpansion(org.apache.lucene.analysis.Analyzer analyzer, org.apache.lucene.search.Searcher searcher, org.apache.lucene.search.Similarity similarity, java.util.Properties prop)
          Creates a new instance of QueryExpansion
 
Method Summary
 org.apache.lucene.search.Query adjust(java.util.Vector<org.apache.lucene.search.QueryTermVector> docsTermsVector, java.lang.String queryStr, float alpha, float beta, float decay, int docsRelevantCount, int maxExpandedQueryTerms)
          Adjust term features of the docs with alpha * query; and beta; and assign weights/boost to terms (tf*idf).
 java.util.Vector<org.apache.lucene.search.TermQuery> combine(java.util.Vector<org.apache.lucene.search.TermQuery> queryTerms, java.util.Vector<org.apache.lucene.search.TermQuery> docsTerms)
          combine weights according to expansion formula
 org.apache.lucene.search.Query expandQuery(java.lang.String queryStr, org.apache.lucene.search.Hits hits, java.util.Properties prop)
          Performs Rocchio's query expansion with pseudo feedback qm = alpha * query + ( beta / relevanDocsCount ) * Sum ( rel docs vector )
 org.apache.lucene.search.Query expandQuery(java.lang.String queryStr, java.util.Vector<org.apache.lucene.document.Document> hits, java.util.Properties prop)
          Performs Rocchio's query expansion with pseudo feedback qm = alpha * query + ( beta / relevanDocsCount ) * Sum ( rel docs vector )
 org.apache.lucene.search.TermQuery find(org.apache.lucene.search.TermQuery term, java.util.Vector<org.apache.lucene.search.TermQuery> terms)
          Finds term that is equal
private  java.util.Vector<org.apache.lucene.document.Document> getDocs(java.lang.String query, org.apache.lucene.search.Hits hits, java.util.Properties prop)
          Gets documents that will be used in query expansion.
 java.util.Vector<org.apache.lucene.search.QueryTermVector> getDocsTerms(java.util.Vector<org.apache.lucene.document.Document> hits, int docsRelevantCount, org.apache.lucene.analysis.Analyzer analyzer)
          Extracts terms of the documents; Adds them to vector in the same order
 java.util.Vector<org.apache.lucene.search.TermQuery> getExpandedTerms()
          Returns QueryExpansion.TERM_NUM_FLD expanded terms from the most recent query
private  void merge(java.util.Vector<org.apache.lucene.search.TermQuery> terms)
          Gets rid of duplicates by merging termQueries with equal terms
 org.apache.lucene.search.Query mergeQueries(java.util.Vector<org.apache.lucene.search.TermQuery> termQueries, int maxTerms)
          Merges termQueries into a single query.
 java.util.Vector<org.apache.lucene.search.TermQuery> setBoost(org.apache.lucene.search.QueryTermVector termVector, float factor)
          Sets boost of terms.
 java.util.Vector<org.apache.lucene.search.TermQuery> setBoost(java.util.Vector<org.apache.lucene.search.QueryTermVector> docsTerms, float factor, float decayFactor)
          Sets boost of terms.
private  void setExpandedTerms(java.util.Vector<org.apache.lucene.search.TermQuery> expandedTerms)
           
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

METHOD_FLD

public static final java.lang.String METHOD_FLD
Indicates which method to use for QE

See Also:
Constant Field Values

ROCCHIO_METHOD

public static final java.lang.String ROCCHIO_METHOD
See Also:
Constant Field Values

DECAY_FLD

public static final java.lang.String DECAY_FLD
how much importance of document decays as doc rank gets higher. decay = decay * rank 0 - no decay

See Also:
Constant Field Values

DOC_NUM_FLD

public static final java.lang.String DOC_NUM_FLD
Number of documents to use

See Also:
Constant Field Values

TERM_NUM_FLD

public static final java.lang.String TERM_NUM_FLD
Number of terms to produce

See Also:
Constant Field Values

DOC_SOURCE_FLD

public static final java.lang.String DOC_SOURCE_FLD
Indicates FLD what source to use to obtain documents {google, local, null}

See Also:
Constant Field Values

DOC_SOURCE_LOCAL

public static final java.lang.String DOC_SOURCE_LOCAL
get documents from local repository

See Also:
Constant Field Values

DOC_SOURCE_GOOGLE

public static final java.lang.String DOC_SOURCE_GOOGLE
get documents from google

See Also:
Constant Field Values

ROCCHIO_ALPHA_FLD

public static final java.lang.String ROCCHIO_ALPHA_FLD
Rocchio Params

See Also:
Constant Field Values

ROCCHIO_BETA_FLD

public static final java.lang.String ROCCHIO_BETA_FLD
See Also:
Constant Field Values

prop

private java.util.Properties prop

analyzer

private org.apache.lucene.analysis.Analyzer analyzer

searcher

private org.apache.lucene.search.Searcher searcher

similarity

private org.apache.lucene.search.Similarity similarity

expandedTerms

private java.util.Vector<org.apache.lucene.search.TermQuery> expandedTerms

logger

private static java.util.logging.Logger logger
Constructor Detail

QueryExpansion

public QueryExpansion(org.apache.lucene.analysis.Analyzer analyzer,
                      org.apache.lucene.search.Searcher searcher,
                      org.apache.lucene.search.Similarity similarity,
                      java.util.Properties prop)
Creates a new instance of QueryExpansion

Parameters:
similarity -
analyzer - - used to parse documents to extract terms
searcher - - used to obtain idf
Method Detail

expandQuery

public org.apache.lucene.search.Query expandQuery(java.lang.String queryStr,
                                                  org.apache.lucene.search.Hits hits,
                                                  java.util.Properties prop)
                                           throws java.io.IOException,
                                                  org.apache.lucene.queryParser.ParseException
Performs Rocchio's query expansion with pseudo feedback qm = alpha * query + ( beta / relevanDocsCount ) * Sum ( rel docs vector )

Parameters:
queryStr - - that will be expanded
hits - - from the original query to use for expansion
prop - - properties that contain necessary values to perform query; see constants for field names and values
Returns:
expandedQuery
Throws:
java.io.IOException
org.apache.lucene.queryParser.ParseException

getDocs

private java.util.Vector<org.apache.lucene.document.Document> getDocs(java.lang.String query,
                                                                      org.apache.lucene.search.Hits hits,
                                                                      java.util.Properties prop)
                                                               throws java.io.IOException
Gets documents that will be used in query expansion. number of docs indicated by QueryExpansion.DOC_NUM_FLD from QueryExpansion.DOC_SOURCE_FLD

Parameters:
query - - for which expansion is being performed
hits - - to use in case QueryExpansion.DOC_SOURCE_FLD is not specified
prop - - uses QueryExpansion.DOC_SOURCE_FLD to determine where to get docs
Returns:
number of docs indicated by QueryExpansion.DOC_NUM_FLD from QueryExpansion.DOC_SOURCE_FLD
Throws:
java.io.IOException
com.google.soap.search.GoogleSearchFault

expandQuery

public org.apache.lucene.search.Query expandQuery(java.lang.String queryStr,
                                                  java.util.Vector<org.apache.lucene.document.Document> hits,
                                                  java.util.Properties prop)
                                           throws java.io.IOException,
                                                  org.apache.lucene.queryParser.ParseException
Performs Rocchio's query expansion with pseudo feedback qm = alpha * query + ( beta / relevanDocsCount ) * Sum ( rel docs vector )

Parameters:
queryStr - - that will be expanded
hits - - from the original query to use for expansion
prop - - properties that contain necessary values to perform query; see constants for field names and values
Returns:
Throws:
java.io.IOException
org.apache.lucene.queryParser.ParseException

adjust

public org.apache.lucene.search.Query adjust(java.util.Vector<org.apache.lucene.search.QueryTermVector> docsTermsVector,
                                             java.lang.String queryStr,
                                             float alpha,
                                             float beta,
                                             float decay,
                                             int docsRelevantCount,
                                             int maxExpandedQueryTerms)
                                      throws java.io.IOException,
                                             org.apache.lucene.queryParser.ParseException
Adjust term features of the docs with alpha * query; and beta; and assign weights/boost to terms (tf*idf).

Parameters:
docsTermsVector - of the terms of the top docsRelevantCount documents returned by original query
queryStr - - that will be expanded
alpha - - factor of the equation
beta - - factor of the equation
docsRelevantCount - - number of the top documents to assume to be relevant
maxExpandedQueryTerms - - maximum number of terms in expanded query
Returns:
expandedQuery with boost factors adjusted using Rocchio's algorithm
Throws:
java.io.IOException
org.apache.lucene.queryParser.ParseException

mergeQueries

public org.apache.lucene.search.Query mergeQueries(java.util.Vector<org.apache.lucene.search.TermQuery> termQueries,
                                                   int maxTerms)
                                            throws org.apache.lucene.queryParser.ParseException
Merges termQueries into a single query. In the future this method should probably be in Query class. This is akward way of doing it; but only merge queries method that is available is mergeBooleanQueries; so actually have to make a string term1^boost1, term2^boost and then parse it into a query

Parameters:
termQueries - - to merge
Returns:
query created from termQueries including boost parameters
Throws:
org.apache.lucene.queryParser.ParseException

getDocsTerms

public java.util.Vector<org.apache.lucene.search.QueryTermVector> getDocsTerms(java.util.Vector<org.apache.lucene.document.Document> hits,
                                                                               int docsRelevantCount,
                                                                               org.apache.lucene.analysis.Analyzer analyzer)
                                                                        throws java.io.IOException
Extracts terms of the documents; Adds them to vector in the same order

Parameters:
doc - - from which to extract terms
docsRelevantCount - - number of the top documents to assume to be relevant
analyzer - - to extract terms
Returns:
docsTerms docs must be in order
Throws:
java.io.IOException

setBoost

public java.util.Vector<org.apache.lucene.search.TermQuery> setBoost(org.apache.lucene.search.QueryTermVector termVector,
                                                                     float factor)
                                                              throws java.io.IOException
Sets boost of terms. boost = weight = factor(tf*idf)

Parameters:
termVector -
beta - - adjustment factor ( ex. alpha or beta )
Throws:
java.io.IOException

setBoost

public java.util.Vector<org.apache.lucene.search.TermQuery> setBoost(java.util.Vector<org.apache.lucene.search.QueryTermVector> docsTerms,
                                                                     float factor,
                                                                     float decayFactor)
                                                              throws java.io.IOException
Sets boost of terms. boost = weight = factor(tf*idf)

Parameters:
docsTerms -
factor - - adjustment factor ( ex. alpha or beta )
Throws:
java.io.IOException

merge

private void merge(java.util.Vector<org.apache.lucene.search.TermQuery> terms)
Gets rid of duplicates by merging termQueries with equal terms

Parameters:
terms -

combine

public java.util.Vector<org.apache.lucene.search.TermQuery> combine(java.util.Vector<org.apache.lucene.search.TermQuery> queryTerms,
                                                                    java.util.Vector<org.apache.lucene.search.TermQuery> docsTerms)
combine weights according to expansion formula


find

public org.apache.lucene.search.TermQuery find(org.apache.lucene.search.TermQuery term,
                                               java.util.Vector<org.apache.lucene.search.TermQuery> terms)
Finds term that is equal

Returns:
term; if not found -> null

getExpandedTerms

public java.util.Vector<org.apache.lucene.search.TermQuery> getExpandedTerms()
Returns QueryExpansion.TERM_NUM_FLD expanded terms from the most recent query

Returns:

setExpandedTerms

private void setExpandedTerms(java.util.Vector<org.apache.lucene.search.TermQuery> expandedTerms)