QueryExpansion

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

com.hrstc.lucene.queryexpansion
Class QueryExpansion

java.lang.Object
  com.hrstc.lucene.queryexpansion.QueryExpansion

public class QueryExpansion
extends java.lang.Object
extends java.lang.Object

Implements Rocchio's pseudo feedback QueryExpansion algorithm

Query Expansion - Adding search terms to a user's search. Query expansion is the process of a search engine adding search terms to a user's weighted search. The intent is to improve precision and/or recall. The additional terms may be taken from a thesaurus. For example a search for "car" may be expanded to: car cars auto autos automobile automobiles [foldoc.org]. To see options that could be configured through the properties file @see Constants Section

Created on February 23, 2005, 5:29 AM

TODO: Yahoo started providing API to query www; could be nice to add yahoo implementation as well

Author:: Neil O. Rouben

Field Summary
`private org.apache.lucene.analysis.Analyzer`	`analyzer`
`static java.lang.String`	`DECAY_FLD` how much importance of document decays as doc rank gets higher.
`static java.lang.String`	`DOC_NUM_FLD` Number of documents to use
`static java.lang.String`	`DOC_SOURCE_FLD` Indicates FLD what source to use to obtain documents {google, local, null}
`static java.lang.String`	`DOC_SOURCE_GOOGLE` get documents from google
`static java.lang.String`	`DOC_SOURCE_LOCAL` get documents from local repository
`private java.util.Vector<org.apache.lucene.search.TermQuery>`	`expandedTerms`
`private static java.util.logging.Logger`	`logger`
`static java.lang.String`	`METHOD_FLD` Indicates which method to use for QE
`private java.util.Properties`	`prop`
`static java.lang.String`	`ROCCHIO_ALPHA_FLD` Rocchio Params
`static java.lang.String`	`ROCCHIO_BETA_FLD`
`static java.lang.String`	`ROCCHIO_METHOD`
`private org.apache.lucene.search.Searcher`	`searcher`
`private org.apache.lucene.search.Similarity`	`similarity`
`static java.lang.String`	`TERM_NUM_FLD` Number of terms to produce

Constructor Summary
`QueryExpansion(org.apache.lucene.analysis.Analyzer analyzer, org.apache.lucene.search.Searcher searcher, org.apache.lucene.search.Similarity similarity, java.util.Properties prop)` Creates a new instance of QueryExpansion

Method Summary
`org.apache.lucene.search.Query`	`adjust(java.util.Vector<org.apache.lucene.search.QueryTermVector> docsTermsVector, java.lang.String queryStr, float alpha, float beta, float decay, int docsRelevantCount, int maxExpandedQueryTerms)` Adjust term features of the docs with alpha * query; and beta; and assign weights/boost to terms (tf*idf).
`java.util.Vector<org.apache.lucene.search.TermQuery>`	`combine(java.util.Vector<org.apache.lucene.search.TermQuery> queryTerms, java.util.Vector<org.apache.lucene.search.TermQuery> docsTerms)` combine weights according to expansion formula
`org.apache.lucene.search.Query`	`expandQuery(java.lang.String queryStr, org.apache.lucene.search.Hits hits, java.util.Properties prop)` Performs Rocchio's query expansion with pseudo feedback qm = alpha * query + ( beta / relevanDocsCount ) * Sum ( rel docs vector )
`org.apache.lucene.search.Query`	`expandQuery(java.lang.String queryStr, java.util.Vector<org.apache.lucene.document.Document> hits, java.util.Properties prop)` Performs Rocchio's query expansion with pseudo feedback qm = alpha * query + ( beta / relevanDocsCount ) * Sum ( rel docs vector )
`org.apache.lucene.search.TermQuery`	`find(org.apache.lucene.search.TermQuery term, java.util.Vector<org.apache.lucene.search.TermQuery> terms)` Finds term that is equal
`private java.util.Vector<org.apache.lucene.document.Document>`	`getDocs(java.lang.String query, org.apache.lucene.search.Hits hits, java.util.Properties prop)` Gets documents that will be used in query expansion.
`java.util.Vector<org.apache.lucene.search.QueryTermVector>`	`getDocsTerms(java.util.Vector<org.apache.lucene.document.Document> hits, int docsRelevantCount, org.apache.lucene.analysis.Analyzer analyzer)` Extracts terms of the documents; Adds them to vector in the same order
`java.util.Vector<org.apache.lucene.search.TermQuery>`	`getExpandedTerms()` Returns `QueryExpansion.TERM_NUM_FLD` expanded terms from the most recent query
`private void`	`merge(java.util.Vector<org.apache.lucene.search.TermQuery> terms)` Gets rid of duplicates by merging termQueries with equal terms
`org.apache.lucene.search.Query`	`mergeQueries(java.util.Vector<org.apache.lucene.search.TermQuery> termQueries, int maxTerms)` Merges `termQueries` into a single query.
`java.util.Vector<org.apache.lucene.search.TermQuery>`	`setBoost(org.apache.lucene.search.QueryTermVector termVector, float factor)` Sets boost of terms.
`java.util.Vector<org.apache.lucene.search.TermQuery>`	`setBoost(java.util.Vector<org.apache.lucene.search.QueryTermVector> docsTerms, float factor, float decayFactor)` Sets boost of terms.
`private void`	`setExpandedTerms(java.util.Vector<org.apache.lucene.search.TermQuery> expandedTerms)`

Methods inherited from class java.lang.Object
`clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait`

Field Detail

METHOD_FLD

public static final java.lang.String METHOD_FLD

Indicates which method to use for QE

See Also:: Constant Field Values

ROCCHIO_METHOD

public static final java.lang.String ROCCHIO_METHOD

See Also:: Constant Field Values

DECAY_FLD

public static final java.lang.String DECAY_FLD

how much importance of document decays as doc rank gets higher. decay = decay * rank 0 - no decay

See Also:: Constant Field Values

DOC_NUM_FLD

public static final java.lang.String DOC_NUM_FLD

Number of documents to use

See Also:: Constant Field Values

TERM_NUM_FLD

public static final java.lang.String TERM_NUM_FLD

Number of terms to produce

See Also:: Constant Field Values

DOC_SOURCE_FLD

public static final java.lang.String DOC_SOURCE_FLD

Indicates FLD what source to use to obtain documents {google, local, null}

See Also:: Constant Field Values

DOC_SOURCE_LOCAL

public static final java.lang.String DOC_SOURCE_LOCAL

get documents from local repository

See Also:: Constant Field Values

DOC_SOURCE_GOOGLE

public static final java.lang.String DOC_SOURCE_GOOGLE

get documents from google

See Also:: Constant Field Values

ROCCHIO_ALPHA_FLD

public static final java.lang.String ROCCHIO_ALPHA_FLD

Rocchio Params

See Also:: Constant Field Values

ROCCHIO_BETA_FLD

public static final java.lang.String ROCCHIO_BETA_FLD

See Also:: Constant Field Values

prop

private java.util.Properties prop

analyzer

private org.apache.lucene.analysis.Analyzer analyzer

searcher

private org.apache.lucene.search.Searcher searcher

similarity

private org.apache.lucene.search.Similarity similarity

expandedTerms

private java.util.Vector<org.apache.lucene.search.TermQuery> expandedTerms

logger

private static java.util.logging.Logger logger

Constructor Detail

QueryExpansion

public QueryExpansion(org.apache.lucene.analysis.Analyzer analyzer,
                      org.apache.lucene.search.Searcher searcher,
                      org.apache.lucene.search.Similarity similarity,
                      java.util.Properties prop)

Creates a new instance of QueryExpansion

Parameters:: similarity -; analyzer - - used to parse documents to extract terms; searcher - - used to obtain idf

Method Detail

expandQuery

public org.apache.lucene.search.Query expandQuery(java.lang.String queryStr,
                                                  org.apache.lucene.search.Hits hits,
                                                  java.util.Properties prop)
                                           throws java.io.IOException,
                                                  org.apache.lucene.queryParser.ParseException

Performs Rocchio's query expansion with pseudo feedback qm = alpha * query + ( beta / relevanDocsCount ) * Sum ( rel docs vector )

Parameters:: queryStr - - that will be expanded; hits - - from the original query to use for expansion; prop - - properties that contain necessary values to perform query; see constants for field names and values
Returns:: expandedQuery
Throws:: java.io.IOException; org.apache.lucene.queryParser.ParseException

getDocs

private java.util.Vector<org.apache.lucene.document.Document> getDocs(java.lang.String query,
                                                                      org.apache.lucene.search.Hits hits,
                                                                      java.util.Properties prop)
                                                               throws java.io.IOException

Gets documents that will be used in query expansion. number of docs indicated by QueryExpansion.DOC_NUM_FLD from QueryExpansion.DOC_SOURCE_FLD

Parameters:: query - - for which expansion is being performed; hits - - to use in case QueryExpansion.DOC_SOURCE_FLD is not specified; prop - - uses QueryExpansion.DOC_SOURCE_FLD to determine where to get docs
Returns:: number of docs indicated by QueryExpansion.DOC_NUM_FLD from QueryExpansion.DOC_SOURCE_FLD
Throws:: java.io.IOException; com.google.soap.search.GoogleSearchFault

expandQuery

public org.apache.lucene.search.Query expandQuery(java.lang.String queryStr,
                                                  java.util.Vector<org.apache.lucene.document.Document> hits,
                                                  java.util.Properties prop)
                                           throws java.io.IOException,
                                                  org.apache.lucene.queryParser.ParseException

Performs Rocchio's query expansion with pseudo feedback qm = alpha * query + ( beta / relevanDocsCount ) * Sum ( rel docs vector )

Parameters:: queryStr - - that will be expanded; hits - - from the original query to use for expansion; prop - - properties that contain necessary values to perform query; see constants for field names and values
Returns:
Throws:: java.io.IOException; org.apache.lucene.queryParser.ParseException

adjust

public org.apache.lucene.search.Query adjust(java.util.Vector<org.apache.lucene.search.QueryTermVector> docsTermsVector,
                                             java.lang.String queryStr,
                                             float alpha,
                                             float beta,
                                             float decay,
                                             int docsRelevantCount,
                                             int maxExpandedQueryTerms)
                                      throws java.io.IOException,
                                             org.apache.lucene.queryParser.ParseException

Adjust term features of the docs with alpha * query; and beta; and assign weights/boost to terms (tf*idf).

Parameters:: docsTermsVector - of the terms of the top docsRelevantCount documents returned by original query; queryStr - - that will be expanded; alpha - - factor of the equation; beta - - factor of the equation; docsRelevantCount - - number of the top documents to assume to be relevant; maxExpandedQueryTerms - - maximum number of terms in expanded query
Returns:: expandedQuery with boost factors adjusted using Rocchio's algorithm
Throws:: java.io.IOException; org.apache.lucene.queryParser.ParseException

mergeQueries

public org.apache.lucene.search.Query mergeQueries(java.util.Vector<org.apache.lucene.search.TermQuery> termQueries,
                                                   int maxTerms)
                                            throws org.apache.lucene.queryParser.ParseException

Merges termQueries into a single query. In the future this method should probably be in Query class. This is akward way of doing it; but only merge queries method that is available is mergeBooleanQueries; so actually have to make a string term1^boost1, term2^boost and then parse it into a query

Parameters:: termQueries - - to merge
Returns:: query created from termQueries including boost parameters
Throws:: org.apache.lucene.queryParser.ParseException

getDocsTerms

public java.util.Vector<org.apache.lucene.search.QueryTermVector> getDocsTerms(java.util.Vector<org.apache.lucene.document.Document> hits,
                                                                               int docsRelevantCount,
                                                                               org.apache.lucene.analysis.Analyzer analyzer)
                                                                        throws java.io.IOException

Extracts terms of the documents; Adds them to vector in the same order

Parameters:: doc - - from which to extract terms; docsRelevantCount - - number of the top documents to assume to be relevant; analyzer - - to extract terms
Returns:: docsTerms docs must be in order
Throws:: java.io.IOException

setBoost

public java.util.Vector<org.apache.lucene.search.TermQuery> setBoost(org.apache.lucene.search.QueryTermVector termVector,
                                                                     float factor)
                                                              throws java.io.IOException

Sets boost of terms. boost = weight = factor(tf*idf)

Parameters:: termVector -; beta - - adjustment factor ( ex. alpha or beta )
Throws:: java.io.IOException

setBoost

public java.util.Vector<org.apache.lucene.search.TermQuery> setBoost(java.util.Vector<org.apache.lucene.search.QueryTermVector> docsTerms,
                                                                     float factor,
                                                                     float decayFactor)
                                                              throws java.io.IOException

Sets boost of terms. boost = weight = factor(tf*idf)

Parameters:: docsTerms -; factor - - adjustment factor ( ex. alpha or beta )
Throws:: java.io.IOException

merge

private void merge(java.util.Vector<org.apache.lucene.search.TermQuery> terms)

Gets rid of duplicates by merging termQueries with equal terms

Parameters:: terms -

combine

public java.util.Vector<org.apache.lucene.search.TermQuery> combine(java.util.Vector<org.apache.lucene.search.TermQuery> queryTerms,
                                                                    java.util.Vector<org.apache.lucene.search.TermQuery> docsTerms)

combine weights according to expansion formula

find

public org.apache.lucene.search.TermQuery find(org.apache.lucene.search.TermQuery term,
                                               java.util.Vector<org.apache.lucene.search.TermQuery> terms)

Finds term that is equal

Returns:: term; if not found -> null

getExpandedTerms

public java.util.Vector<org.apache.lucene.search.TermQuery> getExpandedTerms()

Returns QueryExpansion.TERM_NUM_FLD expanded terms from the most recent query

Returns:

setExpandedTerms

private void setExpandedTerms(java.util.Vector<org.apache.lucene.search.TermQuery> expandedTerms)

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

com.hrstc.lucene.queryexpansion Class QueryExpansion

METHOD_FLD

ROCCHIO_METHOD

DECAY_FLD

DOC_NUM_FLD

TERM_NUM_FLD

DOC_SOURCE_FLD

DOC_SOURCE_LOCAL

DOC_SOURCE_GOOGLE

ROCCHIO_ALPHA_FLD

ROCCHIO_BETA_FLD

prop

analyzer

searcher

similarity

expandedTerms

logger

QueryExpansion

expandQuery

getDocs

expandQuery

adjust

mergeQueries

getDocsTerms

setBoost

setBoost

merge

combine

find

getExpandedTerms

setExpandedTerms

com.hrstc.lucene.queryexpansion
Class QueryExpansion