IndexingFilter (apache-nutch 1.23-SNAPSHOT API)
Package
org.apache.nutch.indexer
Interface IndexingFilter
All Superinterfaces:
Configurable
Pluggable
All Known Implementing Classes:
AnchorIndexingFilter
ArbitraryIndexingFilter
BasicIndexingFilter
CCIndexingFilter
FeedIndexingFilter
GeoIPIndexingFilter
JexlIndexingFilter
LanguageIndexingFilter
LinksIndexingFilter
MetadataIndexer
MimeTypeIndexingFilter
MoreIndexingFilter
RelTagIndexingFilter
ReplaceIndexer
StaticFieldIndexer
SubcollectionIndexingFilter
TLDIndexingFilter
URLMetaIndexingFilter
public interface
IndexingFilter
extends
Pluggable
Configurable
Extension point for indexing. Permits one to add metadata to the indexed
fields. All plugins found which implement this extension point are run
sequentially on the parse.
Field Summary
Fields
Modifier and Type
Field
Description
static final
String
X_POINT_ID
The name of the extension point.
Method Summary
Modifier and Type
Method
Description
NutchDocument
filter
NutchDocument
doc,
Parse
parse,
Text
url,
CrawlDatum
datum,
Inlinks
inlinks)
Adds fields or otherwise modifies the document that will be indexed for a
parse.
Methods inherited from interface org.apache.hadoop.conf.
Configurable
getConf
setConf
Field Details
X_POINT_ID
static final
String
X_POINT_ID
The name of the extension point.
Method Details
filter
NutchDocument
filter
NutchDocument
doc,
Parse
parse,
Text
url,
CrawlDatum
datum,
Inlinks
inlinks)
throws
IndexingException
Adds fields or otherwise modifies the document that will be indexed for a
parse. Unwanted documents can be removed from indexing by returning a null
value.
Parameters:
doc
- document instance for collecting fields
parse
- parse data instance
url
- page url
datum
- crawl datum for the page (fetch datum from segment containing
fetch status and fetch time)
inlinks
- page inlinks
Returns:
modified (or a new) document instance, or null (meaning the
document should be discarded)
Throws:
IndexingException
- if an error occurs during during filtering