tika-pipes - TIKA - Apache Software Foun

tika-pipes - TIKA - Apache Software Foundation
DUE TO SPAM, SIGN-UP IS DISABLED. Goto
Selfserve wiki signup
and request an account.
TIKA
Pages
Page tree
Browse pages
tachments (19)
Page History
Resolved comments
Page Information
View in Hierarchy
View Source
Export to PDF
Export to Word
Copy Page Tree
Jira links
tika-pipes
Created by
Tim Allison
, last modified by
Nicholas DiPiazza
on
Jun 19, 2024
Security Warning
NOTE:
The tika-pipes modules in combination with tika-server open potential security vulnerabilities if you do not carefully limit access to tika-server. If the tika-pipes modules are turned on, anyone with access to your tika-server has the read and write permissions of the tika-server, and they will be able to read data and to forward the parsed results to whatever you've configured (see, for example:
). The tika-pipes modules for tika-server are intended to be run in
tightly controlled networks
DO NOT
use tika-pipes if your tika-server is exposed to the internet or if you do not carefully restrict access to tika-server.
Consider adding two-way TLS encryption to your client and server, a beta version of which is available in 2.4.0:
TikaServer#SSL(Beta)
Overview
The tika-pipes modules enable
fetching
data from various sources, running the parse and then
emitting
the output to various destinations. These modules are built around the
RecursiveParserWrapper
output model (
-J
option in
tika-app
and
/rmeta
endpoint in
tika-server-standard
). Users can specify content format (text/html/body) and set limits (number of embedded files, max content length) via
FetchEmitTuple
s. Further, users can add
Metadata Filters
to select and modify the metadata that is extracted during the parse before emitting the output.
We need to improve how to add dependencies. Very few of the fetchers/emitters are embedded in
tika-app
or
tika-server-standard
. For now, users can download required jars from maven central, e.g. the S3Emitter is available:
I JUST WANT EXAMPLES. SHOW ME THE EXAMPLES!!!
See below (
tika-app
) for fully worked examples of using tika-app to fetch from a local file share, parse and send the output to Solr.
Fetchers
Fetchers allow users to specify sources of inputstream+metadata for the parsing process. Fetchers are currently enabled in all of
tika-server-standard
and in the async option (
-a
) in
tika-app
With the exception of the
FileSystemFetcher
, users have to add the other fetcher dependencies to their class path.
FileSystemFetcher
Class name:
org.apache.tika.pipes.fetcher.fs.FileSystemFetcher
A FileSystemFetcher allows the user to specify a base directory in
tika-config.xml
and then at parse time, the user specifies the relative path for a file. This class is included in
tika-core
and no external resources are required.
For example, a minimal
tika-config.xml
file for a FileSystemFetcher would be:

fsf
/my/base/path1

HttpFetcher
Class name:
org.apache.tika.pipes.fetcher.http.HttpFetcher
The HttpFetcher requires that this dependency be on your class path:
HttpFetcher
Expand source

http

30000

S3Fetcher
Class name:
org.apache.tika.pipes.fetcher.s3.S3Fetcher
S3Fetcher
Expand source

s3f
us-east-1
my-bucket

instance

profile
myProfile

true

my-prefix

false

100

GCSFetcher
Class name:
org.apache.tika.pipes.fetcher.gcs.GCSFetcher
AZBlobFetcher
Class name:
org.apache.tika.pipes.fetcher.azblob.AZBlobFetcher
MSGraphFetcher
Class name:
org.apache.tika.pipes.fetchers.microsoftgraph.MSGraphFetcher
Introduced in:
Emitters
The
FileSystemEmitter
requires the
tika-serialization
module and is not included in
tika-core
. However, it is bundled with
tika-app
and
tika-server-standard
. For the other emitters, users have to add the other emitter dependencies to their class path.
FileSystemEmitter
FileSystemEmitter
allows the user to specify a base directory in
tika-config.xml
and then at parse time, the user specifies the relative path for the emitted .json file.
For example, a minimal
tika-config.xml
file for a
FileSystemEmitter
would be:

fse
/my/base/extracts

S3Emitter
OpenSearchEmitter
SolrEmitter
PipesIterators
tbd
tika-app examples
From FileShare to FileShare
Process all files in a directory recursively and place the .json extracts in a parallel directory structure.
N.B.
For the logging to work correctly in the async pipes parser, you have to use >= 2.1.0
Place the tika-app jar and any other dependencies in a
bin/
directory
Unzip this file (
fs-to-fs-config.tgz
) and place the
config/
directory at the same level as the
bin/
directory in the previous step
Open
config/tika-config-fs-to-fs.xml
and update the <
basePath/>
elements in the fetcher and emitter sections to specify the absolute path to the root directory for the binary documents (fetcher) and to the target root directory for the extracts (emitter). Update the

element in the
pipesiterator
section and make sure that it matches what you specified in the
fetcher
section.
Commandline:
java -Xmx512m -cp "bin/*" org.apache.tika.cli.TikaCLI -a --config=config/tika-config-fs-to-fs.xml
From file list on FileShare to FileShare
The input is a list of relative paths to files (e.g.
file-list.txt
) on a file share and the output is .json extract files on a file share.
N.B.
For the logging to work correctly in the async pipes parser, you have to use >= 2.1.0.
Place the tika-app jar and any other dependencies in a
bin/
directory
Unzip this file (
file-list-config.tgz
) and place the
config/
directory at the same level as the
bin/
directory in the previous step and the same level as the
file-list.txt
Open
config/tika-config-filelist.xml
and update the <
basePath/>
elements in the fetcher and emitter sections to specify the absolute path to the root directory for the binary documents (fetcher) and to the target root directory for the extracts (emitter).
Commandline:
java -Xmx512m -cp "bin/*" org.apache.tika.cli.TikaCLI -a --config=config/tika-config-filelist.xml
From Fileshare to Solr
These examples were tested with Solr 8.9.0 on Ubuntu in single core mode (not cloud). These examples require Tika >= 2.1.0.
Index embedded files in a parent-child relationship
Create collection:
bin/solr create -c tika-example && bin/solr config -c tika-example -p 8983 -action set-user-property -property update.autoCreateFields -value false
Set schema with this file
solr-parent-child-schema.json
curl -F 'data=@solr-parent-child-schema.json'
Put the latest tika app jar and tika-emitter-solr-2.1.0.jar in a
bin/
directory
Unzip this config/ directory
solr-parent-child-config.tgz
and put it at the same level as the
bin/
directory
Open
config/tika-config-fs-to-solr.xml
and update the

elements in the fetcher AND the pipesiterator to point to the directory that you want to index
Run tika:
java -cp "bin/*" org.apache.tika.cli.TikaCLI -a --config=config/tika-config-fs-to-solr.xml
Treat each embedded file as a separate file
Create collection:
bin/solr create -c tika-example && bin/solr config -c tika-example -p 8983 -action set-user-property -property update.autoCreateFields -value false
Set schema with this file
solr-separate-docs-schema.json
curl -F 'data=@solr-separate-docs-schema.json'
Put the latest tika app jar and tika-emitter-solr-2.1.0.jar in a
bin/
directory
Unzip this config/ directory
solr-separate-docs-config.tgz
and put it at the same level as the
bin/
directory
Open
config/tika-config-fs-to-solr.xml
and update the

elements in the fetcher AND the pipesiterator to point to the directory that you want to index
Run tika:
java -cp "bin/*" org.apache.tika.cli.TikaCLI -a --config=config/tika-config-fs-to-solr.xml
Legacy mode, concatenate content from embedded files
Create collection:
bin/solr create -c tika-example && bin/solr config -c tika-example -p 8983 -action set-user-property -property update.autoCreateFields -value false
Set schema with this file
solr-concatenate-schema.json
curl -F 'data=@solr-concatenate-schema.json'
Put the latest tika app jar and tika-emitter-solr-2.1.0.jar in a
bin/
directory
Unzip this config/ directory
solr-concatenate-config.tgz
and put it at the same level as the
bin/
directory
Open
config/tika-config-fs-to-solr.xml
and update the

elements in the fetcher AND the pipesiterator to point to the directory that you want to index
Run tika:
java -cp "bin/*" org.apache.tika.cli.TikaCLI -a --config=config/tika-config-fs-to-solr.xml
From Fileshare to OpenSearch
The following require Tika >= 2.1.0. They will not work with the 2.0.0 release. These examples were tested with OpenSearch 1.0.0 running in docker on an Ubuntu host.
Index embedded files in a parent-child relationship
This option requires specification of the parent child relationship in the mappings file. The parent is currently hardcoded to be
container
, and the embedded files are
embedded
. The OpenSearch emitter flattens relationships so that if there are deeply recursively embedded files, all embedded files are children of the single container/parent file; recursive relationships are not captured in the OpenSearch join relation. However, the embedded path is stored in the
X-TIKA:embedded_resource_path
metadata value, and the recursive relations can be reconstructed from that path.
Place the tika-app jar and the
tika-emitter-opensearch-2.1.0.jar
in the
bin/
directory
Unzip this file
opensearch-parent-child-config.tgz
and place the
config/
directory at the same level as the
bin/
directory
Open
config/tika-config-fs-to-opensearch.xml
and update the

elements in BOTH the fetcher and the pipesiterator to point to the directory that you want to index
Curl these mappings
opensearch-parent-child-mappings.json
to OpenSearch:
curl -k -T opensearch-mappings.json -u admin:admin -H "Content-Type:application/json"
Run tika app:
java -cp "bin/*" org.apache.tika.cli.TikaCLI -a --config=config/tika-config-fs-to-opensearch.xml
Treat each embedded file as a separate file
Place the tika-app jar and the
tika-emitter-opensearch-2.1.0.jar
in the
bin/
directory
Unzip this file
opensearch-parent-child-config.tgz
and place the
config/
directory at the same level as the
bin/
directory
Open
config/tika-config-fs-to-opensearch.xml
and update the

elements in the fetcher and the pipesiterator to point to the directory that you want to index
Curl these mappings
opensearch-mappings.json
to OpenSearch:
curl -k -I -T opensearch-mappings.json
-u admin:admin -H "Content-Type: application/json"
Run tika app:
java -cp "bin/*" org.apache.tika.cli.TikaCLI -a --config=config/tika-config-fs-to-opensearch.xml
Legacy mode, concatenate content from embedded files
This emulates the legacy output from
tika-app
and the
/tika
endpoint in
tika-server-standard
. Note that this option hides exceptions from embedded files and metadata from embedded files. The key difference between this config and the "treat each embedded file as a separate file" is the
parseMode
element in the
pipesIterator

CONCATENATE
...
Place the tika-app jar and the
tika-emitter-opensearch-2.1.0.jar
in the
bin/
directory
Unzip this file
opensearch-concatenate-config.tgz
and place the
config/
directory at the same level as the
bin/
directory
Open
config/tika-config-fs-to-opensearch.xml
and update the

elements in the fetcher and the pipesiterator to point to the directory that you want to index
Curl these mappings
opensearch-mappings.json
to OpenSearch:
curl -k -T opensearch-mappings.json -u admin:admin -H "Content-Type:application/json"
Run tika app:
java -cp "bin/*" org.apache.tika.cli.TikaCLI -a --config=config/tika-config-fs-to-opensearch.xml
tika-server
Fetchers in the classic tika-server endpoints
For the classic tika-server endpoints (
/rmeta, /tika, /unpack, /meta
), users specify
fetcherName
and
fetchKey
in the headers. This replaces
enableFileUrl
from tika-1.x. Note that
enableUnsecureFeatures
must still be set via the tika-config.xml:

fsf
/my/base/path1

true

To parse
/my/base/path1/path2/myfile.pdf:
curl -X PUT http://localhost:9998/tika --header "fetcherName: fsf" --header "fetchKey: path2/myfile.pdf"
If your file path has non-ASCII characters, you should specify the fetcherName and the fetchKey as query parameters in the request instead of in the headers:
curl -X PUT 'http://tika:9998/rmeta/text?fetcherName=fsf&fetchKey=中文.txt'
curl -X PUT 'http://tika:9998/rmeta/text?fetcherName=fsf&fetchKey=%E4%B8%AD%E6%96%87.txt'
The
/pipes
endpoint
This endpoint requires that at least one fetcher and one emitter be specified in the config file and that
enableUnsecureFeatures
be set to true. In the following example, we have source documents in
/my/base/path1
, and we want to write extracts to
/my/base/extracts
. Unlike with the classic endpoints, users send a json FetchEmitTuple to tika-server. For full documentation of this object see:
FetchEmitTuple

fsf
/my/base/path1

fse
/my/base/extracts

true

/path/to/tika-config.xml

To parse
/my/base/path1/path2/myfile.pdf:
curl -X POST -H "Content-Type: application/json" -d '{"fetcher":"fsf","fetchKey":"path2/myfile.pdf","emitter":"fse","emitKey":"path2/myfile.pdf"}' http://localhost:9998/pipes
Note, by default, the
FileSystemEmitter
automatically adds ".json" to the end of the
emitKey
The
/async
endpoint
The only difference in the /async handler is that you send a list of
FetchEmitTuples
curl -X POST -H "Content-Type: application/json" -d '[{"fetcher":"fsf","fetchKey":"path2/myfile.pdf","emitter":"fse","emitKey":"path2/myfile.pdf"}]' http://localhost:9998/async
Modifying Docker to use the pipes modules
For examples of how to load the pipes modules with Docker see:
tika-pipes and Docker
No labels
Overview
Content Tools
Atlassian Confluence Open Source Project License
granted to Apache Software Foundation.
Evaluate Confluence today
Atlassian Confluence
8.5.31
Printed by Atlassian Confluence 8.5.31
Report a bug
Atlassian News
Atlassian
{"serverDuration": 147, "requestCorrelationId": "1330ef21a05a11d4"}