Gary Sieling

Change Tika output format to text

Tika supports multiple output formats – the default is HTML, which seems like an odd choice.

You can change it to text like so:

java -jar tika.jar -t

There are different options for many formats – you’ll need to decide whether you want metadata from Office documents or not.

usage: java -jar tika-app.jar [option...] [file|port...]
    -x  or --xml           Output XHTML content (default)
    -h  or --html          Output HTML content
    -t  or --text          Output plain text content
    -T  or --text-main     Output plain text content (main content only)
    -m  or --metadata      Output only metadata
    -j  or --json          Output metadata in JSON
    -y  or --xmp           Output metadata in XMP
    -l  or --language      Output only language
    -d  or --detect        Detect document type
    -eX or --encoding=X    Use output encoding X
    -pX or --password=X    Use document password X
    -z  or --extract       Extract all attachements into current directory
    --extract-dir=    Specify target directory for -z
    -r  or --pretty-print  For XML and XHTML outputs, adds newlines and
                           whitespace, for better readability
Exit mobile version