Tika supports multiple output formats – the default is HTML, which seems like an odd choice.
You can change it to text like so:
java -jar tika.jar -t
There are different options for many formats – you’ll need to decide whether you want metadata from Office documents or not.
usage: java -jar tika-app.jar [option...] [file|port...]
-x or --xml Output XHTML content (default)
-h or --html Output HTML content
-t or --text Output plain text content
-T or --text-main Output plain text content (main content only)
-m or --metadata Output only metadata
-j or --json Output metadata in JSON
-y or --xmp Output metadata in XMP
-l or --language Output only language
-d or --detect Detect document type
-eX or --encoding=X Use output encoding X
-pX or --password=X Use document password X
-z or --extract Extract all attachements into current directory
--extract-dir= Specify target directory for -z
-r or --pretty-print For XML and XHTML outputs, adds newlines and
whitespace, for better readability