File Formats Supported by Tika
The following table shows the file formats Tika supports.
File format |
Package Library |
Class in Tika |
XML |
org.apache.tika.parser.xml |
XMLParser |
HTML |
org.apache.tika.parser.html and it uses Tagsoup Library |
HtmlParser |
MS-Office compound document Ole2 till 2007 ooxml 2007 onwards |
org.apache.tika.parser.microsoft
org.apache.tika.parser.microsoft.ooxml and it uses Apache Poi library |
OfficeParser(ole2)
OOXMLParser(ooxml) |
OpenDocument Format openoffice |
org.apache.tika.parser.odf |
OpenOfficeParser |
portable Document Format(PDF) |
org.apache.tika.parser.pdf and this package uses Apache PdfBox library |
PDFParser |
Electronic Publication Format (digital books) |
org.apache.tika.parser.epub |
EpubParser |
Rich Text format |
org.apache.tika.parser.rtf |
RTFParser |
Compression and packaging formats |
org.apache.tika.parser.pkg and this package uses Common compress library |
PackageParser and CompressorParser and its sub-classes |
Text format |
org.apache.tika.parser.txt |
TXTParser |
Feed and syndication formats |
org.apache.tika.parser.feed |
FeedParser |
Audio formats |
org.apache.tika.parser.audio and org.apache.tika.parser.mp3 |
AudioParser MidiParser Mp3- for mp3parser |
Imageparsers |
org.apache.tika.parser.jpeg |
JpegParser-for jpeg images |
Videoformats |
org.apache.tika.parser.mp4 and org.apache.tika.parser.video this
parser internally uses Simple Algorithm to parse flash video formats |
Mp4parser FlvParser |
java class files and jar files |
org.apache.tika.parser.asm |
ClassParser CompressorParser |
Mobxformat (email messages) |
org.apache.tika.parser.mbox |
MobXParser |
Cad formats |
org.apache.tika.parser.dwg |
DWGParser |
FontFormats |
org.apache.tika.parser.font |
TrueTypeParser |
executable programs and libraries |
org.apache.tika.parser.executable |
ExecutableParser |
No comments:
Post a Comment