Tika Document Type Detection
Document detection is a process to identify type of a document. Document types are different, the text/plain represents text file and image/jpeg to image type file.
Tika detects document type so that it can call appropriate parser to extract content and metadata.
Tika supports all the document types mentioned in MIME (Multipurpose Internet Mail Extension).
Currently eight official top-level types and thousands of subtype are supported by Internet Assigned Numbers Authority (IANA).
Following are the top-level media types.
Top-level type | Description |
---|---|
Text/* | It means text-based documents such as HTML, CSS, CSV and plain text.. |
Image/* | All the image subtype such as JPEG, Portable Network Graphics, GIF etc. |
Audio/* | It includes music and other audio formats such as MP3 and Ogg audio. |
Video/* | Video formats such as QuickTime and Mp4. |
Model/* | File formats for expressing physical or behavioral models in various domains. For example VRML format used to express 3D models |
Application/* | Application-specific document formats that don’t necessarily fit any of the other top-level categories. For example PDF and Microsoft Word (application/msword) documents. |
Message/* | Email and other message types sent over the internet and other networks. |
Multipart/* | It shows container formats for related component documents. Like message/* types, multipart/* documents are messages transmitted over the network. |
Media Types in Tika
Media types are the types of files, they tell to the computer what applications to associate with what files.
Detecting media types accurately is a major task that Tika handle perfectly.
Tika provides Java API and class-level support for interacting with the Tika MIME data-base
Tika has its own media type registry that stores IANA-registered types and other known types that are being used in practice.
Tika uses the MediaType class to represent media types. Instances of this class are immutable and contain only the media type’s type/subtype pair and optional name=value parameters.
Following are the some commonly used file extensions. See the table.
Extension | File Format | Media Type |
---|---|---|
.txt | Text document | text/plain |
.html | HTML page | text/html |
.xls | Microsoft Excel spreadsheet | application/vnd.ms-excel |
.jpg | JPEG image | image/jpeg |
.mp3 | MP3 audio | audio/mpeg |
.zip | Zip archive | application/zip |
Tika uses its detect() method that detects document type. See an example.
Output:
File type : text/plain