Home » Tika Extracting PDF File

Tika Extracting PDF File

by Online Tutorials Library

Tika Extracting PDF File

To extract content from pdf file, Tika uses PDFParser. PDFParser is a class that is used to extract content and metadata from a pdf file. This class is located into the org.apache.tika.parser.pdf package.

It contains constructor and methods that are tabled below.

Tika PDFParser Constructor

Constructor Description
public PDFParser() It is used to create instance of this class.

Tika PDFParser Methods

Method Description
public Set<MediaType> getSupportedTypes(ParseContext context) It returns the set of media types supported by this parser when used with the given parse context.
public void parse(InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context) throws IOException, SAXException, TikaException It parses a document stream into a sequence of XHTML SAX events.
public PDFParserConfig getPDFParserConfig() It is used to get pdfparser config.
public void setPDFParserConfig(PDFParserConfig config) It is used to set config for pdfparser
public void setEnableAutoSpace(boolean v) The parser should estimate where spaces should be inserted between words.
public boolean getExtractAnnotationText() It extracts text in annotations..
public void setExtractAnnotationText(boolean v) If true (the default), text in annotations will be extracted.
public void setSuppressDuplicateOverlappingText(boolean v) If true, the parser should try to remove duplicated text over the same region.

Tika Extracting PDF File Example

In the following example, we are extracting content and metadata from a pdf file.

Output:

Document Content:  Welcome to the tutoraspire.      tutoraspire is a Technical portal that contains latest computer science topics.         Document Metadata:  pdf:PDFVersion:   1.4  xmp:CreatorTool:   Online2PDF.com  access_permission:modify_annotations:   true  access_permission:can_print_degraded:   true  meta:creation-date:   2018-05-05T11:25:40Z  created:   Sat May 05 16:55:40 IST 2018  access_permission:extract_for_accessibility:   true  access_permission:assemble_document:   true  xmpTPg:NPages:   1  Creation-Date:   2018-05-05T11:25:40Z  dcterms:created:   2018-05-05T11:25:40Z  dc:format:   application/pdf; version=1.4  access_permission:extract_content:   true  access_permission:can_print:   true  pdf:docinfo:creator_tool:   Online2PDF.com  access_permission:fill_in_form:   true  pdf:encrypted:   false  producer:   Online2PDF.com  access_permission:can_modify:   true  pdf:docinfo:producer:   Online2PDF.com  pdf:docinfo:created:   2018-05-05T11:25:40Z  Content-Type:   application/pdf  

You may also like