Home » Tika Extracting HTML File

Tika Extracting HTML File

by Online Tutorials Library

Tika Html File Extraction

To extract content of HTML file, Tika uses HtmlParser. HtmlParser is a class which is used to extract content and metadata of an HTML file. This class is located into org.apache.tika.parser.html package. It contains constructors and methods that are tabled below.

Tika HtmlParser Constructor

Constructor Description
public HtmlParser() It is used to create instance of the class.
public HtmlParser(EncodingDetector encodingDetector) It creates instance of HtmlParser class by taking instance of EncodingDetector class .

Tika HtmlParser Methods

Method Description
public Set<MediaType> getSupportedTypes(ParseContext context) It returns the set of media types supported by this parser when used with the given parse context.
public void parse(InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context) throws IOException, SAXException, TikaException It parses a document stream into a sequence of XHTML SAX events.
protected String mapSafeElement(String name) It is used to Map safe HTML element names to semantic XHTML equivalents.
protected boolean isDiscardElement(String name) It checks whether all content within the given HTML element should be discarded instead of including it in the parse output.
public String mapSafeAttribute(String elementName, String attributeName) It uses the HtmlMapper mechanism to customize the HTML mapping.
@Field public void setExtractScripts(boolean extractScripts) It determines whether or not to extract contents in script entities.
public boolean getExtractScripts() It is used to get extracted script.

Tika Html File Extraction Example

In this example, we are extracting content and metadata of an Html file. See the example.

Output:

Document Content:  Hello, Welcome to Tutor Aspire.     Document Metadata:  dc:title:   Index Page  Content-Encoding:   ISO-8859-1  title:   Index Page  Content-Type:   text/html; charset=ISO-8859-1  

You may also like