Tika Extracting HTML File

by Online Tutorials Library July 14, 2022

Tika Html File Extraction

To extract content of HTML file, Tika uses HtmlParser. HtmlParser is a class which is used to extract content and metadata of an HTML file. This class is located into org.apache.tika.parser.html package. It contains constructors and methods that are tabled below.

Tika HtmlParser Constructor

Constructor	Description
public HtmlParser()	It is used to create instance of the class.
public HtmlParser(EncodingDetector encodingDetector)	It creates instance of HtmlParser class by taking instance of EncodingDetector class .

Tika HtmlParser Methods

Method	Description
public Set<MediaType> getSupportedTypes(ParseContext context)	It returns the set of media types supported by this parser when used with the given parse context.
public void parse(InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context) throws IOException, SAXException, TikaException	It parses a document stream into a sequence of XHTML SAX events.
protected String mapSafeElement(String name)	It is used to Map safe HTML element names to semantic XHTML equivalents.
protected boolean isDiscardElement(String name)	It checks whether all content within the given HTML element should be discarded instead of including it in the parse output.
public String mapSafeAttribute(String elementName, String attributeName)	It uses the HtmlMapper mechanism to customize the HTML mapping.
@Field public void setExtractScripts(boolean extractScripts)	It determines whether or not to extract contents in script entities.
public boolean getExtractScripts()	It is used to get extracted script.

Tika Html File Extraction Example

In this example, we are extracting content and metadata of an Html file. See the example.

  package tikaexample;    import java.io.IOException;  import java.io.InputStream;  import org.apache.tika.exception.TikaException;  import org.apache.tika.metadata.Metadata;  import org.apache.tika.parser.ParseContext;  import org.apache.tika.parser.html.HtmlParser;  import org.apache.tika.sax.BodyContentHandler;  import org.xml.sax.SAXException;  public class HtmlParse {     public static void main(final String[] args) throws IOException,SAXException, TikaException {     BodyContentHandler handler = new BodyContentHandler();     HtmlParser parser          = new HtmlParser();     Metadata metadata          = new Metadata();     ParseContext pcontext      = new ParseContext();     try (InputStream stream = AutoDetectParseExample.class.getResourceAsStream(“index.html”)) {          parser.parse(stream, handler, metadata,pcontext);     }        System.out.println(“Document Content:” + handler.toString());        System.out.println(“Document Metadata:”);        String[] metadatas = metadata.names();        for(String meta : metadatas) {           System.out.println(meta + “:   ” + metadata.get(meta));          }     }  }  

Output:

Document Content:  Hello, Welcome to Tutor Aspire.     Document Metadata:  dc:title:   Index Page  Content-Encoding:   ISO-8859-1  title:   Index Page  Content-Type:   text/html; charset=ISO-8859-1

Next TopicTika Extracting PDF File

Tika Extracting HTML File

Tika Html File Extraction

Tika HtmlParser Constructor

Tika HtmlParser Methods

Tika Html File Extraction Example

Artificial Neural Network in TensorFlow

Namespaces vs Modules

You may also like