Home » Tika Parsing Document to XHTML

Tika Parsing Document to XHTML

by Online Tutorials Library

Tika Parsing Document to XHTML

Tika uses ToXMLContentHandler class to get output in XHTML format. It returns XHTML content of the whole document as a string.

This class contains the following constructors and methods.

Tika ToXMLContentHandler Constructors

Following are the constructors of ToXMLContentHandler class.

Constructor Description
public ToXMLContentHandler() It is used to create instance of the class.
public ToXMLContentHandler(String encoding) It creates instance by getting string argument.

Tika ToXMLContentHandler Methods

Following are the methods of ToXMLContentHandler class.

Methods Description
public void characters(char[] ch, int start, int length) throws SAXException It writes the given characters to the given character stream.
protected void write(char ch) throws SAXException It writes the given character as-is.
protected void write(String string) throws SAXException It writes the given string of character as-is.
public void startDocument() throws SAXException It writes the XML prefix.

Tika Parsing Document to XHTML Example

This example produce the output in XHTML format while the input is in text format.

Output:

Following is the content of hello.txt file.

Hello Welcome to Tutor Aspire  

After extraction, it produces the output in XHTML format. See the below.

<html xmlns="http://www.w3.org/1999/xhtml">  <head>  <meta name="X-Parsed-By" content="org.apache.tika.parser.DefaultParser" />  <meta name="X-Parsed-By" content="org.apache.tika.parser.txt.TXTParser" />  <meta name="Content-Encoding" content="ISO-8859-1" />  <meta name="Content-Type" content="text/plain; charset=ISO-8859-1" />  <title></title>  </head>  <body><p>Hello Welcome to Tutor Aspire</p>  </body></html>  

You may also like