Tika Text File Extraction

by Online Tutorials Library July 14, 2022

Tika Text File Extraction

To extract text file, Tika provides TXTParser class. This class is used to extract content and metadata from text file. It is located into org.apache.tika.parser.txt package.

This class contains constructor and methods that are tabled below.

Tika TextParser Constructor

Constructor	Description
public TXTParser()	It is used to create instance of the class.
public TXTParser(EncodingDetector encodingDetector)	It creates instance with encoding detector.

Tika TextParser Methods

Method	Description
public Set<MediaType> getSupportedTypes(ParseContext context)	It returns the set of media types supported by this parser.
public void parse(InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context) throws IOException, SAXException, TikaException	It parses a document stream into a sequence of XHTML SAX events.

Tika Text File Extraction Example

In this example, we are extracting content and metadata from a text file. See the below example.

  package tikaexample;    import java.io.InputStream;  import org.apache.tika.metadata.Metadata;  import org.apache.tika.parser.ParseContext;  import org.apache.tika.parser.txt.TXTParser;  import org.apache.tika.sax.BodyContentHandler;  public class TextFileExtraction {  public static void main(String[] args) {   BodyContentHandler handler   = new BodyContentHandler();   TXTParser parser             = new TXTParser();   Metadata metadata            = new Metadata();   ParseContext pcontext        = new ParseContext();   try (InputStream stream = AutoDetectParseExample.class.getResourceAsStream(“tutoraspire.txt”)) {          parser.parse(stream, handler, metadata, pcontext);       System.out.println(“Document Content:” + handler.toString());       System.out.println(“Document Metadata:”);       String[] metadatas = metadata.names();        for(String data : metadatas) {           System.out.println(data + “:   ” + metadata.get(data));         }   }catch(Exception e) {System.out.println(e);}  }  }  

//tutoraspire.txt

Our text file content.

Welcome to the tutoraspire.

tutoraspire is a Technical portal that contains latest computer science topics.

Output:

Document Content:Welcome to the tutoraspire.    tutoraspire is a Technical portal that contains latest computer science topics.      Document Metadata:  Content-Encoding:   ISO-8859-1  Content-Type:   text/plain; charset=ISO-8859-1

Next Topic#

Tika Text File Extraction

Tika Text File Extraction

Tika TextParser Constructor

Tika TextParser Methods

Tika Text File Extraction Example

Struts 2 modelDriven interceptor example

How to restore tabs in Chrome

You may also like