Class ExtractStructuredText

java.lang.Object
org.jpedal.examples.text.ExtractStructuredText

public class ExtractStructuredText extends Object

Extract Structured Content (if present) from PDF files


This class provides a simple Java API to extract Structured Content (if present) from a PDF file and also a static convenience method if you just want to dump any structured outlines from a PDF file or directory containing PDF files
If no Structure is present a blank file is returned

For non-structured files, consider:
See our Support Pages for more information on Text Extraction
  • Constructor Details

    • ExtractStructuredText

      public ExtractStructuredText(String fileName)
      Sets up an ExtractStructuredText instance to open a PDF File
      Parameters:
      fileName - full path to a single PDF file
    • ExtractStructuredText

      public ExtractStructuredText(byte[] byteArray)
      Sets up an ExtractStructuredText instance to open a PDF file contained as a BLOB within a byte[] stream
      Parameters:
      byteArray - Array which will hold BLOB
  • Method Details

    • main

      public static void main(String[] args)
      This class will allow you to extract any Structured Text data via command line from a single PDF file or a directory of PDF files.
      The example expects two or three parameters:
      • Value 1 is the file name or directory of PDF files to process
      • Value 2 is directory to write out the outline data
      Parameters:
      args - The expected arguments are described above.
    • writeAllStructuredTextOutlinesToDir

      public static void writeAllStructuredTextOutlinesToDir(String inputDir, String password, String outputDir) throws PdfException
      Convenience method to write any Structured text in a directory of PDF files
      Parameters:
      inputDir - directory containing PDF files
      password - user or owner password for pdf file
      outputDir - directory for writing out images
      Throws:
      PdfException - a PDF exception
    • writeAllStructuredTextOutlinesToDir

      public static void writeAllStructuredTextOutlinesToDir(String inputDir, String password, String outputDir, ErrorTracker errorTracker) throws PdfException
      Convenience method to write any Structured text in a directory of PDF files
      Parameters:
      inputDir - directory containing PDF files
      password - user or owner password for pdf file
      outputDir - directory for writing out images
      errorTracker - a custom error tracker
      Throws:
      PdfException - a PDF exception
    • writeAllStructuredTextOutlinesToDir

      public static void writeAllStructuredTextOutlinesToDir(String inputDir, String outputDir) throws PdfException
      Convenience method to write any Structured text in a directory of PDF files
      Parameters:
      inputDir - directory containing PDF files
      outputDir - directory for writing out images
      Throws:
      PdfException - a PDF exception
    • getStructuredTextContent

      public Document getStructuredTextContent()
      gets the Document containing any Structured text (if present) as a Document structure
      If the Document does not contain the meta data for Structured Content, an empty Document is returned
      Returns:
      Document
    • setPassword

      public void setPassword(String password)
      Parameters:
      password - the USER or OWNER password for the PDF file
    • getPageCount

      public int getPageCount()
      number of pages in PDF file (starting at 1)
      Returns:
      page count
    • openPDFFile

      public boolean openPDFFile() throws PdfException
      routine to open the PDF File so we can access
      Returns:
      true if successful
      Throws:
      PdfException - if problem with opening PDF file
    • closePDFfile

      public void closePDFfile()
      ensure PDF file is closed once no longer needed and all resources released