Class ExtractStructuredText

java.lang.Object
org.jpedal.examples.text.ExtractStructuredText

public class ExtractStructuredText extends Object

Extract Structured Content (if present) from PDF files


This class provides a simple Java API to extract Structured Content (if present) from a PDF file and also a static convenience method if you just want to dump any structured outlines from a PDF file or directory containing PDF files
If no Structure is present a blank file is returned

For non-structured files, consider:
See our Support Pages for more information on Text Extraction
  • Constructor Summary

    Constructors
    Constructor
    Description
    ExtractStructuredText(byte[] byteArray)
    Sets up an ExtractStructuredText instance to open a PDF file contained as a BLOB within a byte[] stream
    ExtractStructuredText(byte[] byteArray, org.jpedal.examples.text.configuration.ExtractStructuredTextProperties properties)
    Sets up an ExtractStructuredText instance to open a PDF file contained as a BLOB within a byte[] stream
    Sets up an ExtractStructuredText instance to open a PDF File
    ExtractStructuredText(String fileName, org.jpedal.examples.text.configuration.ExtractStructuredTextProperties properties)
    Sets up an ExtractStructuredText instance to open a PDF File
  • Method Summary

    Modifier and Type
    Method
    Description
    void
    ensure PDF file is closed once no longer needed and all resources released
    int
    number of pages in PDF file (starting at 1)
    gets the Document containing any Structured text (if present) as a Document structure
    If the Document does not contain the meta data for Structured Content, an empty Document is returned
    static void
    main(String[] args)
    This class will allow you to extract any Structured Text data via command line from a single PDF file or a directory of PDF files.
    boolean
    routine to open the PDF File so we can access
    void
    setPassword(String password)
     
    static void
    Convenience method to write any Structured text in a directory of PDF files
    static void
    writeAllStructuredTextOutlinesToDir(String inputDir, String password, String outputDir)
    Convenience method to write any Structured text in a directory of PDF files
    static void
    writeAllStructuredTextOutlinesToDir(String inputDir, String password, String outputDir, ErrorTracker errorTracker)
    Convenience method to write any Structured text in a directory of PDF files
    static void
    writeAllStructuredTextOutlinesToDir(String inputDir, String password, String outputDir, ErrorTracker errorTracker, org.jpedal.examples.text.configuration.ExtractStructuredTextProperties properties)
    Convenience method to write any Structured text in a directory of PDF files

    Methods inherited from class java.lang.Object

    equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
  • Constructor Details

    • ExtractStructuredText

      public ExtractStructuredText(String fileName)
      Sets up an ExtractStructuredText instance to open a PDF File
      Parameters:
      fileName - full path to a single PDF file
    • ExtractStructuredText

      public ExtractStructuredText(byte[] byteArray)
      Sets up an ExtractStructuredText instance to open a PDF file contained as a BLOB within a byte[] stream
      Parameters:
      byteArray - Array which will hold BLOB
    • ExtractStructuredText

      public ExtractStructuredText(String fileName, org.jpedal.examples.text.configuration.ExtractStructuredTextProperties properties)
      Sets up an ExtractStructuredText instance to open a PDF File
      Parameters:
      fileName - full path to a single PDF file
      properties - ExtractStructuredTextProperties object for configuring extraction
    • ExtractStructuredText

      public ExtractStructuredText(byte[] byteArray, org.jpedal.examples.text.configuration.ExtractStructuredTextProperties properties)
      Sets up an ExtractStructuredText instance to open a PDF file contained as a BLOB within a byte[] stream
      Parameters:
      byteArray - Array which will hold BLOB
      properties - ExtractStructuredTextProperties object for configuring extraction
  • Method Details

    • main

      public static void main(String[] args)
      This class will allow you to extract any Structured Text data via command line from a single PDF file or a directory of PDF files.
      The example expects two or three parameters:
      • Value 1 is the file name or directory of PDF files to process
      • Value 2 is directory to write out the outline data
      Parameters:
      args - The expected arguments are described above.
    • writeAllStructuredTextOutlinesToDir

      public static void writeAllStructuredTextOutlinesToDir(String inputDir, String password, String outputDir) throws PdfException
      Convenience method to write any Structured text in a directory of PDF files
      Parameters:
      inputDir - directory containing PDF files
      password - user or owner password for pdf file
      outputDir - directory for writing out images
      Throws:
      PdfException - a PDF exception
    • writeAllStructuredTextOutlinesToDir

      public static void writeAllStructuredTextOutlinesToDir(String inputDir, String password, String outputDir, ErrorTracker errorTracker) throws PdfException
      Convenience method to write any Structured text in a directory of PDF files
      Parameters:
      inputDir - directory containing PDF files
      password - user or owner password for pdf file
      outputDir - directory for writing out images
      errorTracker - a custom error tracker
      Throws:
      PdfException - a PDF exception
    • writeAllStructuredTextOutlinesToDir

      public static void writeAllStructuredTextOutlinesToDir(String inputDir, String password, String outputDir, ErrorTracker errorTracker, org.jpedal.examples.text.configuration.ExtractStructuredTextProperties properties) throws PdfException
      Convenience method to write any Structured text in a directory of PDF files
      Parameters:
      inputDir - directory containing PDF files
      password - user or owner password for pdf file
      outputDir - directory for writing out images
      errorTracker - a custom error tracker
      properties - a ExtractStructuredTextProperties object for configuration
      Throws:
      PdfException - a PDF exception
    • writeAllStructuredTextOutlinesToDir

      public static void writeAllStructuredTextOutlinesToDir(String inputDir, String outputDir) throws PdfException
      Convenience method to write any Structured text in a directory of PDF files
      Parameters:
      inputDir - directory containing PDF files
      outputDir - directory for writing out images
      Throws:
      PdfException - a PDF exception
    • getStructuredTextContent

      public Document getStructuredTextContent()
      gets the Document containing any Structured text (if present) as a Document structure
      If the Document does not contain the meta data for Structured Content, an empty Document is returned
      Returns:
      Document
    • setPassword

      public void setPassword(String password)
      Parameters:
      password - the USER or OWNER password for the PDF file
    • getPageCount

      public int getPageCount()
      number of pages in PDF file (starting at 1)
      Returns:
      page count
    • openPDFFile

      public boolean openPDFFile() throws PdfException
      routine to open the PDF File so we can access
      Returns:
      true if successful
      Throws:
      PdfException - if problem with opening PDF file
    • closePDFfile

      public void closePDFfile()
      ensure PDF file is closed once no longer needed and all resources released