Class ExtractStructuredText

java.lang.Object
org.jpedal.examples.BaseExample
org.jpedal.examples.text.ExtractStructuredText

public class ExtractStructuredText extends org.jpedal.examples.BaseExample

Extract Structured Content (if present) from PDF files


This class provides a simple Java API to extract Structured Content (if present) from a PDF file and also a static convenience method if you just want to dump any structured outlines from a PDF file or directory containing PDF files
If no Structure is present a blank file is returned

For non-structured files, consider:

See our Support Pages for more information on Text Extraction
  • Constructor Details

    • ExtractStructuredText

      public ExtractStructuredText(String fileName)
      Sets up an ExtractStructuredText instance to open a PDF File
      Parameters:
      fileName - full path to a single PDF file
    • ExtractStructuredText

      public ExtractStructuredText(byte[] byteArray)
      Sets up an ExtractStructuredText instance to open a PDF file contained as a BLOB within a byte[] stream
      Parameters:
      byteArray - Array which will hold BLOB
    • ExtractStructuredText

      public ExtractStructuredText(String fileName, org.jpedal.examples.text.configuration.ExtractStructuredTextProperties properties)
      Sets up an ExtractStructuredText instance to open a PDF File
      Parameters:
      fileName - full path to a single PDF file
      properties - ExtractStructuredTextProperties object for configuring extraction
    • ExtractStructuredText

      public ExtractStructuredText(byte[] byteArray, org.jpedal.examples.text.configuration.ExtractStructuredTextProperties properties)
      Sets up an ExtractStructuredText instance to open a PDF file contained as a BLOB within a byte[] stream
      Parameters:
      byteArray - Array which will hold BLOB
      properties - ExtractStructuredTextProperties object for configuring extraction
  • Method Details

    • decodeFile

      public void decodeFile(String file_name) throws PdfException
      routine to decode a file
      Throws:
      PdfException
    • main

      public static void main(String[] args)
      This class will allow you to extract any Structured Text data via command line from a single PDF file or a directory of PDF files.
      The example expects the following parameters:
      • Value 1 is the file name or directory of PDF files to process
      • Value 2 is the directory to write out the outline data
      • (Optional, unless Value 4 is present then Value 3 must be present) Value 3 is the outline data file format
      • Value 4 is the directory to write out the figures data
      • (Optional) Value 5 is the figures output format
      Parameters:
      args - The expected arguments are described above.
    • writeAllStructuredTextOutlinesToDir

      public static void writeAllStructuredTextOutlinesToDir(String inputDir, String password, String outputDir) throws PdfException
      Convenience method to write any Structured text in a directory of PDF files
      Parameters:
      inputDir - directory containing PDF files
      password - user or owner password for pdf file
      outputDir - directory for writing out images
      Throws:
      PdfException - a PDF exception
    • writeAllStructuredTextOutlinesToDir

      public static void writeAllStructuredTextOutlinesToDir(String inputDir, String password, String outputDir, ErrorTracker errorTracker) throws PdfException
      Convenience method to write any Structured text in a directory of PDF files
      Parameters:
      inputDir - directory containing PDF files
      password - user or owner password for pdf file
      outputDir - directory for writing out images
      errorTracker - a custom error tracker
      Throws:
      PdfException - a PDF exception
    • writeAllStructuredTextOutlinesToDir

      public static void writeAllStructuredTextOutlinesToDir(String inputDir, String password, String outputDir, ErrorTracker errorTracker, org.jpedal.examples.text.configuration.ExtractStructuredTextProperties properties) throws PdfException
      Convenience method to write any Structured text in a directory of PDF files
      Parameters:
      inputDir - directory containing PDF files
      password - user or owner password for pdf file
      outputDir - directory for writing out structured text
      errorTracker - a custom error tracker
      properties - a ExtractStructuredTextProperties object for configuration
      Throws:
      PdfException - a PDF exception
    • writeAllStructuredTextOutlinesAndFiguresToDir

      public static void writeAllStructuredTextOutlinesAndFiguresToDir(String inputDir, String password, String outputDir, ErrorTracker errorTracker, org.jpedal.examples.text.configuration.ExtractStructuredTextProperties properties, String figuresDir, String figuresFormat) throws PdfException
      Convenience method to write any Structured text in a directory of PDF files
      Parameters:
      inputDir - directory containing PDF files
      password - user or owner password for pdf file
      outputDir - directory for writing out structured text
      errorTracker - a custom error tracker
      properties - a ExtractStructuredTextProperties object for configuration
      figuresDir - directory for writing out figures
      figuresFormat - image file format for writing figures
      Throws:
      PdfException - a PDF exception
    • writeAllStructuredTextOutlinesToDir

      public static void writeAllStructuredTextOutlinesToDir(String inputDir, String outputDir) throws PdfException
      Convenience method to write any Structured text in a directory of PDF files
      Parameters:
      inputDir - directory containing PDF files
      outputDir - directory for writing out images
      Throws:
      PdfException - a PDF exception
    • getStructuredTextContent

      public Document getStructuredTextContent()
      gets the Document containing any Structured text (if present) as a Document structure
      If the Document does not contain the meta data for Structured Content, an empty Document is returned
      Returns:
      Document
    • getStructuredTextContentPerPage

      public Document[] getStructuredTextContentPerPage()
    • getStructuredTextContentAndFigures

      public Document getStructuredTextContentAndFigures(String figureDir, String imageFormat) throws IOException
      Gets the marked content from the Document and also writes out the figures to a supplied directory
      Parameters:
      figureDir - The directory to write the figure images
      imageFormat - The format for white to write the figure images
      Returns:
      The marked content document
      Throws:
      IOException - If there is a problem with writing the images
    • getStructuredTextContentAndFiguresPerPage

      public Document[] getStructuredTextContentAndFiguresPerPage(String figureDir, String imageFormat) throws IOException
      Throws:
      IOException
    • setPassword

      public void setPassword(String password)
      Parameters:
      password - the USER or OWNER password for the PDF file
    • getPageCount

      public int getPageCount()
      number of pages in PDF file (starting at 1)
      Returns:
      page count