Class ExtractStructuredText

java.lang.Object
org.jpedal.examples.BaseExample
org.jpedal.examples.text.ExtractStructuredText

public class ExtractStructuredText extends org.jpedal.examples.BaseExample

Extract Structured Content (if present) from PDF files

This class provides a simple Java API to extract Structured Content (if present) from a PDF file and also a static convenience method if you just want to dump any structured outlines from a PDF file or directory containing PDF files

If no Structure is present a blank file is returned

For non-structured files, consider:

See our Support Pages for more information on Text Extraction
  • Constructor Details

    • ExtractStructuredText

      public ExtractStructuredText(String fileName)
      Sets up an ExtractStructuredText instance to open a PDF File
      Parameters:
      fileName - full path to a single PDF file
    • ExtractStructuredText

      public ExtractStructuredText(byte[] byteArray)
      Sets up an ExtractStructuredText instance to open a PDF file contained as a BLOB within a byte[] stream
      Parameters:
      byteArray - Array which will hold BLOB
    • ExtractStructuredText

      public ExtractStructuredText(String fileName, org.jpedal.examples.text.configuration.ExtractStructuredTextProperties properties)
      Sets up an ExtractStructuredText instance to open a PDF File
      Parameters:
      fileName - full path to a single PDF file
      properties - ExtractStructuredTextProperties object for configuring extraction
    • ExtractStructuredText

      public ExtractStructuredText(byte[] byteArray, org.jpedal.examples.text.configuration.ExtractStructuredTextProperties properties)
      Sets up an ExtractStructuredText instance to open a PDF file contained as a BLOB within a byte[] stream
      Parameters:
      byteArray - Array which will hold BLOB
      properties - ExtractStructuredTextProperties object for configuring extraction
  • Method Details

    • decodeFile

      public void decodeFile(String file_name) throws PdfException
      routine to decode a file
      Throws:
      PdfException
    • main

      public static void main(String[] args)
      This class will allow you to extract any Structured Text data via command line from a single PDF file or a directory of PDF files.
      The example expects the following parameters:
      • Value 1 is the file name or directory of PDF files to process
      • Value 2 is the directory to write out the outline data
      • (Optional, unless Value 4 is present then Value 3 must be present) Value 3 is the outline data file format
      • Value 4 is the directory to write out the figures data
      • (Optional) Value 5 is the figures output format
      Parameters:
      args - The expected arguments are described above.
    • writeAllStructuredTextOutlinesToDir

      public static void writeAllStructuredTextOutlinesToDir(String inputDir, String password, String outputDir) throws PdfException
      Convenience method to write any Structured text in a directory of PDF files
      Parameters:
      inputDir - directory containing PDF files
      password - user or owner password for pdf file
      outputDir - directory for writing out images
      Throws:
      PdfException - a PDF exception
    • writeAllStructuredTextOutlinesToDir

      public static void writeAllStructuredTextOutlinesToDir(String inputDir, String password, String outputDir, ErrorTracker errorTracker) throws PdfException
      Convenience method to write any Structured text in a directory of PDF files
      Parameters:
      inputDir - directory containing PDF files
      password - user or owner password for pdf file
      outputDir - directory for writing out images
      errorTracker - a custom error tracker
      Throws:
      PdfException - a PDF exception
    • writeAllStructuredTextOutlinesToDir

      public static void writeAllStructuredTextOutlinesToDir(String inputDir, String password, String outputDir, ErrorTracker errorTracker, org.jpedal.examples.text.configuration.ExtractStructuredTextProperties properties) throws PdfException
      Convenience method to write any Structured text in a directory of PDF files
      Parameters:
      inputDir - directory containing PDF files
      password - user or owner password for pdf file
      outputDir - directory for writing out structured text
      errorTracker - a custom error tracker
      properties - a ExtractStructuredTextProperties object for configuration
      Throws:
      PdfException - a PDF exception
    • writeAllStructuredTextOutlinesAndFiguresToDir

      public static void writeAllStructuredTextOutlinesAndFiguresToDir(String inputDir, String password, String outputDir, ErrorTracker errorTracker, org.jpedal.examples.text.configuration.ExtractStructuredTextProperties properties, String figuresDir, String figuresFormat) throws PdfException
      Convenience method to write any Structured text in a directory of PDF files
      Parameters:
      inputDir - directory containing PDF files
      password - user or owner password for pdf file
      outputDir - directory for writing out structured text
      errorTracker - a custom error tracker
      properties - a ExtractStructuredTextProperties object for configuration
      figuresDir - directory for writing out figures
      figuresFormat - image file format for writing figures
      Throws:
      PdfException - a PDF exception
    • writeAllStructuredTextOutlinesToDir

      public static void writeAllStructuredTextOutlinesToDir(String inputDir, String outputDir) throws PdfException
      Convenience method to write any Structured text in a directory of PDF files
      Parameters:
      inputDir - directory containing PDF files
      outputDir - directory for writing out images
      Throws:
      PdfException - a PDF exception
    • getStructuredTextContent

      public Document getStructuredTextContent()
      gets the Document containing any Structured text (if present) as a Document structure
      If the Document does not contain the meta data for Structured Content, an empty Document is returned
      Returns:
      Document
    • getStructuredTextContentPerPage

      public Document[] getStructuredTextContentPerPage()
      gets the Document containing any Structured text (if present) per page, as an array of Documents
      If the Document does not contain the meta data for Structured Content, an empty Document is returned
      Returns:
      Document
    • getStructuredTextContentAndFigures

      public Document getStructuredTextContentAndFigures(String figureDir, String imageFormat) throws IOException
      Gets the marked content from the Document and also writes out the figures to a supplied directory
      Parameters:
      figureDir - The directory to write the figure images
      imageFormat - The format for white to write the figure images
      Returns:
      The marked content document
      Throws:
      IOException - If there is a problem with writing the images
    • getStructuredTextContentAndFiguresPerPage

      public Document[] getStructuredTextContentAndFiguresPerPage(String figureDir, String imageFormat) throws IOException
      Gets the marked content from the Document and also writes out the figures to a supplied directory
      Parameters:
      figureDir - The directory to write the figure images
      imageFormat - The format for white to write the figure images
      Returns:
      The marked content document as an array with each page per element
      Throws:
      IOException - If there is a problem with writing the images
    • setPassword

      public void setPassword(String password)
      Parameters:
      password - the USER or OWNER password for the PDF file
    • getPageCount

      public int getPageCount()
      number of pages in PDF file (starting at 1)
      Returns:
      page count