Extract Structured Content (if present) from PDF files

This class provides a simple Java API to extract Structured Content (if present) from a PDF file and also a static convenience method if you just want to dump any structured outlines from a PDF file or directory containing PDF files

If no Structure is present a blank file is returned

For non-structured files, consider:

See our Support Pages for more information on Text Extraction

Constructor Summary

Constructors

Constructor

Description

ExtractStructuredText(byte[] byteArray)

Sets up an ExtractStructuredText instance to open a PDF file contained as a BLOB within a byte[] stream

ExtractStructuredText(byte[] byteArray, org.jpedal.examples.text.configuration.ExtractStructuredTextProperties properties)

Sets up an ExtractStructuredText instance to open a PDF file contained as a BLOB within a byte[] stream

ExtractStructuredText(String fileName)

Sets up an ExtractStructuredText instance to open a PDF File

ExtractStructuredText(String fileName, org.jpedal.examples.text.configuration.ExtractStructuredTextProperties properties)

Sets up an ExtractStructuredText instance to open a PDF File
Method Summary

Modifier and Type

Method

Description

void

decodeFile(String file_name)

routine to decode a file

int

getPageCount()

number of pages in PDF file (starting at 1)

Document

getStructuredTextContent()

gets the Document containing any Structured text (if present) as a Document structure
If the Document does not contain the meta data for Structured Content, an empty Document is returned

Document

getStructuredTextContentAndFigures(String figureDir, String imageFormat)

Gets the marked content from the Document and also writes out the figures to a supplied directory

Document[]

getStructuredTextContentAndFiguresPerPage(String figureDir, String imageFormat)

Gets the marked content from the Document and also writes out the figures to a supplied directory

Document[]

getStructuredTextContentPerPage()

gets the Document containing any Structured text (if present) per page, as an array of Documents
If the Document does not contain the meta data for Structured Content, an empty Document is returned

static void

main(String[] args)

This class will allow you to extract any Structured Text data via command line from a single PDF file or a directory of PDF files.

void

setPassword(String password)

static void

writeAllStructuredTextOutlinesAndFiguresToDir(String inputDir, String password, String outputDir, ErrorTracker errorTracker, org.jpedal.examples.text.configuration.ExtractStructuredTextProperties properties, String figuresDir, String figuresFormat)

Convenience method to write any Structured text in a directory of PDF files

static void

writeAllStructuredTextOutlinesToDir(String inputDir, String outputDir)

Convenience method to write any Structured text in a directory of PDF files

static void

writeAllStructuredTextOutlinesToDir(String inputDir, String password, String outputDir)

Convenience method to write any Structured text in a directory of PDF files

static void

writeAllStructuredTextOutlinesToDir(String inputDir, String password, String outputDir, ErrorTracker errorTracker)

Convenience method to write any Structured text in a directory of PDF files

static void

writeAllStructuredTextOutlinesToDir(String inputDir, String password, String outputDir, ErrorTracker errorTracker, org.jpedal.examples.text.configuration.ExtractStructuredTextProperties properties)

Convenience method to write any Structured text in a directory of PDF files

Methods inherited from class org.jpedal.examples.BaseExample
closePDFfile, openPDFFile

Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Constructor Details
- ExtractStructuredText
  
  public ExtractStructuredText(String fileName)
  
  Sets up an ExtractStructuredText instance to open a PDF File
  
  Parameters:
  
  fileName - full path to a single PDF file
- ExtractStructuredText
  
  public ExtractStructuredText(byte[] byteArray)
  
  Sets up an ExtractStructuredText instance to open a PDF file contained as a BLOB within a byte[] stream
  
  Parameters:
  
  byteArray - Array which will hold BLOB
- ExtractStructuredText
  
  public ExtractStructuredText(String fileName, org.jpedal.examples.text.configuration.ExtractStructuredTextProperties properties)
  
  Sets up an ExtractStructuredText instance to open a PDF File
  
  Parameters:
  
  fileName - full path to a single PDF file
  
  properties - ExtractStructuredTextProperties object for configuring extraction
- ExtractStructuredText
  
  public ExtractStructuredText(byte[] byteArray, org.jpedal.examples.text.configuration.ExtractStructuredTextProperties properties)
  
  Sets up an ExtractStructuredText instance to open a PDF file contained as a BLOB within a byte[] stream
  
  Parameters:
  
  byteArray - Array which will hold BLOB
  
  properties - ExtractStructuredTextProperties object for configuring extraction
Method Details
- decodeFile
  
  public void decodeFile(String file_name) throws PdfException
  
  routine to decode a file
  
  Throws:
  
  PdfException
- main
  
  public static void main(String[] args)
  This class will allow you to extract any Structured Text data via command line from a single PDF file or a directory of PDF files.
  The example expects the following parameters:
  
  Value 1 is the file name or directory of PDF files to process
  
  Value 2 is the directory to write out the outline data
  
  (Optional, unless Value 4 is present then Value 3 must be present) Value 3 is the outline data file format
  
  Value 4 is the directory to write out the figures data
  
  (Optional) Value 5 is the figures output format
  Parameters:
  
  args - The expected arguments are described above.
- writeAllStructuredTextOutlinesToDir
  
  public static void writeAllStructuredTextOutlinesToDir(String inputDir, String password, String outputDir) throws PdfException
  
  Convenience method to write any Structured text in a directory of PDF files
  
  Parameters:
  
  inputDir - directory containing PDF files
  
  password - user or owner password for pdf file
  
  outputDir - directory for writing out images
  
  Throws:
  
  PdfException - a PDF exception
- writeAllStructuredTextOutlinesToDir
  
  public static void writeAllStructuredTextOutlinesToDir(String inputDir, String password, String outputDir, ErrorTracker errorTracker) throws PdfException
  
  Convenience method to write any Structured text in a directory of PDF files
  
  Parameters:
  
  inputDir - directory containing PDF files
  
  password - user or owner password for pdf file
  
  outputDir - directory for writing out images
  
  errorTracker - a custom error tracker
  
  Throws:
  
  PdfException - a PDF exception
- writeAllStructuredTextOutlinesToDir
  
  public static void writeAllStructuredTextOutlinesToDir(String inputDir, String password, String outputDir, ErrorTracker errorTracker, org.jpedal.examples.text.configuration.ExtractStructuredTextProperties properties) throws PdfException
  
  Convenience method to write any Structured text in a directory of PDF files
  
  Parameters:
  
  inputDir - directory containing PDF files
  
  password - user or owner password for pdf file
  
  outputDir - directory for writing out structured text
  
  errorTracker - a custom error tracker
  
  properties - a ExtractStructuredTextProperties object for configuration
  
  Throws:
  
  PdfException - a PDF exception
- writeAllStructuredTextOutlinesAndFiguresToDir
  
  public static void writeAllStructuredTextOutlinesAndFiguresToDir(String inputDir, String password, String outputDir, ErrorTracker errorTracker, org.jpedal.examples.text.configuration.ExtractStructuredTextProperties properties, String figuresDir, String figuresFormat) throws PdfException
  
  Convenience method to write any Structured text in a directory of PDF files
  
  Parameters:
  
  inputDir - directory containing PDF files
  
  password - user or owner password for pdf file
  
  outputDir - directory for writing out structured text
  
  errorTracker - a custom error tracker
  
  properties - a ExtractStructuredTextProperties object for configuration
  
  figuresDir - directory for writing out figures
  
  figuresFormat - image file format for writing figures
  
  Throws:
  
  PdfException - a PDF exception
- writeAllStructuredTextOutlinesToDir
  
  public static void writeAllStructuredTextOutlinesToDir(String inputDir, String outputDir) throws PdfException
  
  Convenience method to write any Structured text in a directory of PDF files
  
  Parameters:
  
  inputDir - directory containing PDF files
  
  outputDir - directory for writing out images
  
  Throws:
  
  PdfException - a PDF exception
- getStructuredTextContent
  
  public Document getStructuredTextContent()
  
  gets the Document containing any Structured text (if present) as a Document structure
  If the Document does not contain the meta data for Structured Content, an empty Document is returned
  
  Returns:
  
  Document
- getStructuredTextContentPerPage
  
  public Document[] getStructuredTextContentPerPage()
  
  gets the Document containing any Structured text (if present) per page, as an array of Documents
  If the Document does not contain the meta data for Structured Content, an empty Document is returned
  
  Returns:
  
  Document
- getStructuredTextContentAndFigures
  
  public Document getStructuredTextContentAndFigures(String figureDir, String imageFormat) throws IOException
  
  Gets the marked content from the Document and also writes out the figures to a supplied directory
  
  Parameters:
  
  figureDir - The directory to write the figure images
  
  imageFormat - The format for white to write the figure images
  
  Returns:
  
  The marked content document
  
  Throws:
  
  IOException - If there is a problem with writing the images
- getStructuredTextContentAndFiguresPerPage
  
  public Document[] getStructuredTextContentAndFiguresPerPage(String figureDir, String imageFormat) throws IOException
  
  Gets the marked content from the Document and also writes out the figures to a supplied directory
  
  Parameters:
  
  figureDir - The directory to write the figure images
  
  imageFormat - The format for white to write the figure images
  
  Returns:
  
  The marked content document as an array with each page per element
  
  Throws:
  
  IOException - If there is a problem with writing the images
- setPassword
  
  public void setPassword(String password)
  
  Parameters:
  
  password - the USER or OWNER password for the PDF file
- getPageCount
  
  public int getPageCount()
  
  number of pages in PDF file (starting at 1)
  
  Returns:
  
  page count

Class ExtractStructuredText

Extract Structured Content (if present) from PDF files

Constructor Summary

Method Summary

Methods inherited from class org.jpedal.examples.BaseExample

Methods inherited from class java.lang.Object

Constructor Details

ExtractStructuredText

ExtractStructuredText

ExtractStructuredText

ExtractStructuredText

Method Details

decodeFile

main

writeAllStructuredTextOutlinesToDir

writeAllStructuredTextOutlinesToDir

writeAllStructuredTextOutlinesToDir

writeAllStructuredTextOutlinesAndFiguresToDir

writeAllStructuredTextOutlinesToDir

getStructuredTextContent

getStructuredTextContentPerPage

getStructuredTextContentAndFigures

getStructuredTextContentAndFiguresPerPage

setPassword

getPageCount