Extract text from PDF files

This class provides a simple Java API to extract text from a PDF file and also a static convenience method if you just want to dump all the text from a PDF file or directory containing PDF files

See our Support Pages for more information on Text Extraction.

Nested Class Summary

Nested Classes

Modifier and Type

Class

Description

static enum

ExtractTextInRectangle.OUTPUT_FORMAT
Constructor Summary

Constructors

Constructor

Description

ExtractTextInRectangle(byte[] byteArray)

Sets up an ExtractTextInRectangle instance to open a PDF file contained as a BLOB within a byte[] stream

ExtractTextInRectangle(String fileName)

Sets up an ExtractTextInRectangle instance to open a PDF File

ExtractTextInRectangle(String fileName, boolean extractPlainText)

Sets up an ExtractTextInRectangle instance to open a PDF File
Method Summary

Modifier and Type

Method

Description

void

closePDFfile()

ensure PDF file is closed once no longer needed and all resources released

int

getPageCount()

number of pages in PDF file (starting at 1)

String

getTextOnPage(int page)

extract all text on page as a string value.

String

getTextOnPage(int page, int x1, int y1, int x2, int y2)

extract all text on page in a specified region as a string value.If the page contains text with multiple orientations (Left to right, bottom to top), only the most common orientation will be extracted and others will be ignored

String

getTextOnPage(int page, Rectangle rectangle)

extract all text on page in a specified region as a string value.

static void

main(String[] args)

This class will allow you to extract all text from page via command line from a single PDF file or a directory of PDF files.

boolean

openPDFFile()

routine to open the PDF File so we can access

void

setEstimateParagraphs(boolean estimateParagraphs)

void

setOutputFormat(ExtractTextInRectangle.OUTPUT_FORMAT format)

void

setPassword(String password)

static void

writeAllTextToDir(String inputDir, String outputDir, int maxPages)

Convenience method to write all the text in a directory of PDF files

static void

writeAllTextToDir(String inputDir, String password, String outputDir, int maxPages)

Convenience method to write all the text in a directory of PDF files

static void

writeAllTextToDir(String inputDir, String password, String outputDir, int maxPages, ExtractTextInRectangle.OUTPUT_FORMAT format, boolean estimateParagraphs)

Convenience method to write all the text in a directory of PDF files

static void

writeAllTextToDir(String inputDir, String password, String outputDir, int maxPages, ExtractTextInRectangle.OUTPUT_FORMAT format, boolean estimateParagraphs, ErrorTracker errorTracker)

Convenience method to write all the text in a directory of PDF files

Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Constructor Details
- ExtractTextInRectangle
  
  public ExtractTextInRectangle(String fileName)
  
  Sets up an ExtractTextInRectangle instance to open a PDF File
  
  Parameters:
  
  fileName - full path to a single PDF file
- ExtractTextInRectangle
  
  public ExtractTextInRectangle(String fileName, boolean extractPlainText)
  
  Sets up an ExtractTextInRectangle instance to open a PDF File
  
  Parameters:
  
  fileName - full path to a single PDF file
  
  extractPlainText - flag to extract plain text rather than XML
- ExtractTextInRectangle
  
  public ExtractTextInRectangle(byte[] byteArray)
  
  Sets up an ExtractTextInRectangle instance to open a PDF file contained as a BLOB within a byte[] stream
  
  Parameters:
  
  byteArray - pdf file data
Method Details
- setOutputFormat
  
  public void setOutputFormat(ExtractTextInRectangle.OUTPUT_FORMAT format)
- setEstimateParagraphs
  
  public void setEstimateParagraphs(boolean estimateParagraphs)
- getTextOnPage
  
  public String getTextOnPage(int page) throws PdfException
  
  extract all text on page as a string value.
  If the page contains text with multiple orientations (Left to right, bottom to top), only the most common orientation will be extracted and others will be ignored
  
  Parameters:
  
  page - number (first page is 1)
  
  Returns:
  
  String with text
  
  Throws:
  
  PdfException - if problem with parsing and extraxting text from PDF file
- getTextOnPage
  
  public String getTextOnPage(int page, Rectangle rectangle) throws PdfException
  
  extract all text on page in a specified region as a string value. If the page contains text with multiple orientations (Left to right, bottom to top), only the most common orientation will be extracted and others will be ignored
  
  Parameters:
  
  page - (first page is 1)
  
  rectangle - - top left corner x
  
  Returns:
  
  String with text
  
  Throws:
  
  PdfException - if problem with parsing and extraxting text from PDF file
- getTextOnPage
  
  public String getTextOnPage(int page, int x1, int y1, int x2, int y2) throws PdfException
  
  extract all text on page in a specified region as a string value.If the page contains text with multiple orientations (Left to right, bottom to top), only the most common orientation will be extracted and others will be ignored
  
  Parameters:
  
  page - (first page is 1)
  
  x1 - - top left corner x
  
  y1 - - top left corner y
  
  x2 - - bottom right corner x
  
  y2 - - bottom right corner y
  
  Returns:
  
  String with text
  
  Throws:
  
  PdfException - if problem with parsing and extraxting text from PDF file
- main
  
  public static void main(String[] args)
  This class will allow you to extract all text from page via command line from a single PDF file or a directory of PDF files.
  The example expects two:
  
  Value 1 is the file name or directory of PDF files to process
  
  Value 2 is directory to write out the data
  Parameters:
  
  args - The expected arguments are described above.
- writeAllTextToDir
  
  public static void writeAllTextToDir(String inputDir, String password, String outputDir, int maxPages, ExtractTextInRectangle.OUTPUT_FORMAT format, boolean estimateParagraphs) throws PdfException
  
  Convenience method to write all the text in a directory of PDF files
  
  Parameters:
  
  inputDir - directory containing PDF files
  
  password - user or owner password for PDF files
  
  outputDir - directory for writing out wordlists
  
  maxPages - limit to just the first maxPages of a document
  
  format - set the output format for the text content (TXT or XML)
  
  estimateParagraphs - set if JPedal should estimate paragraph spacing in output.
  
  Throws:
  
  PdfException - if problem with parsing and extracting text from PDF file
- writeAllTextToDir
  
  public static void writeAllTextToDir(String inputDir, String password, String outputDir, int maxPages, ExtractTextInRectangle.OUTPUT_FORMAT format, boolean estimateParagraphs, ErrorTracker errorTracker) throws PdfException
  
  Convenience method to write all the text in a directory of PDF files
  
  Parameters:
  
  inputDir - directory containing PDF files
  
  password - user or owner password for PDF files
  
  outputDir - directory for writing out wordlists
  
  maxPages - limit to just the first maxPages of a document
  
  format - set the output format for the text content (TXT or XML)
  
  estimateParagraphs - set if JPedal should estimate paragraph spacing in output.
  
  errorTracker - a custom error tracker
  
  Throws:
  
  PdfException - if problem with parsing and extracting text from PDF file
- writeAllTextToDir
  
  public static void writeAllTextToDir(String inputDir, String password, String outputDir, int maxPages) throws PdfException
  
  Convenience method to write all the text in a directory of PDF files
  
  Parameters:
  
  inputDir - directory containing PDF files
  
  password - user or owner password for PDF files
  
  outputDir - directory for writing out wordlists
  
  maxPages - limit to just the first maxPages of a document
  
  Throws:
  
  PdfException - if problem with parsing and extracting text from PDF file
- writeAllTextToDir
  
  public static void writeAllTextToDir(String inputDir, String outputDir, int maxPages) throws PdfException
  
  Convenience method to write all the text in a directory of PDF files
  
  Parameters:
  
  inputDir - directory containing PDF files
  
  outputDir - directory for writing out wordlists
  
  maxPages - limit to just the first maxPages of a document
  
  Throws:
  
  PdfException - if problem with parsing and extracting text from PDF file
- setPassword
  
  public void setPassword(String password)
  
  Parameters:
  
  password - the USER or OWNER password for the PDF file
- getPageCount
  
  public int getPageCount()
  
  number of pages in PDF file (starting at 1)
  
  Returns:
  
  page count
- openPDFFile
  
  public boolean openPDFFile() throws PdfException
  
  routine to open the PDF File so we can access
  
  Returns:
  
  true if successful
  
  Throws:
  
  PdfException - if problem with opening PDF file
- closePDFfile
  
  public void closePDFfile()
  
  ensure PDF file is closed once no longer needed and all resources released

Class ExtractTextInRectangle

Extract text from PDF files

Nested Class Summary

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Constructor Details

ExtractTextInRectangle

ExtractTextInRectangle

ExtractTextInRectangle

Method Details

setOutputFormat

setEstimateParagraphs

getTextOnPage

getTextOnPage

getTextOnPage

main

writeAllTextToDir

writeAllTextToDir

writeAllTextToDir

writeAllTextToDir

setPassword

getPageCount

openPDFFile

closePDFfile