Class ExtractTextInRectangle

java.lang.Object
org.jpedal.examples.text.ExtractTextInRectangle

public class ExtractTextInRectangle extends Object

Extract text from PDF files


This class provides a simple Java API to extract text from a PDF file and also a static convenience method if you just want to dump all the text from a PDF file or directory containing PDF files

See our Support Pages for more information on Text Extraction.
  • Constructor Details

    • ExtractTextInRectangle

      public ExtractTextInRectangle(String fileName)
      Sets up an ExtractTextInRectangle instance to open a PDF File
      Parameters:
      fileName - full path to a single PDF file
    • ExtractTextInRectangle

      public ExtractTextInRectangle(String fileName, boolean extractPlainText)
      Sets up an ExtractTextInRectangle instance to open a PDF File
      Parameters:
      fileName - full path to a single PDF file
      extractPlainText - flag to extract plain text rather than XML
    • ExtractTextInRectangle

      public ExtractTextInRectangle(byte[] byteArray)
      Sets up an ExtractTextInRectangle instance to open a PDF file contained as a BLOB within a byte[] stream
      Parameters:
      byteArray - pdf file data
  • Method Details

    • setOutputFormat

      public void setOutputFormat(ExtractTextInRectangle.OUTPUT_FORMAT format)
    • setEstimateParagraphs

      public void setEstimateParagraphs(boolean estimateParagraphs)
    • getTextOnPage

      public String getTextOnPage(int page) throws PdfException
      extract all text on page as a string value.

      If the page contains text with multiple orientations (Left to right, bottom to top), only the most common orientation will be extracted and others will be ignored

      Parameters:
      page - number (first page is 1)
      Returns:
      String with text
      Throws:
      PdfException - if problem with parsing and extraxting text from PDF file
    • getTextOnPage

      public String getTextOnPage(int page, Rectangle rectangle) throws PdfException
      extract all text on page in a specified region as a string value. If the page contains text with multiple orientations (Left to right, bottom to top), only the most common orientation will be extracted and others will be ignored
      Parameters:
      page - (first page is 1)
      rectangle - - top left corner x
      Returns:
      String with text
      Throws:
      PdfException - if problem with parsing and extraxting text from PDF file
    • getTextOnPage

      public String getTextOnPage(int page, int x1, int y1, int x2, int y2) throws PdfException
      extract all text on page in a specified region as a string value.If the page contains text with multiple orientations (Left to right, bottom to top), only the most common orientation will be extracted and others will be ignored
      Parameters:
      page - (first page is 1)
      x1 - - top left corner x
      y1 - - top left corner y
      x2 - - bottom right corner x
      y2 - - bottom right corner y
      Returns:
      String with text
      Throws:
      PdfException - if problem with parsing and extracting text from PDF file
    • main

      public static void main(String[] args)
      This class will allow you to extract all text from page via command line from a single PDF file or a directory of PDF files.
      The example expects two:
      • Value 1 is the file name or directory of PDF files to process
      • Value 2 is directory to write out the data
      Parameters:
      args - The expected arguments are described above.
    • writeAllTextToDir

      public static void writeAllTextToDir(String inputDir, String password, String outputDir, int maxPages, ExtractTextInRectangle.OUTPUT_FORMAT format, boolean estimateParagraphs) throws PdfException
      Convenience method to write all the text in a directory of PDF files
      Parameters:
      inputDir - directory containing PDF files
      password - user or owner password for PDF files
      outputDir - directory for writing out wordlists
      maxPages - limit to just the first maxPages of a document
      format - set the output format for the text content (TXT or XML)
      estimateParagraphs - set if JPedal should estimate paragraph spacing in output.
      Throws:
      PdfException - if problem with parsing and extracting text from PDF file
    • writeAllTextToDir

      public static void writeAllTextToDir(String inputDir, String password, String outputDir, int maxPages, ExtractTextInRectangle.OUTPUT_FORMAT format, boolean estimateParagraphs, ErrorTracker errorTracker) throws PdfException
      Convenience method to write all the text in a directory of PDF files
      Parameters:
      inputDir - directory containing PDF files
      password - user or owner password for PDF files
      outputDir - directory for writing out wordlists
      maxPages - limit to just the first maxPages of a document
      format - set the output format for the text content (TXT or XML)
      estimateParagraphs - set if JPedal should estimate paragraph spacing in output.
      errorTracker - a custom error tracker
      Throws:
      PdfException - if problem with parsing and extracting text from PDF file
    • writeAllTextToDir

      public static void writeAllTextToDir(String inputDir, String password, String outputDir, int maxPages) throws PdfException
      Convenience method to write all the text in a directory of PDF files
      Parameters:
      inputDir - directory containing PDF files
      password - user or owner password for PDF files
      outputDir - directory for writing out wordlists
      maxPages - limit to just the first maxPages of a document
      Throws:
      PdfException - if problem with parsing and extracting text from PDF file
    • writeAllTextToDir

      public static void writeAllTextToDir(String inputDir, String outputDir, int maxPages) throws PdfException
      Convenience method to write all the text in a directory of PDF files
      Parameters:
      inputDir - directory containing PDF files
      outputDir - directory for writing out wordlists
      maxPages - limit to just the first maxPages of a document
      Throws:
      PdfException - if problem with parsing and extracting text from PDF file
    • setPassword

      public void setPassword(String password)
      Parameters:
      password - the USER or OWNER password for the PDF file
    • getPageCount

      public int getPageCount()
      number of pages in PDF file (starting at 1)
      Returns:
      page count
    • openPDFFile

      public boolean openPDFFile() throws PdfException
      routine to open the PDF File so we can access
      Returns:
      true if successful
      Throws:
      PdfException - if problem with opening PDF file
    • closePDFfile

      public void closePDFfile()
      ensure PDF file is closed once no longer needed and all resources released