Class ExtractTextInRectangle

java.lang.Object
org.jpedal.examples.text.ExtractTextInRectangle

public class ExtractTextInRectangle extends Object

Extract text from PDF files


This class provides a simple Java API to extract text from a PDF file and also a static convenience method if you just want to dump all the text from a PDF file or directory containing PDF files

See our Support Pages for more information on Text Extraction.
  • Field Summary

    Fields
    Modifier and Type
    Field
    Description
    static boolean
     
  • Constructor Summary

    Constructors
    Constructor
    Description
    ExtractTextInRectangle(byte[] byteArray)
    Sets up an ExtractTextInRectangle instance to open a PDF file contained as a BLOB within a byte[] stream
    Sets up an ExtractTextInRectangle instance to open a PDF File
    ExtractTextInRectangle(String fileName, boolean extractPlainText)
    Sets up an ExtractTextInRectangle instance to open a PDF File
  • Method Summary

    Modifier and Type
    Method
    Description
    void
    ensure PDF file is closed once no longer needed and all resources released
    int
    number of pages in PDF file (starting at 1)
    getTextOnPage(int page)
    extract all text on page as a string value.
    getTextOnPage(int page, int x1, int y1, int x2, int y2)
    extract all text on page in a specified region as a string value.If the page contains text with multiple orientations (Left to right, bottom to top), only the most common orientation will be extracted and others will be ignored
    getTextOnPage(int page, Rectangle rectangle)
    extract all text on page in a specified region as a string value.
    static void
    main(String[] args)
    This class will allow you to extract all text from page via command line from a single PDF file or a directory of PDF files.
    boolean
    routine to open the PDF File so we can access
    void
    setPassword(String password)
     
    static void
    writeAllTextToDir(String inputDir, String outputDir, int maxPages)
    Convenience method to write all the text in a directory of PDF files
    static void
    writeAllTextToDir(String inputDir, String password, String outputDir, int maxPages)
    Convenience method to write all the text in a directory of PDF files

    Methods inherited from class java.lang.Object

    equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
  • Field Details

    • isTest

      public static boolean isTest
  • Constructor Details

    • ExtractTextInRectangle

      public ExtractTextInRectangle(String fileName)
      Sets up an ExtractTextInRectangle instance to open a PDF File
      Parameters:
      fileName - full path to a single PDF file
    • ExtractTextInRectangle

      public ExtractTextInRectangle(String fileName, boolean extractPlainText)
      Sets up an ExtractTextInRectangle instance to open a PDF File
      Parameters:
      fileName - full path to a single PDF file
      extractPlainText - flag to extract plain text rather than XML
    • ExtractTextInRectangle

      public ExtractTextInRectangle(byte[] byteArray)
      Sets up an ExtractTextInRectangle instance to open a PDF file contained as a BLOB within a byte[] stream
      Parameters:
      byteArray - pdf file data
  • Method Details

    • getTextOnPage

      public String getTextOnPage(int page) throws PdfException
      extract all text on page as a string value.

      If the page contains text with multiple orientations (Left to right, bottom to top), only the most common orientation will be extracted and others will be ignored

      Parameters:
      page - number (first page is 1)
      Returns:
      String with text
      Throws:
      PdfException - if problem with parsing and extraxting text from PDF file
    • getTextOnPage

      public String getTextOnPage(int page, Rectangle rectangle) throws PdfException
      extract all text on page in a specified region as a string value. If the page contains text with multiple orientations (Left to right, bottom to top), only the most common orientation will be extracted and others will be ignored
      Parameters:
      page - (first page is 1)
      rectangle - - top left corner x
      Returns:
      String with text
      Throws:
      PdfException - if problem with parsing and extraxting text from PDF file
    • getTextOnPage

      public String getTextOnPage(int page, int x1, int y1, int x2, int y2) throws PdfException
      extract all text on page in a specified region as a string value.If the page contains text with multiple orientations (Left to right, bottom to top), only the most common orientation will be extracted and others will be ignored
      Parameters:
      page - (first page is 1)
      x1 - - top left corner x
      y1 - - top left corner y
      x2 - - bottom right corner x
      y2 - - bottom right corner y
      Returns:
      String with text
      Throws:
      PdfException - if problem with parsing and extraxting text from PDF file
    • main

      public static void main(String[] args)
      This class will allow you to extract all text from page via command line from a single PDF file or a directory of PDF files.
      The example expects two:
      • Value 1 is the file name or directory of PDF files to process
      • Value 2 is directory to write out the data
      Parameters:
      args - The expected arguments are described above.
    • writeAllTextToDir

      public static void writeAllTextToDir(String inputDir, String password, String outputDir, int maxPages) throws PdfException
      Convenience method to write all the text in a directory of PDF files
      Parameters:
      inputDir - directory containing PDF files
      password - user or owner password for PDF files
      outputDir - directory for writing out wordlists
      maxPages - limit to just the first maxPages of a document
      Throws:
      PdfException - if problem with parsing and extraxting text from PDF file
    • writeAllTextToDir

      public static void writeAllTextToDir(String inputDir, String outputDir, int maxPages) throws PdfException
      Convenience method to write all the text in a directory of PDF files
      Parameters:
      inputDir - directory containing PDF files
      outputDir - directory for writing out wordlists
      maxPages - limit to just the first maxPages of a document
      Throws:
      PdfException - if problem with parsing and extraxting text from PDF file
    • setPassword

      public void setPassword(String password)
      Parameters:
      password - the USER or OWNER password for the PDF file
    • getPageCount

      public int getPageCount()
      number of pages in PDF file (starting at 1)
      Returns:
      page count
    • openPDFFile

      public boolean openPDFFile() throws PdfException
      routine to open the PDF File so we can access
      Returns:
      true if successful
      Throws:
      PdfException - if problem with opening PDF file
    • closePDFfile

      public void closePDFfile()
      ensure PDF file is closed once no longer needed and all resources released