Class ExtractTextAsWordlist

java.lang.Object
org.jpedal.examples.text.ExtractTextAsWordlist

public class ExtractTextAsWordlist extends Object

Extract words and locations from PDF files


This class provides a simple Java API to extract text as words and the location on the page from a PDF file and also a static convenience method if you just want to dump all the word lists from a PDF file or directory containing PDF files

See our Support Pages for more information on Text Extraction.
  • Constructor Details

    • ExtractTextAsWordlist

      public ExtractTextAsWordlist(String fileName)
      Sets up an ExtractTextAsWordlist instance to open a PDF File
      Parameters:
      fileName - full path to a single PDF file
    • ExtractTextAsWordlist

      public ExtractTextAsWordlist(byte[] byteArray)
      Sets up an ExtractTextAsWordlist instance to open a PDF file contained as a BLOB within a byte[] stream
      Parameters:
      byteArray - pdf file data
  • Method Details

    • getWordsOnPage

      public List<String> getWordsOnPage(int page) throws PdfException
      Gets the individual words from the pages text content and returns them. Uses a default set of delimiters to determine word bounds.
      Parameters:
      page - The page to get text content from.
      Returns:
      List object containing all words found on the page.
      Throws:
      PdfException - if problem with parsing and extraxting text from PDF file
    • getWordsOnPage

      public List<String> getWordsOnPage(int page, String delimiters) throws PdfException
      Gets the individual words from the pages text content and returns them. Uses the provided delimiters to determine word bounds.
      Parameters:
      page - The page to get text content from.
      delimiters - A String of characters to be used as delimiters for words.
      Returns:
      List object containing all words found on the page.
      Throws:
      PdfException - if problem with parsing and extraxting text from PDF file
    • getWordsOnPage

      public List<String> getWordsOnPage(int page, int x1, int y1, int x2, int y2, String delimiters) throws PdfException
      Gets the individual words from the pages text content with a greater degree of control.
      Parameters:
      page - The page to get text content from.
      x1 - The left most point to extract from.
      y1 - The top most point to extract from.
      x2 - The right most point to extract from.
      y2 - The bottom most point to extract from.
      delimiters - key to separate values
      Returns:
      List object containing all words found on the page.
      Throws:
      PdfException - if problem with parsing and extraxting text from PDF file
    • getWordsOnPage

      public List<String> getWordsOnPage(int page, Rectangle rectangle, String delimiters) throws PdfException
      Gets the individual words from the pages text content with a greater degree of control.
      Parameters:
      page - The page to get text content from.
      rectangle - Rectangle area on the page to extract words from.
      delimiters - separator used for output
      Returns:
      List object containing all words found on the page.
      Throws:
      PdfException - if problem with parsing and extraxting text from PDF file
    • main

      public static void main(String[] args)
      This class will allow you to extract any Words from page as a list via command line from a single PDF file or a directory of PDF files.
      The example expects two:
      • Value 1 is the file name or directory of PDF files to process
      • Value 2 is directory to write out the outline data
      Parameters:
      args - The expected arguments are described above.
    • writeAllWordlistsToDir

      public static int writeAllWordlistsToDir(String inputDir, String password, String outputDir, int maxPages) throws PdfException
      Convenience method to write all the Wordlists in a directory of PDF files
      Parameters:
      inputDir - directory containing PDF files
      password - user or owner password for pdf file
      outputDir - directory for writing out wordlists
      maxPages - limit to the first pages up to this page
      Returns:
      count of words extracted in total
      Throws:
      PdfException - if problem with parsing and extraxting text from PDF file
    • writeAllWordlistsToDir

      public static int writeAllWordlistsToDir(String inputDir, String password, String outputDir, int maxPages, ErrorTracker errorTracker) throws PdfException
      Convenience method to write all the Wordlists in a directory of PDF files
      Parameters:
      inputDir - directory containing PDF files
      password - user or owner password for pdf file
      outputDir - directory for writing out wordlists
      maxPages - limit to the first pages up to this page
      errorTracker - a custom error tracker
      Returns:
      count of words extracted in total
      Throws:
      PdfException - if problem with parsing and extraxting text from PDF file
    • writeAllWordlistsToDir

      public static int writeAllWordlistsToDir(String inputDir, String outputDir, int maxPages) throws PdfException
      Convenience method to write all the Wordlists in a directory of PDF files
      Parameters:
      inputDir - directory containing PDF files
      outputDir - directory for writing out wordlists
      maxPages - limit to just the first maxPages of a document
      Returns:
      count of number of words extracted
      Throws:
      PdfException - if problem with parsing and extracting text from PDF file
    • setPassword

      public void setPassword(String password)
      Parameters:
      password - the USER or OWNER password for the PDF file
    • getPageCount

      public int getPageCount()
      number of pages in PDF file (starting at 1)
      Returns:
      page count
    • openPDFFile

      public boolean openPDFFile() throws PdfException
      routine to open the PDF File so we can access
      Returns:
      true if successful
      Throws:
      PdfException - if problem with opening PDF file
    • closePDFfile

      public void closePDFfile()
      ensure PDF file is closed once no longer needed and all resources released