Class ExtractTextAsWordlist

java.lang.Object
org.jpedal.examples.text.ExtractTextAsWordlist

public class ExtractTextAsWordlist extends Object

Extract words and locations from PDF files


This class provides a simple Java API to extract text as words and the location on the page from a PDF file and also a static convenience method if you just want to dump all the word lists from a PDF file or directory containing PDF files

See our Support Pages for more information on Text Extraction.
  • Constructor Summary

    Constructors
    Constructor
    Description
    ExtractTextAsWordlist(byte[] byteArray)
    Sets up an ExtractTextAsWordlist instance to open a PDF file contained as a BLOB within a byte[] stream
    Sets up an ExtractTextAsWordlist instance to open a PDF File
  • Method Summary

    Modifier and Type
    Method
    Description
    void
    ensure PDF file is closed once no longer needed and all resources released
    int
    number of pages in PDF file (starting at 1)
    getWordsOnPage(int page)
    Gets the individual words from the pages text content and returns them.
    getWordsOnPage(int page, int x1, int y1, int x2, int y2, String delimiters)
    Gets the individual words from the pages text content with a greater degree of control.
    getWordsOnPage(int page, Rectangle rectangle, String delimiters)
    Gets the individual words from the pages text content with a greater degree of control.
    getWordsOnPage(int page, String delimiters)
    Gets the individual words from the pages text content and returns them.
    static void
    main(String[] args)
    This class will allow you to extract any Words from page as a list via command line from a single PDF file or a directory of PDF files.
    boolean
    routine to open the PDF File so we can access
    void
    setPassword(String password)
     
    static int
    writeAllWordlistsToDir(String inputDir, String outputDir, int maxPages)
    Convenience method to write all the Wordlists in a directory of PDF files
    static int
    writeAllWordlistsToDir(String inputDir, String password, String outputDir, int maxPages)
    Convenience method to write all the Wordlists in a directory of PDF files

    Methods inherited from class java.lang.Object

    equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
  • Constructor Details

    • ExtractTextAsWordlist

      public ExtractTextAsWordlist(String fileName)
      Sets up an ExtractTextAsWordlist instance to open a PDF File
      Parameters:
      fileName - full path to a single PDF file
    • ExtractTextAsWordlist

      public ExtractTextAsWordlist(byte[] byteArray)
      Sets up an ExtractTextAsWordlist instance to open a PDF file contained as a BLOB within a byte[] stream
      Parameters:
      byteArray - pdf file data
  • Method Details

    • getWordsOnPage

      public List<String> getWordsOnPage(int page) throws PdfException
      Gets the individual words from the pages text content and returns them. Uses a default set of delimiters to determine word bounds.
      Parameters:
      page - The page to get text content from.
      Returns:
      List object containing all words found on the page.
      Throws:
      PdfException - if problem with parsing and extraxting text from PDF file
    • getWordsOnPage

      public List<String> getWordsOnPage(int page, String delimiters) throws PdfException
      Gets the individual words from the pages text content and returns them. Uses the provided delimiters to determine word bounds.
      Parameters:
      page - The page to get text content from.
      delimiters - A String of characters to be used as delimiters for words.
      Returns:
      List object containing all words found on the page.
      Throws:
      PdfException - if problem with parsing and extraxting text from PDF file
    • getWordsOnPage

      public List<String> getWordsOnPage(int page, int x1, int y1, int x2, int y2, String delimiters) throws PdfException
      Gets the individual words from the pages text content with a greater degree of control.
      Parameters:
      page - The page to get text content from.
      x1 - The left most point to extract from.
      y1 - The top most point to extract from.
      x2 - The right most point to extract from.
      y2 - The bottom most point to extract from.
      delimiters - key to separate values
      Returns:
      List object containing all words found on the page.
      Throws:
      PdfException - if problem with parsing and extraxting text from PDF file
    • getWordsOnPage

      public List<String> getWordsOnPage(int page, Rectangle rectangle, String delimiters) throws PdfException
      Gets the individual words from the pages text content with a greater degree of control.
      Parameters:
      page - The page to get text content from.
      rectangle - Rectangle area on the page to extract words from.
      delimiters - separator used for output
      Returns:
      List object containing all words found on the page.
      Throws:
      PdfException - if problem with parsing and extraxting text from PDF file
    • main

      public static void main(String[] args)
      This class will allow you to extract any Words from page as a list via command line from a single PDF file or a directory of PDF files.
      The example expects two:
      • Value 1 is the file name or directory of PDF files to process
      • Value 2 is directory to write out the outline data
      Parameters:
      args - The expected arguments are described above.
    • writeAllWordlistsToDir

      public static int writeAllWordlistsToDir(String inputDir, String password, String outputDir, int maxPages) throws PdfException
      Convenience method to write all the Wordlists in a directory of PDF files
      Parameters:
      inputDir - directory containing PDF files
      password - user or owner password for pdf file
      outputDir - directory for writing out wordlists
      maxPages - limit to the first pages up to this page
      Returns:
      count of words extracted in total
      Throws:
      PdfException - if problem with parsing and extraxting text from PDF file
    • writeAllWordlistsToDir

      public static int writeAllWordlistsToDir(String inputDir, String outputDir, int maxPages) throws PdfException
      Convenience method to write all the Wordlists in a directory of PDF files
      Parameters:
      inputDir - directory containing PDF files
      outputDir - directory for writing out wordlists
      maxPages - limit to just the first maxPages of a document
      Returns:
      count of number of words extracted
      Throws:
      PdfException - if problem with parsing and extracting text from PDF file
    • setPassword

      public void setPassword(String password)
      Parameters:
      password - the USER or OWNER password for the PDF file
    • getPageCount

      public int getPageCount()
      number of pages in PDF file (starting at 1)
      Returns:
      page count
    • openPDFFile

      public boolean openPDFFile() throws PdfException
      routine to open the PDF File so we can access
      Returns:
      true if successful
      Throws:
      PdfException - if problem with opening PDF file
    • closePDFfile

      public void closePDFfile()
      ensure PDF file is closed once no longer needed and all resources released