Extract words and locations from PDF files

This class provides a simple Java API to extract text as words and the location on the page from a PDF file and also a static convenience method if you just want to dump all the word lists from a PDF file or directory containing PDF files

See our Support Pages for more information on Text Extraction.

Constructor Summary

Constructors

Constructor

Description

ExtractTextAsWordlist(byte[] byteArray)

Sets up an ExtractTextAsWordlist instance to open a PDF file contained as a BLOB within a byte[] stream

ExtractTextAsWordlist(String fileName)

Sets up an ExtractTextAsWordlist instance to open a PDF File
Method Summary

Modifier and Type

Method

Description

void

decodeFile(String file_name)

routine to decode a file

int

getPageCount()

number of pages in PDF file (starting at 1)

List<String>

getWordsOnPage(int page)

Gets the individual words from the pages text content and returns them.

List<String>

getWordsOnPage(int page, int x1, int y1, int x2, int y2, String delimiters)

Gets the individual words from the pages text content with a greater degree of control.

List<String>

getWordsOnPage(int page, Rectangle rectangle, String delimiters)

Gets the individual words from the pages text content with a greater degree of control.

List<String>

getWordsOnPage(int page, String delimiters)

Gets the individual words from the pages text content and returns them.

static void

main(String[] args)

This class will allow you to extract any Words from page as a list via command line from a single PDF file or a directory of PDF files.

void

setPassword(String password)

static int

writeAllWordlistsToDir(String inputDir, String outputDir, int maxPages)

Convenience method to write all the Wordlists in a directory of PDF files

static int

writeAllWordlistsToDir(String inputDir, String password, String outputDir, int maxPages)

Convenience method to write all the Wordlists in a directory of PDF files

static int

writeAllWordlistsToDir(String inputDir, String password, String outputDir, int maxPages, ErrorTracker errorTracker)

Convenience method to write all the Wordlists in a directory of PDF files

Methods inherited from class org.jpedal.examples.BaseExample
closePDFfile, openPDFFile

Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Constructor Details
- ExtractTextAsWordlist
  
  public ExtractTextAsWordlist(String fileName)
  
  Sets up an ExtractTextAsWordlist instance to open a PDF File
  
  Parameters:
  
  fileName - full path to a single PDF file
- ExtractTextAsWordlist
  
  public ExtractTextAsWordlist(byte[] byteArray)
  
  Sets up an ExtractTextAsWordlist instance to open a PDF file contained as a BLOB within a byte[] stream
  
  Parameters:
  
  byteArray - pdf file data
Method Details
- decodeFile
  
  public void decodeFile(String file_name) throws PdfException
  
  routine to decode a file
  
  Throws:
  
  PdfException
- getWordsOnPage
  
  public List<String> getWordsOnPage(int page) throws PdfException
  
  Gets the individual words from the pages text content and returns them. Uses a default set of delimiters to determine word bounds.
  
  Parameters:
  
  page - The page to get text content from.
  
  Returns:
  
  List object containing all words found on the page.
  
  Throws:
  
  PdfException - if problem with parsing and extraxting text from PDF file
- getWordsOnPage
  
  public List<String> getWordsOnPage(int page, String delimiters) throws PdfException
  
  Gets the individual words from the pages text content and returns them. Uses the provided delimiters to determine word bounds.
  
  Parameters:
  
  page - The page to get text content from.
  
  delimiters - A String of characters to be used as delimiters for words.
  
  Returns:
  
  List object containing all words found on the page.
  
  Throws:
  
  PdfException - if problem with parsing and extraxting text from PDF file
- getWordsOnPage
  
  public List<String> getWordsOnPage(int page, int x1, int y1, int x2, int y2, String delimiters) throws PdfException
  
  Gets the individual words from the pages text content with a greater degree of control.
  
  Parameters:
  
  page - The page to get text content from.
  
  x1 - The left most point to extract from.
  
  y1 - The top most point to extract from.
  
  x2 - The right most point to extract from.
  
  y2 - The bottom most point to extract from.
  
  delimiters - key to separate values
  
  Returns:
  
  List object containing all words found on the page.
  
  Throws:
  
  PdfException - if problem with parsing and extraxting text from PDF file
- getWordsOnPage
  
  public List<String> getWordsOnPage(int page, Rectangle rectangle, String delimiters) throws PdfException
  
  Gets the individual words from the pages text content with a greater degree of control.
  
  Parameters:
  
  page - The page to get text content from.
  
  rectangle - Rectangle area on the page to extract words from.
  
  delimiters - separator used for output
  
  Returns:
  
  List object containing all words found on the page.
  
  Throws:
  
  PdfException - if problem with parsing and extraxting text from PDF file
- main
  
  public static void main(String[] args)
  This class will allow you to extract any Words from page as a list via command line from a single PDF file or a directory of PDF files.
  The example expects two:
  
  Value 1 is the file name or directory of PDF files to process
  
  Value 2 is directory to write out the outline data
  Parameters:
  
  args - The expected arguments are described above.
- writeAllWordlistsToDir
  
  public static int writeAllWordlistsToDir(String inputDir, String password, String outputDir, int maxPages) throws PdfException
  
  Convenience method to write all the Wordlists in a directory of PDF files
  
  Parameters:
  
  inputDir - directory containing PDF files
  
  password - user or owner password for pdf file
  
  outputDir - directory for writing out wordlists
  
  maxPages - limit to the first pages up to this page
  
  Returns:
  
  count of words extracted in total
  
  Throws:
  
  PdfException - if problem with parsing and extraxting text from PDF file
- writeAllWordlistsToDir
  
  public static int writeAllWordlistsToDir(String inputDir, String password, String outputDir, int maxPages, ErrorTracker errorTracker) throws PdfException
  
  Convenience method to write all the Wordlists in a directory of PDF files
  
  Parameters:
  
  inputDir - directory containing PDF files
  
  password - user or owner password for pdf file
  
  outputDir - directory for writing out wordlists
  
  maxPages - limit to the first pages up to this page
  
  errorTracker - a custom error tracker
  
  Returns:
  
  count of words extracted in total
  
  Throws:
  
  PdfException - if problem with parsing and extraxting text from PDF file
- writeAllWordlistsToDir
  
  public static int writeAllWordlistsToDir(String inputDir, String outputDir, int maxPages) throws PdfException
  
  Convenience method to write all the Wordlists in a directory of PDF files
  
  Parameters:
  
  inputDir - directory containing PDF files
  
  outputDir - directory for writing out wordlists
  
  maxPages - limit to just the first maxPages of a document
  
  Returns:
  
  count of number of words extracted
  
  Throws:
  
  PdfException - if problem with parsing and extracting text from PDF file
- setPassword
  
  public void setPassword(String password)
  
  Parameters:
  
  password - the USER or OWNER password for the PDF file
- getPageCount
  
  public int getPageCount()
  
  number of pages in PDF file (starting at 1)
  
  Returns:
  
  page count

Class ExtractTextAsWordlist

Extract words and locations from PDF files

Constructor Summary

Method Summary

Methods inherited from class org.jpedal.examples.BaseExample

Methods inherited from class java.lang.Object

Constructor Details

ExtractTextAsWordlist

ExtractTextAsWordlist

Method Details

decodeFile

getWordsOnPage

getWordsOnPage

getWordsOnPage

getWordsOnPage

main

writeAllWordlistsToDir

writeAllWordlistsToDir

writeAllWordlistsToDir

setPassword

getPageCount