Package org.jpedal.examples.text
Class ExtractTextAsWordlist
java.lang.Object
org.jpedal.examples.text.ExtractTextAsWordlist
Extract words and locations from PDF files
This class provides a simple Java API to extract text as words and the location on the page from a PDF file and also a static convenience method if you just want to dump all the word lists from a PDF file or directory containing PDF files
See our Support Pages for more information on Text Extraction.
-
Constructor Summary
ConstructorDescriptionExtractTextAsWordlist
(byte[] byteArray) Sets up an ExtractTextAsWordlist instance to open a PDF file contained as a BLOB within a byte[] streamExtractTextAsWordlist
(String fileName) Sets up an ExtractTextAsWordlist instance to open a PDF File -
Method Summary
Modifier and TypeMethodDescriptionvoid
ensure PDF file is closed once no longer needed and all resources releasedint
number of pages in PDF file (starting at 1)getWordsOnPage
(int page) Gets the individual words from the pages text content and returns them.getWordsOnPage
(int page, int x1, int y1, int x2, int y2, String delimiters) Gets the individual words from the pages text content with a greater degree of control.getWordsOnPage
(int page, Rectangle rectangle, String delimiters) Gets the individual words from the pages text content with a greater degree of control.getWordsOnPage
(int page, String delimiters) Gets the individual words from the pages text content and returns them.static void
This class will allow you to extract any Words from page as a list via command line from a single PDF file or a directory of PDF files.boolean
routine to open the PDF File so we can accessvoid
setPassword
(String password) static int
writeAllWordlistsToDir
(String inputDir, String outputDir, int maxPages) Convenience method to write all the Wordlists in a directory of PDF filesstatic int
writeAllWordlistsToDir
(String inputDir, String password, String outputDir, int maxPages) Convenience method to write all the Wordlists in a directory of PDF filesstatic int
writeAllWordlistsToDir
(String inputDir, String password, String outputDir, int maxPages, ErrorTracker errorTracker) Convenience method to write all the Wordlists in a directory of PDF files
-
Constructor Details
-
ExtractTextAsWordlist
Sets up an ExtractTextAsWordlist instance to open a PDF File- Parameters:
fileName
- full path to a single PDF file
-
ExtractTextAsWordlist
public ExtractTextAsWordlist(byte[] byteArray) Sets up an ExtractTextAsWordlist instance to open a PDF file contained as a BLOB within a byte[] stream- Parameters:
byteArray
- pdf file data
-
-
Method Details
-
getWordsOnPage
Gets the individual words from the pages text content and returns them. Uses a default set of delimiters to determine word bounds.- Parameters:
page
- The page to get text content from.- Returns:
- List object containing all words found on the page.
- Throws:
PdfException
- if problem with parsing and extraxting text from PDF file
-
getWordsOnPage
Gets the individual words from the pages text content and returns them. Uses the provided delimiters to determine word bounds.- Parameters:
page
- The page to get text content from.delimiters
- A String of characters to be used as delimiters for words.- Returns:
- List object containing all words found on the page.
- Throws:
PdfException
- if problem with parsing and extraxting text from PDF file
-
getWordsOnPage
public List<String> getWordsOnPage(int page, int x1, int y1, int x2, int y2, String delimiters) throws PdfException Gets the individual words from the pages text content with a greater degree of control.- Parameters:
page
- The page to get text content from.x1
- The left most point to extract from.y1
- The top most point to extract from.x2
- The right most point to extract from.y2
- The bottom most point to extract from.delimiters
- key to separate values- Returns:
- List object containing all words found on the page.
- Throws:
PdfException
- if problem with parsing and extraxting text from PDF file
-
getWordsOnPage
public List<String> getWordsOnPage(int page, Rectangle rectangle, String delimiters) throws PdfException Gets the individual words from the pages text content with a greater degree of control.- Parameters:
page
- The page to get text content from.rectangle
- Rectangle area on the page to extract words from.delimiters
- separator used for output- Returns:
- List object containing all words found on the page.
- Throws:
PdfException
- if problem with parsing and extraxting text from PDF file
-
main
This class will allow you to extract any Words from page as a list via command line from a single PDF file or a directory of PDF files.
The example expects two:- Value 1 is the file name or directory of PDF files to process
- Value 2 is directory to write out the outline data
- Parameters:
args
- The expected arguments are described above.
-
writeAllWordlistsToDir
public static int writeAllWordlistsToDir(String inputDir, String password, String outputDir, int maxPages) throws PdfException Convenience method to write all the Wordlists in a directory of PDF files- Parameters:
inputDir
- directory containing PDF filespassword
- user or owner password for pdf fileoutputDir
- directory for writing out wordlistsmaxPages
- limit to the first pages up to this page- Returns:
- count of words extracted in total
- Throws:
PdfException
- if problem with parsing and extraxting text from PDF file
-
writeAllWordlistsToDir
public static int writeAllWordlistsToDir(String inputDir, String password, String outputDir, int maxPages, ErrorTracker errorTracker) throws PdfException Convenience method to write all the Wordlists in a directory of PDF files- Parameters:
inputDir
- directory containing PDF filespassword
- user or owner password for pdf fileoutputDir
- directory for writing out wordlistsmaxPages
- limit to the first pages up to this pageerrorTracker
- a custom error tracker- Returns:
- count of words extracted in total
- Throws:
PdfException
- if problem with parsing and extraxting text from PDF file
-
writeAllWordlistsToDir
public static int writeAllWordlistsToDir(String inputDir, String outputDir, int maxPages) throws PdfException Convenience method to write all the Wordlists in a directory of PDF files- Parameters:
inputDir
- directory containing PDF filesoutputDir
- directory for writing out wordlistsmaxPages
- limit to just the first maxPages of a document- Returns:
- count of number of words extracted
- Throws:
PdfException
- if problem with parsing and extracting text from PDF file
-
setPassword
- Parameters:
password
- the USER or OWNER password for the PDF file
-
getPageCount
public int getPageCount()number of pages in PDF file (starting at 1)- Returns:
- page count
-
openPDFFile
routine to open the PDF File so we can access- Returns:
- true if successful
- Throws:
PdfException
- if problem with opening PDF file
-
closePDFfile
public void closePDFfile()ensure PDF file is closed once no longer needed and all resources released
-