Package org.jpedal.examples.text
Class ExtractTextInRectangle
java.lang.Object
org.jpedal.examples.text.ExtractTextInRectangle
Extract text from PDF files
This class provides a simple Java API to extract text from a PDF file and also a static convenience method if you just want to dump all the text from a PDF file or directory containing PDF files
See our Support Pages for more information on Text Extraction.
-
Nested Class Summary
Modifier and TypeClassDescriptionstatic enum
The available formats that text can be output as -
Constructor Summary
ConstructorDescriptionExtractTextInRectangle
(byte[] byteArray) Sets up an ExtractTextInRectangle instance to open a PDF file contained as a BLOB within a byte[] streamExtractTextInRectangle
(String fileName) Sets up an ExtractTextInRectangle instance to open a PDF FileExtractTextInRectangle
(String fileName, boolean extractPlainText) Sets up an ExtractTextInRectangle instance to open a PDF File -
Method Summary
Modifier and TypeMethodDescriptionvoid
ensure PDF file is closed once no longer needed and all resources releasedint
number of pages in PDF file (starting at 1)getTextOnPage
(int page) extract all text on page as a string value.getTextOnPage
(int page, int x1, int y1, int x2, int y2) extract all text on page in a specified region as a string value.If the page contains text with multiple orientations (Left to right, bottom to top), only the most common orientation will be extracted and others will be ignoredgetTextOnPage
(int page, Rectangle rectangle) extract all text on page in a specified region as a string value.static void
This class will allow you to extract all text from page via command line from a single PDF file or a directory of PDF files.boolean
routine to open the PDF File so we can accessvoid
setEstimateParagraphs
(boolean estimateParagraphs) void
void
setPassword
(String password) static void
writeAllTextToDir
(String inputDir, String outputDir, int maxPages) Convenience method to write all the text in a directory of PDF filesstatic void
writeAllTextToDir
(String inputDir, String password, String outputDir, int maxPages) Convenience method to write all the text in a directory of PDF filesstatic void
writeAllTextToDir
(String inputDir, String password, String outputDir, int maxPages, ExtractTextInRectangle.OUTPUT_FORMAT format, boolean estimateParagraphs) Convenience method to write all the text in a directory of PDF filesstatic void
writeAllTextToDir
(String inputDir, String password, String outputDir, int maxPages, ExtractTextInRectangle.OUTPUT_FORMAT format, boolean estimateParagraphs, ErrorTracker errorTracker) Convenience method to write all the text in a directory of PDF files
-
Constructor Details
-
ExtractTextInRectangle
Sets up an ExtractTextInRectangle instance to open a PDF File- Parameters:
fileName
- full path to a single PDF file
-
ExtractTextInRectangle
Sets up an ExtractTextInRectangle instance to open a PDF File- Parameters:
fileName
- full path to a single PDF fileextractPlainText
- flag to extract plain text rather than XML
-
ExtractTextInRectangle
public ExtractTextInRectangle(byte[] byteArray) Sets up an ExtractTextInRectangle instance to open a PDF file contained as a BLOB within a byte[] stream- Parameters:
byteArray
- pdf file data
-
-
Method Details
-
setOutputFormat
-
setEstimateParagraphs
public void setEstimateParagraphs(boolean estimateParagraphs) -
getTextOnPage
extract all text on page as a string value.If the page contains text with multiple orientations (Left to right, bottom to top), only the most common orientation will be extracted and others will be ignored
- Parameters:
page
- number (first page is 1)- Returns:
- String with text
- Throws:
PdfException
- if problem with parsing and extraxting text from PDF file
-
getTextOnPage
extract all text on page in a specified region as a string value. If the page contains text with multiple orientations (Left to right, bottom to top), only the most common orientation will be extracted and others will be ignored- Parameters:
page
- (first page is 1)rectangle
- - top left corner x- Returns:
- String with text
- Throws:
PdfException
- if problem with parsing and extraxting text from PDF file
-
getTextOnPage
extract all text on page in a specified region as a string value.If the page contains text with multiple orientations (Left to right, bottom to top), only the most common orientation will be extracted and others will be ignored- Parameters:
page
- (first page is 1)x1
- - top left corner xy1
- - top left corner yx2
- - bottom right corner xy2
- - bottom right corner y- Returns:
- String with text
- Throws:
PdfException
- if problem with parsing and extracting text from PDF file
-
main
This class will allow you to extract all text from page via command line from a single PDF file or a directory of PDF files.
The example expects two:- Value 1 is the file name or directory of PDF files to process
- Value 2 is directory to write out the data
- Parameters:
args
- The expected arguments are described above.
-
writeAllTextToDir
public static void writeAllTextToDir(String inputDir, String password, String outputDir, int maxPages, ExtractTextInRectangle.OUTPUT_FORMAT format, boolean estimateParagraphs) throws PdfException Convenience method to write all the text in a directory of PDF files- Parameters:
inputDir
- directory containing PDF filespassword
- user or owner password for PDF filesoutputDir
- directory for writing out wordlistsmaxPages
- limit to just the first maxPages of a documentformat
- set the output format for the text content (TXT or XML)estimateParagraphs
- set if JPedal should estimate paragraph spacing in output.- Throws:
PdfException
- if problem with parsing and extracting text from PDF file
-
writeAllTextToDir
public static void writeAllTextToDir(String inputDir, String password, String outputDir, int maxPages, ExtractTextInRectangle.OUTPUT_FORMAT format, boolean estimateParagraphs, ErrorTracker errorTracker) throws PdfException Convenience method to write all the text in a directory of PDF files- Parameters:
inputDir
- directory containing PDF filespassword
- user or owner password for PDF filesoutputDir
- directory for writing out wordlistsmaxPages
- limit to just the first maxPages of a documentformat
- set the output format for the text content (TXT or XML)estimateParagraphs
- set if JPedal should estimate paragraph spacing in output.errorTracker
- a custom error tracker- Throws:
PdfException
- if problem with parsing and extracting text from PDF file
-
writeAllTextToDir
public static void writeAllTextToDir(String inputDir, String password, String outputDir, int maxPages) throws PdfException Convenience method to write all the text in a directory of PDF files- Parameters:
inputDir
- directory containing PDF filespassword
- user or owner password for PDF filesoutputDir
- directory for writing out wordlistsmaxPages
- limit to just the first maxPages of a document- Throws:
PdfException
- if problem with parsing and extracting text from PDF file
-
writeAllTextToDir
public static void writeAllTextToDir(String inputDir, String outputDir, int maxPages) throws PdfException Convenience method to write all the text in a directory of PDF files- Parameters:
inputDir
- directory containing PDF filesoutputDir
- directory for writing out wordlistsmaxPages
- limit to just the first maxPages of a document- Throws:
PdfException
- if problem with parsing and extracting text from PDF file
-
setPassword
- Parameters:
password
- the USER or OWNER password for the PDF file
-
getPageCount
public int getPageCount()number of pages in PDF file (starting at 1)- Returns:
- page count
-
openPDFFile
routine to open the PDF File so we can access- Returns:
- true if successful
- Throws:
PdfException
- if problem with opening PDF file
-
closePDFfile
public void closePDFfile()ensure PDF file is closed once no longer needed and all resources released
-