Package org.jpedal.examples.text
Class ExtractStructuredText
java.lang.Object
org.jpedal.examples.text.ExtractStructuredText
Extract Structured Content (if present) from PDF files
This class provides a simple Java API to extract Structured Content (if present) from a PDF file and also a static convenience method if you just want to dump any structured outlines from a PDF file or directory containing PDF files
If no Structure is present a blank file is returned
For non-structured files, consider:
See our Support Pages for more information on Text Extraction-
Constructor Summary
ConstructorDescriptionExtractStructuredText
(byte[] byteArray) Sets up an ExtractStructuredText instance to open a PDF file contained as a BLOB within a byte[] streamExtractStructuredText
(byte[] byteArray, org.jpedal.examples.text.configuration.ExtractStructuredTextProperties properties) Sets up an ExtractStructuredText instance to open a PDF file contained as a BLOB within a byte[] streamExtractStructuredText
(String fileName) Sets up an ExtractStructuredText instance to open a PDF FileExtractStructuredText
(String fileName, org.jpedal.examples.text.configuration.ExtractStructuredTextProperties properties) Sets up an ExtractStructuredText instance to open a PDF File -
Method Summary
Modifier and TypeMethodDescriptionvoid
ensure PDF file is closed once no longer needed and all resources releasedint
number of pages in PDF file (starting at 1)gets the Document containing any Structured text (if present) as a Document structure
If the Document does not contain the meta data for Structured Content, an empty Document is returnedstatic void
This class will allow you to extract any Structured Text data via command line from a single PDF file or a directory of PDF files.boolean
routine to open the PDF File so we can accessvoid
setPassword
(String password) static void
writeAllStructuredTextOutlinesToDir
(String inputDir, String outputDir) Convenience method to write any Structured text in a directory of PDF filesstatic void
writeAllStructuredTextOutlinesToDir
(String inputDir, String password, String outputDir) Convenience method to write any Structured text in a directory of PDF filesstatic void
writeAllStructuredTextOutlinesToDir
(String inputDir, String password, String outputDir, ErrorTracker errorTracker) Convenience method to write any Structured text in a directory of PDF filesstatic void
writeAllStructuredTextOutlinesToDir
(String inputDir, String password, String outputDir, ErrorTracker errorTracker, org.jpedal.examples.text.configuration.ExtractStructuredTextProperties properties) Convenience method to write any Structured text in a directory of PDF files
-
Constructor Details
-
ExtractStructuredText
Sets up an ExtractStructuredText instance to open a PDF File- Parameters:
fileName
- full path to a single PDF file
-
ExtractStructuredText
public ExtractStructuredText(byte[] byteArray) Sets up an ExtractStructuredText instance to open a PDF file contained as a BLOB within a byte[] stream- Parameters:
byteArray
- Array which will hold BLOB
-
ExtractStructuredText
public ExtractStructuredText(String fileName, org.jpedal.examples.text.configuration.ExtractStructuredTextProperties properties) Sets up an ExtractStructuredText instance to open a PDF File- Parameters:
fileName
- full path to a single PDF fileproperties
- ExtractStructuredTextProperties object for configuring extraction
-
ExtractStructuredText
public ExtractStructuredText(byte[] byteArray, org.jpedal.examples.text.configuration.ExtractStructuredTextProperties properties) Sets up an ExtractStructuredText instance to open a PDF file contained as a BLOB within a byte[] stream- Parameters:
byteArray
- Array which will hold BLOBproperties
- ExtractStructuredTextProperties object for configuring extraction
-
-
Method Details
-
main
This class will allow you to extract any Structured Text data via command line from a single PDF file or a directory of PDF files.
The example expects two or three parameters:- Value 1 is the file name or directory of PDF files to process
- Value 2 is directory to write out the outline data
- Parameters:
args
- The expected arguments are described above.
-
writeAllStructuredTextOutlinesToDir
public static void writeAllStructuredTextOutlinesToDir(String inputDir, String password, String outputDir) throws PdfException Convenience method to write any Structured text in a directory of PDF files- Parameters:
inputDir
- directory containing PDF filespassword
- user or owner password for pdf fileoutputDir
- directory for writing out images- Throws:
PdfException
- a PDF exception
-
writeAllStructuredTextOutlinesToDir
public static void writeAllStructuredTextOutlinesToDir(String inputDir, String password, String outputDir, ErrorTracker errorTracker) throws PdfException Convenience method to write any Structured text in a directory of PDF files- Parameters:
inputDir
- directory containing PDF filespassword
- user or owner password for pdf fileoutputDir
- directory for writing out imageserrorTracker
- a custom error tracker- Throws:
PdfException
- a PDF exception
-
writeAllStructuredTextOutlinesToDir
public static void writeAllStructuredTextOutlinesToDir(String inputDir, String password, String outputDir, ErrorTracker errorTracker, org.jpedal.examples.text.configuration.ExtractStructuredTextProperties properties) throws PdfException Convenience method to write any Structured text in a directory of PDF files- Parameters:
inputDir
- directory containing PDF filespassword
- user or owner password for pdf fileoutputDir
- directory for writing out imageserrorTracker
- a custom error trackerproperties
- a ExtractStructuredTextProperties object for configuration- Throws:
PdfException
- a PDF exception
-
writeAllStructuredTextOutlinesToDir
public static void writeAllStructuredTextOutlinesToDir(String inputDir, String outputDir) throws PdfException Convenience method to write any Structured text in a directory of PDF files- Parameters:
inputDir
- directory containing PDF filesoutputDir
- directory for writing out images- Throws:
PdfException
- a PDF exception
-
getStructuredTextContent
gets the Document containing any Structured text (if present) as a Document structure
If the Document does not contain the meta data for Structured Content, an empty Document is returned- Returns:
- Document
-
setPassword
- Parameters:
password
- the USER or OWNER password for the PDF file
-
getPageCount
public int getPageCount()number of pages in PDF file (starting at 1)- Returns:
- page count
-
openPDFFile
routine to open the PDF File so we can access- Returns:
- true if successful
- Throws:
PdfException
- if problem with opening PDF file
-
closePDFfile
public void closePDFfile()ensure PDF file is closed once no longer needed and all resources released
-