Package org.jpedal.examples.text
Class ExtractStructuredText
java.lang.Object
org.jpedal.examples.BaseExample
org.jpedal.examples.text.ExtractStructuredText
public class ExtractStructuredText
extends org.jpedal.examples.BaseExample
Extract Structured Content (if present) from PDF files
This class provides a simple Java API to extract Structured Content (if present) from a PDF file and also a static convenience method if you just want to dump any structured outlines from a PDF file or directory containing PDF files
If no Structure is present a blank file is returned
For non-structured files, consider:
See our Support Pages for more information on Text Extraction-
Constructor Summary
ConstructorDescriptionExtractStructuredText
(byte[] byteArray) Sets up an ExtractStructuredText instance to open a PDF file contained as a BLOB within a byte[] streamExtractStructuredText
(byte[] byteArray, org.jpedal.examples.text.configuration.ExtractStructuredTextProperties properties) Sets up an ExtractStructuredText instance to open a PDF file contained as a BLOB within a byte[] streamExtractStructuredText
(String fileName) Sets up an ExtractStructuredText instance to open a PDF FileExtractStructuredText
(String fileName, org.jpedal.examples.text.configuration.ExtractStructuredTextProperties properties) Sets up an ExtractStructuredText instance to open a PDF File -
Method Summary
Modifier and TypeMethodDescriptionvoid
decodeFile
(String file_name) routine to decode a fileint
number of pages in PDF file (starting at 1)gets the Document containing any Structured text (if present) as a Document structure
If the Document does not contain the meta data for Structured Content, an empty Document is returnedgetStructuredTextContentAndFigures
(String figureDir, String imageFormat) Gets the marked content from the Document and also writes out the figures to a supplied directoryDocument[]
getStructuredTextContentAndFiguresPerPage
(String figureDir, String imageFormat) Document[]
static void
This class will allow you to extract any Structured Text data via command line from a single PDF file or a directory of PDF files.void
setPassword
(String password) static void
writeAllStructuredTextOutlinesAndFiguresToDir
(String inputDir, String password, String outputDir, ErrorTracker errorTracker, org.jpedal.examples.text.configuration.ExtractStructuredTextProperties properties, String figuresDir, String figuresFormat) Convenience method to write any Structured text in a directory of PDF filesstatic void
writeAllStructuredTextOutlinesToDir
(String inputDir, String outputDir) Convenience method to write any Structured text in a directory of PDF filesstatic void
writeAllStructuredTextOutlinesToDir
(String inputDir, String password, String outputDir) Convenience method to write any Structured text in a directory of PDF filesstatic void
writeAllStructuredTextOutlinesToDir
(String inputDir, String password, String outputDir, ErrorTracker errorTracker) Convenience method to write any Structured text in a directory of PDF filesstatic void
writeAllStructuredTextOutlinesToDir
(String inputDir, String password, String outputDir, ErrorTracker errorTracker, org.jpedal.examples.text.configuration.ExtractStructuredTextProperties properties) Convenience method to write any Structured text in a directory of PDF filesMethods inherited from class org.jpedal.examples.BaseExample
closePDFfile, openPDFFile
-
Constructor Details
-
ExtractStructuredText
Sets up an ExtractStructuredText instance to open a PDF File- Parameters:
fileName
- full path to a single PDF file
-
ExtractStructuredText
public ExtractStructuredText(byte[] byteArray) Sets up an ExtractStructuredText instance to open a PDF file contained as a BLOB within a byte[] stream- Parameters:
byteArray
- Array which will hold BLOB
-
ExtractStructuredText
public ExtractStructuredText(String fileName, org.jpedal.examples.text.configuration.ExtractStructuredTextProperties properties) Sets up an ExtractStructuredText instance to open a PDF File- Parameters:
fileName
- full path to a single PDF fileproperties
- ExtractStructuredTextProperties object for configuring extraction
-
ExtractStructuredText
public ExtractStructuredText(byte[] byteArray, org.jpedal.examples.text.configuration.ExtractStructuredTextProperties properties) Sets up an ExtractStructuredText instance to open a PDF file contained as a BLOB within a byte[] stream- Parameters:
byteArray
- Array which will hold BLOBproperties
- ExtractStructuredTextProperties object for configuring extraction
-
-
Method Details
-
decodeFile
routine to decode a file- Throws:
PdfException
-
main
This class will allow you to extract any Structured Text data via command line from a single PDF file or a directory of PDF files.
The example expects the following parameters:- Value 1 is the file name or directory of PDF files to process
- Value 2 is the directory to write out the outline data
- (Optional, unless Value 4 is present then Value 3 must be present) Value 3 is the outline data file format
- Value 4 is the directory to write out the figures data
- (Optional) Value 5 is the figures output format
- Parameters:
args
- The expected arguments are described above.
-
writeAllStructuredTextOutlinesToDir
public static void writeAllStructuredTextOutlinesToDir(String inputDir, String password, String outputDir) throws PdfException Convenience method to write any Structured text in a directory of PDF files- Parameters:
inputDir
- directory containing PDF filespassword
- user or owner password for pdf fileoutputDir
- directory for writing out images- Throws:
PdfException
- a PDF exception
-
writeAllStructuredTextOutlinesToDir
public static void writeAllStructuredTextOutlinesToDir(String inputDir, String password, String outputDir, ErrorTracker errorTracker) throws PdfException Convenience method to write any Structured text in a directory of PDF files- Parameters:
inputDir
- directory containing PDF filespassword
- user or owner password for pdf fileoutputDir
- directory for writing out imageserrorTracker
- a custom error tracker- Throws:
PdfException
- a PDF exception
-
writeAllStructuredTextOutlinesToDir
public static void writeAllStructuredTextOutlinesToDir(String inputDir, String password, String outputDir, ErrorTracker errorTracker, org.jpedal.examples.text.configuration.ExtractStructuredTextProperties properties) throws PdfException Convenience method to write any Structured text in a directory of PDF files- Parameters:
inputDir
- directory containing PDF filespassword
- user or owner password for pdf fileoutputDir
- directory for writing out structured texterrorTracker
- a custom error trackerproperties
- a ExtractStructuredTextProperties object for configuration- Throws:
PdfException
- a PDF exception
-
writeAllStructuredTextOutlinesAndFiguresToDir
public static void writeAllStructuredTextOutlinesAndFiguresToDir(String inputDir, String password, String outputDir, ErrorTracker errorTracker, org.jpedal.examples.text.configuration.ExtractStructuredTextProperties properties, String figuresDir, String figuresFormat) throws PdfException Convenience method to write any Structured text in a directory of PDF files- Parameters:
inputDir
- directory containing PDF filespassword
- user or owner password for pdf fileoutputDir
- directory for writing out structured texterrorTracker
- a custom error trackerproperties
- a ExtractStructuredTextProperties object for configurationfiguresDir
- directory for writing out figuresfiguresFormat
- image file format for writing figures- Throws:
PdfException
- a PDF exception
-
writeAllStructuredTextOutlinesToDir
public static void writeAllStructuredTextOutlinesToDir(String inputDir, String outputDir) throws PdfException Convenience method to write any Structured text in a directory of PDF files- Parameters:
inputDir
- directory containing PDF filesoutputDir
- directory for writing out images- Throws:
PdfException
- a PDF exception
-
getStructuredTextContent
gets the Document containing any Structured text (if present) as a Document structure
If the Document does not contain the meta data for Structured Content, an empty Document is returned- Returns:
- Document
-
getStructuredTextContentPerPage
-
getStructuredTextContentAndFigures
public Document getStructuredTextContentAndFigures(String figureDir, String imageFormat) throws IOException Gets the marked content from the Document and also writes out the figures to a supplied directory- Parameters:
figureDir
- The directory to write the figure imagesimageFormat
- The format for white to write the figure images- Returns:
- The marked content document
- Throws:
IOException
- If there is a problem with writing the images
-
getStructuredTextContentAndFiguresPerPage
public Document[] getStructuredTextContentAndFiguresPerPage(String figureDir, String imageFormat) throws IOException - Throws:
IOException
-
setPassword
- Parameters:
password
- the USER or OWNER password for the PDF file
-
getPageCount
public int getPageCount()number of pages in PDF file (starting at 1)- Returns:
- page count
-