Class ExtractStructuredText


  • public class ExtractStructuredText
    extends Object

    Extract Structured Content (if present) from PDF files


    This class provides a simple Java API to extract Structured Content (if present) from a PDF file and also a static convenience method if you just want to dump any structured outlines from a PDF file or directory containing PDF files
    If no Structure is present a blank file is returned

    Example 1 - access API methods

    ExtractStructuredText extract=new ExtractStructuredText("C:/pdfs/mypdf.pdf");
     //extract.setPassword("password");
     if (extract.openPDFFile()) {
         Document anyStructuredText=extract.getStructuredTextContent();
     }
    
     extract.closePDFfile();

    Example 2 - convenience static method

    Extract any Structured test for a file or set of files and write out results as XML in a txt file (separate directory for each PDF file)

    ExtractStructuredText.writeAllStructuredTextOutlinesToDir("pdfs", "output");

    Example 3 - Access directly from the Jar

    ExtractStructuredText can run directly from the jar using the command line and will extract any data from a PDF file or directory to a defined output directory
    java -cp libraries_needed org/jpedal/examples/text/ExtractStructuredText input_pdf output_dir
    • input_pdf: The PDF filename (including the path if needed) or a directory containing PDF files. If it contains spaces it must be enclosed by double quotes (ie "C:/Path with spaces/").
    • output_dir: The directory to write out outline data extracted from the PDF file or files. If it contains spaces it must be enclosed by double quotes (ie "C:/Path with spaces/").

    For non-structured files, consider:
    • http://files.idrsolutions.com/samplecode/org/jpedal/examples/text/ExtractTextAsWordlist.java.html
    • http://files.idrsolutions.com/samplecode/org/jpedal/examples/text/ExtractTextInRectangle.java.html

    See our Support Pages for more information on Text Extraction
    • Constructor Detail

      • ExtractStructuredText

        public ExtractStructuredText​(String fileName)
        Sets up an ExtractStructuredText instance to open a PDF File
        Parameters:
        fileName - full path to a single PDF file
      • ExtractStructuredText

        public ExtractStructuredText​(byte[] byteArray)
        Sets up an ExtractStructuredText instance to open a PDF file contained as a BLOB within a byte[] stream
        Parameters:
        byteArray - Array which will hold BLOB
    • Method Detail

      • main

        public static void main​(String[] args)
        This class will allow you to extract any Structured Text data via command line from a single PDF file or a directory of PDF files.
        The example expects two or three parameters:
        • Value 1 is the file name or directory of PDF files to process
        • Value 2 is directory to write out the outline data
        Parameters:
        args - The expected arguments are described above.
      • writeAllStructuredTextOutlinesToDir

        public static void writeAllStructuredTextOutlinesToDir​(String inputDir,
                                                               String outputDir)
                                                        throws PdfException
        Convenience method to write any Structured text in a directory of PDF files
        Parameters:
        inputDir - directory containing PDF files
        outputDir - directory for writing out images
        Throws:
        PdfException - a PDF exception
      • getStructuredTextContent

        public Document getStructuredTextContent()
        gets the Document containing any Structured text (if present) as a Document structure
        If the Document does not contain the meta data for Structured Content, an empty Document is returned
        Returns:
        Document
      • setPassword

        public void setPassword​(String password)
        Parameters:
        password - the USER or OWNER password for the PDF file
      • getPageCount

        public int getPageCount()
        number of pages in PDF file (starting at 1)
        Returns:
        page count
      • openPDFFile

        public boolean openPDFFile()
                            throws PdfException
        routine to open the PDF File so we can access
        Returns:
        true if successful
        Throws:
        PdfException
      • closePDFfile

        public void closePDFfile()
        ensure PDF file is closed once no longer needed and all resources released