Class ExtractTextAsWordlist


  • public class ExtractTextAsWordlist
    extends Object

    Extract words and locations from PDF files


    This class provides a simple Java API to extract text as words and the location on the page from a PDF file and also a static convenience method if you just want to dump all the word lists from a PDF file or directory containing PDF files

    See our Support Pages for more information on Text Extraction.
    • Constructor Detail

      • ExtractTextAsWordlist

        public ExtractTextAsWordlist​(String fileName)
        Sets up an ExtractTextAsWordlist instance to open a PDF File
        Parameters:
        fileName - full path to a single PDF file
      • ExtractTextAsWordlist

        public ExtractTextAsWordlist​(byte[] byteArray)
        Sets up an ExtractTextAsWordlist instance to open a PDF file contained as a BLOB within a byte[] stream
        Parameters:
        byteArray -
    • Method Detail

      • getWordsOnPage

        public List<String> getWordsOnPage​(int page)
                                    throws PdfException
        Gets the individual words from the pages text content and returns them. Uses a default set of delimiters to determine word bounds.
        Parameters:
        page - The page to get text content from.
        Returns:
        List object containing all words found on the page.
        Throws:
        PdfException
      • getWordsOnPage

        public List<String> getWordsOnPage​(int page,
                                           String delimiters)
                                    throws PdfException
        Gets the individual words from the pages text content and returns them. Uses the provided delimiters to determine word bounds.
        Parameters:
        page - The page to get text content from.
        delimiters - A String of characters to be used as delimiters for words.
        Returns:
        List object containing all words found on the page.
        Throws:
        PdfException
      • main

        public static void main​(String[] args)
        This class will allow you to extract any Words from page as a list via command line from a single PDF file or a directory of PDF files.
        The example expects two:
        • Value 1 is the file name or directory of PDF files to process
        • Value 2 is directory to write out the outline data
        Parameters:
        args - The expected arguments are described above.
      • writeAllWordlistsToDir

        public static int writeAllWordlistsToDir​(String inputDir,
                                                 String outputDir,
                                                 int maxPages)
                                          throws PdfException
        Convenience method to write all the Wordlists in a directory of PDF files
        Parameters:
        inputDir - directory containing PDF files
        outputDir - directory for writing out wordlists
        Throws:
        PdfException
      • setPassword

        public void setPassword​(String password)
        Parameters:
        password - the USER or OWNER password for the PDF file
      • getPageCount

        public int getPageCount()
        number of pages in PDF file (starting at 1)
        Returns:
        page count
      • openPDFFile

        public boolean openPDFFile()
                            throws PdfException
        routine to open the PDF File so we can access
        Returns:
        true if successful
        Throws:
        PdfException
      • closePDFfile

        public void closePDFfile()
        ensure PDF file is closed once no longer needed and all resources released