Class ExtractTextInRectangle

    • Field Summary

      Fields 
      Modifier and Type Field Description
      static boolean isTest  
    • Constructor Summary

      Constructors 
      Constructor Description
      ExtractTextInRectangle​(byte[] byteArray)
      Sets up an ExtractTextInRectangle instance to open a PDF file contained as a BLOB within a byte[] stream
      ExtractTextInRectangle​(String fileName)
      Sets up an ExtractTextInRectangle instance to open a PDF File
      ExtractTextInRectangle​(String fileName, boolean extractPlainText)
      Sets up an ExtractTextInRectangle instance to open a PDF File
    • Field Detail

      • isTest

        public static boolean isTest
    • Constructor Detail

      • ExtractTextInRectangle

        public ExtractTextInRectangle​(String fileName)
        Sets up an ExtractTextInRectangle instance to open a PDF File
        Parameters:
        fileName - full path to a single PDF file
      • ExtractTextInRectangle

        public ExtractTextInRectangle​(String fileName,
                                      boolean extractPlainText)
        Sets up an ExtractTextInRectangle instance to open a PDF File
        Parameters:
        fileName - full path to a single PDF file
        extractPlainText - flag to extract plain text rather than XML
      • ExtractTextInRectangle

        public ExtractTextInRectangle​(byte[] byteArray)
        Sets up an ExtractTextInRectangle instance to open a PDF file contained as a BLOB within a byte[] stream
        Parameters:
        byteArray -
    • Method Detail

      • getTextOnPage

        public String getTextOnPage​(int page)
                             throws PdfException
        extract all text on page as a string value.

        If the page contains text with multiple orientations (Left to right, bottom to top), only the most common orientation will be extracted and others will be ignored

        Parameters:
        page - number (first page is 1)
        Returns:
        String with text
        Throws:
        PdfException
      • getTextOnPage

        public String getTextOnPage​(int page,
                                    Rectangle rectangle)
                             throws PdfException
        extract all text on page in a specified region as a string value. If the page contains text with multiple orientations (Left to right, bottom to top), only the most common orientation will be extracted and others will be ignored
        Parameters:
        page - (first page is 1)
        rectangle - - top left corner x
        Returns:
        String with text
        Throws:
        PdfException
      • getTextOnPage

        public String getTextOnPage​(int page,
                                    int x1,
                                    int y1,
                                    int x2,
                                    int y2)
                             throws PdfException
        extract all text on page in a specified region as a string value.If the page contains text with multiple orientations (Left to right, bottom to top), only the most common orientation will be extracted and others will be ignored
        Parameters:
        page - (first page is 1)
        x1 - - top left corner x
        y1 - - top left corner y
        x2 - - bottom right corner x
        y2 - - bottom right corner y
        Returns:
        String with text
        Throws:
        PdfException
      • main

        public static void main​(String[] args)
        This class will allow you to extract all text from page via command line from a single PDF file or a directory of PDF files.
        The example expects two:
        • Value 1 is the file name or directory of PDF files to process
        • Value 2 is directory to write out the data
        Parameters:
        args - The expected arguments are described above.
      • writeAllTextToDir

        public static void writeAllTextToDir​(String inputDir,
                                             String outputDir,
                                             int maxPages)
                                      throws PdfException
        Convenience method to write all the text in a directory of PDF files
        Parameters:
        inputDir - directory containing PDF files
        outputDir - directory for writing out wordlists
        Throws:
        PdfException
      • setPassword

        public void setPassword​(String password)
        Parameters:
        password - the USER or OWNER password for the PDF file
      • getPageCount

        public int getPageCount()
        number of pages in PDF file (starting at 1)
        Returns:
        page count
      • openPDFFile

        public boolean openPDFFile()
                            throws PdfException
        routine to open the PDF File so we can access
        Returns:
        true if successful
        Throws:
        PdfException
      • closePDFfile

        public void closePDFfile()
        ensure PDF file is closed once no longer needed and all resources released