Class ExtractImages


  • public class ExtractImages
    extends Object

    Image Extraction from PDF files


    This class provides a simple Java API to extract images from a PDF file and also a static convenience method if you just want to dump all the images from a PDF file or directory containing PDF files.

    Example 1 - access API methods

    ExtractImages extract=new ExtractImages("C:/pdfs/mypdf.pdf");
     //extract.setPassword("password");
     if (extract.openPDFFile()) {
         int pageCount=extract.getPageCount();
         for (int page=1; page<=pageCount; page++) {
    
            int imagesOnPageCount=extract.getImageCount(page);
            for (int image=0; image<imagesOnPageCount; image++) {
                 BufferedImage image=extract.getImage(page, image, true);
             }
         }
     }
    
     extract.closePDFfile();

    Example 2 - convenience static method

    Extract all images with no metadata XML file into a directory

    ExtractImages.writeAllImagesToDir("pdfs", "images", false, false);

    Example 3 -Access directly from the Jar

    ExtractImages can run from jar directly using the command and will extract all files from a PDF file or directory to a defined output directory:
    java -cp libraries_needed org/jpedal/examples/images/ExtractImages inputValues
    Where inputValues is 3 values:
    • First value: The PDF filename (including the path if needed) or a directory containing PDF files. If it contains spaces it must be enclosed by double quotes (ie "C:/Path with spaces/").
    • Second value: The location to write out images extracted from the PDF file or files. If it contains spaces it must be enclosed by double quotes (ie "C:/Path with spaces/").
    • Third value: Required output image type (default is png if nothing specified). Options are tiff, png, jpg.

    See our Support Pages for more info on Image Extraction.
    • Constructor Summary

      Constructors 
      Constructor Description
      ExtractImages​(byte[] byteArray)
      Sets up an ExtractImages instance to open a PDF file contained as a BLOB within a byte[] stream
      ExtractImages​(String fileName)
      Sets up an ExtractImages instance to open a PDF File
    • Method Summary

      All Methods Static Methods Instance Methods Concrete Methods 
      Modifier and Type Method Description
      void closePDFfile()
      ensure PDF file is closed once no longer needed and all resources released
      BufferedImage getImage​(int page, int imageNumber, boolean imageAsDisplayed)
      extract any image from any page - recommended you process images on each page in turn as quicker
      int getImageCount​(int page)
      returns an image count for the selected page
      int getPageCount()
      number of pages in PDF file (starting at 1)
      static void main​(String[] args)
      This class will allow you to extract Images via command line from a single PDF file or a directory of PDF files.
      boolean openPDFFile()
      routine to open the PDF File so we can access - needs to be checked as will be false if file cannot be opened for any reason
      void setPassword​(String password)
      sets the Owner or User Password to use when opening encrypted PDF file
      static void writeAllImagesToDir​(String inputDir, String outputDir, String imageType, boolean generateMetaData, boolean outputPagesInSepDirs)
      Convenience method to Extract all the images in a directory of PDF files
    • Constructor Detail

      • ExtractImages

        public ExtractImages​(String fileName)
        Sets up an ExtractImages instance to open a PDF File
        Parameters:
        fileName - full path to a single PDF file
      • ExtractImages

        public ExtractImages​(byte[] byteArray)
        Sets up an ExtractImages instance to open a PDF file contained as a BLOB within a byte[] stream
        Parameters:
        byteArray -
    • Method Detail

      • writeAllImagesToDir

        public static void writeAllImagesToDir​(String inputDir,
                                               String outputDir,
                                               String imageType,
                                               boolean generateMetaData,
                                               boolean outputPagesInSepDirs)
                                        throws PdfException
        Convenience method to Extract all the images in a directory of PDF files
        Parameters:
        inputDir - directory containing PDF files
        outputDir - directory for writing out images
        generateMetaData - if true include additional XML file with metadata on image
        outputPagesInSepDirs - if true place images from each page in separate sub-directory
        Throws:
        PdfException
      • main

        public static void main​(String[] args)
        This class will allow you to extract Images via command line from a single PDF file or a directory of PDF files.
        The example expects three parameters:
        • Value 1 is the file name or directory of PDF files to process
        • Value 2 is directory to write out the images
        • Value 3 is image type (jpeg,tiff,png). Default is png
        Parameters:
        args - The expected arguments are described above.
      • getImage

        public BufferedImage getImage​(int page,
                                      int imageNumber,
                                      boolean imageAsDisplayed)
                               throws PdfException
        extract any image from any page - recommended you process images on each page in turn as quicker
        Parameters:
        page - logical page number (1 is first page)
        imageNumber - image on page (0 is first image)
        imageAsDisplayed - if true return image as displayed (with scaling/rotation) otherwise use raw stored image (often but not always the same). Neither is clipped
        Returns:
        BufferedImage
        Throws:
        PdfException
      • getImageCount

        public int getImageCount​(int page)
                          throws PdfException
        returns an image count for the selected page
        Parameters:
        page - logical page number
        Returns:
        int number of images (0 if no images)
        Throws:
        PdfException
      • setPassword

        public void setPassword​(String password)
        sets the Owner or User Password to use when opening encrypted PDF file
        Parameters:
        password - the USER or OWNER password for the PDF file
      • getPageCount

        public int getPageCount()
        number of pages in PDF file (starting at 1)
        Returns:
        page count
      • openPDFFile

        public boolean openPDFFile()
                            throws PdfException
        routine to open the PDF File so we can access - needs to be checked as will be false if file cannot be opened for any reason
        Returns:
        true if successful
        Throws:
        PdfException - is problem opening file
      • closePDFfile

        public void closePDFfile()
        ensure PDF file is closed once no longer needed and all resources released