Extract Images from PDF

The PDF Images Extractor component of the library allows you to extract images from PDF documents in PNG format. This component is distributed as part of the HiQPdf.Next.PdfProcessor.Windows NuGet package when targeting Windows and as part of the HiQPdf.Next.PdfProcessor.Linux package when targeting Linux. The package for Windows is referenced by the HiQPdf.Next.Windows meta package and the package for Linux is referenced by the HiQPdf.Next.Linux meta package.

You can specify the page range in the PDF document from which to extract images. The extracted images preserve the transparency information from the PDF document. Extraction of password protected PDF documents is also supported and you can provide both user and owner passwords.

Overview

The HiQPdf.NextPdfImagesExtractor class allows you to load a PDF file and extract images from its pages in PNG format.

Create the PDF Images Extractor

The HiQPdf.NextPdfImagesExtractor class is used to extract images from PDF documents. You can create an instance using the default constructor, which initializes the extractor with standard settings.

Create a PDF Images Extractor Instance
 // Create the PDF Images Extractor instance with default options
PdfImagesExtractor pdfImagesExtractor = new PdfImagesExtractor();

Note that PdfImagesExtractor instances are not reusable. You must create a new instance for each extraction. Reusing an instance after a completed extraction will result in an exception.

Open Password Protected PDFs

If the PDF document from which you extract images is password protected you have to specify the user or owner password to be used to decrypt the PDF document before extraction. You can set the user password in the PdfImagesExtractorUserPassword property and the owner password in the PdfImagesExtractorOwnerPassword property.

Set User and Owner Passwords
pdfImagesExtractor.UserPassword = userPasswordString;
pdfImagesExtractor.OwnerPassword = ownerPasswordString;

Extracted Page Range Limit

The PdfImagesExtractorMaxPageCount property controls the upper limit for the number of PDF pages to process. The PDF page range from which to extract images can be set in the extraction methods. The default is 0 which means there is no upper limit.

Unlimited Extracted PDF Page Range
pdfImagesExtractor.MaxPageCount = 0;
Limit the Extracted PDF Page Range to 10 Pages
pdfImagesExtractor.MaxPageCount = 10;

Extract Images from PDF

To extract images from all pages in a PDF document from a memory buffer, use the PdfImagesExtractorExtractImages(Byte) method. The parameter is the PDF document read into a memory buffer.

The function returns an array of arrays of ExtractedImage objects. Each element in the outer array represents a page in the PDF document, and each inner array contains the images extracted from that page. For example, images[1] returns all images extracted from the second page, and images[1][2] is the third image from the second page.

The ExtractedImage object contains information such as the page number in ExtractedImagePageNumber and the extracted image data in PNG format in ExtractedImageImageData.

Extract Images from All Pages in a PDF from Memory
ExtractedImage[][] extractedImages = pdfImagesExtractor.ExtractImages(inputPdfBytes);

To extract images from a PDF document from a memory buffer starting at the given page number through the end of the document, use the PdfImagesExtractorExtractImages(Byte, Int32) method. The first parameter is the PDF document read into a memory buffer, and the second parameter is the 1-based start page number.

Extract Images from a PDF from Memory Starting at a Given Page Number
ExtractedImage[][] extractedImages = pdfImagesExtractor.ExtractImages(inputPdfBytes, startPageNumber);

To extract images from a PDF document from a memory buffer starting at the given page number up to the end page number inclusive, use the PdfImagesExtractorExtractImages(Byte, Int32, Int32) method. The first parameter is the PDF document read into a memory buffer, and the second and third parameters are the 1-based start and end page numbers. If the end page number is 0, the extraction continues to the end of the document.

Extract Images from a Range of Pages in a PDF from Memory
ExtractedImage[][] extractedImages = pdfImagesExtractor.ExtractImages(inputPdfBytes, startPageNumber, endPageNumber);

There are also similar methods to extract images from PDF pages that accept a PDF stream or a PDF file path.

Extract Images from All Pages in a PDF from a Stream or File
ExtractedImage[][] extractedImages = pdfImagesExtractor.ExtractImages(inputPdfStream);
ExtractedImage[][] extractedImages = pdfImagesExtractor.ExtractImages(inputPdfFile);
Extract Images from Pages in a PDF from a Stream or File Starting at a Given Page Number
ExtractedImage[][] extractedImages = pdfImagesExtractor.ExtractImages(inputPdfStream, startPageNumber);
ExtractedImage[][] extractedImages = pdfImagesExtractor.ExtractImages(inputPdfFile, startPageNumber);
Extract Images from a Range of Pages in a PDF from a Stream or File
ExtractedImage[][] extractedImages = pdfImagesExtractor.ExtractImages(inputPdfStream, startPageNumber, endPageNumber);
ExtractedImage[][] extractedImages = pdfImagesExtractor.ExtractImages(inputPdfFile, startPageNumber, endPageNumber);

The extraction methods above create the images in memory. There are similar methods to extract images from PDF pages to image files on disk in a specified folder. The parameters of these methods are the same as above with two additional parameters specifying the output folder path and the output file name without extension, which will be used as a base name for the generated image files. The final image file names will be formed by appending the page number and the image number in the page to the base name. The output folder is created if it does not exist.

Extract Images from All Pages in a PDF from Memory, Stream or File to Image Files
pdfImagesExtractor.ExtractImagesToFile(inputPdfBytes, outputDirectory, imageFileName);
pdfImagesExtractor.ExtractImagesToFile(inputPdfStream, outputDirectory, imageFileName);
pdfImagesExtractor.ExtractImagesToFile(inputPdfFile, outputDirectory, imageFileName);
Extract Images from Pages in a PDF from Memory, Stream or File to Image Files Starting at a Given Page Number
pdfImagesExtractor.ExtractImagesToFile(inputPdfBytes, startPageNumber, outputDirectory, imageFileName);
pdfImagesExtractor.ExtractImagesToFile(inputPdfStream, startPageNumber, outputDirectory, imageFileName);
pdfImagesExtractor.ExtractImagesToFile(inputPdfFile, startPageNumber, outputDirectory, imageFileName);
Extract Images from a Range of Pages in a PDF from Memory, Stream or File to Image Files
pdfImagesExtractor.ExtractImagesToFile(inputPdfBytes, startPageNumber, endPageNumber, outputDirectory, imageFileName);
pdfImagesExtractor.ExtractImagesToFile(inputPdfStream, startPageNumber, endPageNumber, outputDirectory, imageFileName);
pdfImagesExtractor.ExtractImagesToFile(inputPdfFile, startPageNumber, endPageNumber, outputDirectory, imageFileName);

Asynchronous Methods to Extract Images from PDF

There are also asynchronous variants of these methods that follow the Task-based Asynchronous Pattern (TAP) in .NET, allowing image extraction from PDF to run in parallel using async and await. These methods share the same names as their synchronous counterparts and include the "Async" suffix. They also accept an optional System.ThreadingCancellationToken parameter that can be used to cancel the extraction operation where applicable.

To extract images from all pages in a PDF document from a memory buffer, use the PdfImagesExtractorExtractImagesAsync(Byte, CancellationToken) method. The parameter is the PDF document read into a memory buffer.

Asynchronously Extract Images from All Pages in a PDF from Memory
ExtractedImage[][] extractedImages = await pdfImagesExtractor.ExtractImagesAsync(inputPdfBytes);

To extract images from a PDF document from a memory buffer starting at the given page number through the end of the document, use the PdfImagesExtractorExtractImagesAsync(Byte, Int32, CancellationToken) method. The first parameter is the PDF document read into a memory buffer, and the second parameter is the 1-based start page number.

Asynchronously Extract Images from a PDF from Memory Starting at a Given Page Number
ExtractedImage[][] extractedImages = await pdfImagesExtractor.ExtractImagesAsync(inputPdfBytes, startPageNumber);

To extract images from a PDF document from a memory buffer starting at the given page number up to the end page number inclusive, use the PdfImagesExtractorExtractImagesAsync(Byte, Int32, Int32, CancellationToken) method. The first parameter is the PDF document read into a memory buffer, and the second and third parameters are the 1-based start and end page numbers. If the end page number is 0, the extraction continues to the end of the document.

Asynchronously Extract Images from a Range of Pages in a PDF from Memory
ExtractedImage[][] extractedImages = await pdfImagesExtractor.ExtractImagesAsync(inputPdfBytes, startPageNumber, endPageNumber);

There are also similar methods to extract images from PDF pages that accept a PDF stream or a PDF file path.

Asynchronously Extract Images from All Pages in a PDF from a Stream or File
ExtractedImage[][] extractedImages = await pdfImagesExtractor.ExtractImagesAsync(inputPdfStream);
ExtractedImage[][] extractedImages = await pdfImagesExtractor.ExtractImagesAsync(inputPdfFile);
Asynchronously Extract Images from Pages in a PDF from a Stream or File Starting at a Given Page Number
ExtractedImage[][] extractedImages = await pdfImagesExtractor.ExtractImagesAsync(inputPdfStream, startPageNumber);
ExtractedImage[][] extractedImages = await pdfImagesExtractor.ExtractImagesAsync(inputPdfFile, startPageNumber);
Asynchronously Extract Images from a Range of Pages in a PDF from a Stream or File
ExtractedImage[][] extractedImages = await pdfImagesExtractor.ExtractImagesAsync(inputPdfStream, startPageNumber, endPageNumber);
ExtractedImage[][] extractedImages = await pdfImagesExtractor.ExtractImagesAsync(inputPdfFile, startPageNumber, endPageNumber);

The extraction methods above create the images in memory. There are similar methods to extract images from PDF pages to image files on disk in a specified folder. The parameters of these methods are the same as above with two additional parameters specifying the output folder path and the output file name without extension, which will be used as a base name for the generated image files. The final image file names will be formed by appending the page number and the image number in the page to the base name. The output folder is created if it does not exist.

Asynchronously Extract Images from All Pages in a PDF from Memory, Stream or File to Image Files
await pdfImagesExtractor.ExtractImagesToFileAsync(inputPdfBytes, outputDirectory, imageFileName);
await pdfImagesExtractor.ExtractImagesToFileAsync(inputPdfStream, outputDirectory, imageFileName);
await pdfImagesExtractor.ExtractImagesToFileAsync(inputPdfFile, outputDirectory, imageFileName);
Asynchronously Extract Images from Pages in a PDF from Memory, Stream or File to Image Files Starting at a Given Page Number
await pdfImagesExtractor.ExtractImagesToFileAsync(inputPdfBytes, startPageNumber, outputDirectory, imageFileName);
await pdfImagesExtractor.ExtractImagesToFileAsync(inputPdfStream, startPageNumber, outputDirectory, imageFileName);
await pdfImagesExtractor.ExtractImagesToFileAsync(inputPdfFile, startPageNumber, outputDirectory, imageFileName);
Asynchronously Extract Images from a Range of Pages in a PDF from Memory, Stream or File to Image Files
await pdfImagesExtractor.ExtractImagesToFileAsync(inputPdfBytes, startPageNumber, endPageNumber, outputDirectory, imageFileName);
await pdfImagesExtractor.ExtractImagesToFileAsync(inputPdfStream, startPageNumber, endPageNumber, outputDirectory, imageFileName);
await pdfImagesExtractor.ExtractImagesToFileAsync(inputPdfFile, startPageNumber, endPageNumber, outputDirectory, imageFileName);

Extraction Info

The PdfImagesExtractorExtractionInfo property exposes an object of HiQPdf.NextPdfImagesExtractionInfo type which is populated after the extraction completes successfully with information about the extraction process such as the number of PDF pages within the specified range from which images were extracted.

Gets the Number of PDF Pages from Which Images Were Extracted
int numberOfPagesExtracted = pdfImagesExtractor.ExtractionInfo.PageCount;

Code Sample - Extract Images from PDF Pages

Extract Images from PDF Pages in ASP.NET Core
using System;
using System.IO;
using System.Threading.Tasks;
using System.ComponentModel.DataAnnotations;
using Microsoft.AspNetCore.Hosting;
using Microsoft.AspNetCore.Http;
using Microsoft.AspNetCore.Mvc;
using HiQPdf_Next_AspNetDemo.Models;

using HiQPdf.Next;

namespace HiQPdf_Next_AspNetDemo.Controllers
{
    public class ExtractPdfImagesController : Controller
    {
        private readonly IWebHostEnvironment m_hostingEnvironment;
        public ExtractPdfImagesController(IWebHostEnvironment hostingEnvironment)
        {
            m_hostingEnvironment = hostingEnvironment;
        }

        public IActionResult Index()
        {
            var model = SetViewModel();

            return View(model);
        }

        [HttpPost]
        public async Task<IActionResult> ExtractPdfImages(ExtractPdfImagesViewModel model)
        {
            if (!ModelState.IsValid)
            {
                var errorMessage = ModelStateHelper.GetModelErrors(ModelState);
                throw new ValidationException(errorMessage);
            }

            // Replace the demo serial number with the serial number received upon purchase
            // to run the extractor in licensed mode
            Licensing.SerialNumber = "YCgJMTAE-BiwJAhIB-EhlWTlBA-UEBRQFBA-U1FOUVJO-WVlZWQ==";

            // Create the PDF Images Extractor instance with default options
            PdfImagesExtractor pdfImagesExtractor = new PdfImagesExtractor();

            // Optionally set the user password to open a password-protected PDF
            if (!string.IsNullOrEmpty(model.UserPassword))
                pdfImagesExtractor.UserPassword = model.UserPassword;

            // Optionally set the owner password to open a password-protected PDF
            if (!string.IsNullOrEmpty(model.OwnerPassword))
                pdfImagesExtractor.OwnerPassword = model.OwnerPassword;

            // PDF page number to start extraction from
            int startPageNumber = model.StartPageNumber;

            // PDF page number to end extraction at
            // If 0, extraction continues to the end of the document
            int endPageNumber = 0;
            if (model.EndPageNumber.HasValue)
                endPageNumber = model.EndPageNumber.Value;

            byte[] inputPdfBytes = null;
            string outputFileName = null;

            // If an uploaded file exists, use it with priority
            if (model.PdfFile != null && model.PdfFile.Length > 0)
            {
                try
                {
                    using var ms = new MemoryStream();
                    await model.PdfFile.CopyToAsync(ms);
                    inputPdfBytes = ms.ToArray();
                }
                catch (Exception ex)
                {
                    throw new Exception("Failed to read the uploaded PDF file", ex);
                }

                outputFileName = Path.GetFileNameWithoutExtension(model.PdfFile.FileName);
            }
            else
            {
                // Otherwise, fall back to the URL
                string pdfUrl = model.PdfFileUrl?.Trim();
                if (string.IsNullOrWhiteSpace(pdfUrl))
                    throw new Exception("No PDF file provided: upload a file or specify a URL");

                try
                {
                    if (pdfUrl.StartsWith("file://", StringComparison.OrdinalIgnoreCase))
                    {
                        string localPath = new Uri(pdfUrl).LocalPath;
                        inputPdfBytes = await System.IO.File.ReadAllBytesAsync(localPath);
                    }
                    else
                    {
                        using var httpClient = new System.Net.Http.HttpClient();
                        inputPdfBytes = await httpClient.GetByteArrayAsync(pdfUrl);
                    }
                }
                catch (Exception ex)
                {
                    throw new Exception("Could not download the PDF file from URL", ex);
                }

                outputFileName = Path.GetFileNameWithoutExtension(model.PdfFileUrl);
            }

            // Extract the images from the specified PDF page range, grouped by page
            ExtractedImage[][] extractedImages = pdfImagesExtractor.ExtractImages(inputPdfBytes, startPageNumber, endPageNumber);

            int nPdfPages = extractedImages.Length;
            if (nPdfPages == 1 && extractedImages[0].Length > 0 && model.ExtractLargest)
            {
                // If only one page was processed and only the largest image is requested, return that image directly
                // Return the largest image as a downloadable file
                outputFileName += "-largest.png";
                ExtractedImage largestImage = GetLargestImage(extractedImages[0]);
                return File(largestImage.ImageData, "image/png", outputFileName);
            }
            else
            {
                // Build an in-memory ZIP with all page images and return it
                using var zipMs = new MemoryStream();
                using (var zip = new System.IO.Compression.ZipArchive(zipMs, System.IO.Compression.ZipArchiveMode.Create, leaveOpen: true))
                {
                    for (int pageIdx = 0; pageIdx < extractedImages.Length; pageIdx++)
                    {
                        var pageImages = extractedImages[pageIdx];
                        if (model.ExtractLargest)
                        {
                            // Add only the largest image from the page to the ZIP
                            ExtractedImage largestImage = GetLargestImage(pageImages);
                            if (largestImage != null)
                            {
                                var entry = zip.CreateEntry($"page-{largestImage.PageNumber:000000}-largest.png", System.IO.Compression.CompressionLevel.Fastest);
                                // Write the image bytes into the ZIP entry
                                using var entryStream = entry.Open();
                                entryStream.Write(largestImage.ImageData, 0, largestImage.ImageData.Length);
                            }
                        }
                        else
                        {
                            // Add all images from the PDF page to the ZIP
                            for (int imgIdx = 0; imgIdx < pageImages.Length; imgIdx++)
                            {
                                ExtractedImage extractedImage = pageImages[imgIdx];
                                var entry = zip.CreateEntry($"page-{extractedImage.PageNumber:000000}-{imgIdx:000000}.png", System.IO.Compression.CompressionLevel.Fastest);

                                // Write the image bytes into the ZIP entry
                                using var entryStream = entry.Open();
                                entryStream.Write(extractedImage.ImageData, 0, extractedImage.ImageData.Length);
                            }
                        }
                    }
                }

                outputFileName += ".zip";

                // Copy ZIP memory stream to a byte array
                byte[] outputZipBytes = zipMs.ToArray();

                // Return the ZIP as a downloadable file
                return File(outputZipBytes, "application/zip", outputFileName);
            }
        }

        private ExtractedImage GetLargestImage(ExtractedImage[] extractedImages)
        {
            ExtractedImage largestImage = null;
            int largestSize = 0;
            foreach (var image in extractedImages)
            {
                if (image.ImageData.Length > largestSize)
                {
                    largestImage = image;
                    largestSize = image.ImageData.Length;
                }
            }
            return largestImage;
        }

        private ExtractPdfImagesViewModel SetViewModel()
        {
            var model = new ExtractPdfImagesViewModel();

            HttpRequest request = ControllerContext.HttpContext.Request;
            UriBuilder uriBuilder = new UriBuilder();
            uriBuilder.Scheme = request.Scheme;
            uriBuilder.Host = request.Host.Host;
            if (request.Host.Port != null)
                uriBuilder.Port = (int)request.Host.Port;
            uriBuilder.Path = request.PathBase.ToString() + request.Path.ToString();
            uriBuilder.Query = request.QueryString.ToString();

            string currentPageUrl = uriBuilder.Uri.AbsoluteUri;
            string rootUrl = currentPageUrl.Substring(0, currentPageUrl.Length - "ExtractPdfImages".Length);

            model.PdfFileUrl = rootUrl + "/DemoFiles/PdfProcessor/PDF_Document.pdf";

            return model;
        }
    }
}

See Also