Convert PDF to Text

The PDF to Text Converter component allows you to extract the text from PDF documents. This component is distributed as part of the HiQPdf.Next.PdfProcessor.Windows NuGet package when targeting Windows and as part of the HiQPdf.Next.PdfProcessor.Linux package when targeting Linux. The package for Windows is referenced by the HiQPdf.Next.Windows meta package and the package for Linux is referenced by the HiQPdf.Next.Linux meta package.

You can set the text layout, mark the page breaks in the extracted text, set the page range from PDF document to convert to text. Conversion of password protected PDF documents is also supported and you can provide both user and owner passwords.

Overview

The HiQPdf.NextPdfToTextConverter class allows you to load a PDF file and extract the text from it with the specified layout.

Create the PDF to Text Converter

The HiQPdf.NextPdfToTextConverter class is used to convert PDF documents to text. You can create an instance using the default constructor, which initializes the converter with standard settings. These settings can later be customized through properties like PdfToTextConverterTextLayout, PdfToTextConverterMarkPageBreaks, PdfToTextConverterUserPassword, PdfToTextConverterOwnerPassword and others, which controls the text extraction process.

Create a PDF to Text Converter Instance
// Create a new PDF to Text converter instance
PdfToTextConverter pdfToTextConverter = new PdfToTextConverter();

Note that PdfToTextConverter instances are not reusable. You must create a new instance for each conversion. Reusing an instance after a completed conversion will result in an exception.

Extracted Text Layout

The PdfToTextConverterTextLayout property controls whether the extracted text preserves the visual layout of the text in the PDF document when its value is set to Original or whether the resulting text preserves the reading order of the text in the PDF document when its value is set to Reading. The default value is Original, which means that the converter will try to preserve the original text layout as much as possible.

Set Extracted Text Layout
pdfToTextConverter.TextLayout = PdfToTextLayout.Original;

Mark Page Breaks

The PdfToTextConverterMarkPageBreaks property controls whether page breaks are marked by the PdfToTextConverterPAGE_BREAK_MARK character in the resulting text document. The default is false, which means that page breaks are not marked in the extracted text.

Mark Page Breaks in Extracted Text
pdfToTextConverter.MarkPageBreaks = true;

Open Password Protected PDFs

If the PDF document you convert to text is password protected you have to specify the user or owner password to be used to decrypt the PDF document before extracting the text. You can set the user password in the PdfToTextConverterUserPassword property and the owner password in the PdfToTextConverterOwnerPassword property.

Set User and Owner Passwords
pdfToTextConverter.UserPassword = userPasswordString;
pdfToTextConverter.OwnerPassword = ownerPasswordString;

Extracted Page Range Limit

The PdfToTextConverterMaxPageCount property controls the upper limit for the number of PDF pages to process. The PDF page range to convert to text can be set in the conversion methods. The default is 0, which means there is no upper limit.

Unlimited Extracted PDF Page Range
pdfToTextConverter.MaxPageCount = 0;
Limit the Extracted PDF Page Range to 10 Pages
pdfToTextConverter.MaxPageCount = 10;

Convert PDF to Text

To convert all pages in a PDF document from a memory buffer to a string, use the PdfToTextConverterConvertToText(Byte) method. The parameter is the PDF document read into a memory buffer.

Convert All Pages in a PDF from Memory to a String
string extractedText = pdfToTextConverter.ConvertToText(inputPdfBytes);

To convert a PDF document from a memory buffer to a string starting at the given page number through the end of the document, use the PdfToTextConverterConvertToText(Byte, Int32) method. The first parameter is the PDF document read into a memory buffer and the second parameter is the 1-based start page number.

Convert Pages in a PDF from Memory to a String Starting at a Given Page Number
string extractedText = pdfToTextConverter.ConvertToText(inputPdfBytes, startPageNumber);

To convert a PDF document from a memory buffer to a string starting at the given page number up to the end page number inclusive, use the PdfToTextConverterConvertToText(Byte, Int32, Int32) method. The first parameter is the PDF document read into a memory buffer and the second and third parameters are the 1-based start and end page numbers. If the end page number is 0, the conversion continues to the end of the document.

Convert a Range of Pages in a PDF from Memory to a String
string extractedText = pdfToTextConverter.ConvertToText(inputPdfBytes, startPageNumber, endPageNumber);

There are also similar methods to convert a PDF to text that accept a PDF stream or a PDF file path.

Convert All Pages in a PDF from a Stream or File to a String
string extractedText = pdfToTextConverter.ConvertToText(inputPdfStream);
string extractedText = pdfToTextConverter.ConvertToText(inputPdfFile);
Convert Pages in a PDF from a Stream or File to a String Starting at a Given Page Number
string extractedText = pdfToTextConverter.ConvertToText(inputPdfStream, startPageNumber);
string extractedText = pdfToTextConverter.ConvertToText(inputPdfFile, startPageNumber);
Convert a Range of Pages in a PDF from a Stream or File to a String
string extractedText = pdfToTextConverter.ConvertToText(inputPdfStream, startPageNumber, endPageNumber);
string extractedText = pdfToTextConverter.ConvertToText(inputPdfFile, startPageNumber, endPageNumber);

Asynchronous PDF to Text Methods

There are also asynchronous variants of these methods that follow the Task-based Asynchronous Pattern (TAP) in .NET, allowing PDF to Text conversion to run in parallel using async and await. These methods share the same names as their synchronous counterparts and include the "Async" suffix. They also accept an optional System.ThreadingCancellationToken parameter that can be used to cancel the conversion operation where applicable.

To convert all pages in a PDF document from a memory buffer to a string, use the PdfToTextConverterConvertToTextAsync(Byte, CancellationToken) method. The parameter is the PDF document read into a memory buffer.

Asynchronously Convert All Pages in a PDF from Memory to a String
string extractedText = await pdfToTextConverter.ConvertToTextAsync(inputPdfBytes);

To convert a PDF document from a memory buffer to a string starting at a given page number through the end of the document, use the PdfToTextConverterConvertToTextAsync(Byte, Int32, CancellationToken) method. The first parameter is the PDF document read into a memory buffer and the second parameter is the 1-based start page number.

Asynchronously Convert Pages in a PDF from Memory to a String Starting at a Given Page Number
string extractedText = await pdfToTextConverter.ConvertToTextAsync(inputPdfBytes, startPageNumber);

To convert a PDF document from a memory buffer to a string starting at the given page number up to the end page number inclusive, use the PdfToTextConverterConvertToTextAsync(Byte, Int32, Int32, CancellationToken) method. The first parameter is the PDF document read into a memory buffer and the second and third parameters are the 1-based start and end page numbers. If the end page number is 0, the conversion continues to the end of the document.

Asynchronously Convert a Range of Pages in a PDF from Memory to a String
string extractedText = await pdfToTextConverter.ConvertToTextAsync(inputPdfBytes, startPageNumber, endPageNumber);

There are also similar methods to convert a PDF to text that accept a PDF stream or a PDF file path.

Asynchronously Convert All Pages in a PDF from a Stream or File to a String
string extractedText = await pdfToTextConverter.ConvertToTextAsync(inputPdfStream);
string extractedText = await pdfToTextConverter.ConvertToTextAsync(inputPdfFile);
Asynchronously Convert Pages in a PDF from a Stream or File to a String Starting at a Given Page Number
string extractedText = await pdfToTextConverter.ConvertToTextAsync(inputPdfStream, startPageNumber);
string extractedText = await pdfToTextConverter.ConvertToTextAsync(inputPdfFile, startPageNumber);
Asynchronously Convert a Range of Pages in a PDF from a Stream or File to a String
string extractedText = await pdfToTextConverter.ConvertToTextAsync(inputPdfStream, startPageNumber, endPageNumber);
string extractedText = await pdfToTextConverter.ConvertToTextAsync(inputPdfFile, startPageNumber, endPageNumber);

Conversion Info

The PdfToTextConverterConversionInfo property exposes an object of HiQPdf.NextPdfToTextConversionInfo type which is populated after the conversion completes successfully with information about the conversion process such as the number of pages converted.

Gets the Number of PDF Pages Converted to Text
int numberOfPagesConverted = pdfToTextConverter.ConversionInfo.PageCount;

Code Sample - Convert PDF to Text

Convert PDF to Text in ASP.NET Core
using System;
using System.IO;
using System.Text;
using System.Threading.Tasks;
using System.ComponentModel.DataAnnotations;
using Microsoft.AspNetCore.Hosting;
using Microsoft.AspNetCore.Http;
using Microsoft.AspNetCore.Mvc;
using HiQPdf_Next_AspNetDemo.Models;

using HiQPdf.Next;

namespace HiQPdf_Next_AspNetDemo.Controllers
{
    public class PdfToTextController : Controller
    {
        private readonly IWebHostEnvironment m_hostingEnvironment;
        public PdfToTextController(IWebHostEnvironment hostingEnvironment)
        {
            m_hostingEnvironment = hostingEnvironment;
        }

        public IActionResult Index()
        {
            var model = SetViewModel();

            return View(model);
        }

        [HttpPost]
        public async Task<IActionResult> ConvertPdfToText(PdfToTextViewModel model)
        {
            if (!ModelState.IsValid)
            {
                var errorMessage = ModelStateHelper.GetModelErrors(ModelState);
                throw new ValidationException(errorMessage);
            }

            // Replace the demo serial number with the serial number received upon purchase
            // to run the converter in licensed mode
            Licensing.SerialNumber = "YCgJMTAE-BiwJAhIB-EhlWTlBA-UEBRQFBA-U1FOUVJO-WVlZWQ==";

            // Create the PDF to Text converter instance with default options
            PdfToTextConverter pdfToTextConverter = new PdfToTextConverter();

            // Optionally set the user password to open a password-protected PDF
            if (!string.IsNullOrEmpty(model.UserPassword))
                pdfToTextConverter.UserPassword = model.UserPassword;

            // Optionally set the owner password to open a password-protected PDF
            if (!string.IsNullOrEmpty(model.OwnerPassword))
                pdfToTextConverter.OwnerPassword = model.OwnerPassword;

            // Configure the output text layout
            pdfToTextConverter.TextLayout = model.TextLayout == "Original" ? PdfToTextLayout.Original : PdfToTextLayout.Reading;

            // Mark PDF page breaks with the PdfToTextConverter.PAGE_BREAK_MARK special character
            pdfToTextConverter.MarkPageBreaks = model.MarkPageBreaks;

            // PDF page number to start text extraction from
            int startPageNumber = model.StartPageNumber;

            // PDF page number to end text extraction at
            // If 0, extraction continues to the end of the document
            int endPageNumber = 0;
            if (model.EndPageNumber.HasValue)
                endPageNumber = model.EndPageNumber.Value;

            byte[] inputPdfBytes = null;
            string outputFileName = null;

            // If an uploaded file exists, use it with priority
            if (model.PdfFile != null && model.PdfFile.Length > 0)
            {
                try
                {
                    using var ms = new MemoryStream();
                    await model.PdfFile.CopyToAsync(ms);
                    inputPdfBytes = ms.ToArray();
                }
                catch (Exception ex)
                {
                    throw new Exception("Failed to read the uploaded PDF file", ex);
                }

                outputFileName = Path.GetFileNameWithoutExtension(model.PdfFile.FileName) + ".txt";
            }
            else
            {
                // Otherwise, fall back to the URL
                string pdfUrl = model.PdfFileUrl?.Trim();
                if (string.IsNullOrWhiteSpace(pdfUrl))
                    throw new Exception("No PDF file provided: upload a file or specify a URL");

                try
                {
                    if (pdfUrl.StartsWith("file://", StringComparison.OrdinalIgnoreCase))
                    {
                        string localPath = new Uri(pdfUrl).LocalPath;
                        inputPdfBytes = await System.IO.File.ReadAllBytesAsync(localPath);
                    }
                    else
                    {
                        using var httpClient = new System.Net.Http.HttpClient();
                        inputPdfBytes = await httpClient.GetByteArrayAsync(pdfUrl);
                    }
                }
                catch (Exception ex)
                {
                    throw new Exception("Could not download the PDF file from URL", ex);
                }

                outputFileName = Path.GetFileNameWithoutExtension(model.PdfFileUrl) + ".txt";
            }

            // Extract text from the specified PDF page range
            string extractedText = pdfToTextConverter.ConvertToText(inputPdfBytes, startPageNumber, endPageNumber);

            // Encode the extracted text as UTF-8 bytes
            byte[] outputTextBytes = Encoding.UTF8.GetBytes(extractedText);

            // Return the text as a downloadable file
            return File(outputTextBytes, "text/plain; charset=utf-8", outputFileName);
        }

        private PdfToTextViewModel SetViewModel()
        {
            var model = new PdfToTextViewModel();

            HttpRequest request = ControllerContext.HttpContext.Request;
            UriBuilder uriBuilder = new UriBuilder();
            uriBuilder.Scheme = request.Scheme;
            uriBuilder.Host = request.Host.Host;
            if (request.Host.Port != null)
                uriBuilder.Port = (int)request.Host.Port;
            uriBuilder.Path = request.PathBase.ToString() + request.Path.ToString();
            uriBuilder.Query = request.QueryString.ToString();

            string currentPageUrl = uriBuilder.Uri.AbsoluteUri;
            string rootUrl = currentPageUrl.Substring(0, currentPageUrl.Length - "PdfToText".Length);

            model.PdfFileUrl = rootUrl + "/DemoFiles/PdfProcessor/PDF_Document.pdf";

            return model;
        }
    }
}

See Also