C# - Find whether a PDF contains images or text

Table of Contents

Getting the Packages
Creating the App
Source Code
Syncfusion License Key
Conclusion

If you often work with PDF files, you've probably already heard of Syncfusion Essential Studio, a .NET-based software product providing solutions to most of the complex problems faced during application development. Among the many components available the product includes a neat PDF framework, a feature-rich .NET PDF class library developed with 100% managed C# code that can be used to create, read and write PDF files using Windows Forms, WPF, ASP.NET Web Forms, ASP.NET MVC, ASP.NET Core, Blazor, UWP, Xamarin, Flutter applications and Unity platform without any external dependency (no Adobe Acrobat required).

I have recently used Syncfusion's PDF framework to solve a task I was given a few days ago: determine if a bunch of PDF files contain images, text, or both. Needless to say, the files were actually a lot (more than 500K), so this couldn't be done manually. In this post, I'll share the source code I have used to deal with this issue.

Getting the Packages

Let's start with the NuGet packages I have used to fulfill the job.

Syncfusion.Pdf.Net.Core
Syncfusion.Pdf.Imaging.Net.Core

The first package contains the base classes to handle PDF files, while the latter contains specific modules and extension methods to deal with PDF-embedded images.

For both packages I have used the 20.4.0.44 version, which was the latest at the time of writing (and fully compatible with .NET 6 and .NET 7): feel free to update it to a newer version!

Creating the App

The next thing I did was create a simple .NET 6 console application using the standard Visual Studio 2022's C# Console App template, as shown in the screenshot below.

I chose a console app since I didn't need a user interface to do the job: however, the core part of the source code you will find in this post can be also used within any ASP.NET Core Web app, as well as WPF app, Web Form app, and so on.

Source Code

Without further ado, here's the source code:

using Syncfusion.Pdf;
using Syncfusion.Pdf.Exporting;
using Syncfusion.Pdf.Parsing;

Syncfusion.Licensing.SyncfusionLicenseProvider
    .RegisterLicense("<LICENSE-KEY>"); // TODO: Insert your license key here

List<string> filePathList = GetFilePathList(); // TODO: implement this method 
                                      //   to retrieve a list of paths for the PDF files.

var allowedPdfExtensions = new[] { "pdf" };

foreach (var path in filePathList)
{
    var pdfType = "undefined";

    if (string.IsNullOrEmpty(path) || !File.Exists(path))
        continue;

    try
    {
        var ext = Path.GetExtension(path);
        if (!string.IsNullOrEmpty(ext)
            && allowedPdfExtensions.Contains(ext.TrimStart('.'),
                StringComparer.InvariantCultureIgnoreCase))
        {
            using var stream = new FileStream(path, FileMode.Open, FileAccess.Read);
            var loadedDocument = new PdfLoadedDocument(stream);
            var loadedPages = loadedDocument.Pages;
            int charCount = 0;
            int imageCount = 0;
            foreach (PdfLoadedPage loadedPage in loadedPages)
            {
                var txt = loadedPage.ExtractText();
                if (!string.IsNullOrEmpty(txt))
                    charCount += txt.Length;
                var imagesInfo = loadedPage.GetImagesInfo();
                if (imagesInfo != null)
                    imageCount += imagesInfo.Length;
            }
            loadedDocument.Close(true);

            if (charCount > 0 && imageCount == 0)
                pdfType = "text only";
            else if (charCount == 0 && imageCount > 0)
                pdfType = "images only";
            else if (charCount > 0 && imageCount > 0)
                pdfType = "text and images";
            else
                pdfType = "no text and no images";
        }
        else
            pdfType = "non-PDF file";
    }
    catch (Exception e)
    {
        pdfType = $"error: {e.Message}";
    }

    DoSomething(pdfType); // TODO: do something with the [pdfType] variable
}

static List<string> GetFilePathList() 
    => throw new NotSupportedException("TODO");
static List<string> DoSomething(string pdfType) 
    => throw new NotSupportedException("TODO");

using Syncfusion.Pdf;

using Syncfusion.Pdf.Exporting;

using Syncfusion.Pdf.Parsing;

Syncfusion.Licensing.SyncfusionLicenseProvider

.RegisterLicense("<LICENSE-KEY>"); // TODO: Insert your license key here

List<string> filePathList = GetFilePathList(); // TODO: implement this method

// to retrieve a list of paths for the PDF files.

var allowedPdfExtensions = new[] { "pdf" };

foreach (var path in filePathList)

{

var pdfType = "undefined";

if (string.IsNullOrEmpty(path) || !File.Exists(path))

continue;

try

{

var ext = Path.GetExtension(path);

if (!string.IsNullOrEmpty(ext)

&& allowedPdfExtensions.Contains(ext.TrimStart('.'),

StringComparer.InvariantCultureIgnoreCase))

{

using var stream = new FileStream(path, FileMode.Open, FileAccess.Read);

var loadedDocument = new PdfLoadedDocument(stream);

var loadedPages = loadedDocument.Pages;

int charCount = 0;

int imageCount = 0;

foreach (PdfLoadedPage loadedPage in loadedPages)

{

var txt = loadedPage.ExtractText();

if (!string.IsNullOrEmpty(txt))

charCount += txt.Length;

var imagesInfo = loadedPage.GetImagesInfo();

if (imagesInfo != null)

imageCount += imagesInfo.Length;

}

loadedDocument.Close(true);

if (charCount > 0 && imageCount == 0)

pdfType = "text only";

else if (charCount == 0 && imageCount > 0)

pdfType = "images only";

else if (charCount > 0 && imageCount > 0)

pdfType = "text and images";

else

pdfType = "no text and no images";

}

else

pdfType = "non-PDF file";

}

catch (Exception e)

{

pdfType = $"error: {e.Message}";

}

DoSomething(pdfType); // TODO: do something with the [pdfType] variable

}

static List<string> GetFilePathList()

=> throw new NotSupportedException("TODO");

static List<string> DoSomething(string pdfType)

=> throw new NotSupportedException("TODO");

IMPORTANT: for the latest version of the source code, check out the PDFInspector project page on GitHub.

As we can see, the code is quite simple to understand. Here's what we are doing in a nutshell:

Retrieve a list of the PDF file paths.
Cycle through each one of them.
Use Syncfusion PDF to extract the character count and/or the image file count from each PDF page.
Use the above counters to determine if the PDF contains text and/or images (or none).

The overall outcome is stored in the pdfType local variable: in my above example I have used a string, but you could replace it with an enum, a const value, or anything else you might want to use instead.

As you can see, I have placed some TODO comments to highlight the source code lines where you need to add your own stuff, such as: adding the Syncfusion License Key; retrieving the list of PDF file paths; do something after we have determined the pdfType, and so on.

Syncfusion License Key

The Syncfusion Essential Studio license key can be purchased from the Syncfusion official website. The product is quite expensive, but there's some great news for you: the company offers a FREE community license for all companies and individuals with less than $1 million USD in annual gross revenue and 5 or fewer developers. That's precisely what I did (since I am poor enough to be eligible!), thus getting the entire product line (worth more than $ 12K!) for no cost. If you are eligible as well, I strongly suggest you do the same!

Conclusion

That's it, at least for now: I hope that my source code sample will help other .NET developers looking for a way to determine whether a PDF file contains images and/or text!

Print Friendly & PDF Download

C# - Find whether a PDF contains images or text A quick and simple way (with free code sample) to check if a PDF file contains images and/or text using Syncfusion PDF class library for .NET

Getting the Packages

Creating the App

Source Code

Syncfusion License Key

Conclusion

About Ryan

Leave a Reply Cancel reply

Getting the Packages

Creating the App

Source Code

Syncfusion License Key

Conclusion

Related Posts

Web API Advanced Benchmark using NBomber Step-by-step guide to simulating load, concurrent users, and realistic scenarios in a .NET environment

ASP.NET Core and EF Core 9.0 - Sample project Step-by-step guide to create an application from scratch with ASP.NET Core and Entity Framework Core 9 in just a few simple steps

C# 13.0 - Overview and new features What's New in C# 13.0: Language Enhancements, Advanced Features, and Deeper Integration with .NET 9

About Ryan

Leave a Reply Cancel reply