If you often work with PDF files, you've probably already heard of Syncfusion Essential Studio, a .NET-based software product providing solutions to most of the complex problems faced during application development. Among the many components available the product includes a neat PDF framework, a feature-rich .NET PDF class library developed with 100% managed C# code that can be used to create, read and write PDF files using Windows Forms, WPF, ASP.NET Web Forms, ASP.NET MVC, ASP.NET Core, Blazor, UWP, Xamarin, Flutter applications and Unity platform without any external dependency (no Adobe Acrobat required).
I have recently used Syncfusion's PDF framework to solve a task I was given a few days ago: determine if a bunch of PDF files contain images, text, or both. Needless to say, the files were actually a lot (more than 500K), so this couldn't be done manually. In this post, I'll share the source code I have used to deal with this issue.
Getting the Packages
Let's start with the NuGet packages I have used to fulfill the job.
The first package contains the base classes to handle PDF files, while the latter contains specific modules and extension methods to deal with PDF-embedded images.
Creating the App
The next thing I did was create a simple .NET 6 console application using the standard Visual Studio 2022's C# Console App template, as shown in the screenshot below.
I chose a console app since I didn't need a user interface to do the job: however, the core part of the source code you will find in this post can be also used within any ASP.NET Core Web app, as well as WPF app, Web Form app, and so on.
Source Code
Without further ado, here's the source code:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 |
using Syncfusion.Pdf; using Syncfusion.Pdf.Exporting; using Syncfusion.Pdf.Parsing; Syncfusion.Licensing.SyncfusionLicenseProvider .RegisterLicense("<LICENSE-KEY>"); // TODO: Insert your license key here List<string> filePathList = GetFilePathList(); // TODO: implement this method // to retrieve a list of paths for the PDF files. var allowedPdfExtensions = new[] { "pdf" }; foreach (var path in filePathList) { var pdfType = "undefined"; if (string.IsNullOrEmpty(path) || !File.Exists(path)) continue; try { var ext = Path.GetExtension(path); if (!string.IsNullOrEmpty(ext) && allowedPdfExtensions.Contains(ext.TrimStart('.'), StringComparer.InvariantCultureIgnoreCase)) { using var stream = new FileStream(path, FileMode.Open, FileAccess.Read); var loadedDocument = new PdfLoadedDocument(stream); var loadedPages = loadedDocument.Pages; int charCount = 0; int imageCount = 0; foreach (PdfLoadedPage loadedPage in loadedPages) { var txt = loadedPage.ExtractText(); if (!string.IsNullOrEmpty(txt)) charCount += txt.Length; var imagesInfo = loadedPage.GetImagesInfo(); if (imagesInfo != null) imageCount += imagesInfo.Length; } loadedDocument.Close(true); if (charCount > 0 && imageCount == 0) pdfType = "text only"; else if (charCount == 0 && imageCount > 0) pdfType = "images only"; else if (charCount > 0 && imageCount > 0) pdfType = "text and images"; else pdfType = "no text and no images"; } else pdfType = "non-PDF file"; } catch (Exception e) { pdfType = $"error: {e.Message}"; } DoSomething(pdfType); // TODO: do something with the [pdfType] variable } static List<string> GetFilePathList() => throw new NotSupportedException("TODO"); static List<string> DoSomething(string pdfType) => throw new NotSupportedException("TODO"); |
As we can see, the code is quite simple to understand. Here's what we are doing in a nutshell:
- Retrieve a list of the PDF file paths.
- Cycle through each one of them.
- Use Syncfusion PDF to extract the character count and/or the image file count from each PDF page.
- Use the above counters to determine if the PDF contains text and/or images (or none).
The overall outcome is stored in the pdfType local variable: in my above example I have used a string, but you could replace it with an enum, a const value, or anything else you might want to use instead.
As you can see, I have placed some TODO comments to highlight the source code lines where you need to add your own stuff, such as: adding the Syncfusion License Key; retrieving the list of PDF file paths; do something after we have determined the pdfType, and so on.
Syncfusion License Key
The Syncfusion Essential Studio license key can be purchased from the Syncfusion official website. The product is quite expensive, but there's some great news for you: the company offers a FREE community license for all companies and individuals with less than $1 million USD in annual gross revenue and 5 or fewer developers. That's precisely what I did (since I am poor enough to be eligible!), thus getting the entire product line (worth more than $ 12K!) for no cost. If you are eligible as well, I strongly suggest you do the same!
Conclusion
That's it, at least for now: I hope that my source code sample will help other .NET developers looking for a way to determine whether a PDF file contains images and/or text!