Today I had to find a quick way to programmatically convert a bunch of PDF files into txt / text / plain-text format within an ASP.NET web application. Unfortunately, there aren't much open-source libraries that can do that.
After some time struggling with Google, I stumbled upon an old friend of mine - iTextSharp, a great PDF management library for ASP.NET that I used a while ago to fullfill a rather different task involving PDF parsing. By reading the updated SourceForge page I acknowledged that the (once) open-source code has evolved into a commercial product called iText, available for Java and .NET through a Java-port which is still called iTextSharp. Luckily enough, iText also offers a Comunity Edition coming with an AGPL licence model.
Long story short, I installed iTextSharp 5.5.13 from NuGet and used it to pull off this simple helper class that extracts the text from any PDF file:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 |
using System; using System.Collections.Generic; using System.IO; using System.Linq; using System.Text; using System.Threading.Tasks; using iTextSharp.text.pdf; using iTextSharp.text.pdf.parser; namespace PDF { /// <summary> /// Parses a PDF file and extracts the text from it. /// </summary> public static class PDFParser { /// <summary> /// Extracts a text from a PDF file. /// </summary> /// <param name="filePath">the full path to the pdf file.</param> /// <returns>the extracted text</returns> public static string GetText(string filePath) { var sb = new StringBuilder(); try { using (PdfReader reader = new PdfReader(filePath)) { string prevPage = ""; for (int page = 1; page <= reader.NumberOfPages; page++) { ITextExtractionStrategy its = new SimpleTextExtractionStrategy(); var s = PdfTextExtractor.GetTextFromPage(reader, page, its); if (prevPage != s) sb.Append(s); prevPage = s; } reader.Close(); } } catch (Exception e) { throw e; } return sb.ToString(); } } } |
Needless to say, once we extract the plain-text we can easily format and/or style it using some fancy HTML markup in the following way:
1 2 3 4 5 6 7 8 9 |
public static GetHTMLText(string sourceFilePath) { var txt = PDFParser.GetText(sourceFilePath); var sb = new StringBuilder(); foreach (string s in txt.Split('\n')) { sb.AppendFormat("<p>{0}</p>", s); } return sb.ToString(); } |
That's about it: I sincerely hope that this simple class will help those who're looking for an easy way to convert PDF into plain-text or HTML.