ASP.NET - Convert PDF to TXT (Plain-Text) or HTML in C# with iTextSharp

Ryan

7 years ago

Today I had to find a quick way to programmatically convert a bunch of PDF files into txt / text / plain-text format within an ASP.NET web application. Unfortunately, there aren't much open-source libraries that can do that.

After some time struggling with Google, I stumbled upon an old friend of mine - iTextSharp, a great PDF management library for ASP.NET that I used a while ago to fullfill a rather different task involving PDF parsing. By reading the updated SourceForge page I acknowledged that the (once) open-source code has evolved into a commercial product called iText, available for Java and .NET through a Java-port which is still called iTextSharp. Luckily enough, iText also offers a Comunity Edition coming with an AGPL licence model.

Long story short, I installed iTextSharp 5.5.13 from NuGet and used it to pull off this simple helper class that extracts the text from any PDF file:

using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using iTextSharp.text.pdf;
using iTextSharp.text.pdf.parser;

namespace PDF
{
    /// <summary>
    /// Parses a PDF file and extracts the text from it.
    /// </summary>
    public static class PDFParser
    {
        /// <summary>
        /// Extracts a text from a PDF file.
        /// </summary>
        /// <param name="filePath">the full path to the pdf file.</param>
        /// <returns>the extracted text</returns>
        public static string GetText(string filePath)
        {
            var sb = new StringBuilder();
            try
            {
                using (PdfReader reader = new PdfReader(filePath))
                {
                    string prevPage = "";
                    for (int page = 1; page <= reader.NumberOfPages; page++)
                    {
                        ITextExtractionStrategy its = new SimpleTextExtractionStrategy();
                        var s = PdfTextExtractor.GetTextFromPage(reader, page, its);
                        if (prevPage != s) sb.Append(s);
                        prevPage = s;
                    }
                    reader.Close();
                }
            }
            catch (Exception e)
            {
                throw e;
            }
            return sb.ToString();
        }
    }
}

using System;

using System.Collections.Generic;

using System.IO;

using System.Linq;

using System.Text;

using System.Threading.Tasks;

using iTextSharp.text.pdf;

using iTextSharp.text.pdf.parser;

namespace PDF

{

/// <summary>

/// Parses a PDF file and extracts the text from it.

/// </summary>

public static class PDFParser

{

/// <summary>

/// Extracts a text from a PDF file.

/// </summary>

/// <param name="filePath">the full path to the pdf file.</param>

/// <returns>the extracted text</returns>

public static string GetText(string filePath)

{

var sb = new StringBuilder();

try

{

using (PdfReader reader = new PdfReader(filePath))

{

string prevPage = "";

for (int page = 1; page <= reader.NumberOfPages; page++)

{

ITextExtractionStrategy its = new SimpleTextExtractionStrategy();

var s = PdfTextExtractor.GetTextFromPage(reader, page, its);

if (prevPage != s) sb.Append(s);

prevPage = s;

}

reader.Close();

}

catch (Exception e)

{

throw e;

}

return sb.ToString();

}

Needless to say, once we extract the plain-text we can easily format and/or style it using some fancy HTML markup in the following way:

public static GetHTMLText(string sourceFilePath)
{
    var txt = PDFParser.GetText(sourceFilePath);
    var sb = new StringBuilder();
    foreach (string s in txt.Split('\n')) {
        sb.AppendFormat("<p>{0}</p>", s);
    }
    return sb.ToString();  
}

public static GetHTMLText(string sourceFilePath)

{

var txt = PDFParser.GetText(sourceFilePath);

var sb = new StringBuilder();

foreach (string s in txt.Split('\n')) {

sb.AppendFormat("<p>{0}</p>", s);

}

return sb.ToString();

}

That's about it: I sincerely hope that this simple class will help those who're looking for an easy way to convert PDF into plain-text or HTML.

Print Friendly & PDF Download