How to strip invalid characters from an UTF-8 XML file or string in PHP

Yesterday I wrote something about stripping out P7M data from a XML P7M file or string, as long as it was encoded using CAdES format. It was quite ugly, yet it does the job for the most part - which is stripping the header & footer signature info.

Today I will raise the ugly-but-working bar even further by publishing the method I wrote as follow-up, which basically strips/skips all the invalid characters from the resulting XML string, so it can be cast into a SimpleXML PHP class:

/**
 * Removes invalid characters from a UTF-8 XML string
 *
 * @access public
 * @param string a XML string potentially containing invalid characters
 * @return string
 */
function sanitizeXML($string)
{
    if (!empty($string)) 
    {
        // remove EOT+NOREP+EOX|EOT+<char> sequence (FatturaPA)
        $string = preg_replace('/(\x{0004}(?:\x{201A}|\x{FFFD})(?:\x{0003}|\x{0004}).)/u', '', $string);

        $regex = '/(
            [\xC0-\xC1] # Invalid UTF-8 Bytes
            | [\xF5-\xFF] # Invalid UTF-8 Bytes
            | \xE0[\x80-\x9F] # Overlong encoding of prior code point
            | \xF0[\x80-\x8F] # Overlong encoding of prior code point
            | [\xC2-\xDF](?![\x80-\xBF]) # Invalid UTF-8 Sequence Start
            | [\xE0-\xEF](?![\x80-\xBF]{2}) # Invalid UTF-8 Sequence Start
            | [\xF0-\xF4](?![\x80-\xBF]{3}) # Invalid UTF-8 Sequence Start
            | (?<=[\x0-\x7F\xF5-\xFF])[\x80-\xBF] # Invalid UTF-8 Sequence Middle
            | (?<![\xC2-\xDF]|[\xE0-\xEF]|[\xE0-\xEF][\x80-\xBF]|[\xF0-\xF4]|[\xF0-\xF4][\x80-\xBF]|[\xF0-\xF4][\x80-\xBF]{2})[\x80-\xBF] # Overlong Sequence
            | (?<=[\xE0-\xEF])[\x80-\xBF](?![\x80-\xBF]) # Short 3 byte sequence
            | (?<=[\xF0-\xF4])[\x80-\xBF](?![\x80-\xBF]{2}) # Short 4 byte sequence
            | (?<=[\xF0-\xF4][\x80-\xBF])[\x80-\xBF](?![\x80-\xBF]) # Short 4 byte sequence (2)
        )/x';
        $string = preg_replace($regex, '', $string);

        $result = "";
        $current;
        $length = strlen($string);
        for ($i=0; $i < $length; $i++)
        {
            $current = ord($string{$i});
            if (($current == 0x9) ||
                ($current == 0xA) ||
                ($current == 0xD) ||
                (($current >= 0x20) && ($current <= 0xD7FF)) ||
                (($current >= 0xE000) && ($current <= 0xFFFD)) ||
                (($current >= 0x10000) && ($current <= 0x10FFFF)))
            {
                $result .= chr($current);
            }
            else
            {
                $ret;    // use this to strip invalid character(s)
                // $ret .= " ";    // use this to replace them with spaces
            }
        }
        $string = $result;
    }
    return $string;
}

/**

* Removes invalid characters from a UTF-8 XML string

* @access public

* @param string a XML string potentially containing invalid characters

* @return string

function sanitizeXML($string)

{

if (!empty($string))

{

// remove EOT+NOREP+EOX|EOT+<char> sequence (FatturaPA)

$string = preg_replace('/(\x{0004}(?:\x{201A}|\x{FFFD})(?:\x{0003}|\x{0004}).)/u', '', $string);

$regex = '/(

[\xC0-\xC1] # Invalid UTF-8 Bytes

| [\xF5-\xFF] # Invalid UTF-8 Bytes

| \xE0[\x80-\x9F] # Overlong encoding of prior code point

| \xF0[\x80-\x8F] # Overlong encoding of prior code point

| [\xC2-\xDF](?![\x80-\xBF]) # Invalid UTF-8 Sequence Start

| [\xE0-\xEF](?![\x80-\xBF]{2}) # Invalid UTF-8 Sequence Start

| [\xF0-\xF4](?![\x80-\xBF]{3}) # Invalid UTF-8 Sequence Start

| (?<=[\x0-\x7F\xF5-\xFF])[\x80-\xBF] # Invalid UTF-8 Sequence Middle

| (?<![\xC2-\xDF]|[\xE0-\xEF]|[\xE0-\xEF][\x80-\xBF]|[\xF0-\xF4]|[\xF0-\xF4][\x80-\xBF]|[\xF0-\xF4][\x80-\xBF]{2})[\x80-\xBF] # Overlong Sequence

| (?<=[\xE0-\xEF])[\x80-\xBF](?![\x80-\xBF]) # Short 3 byte sequence

| (?<=[\xF0-\xF4])[\x80-\xBF](?![\x80-\xBF]{2}) # Short 4 byte sequence

| (?<=[\xF0-\xF4][\x80-\xBF])[\x80-\xBF](?![\x80-\xBF]) # Short 4 byte sequence (2)

)/x';

$string = preg_replace($regex, '', $string);

$result = "";

$current;

$length = strlen($string);

for ($i=0; $i < $length; $i++)

{

$current = ord($string{$i});

if (($current == 0x9) ||

($current == 0xA) ||

($current == 0xD) ||

(($current >= 0x20) && ($current <= 0xD7FF)) ||

(($current >= 0xE000) && ($current <= 0xFFFD)) ||

(($current >= 0x10000) && ($current <= 0x10FFFF)))

{

$result .= chr($current);

}

else

{

$ret; // use this to strip invalid character(s)

// $ret .= " "; // use this to replace them with spaces

}

$string = $result;

}

return $string;

}

This is nothing less than a mixup of two methods I found here and here on StackOverflow, so the credits go to the respective authors (which I thank): I needed them both because I had to deal with invalid UTF-8 characters and invalid XML characters: as you can see, the method makes use of a regular expression which is shortly followed by an iterative, char-by-char approach.

As I said before, it's rather ugly and highly unefficient, possibly even more than the previous one... however it gets the job done, and since I had to complete the task in a ridiculously short amount of time that's the best I've come with. In case someone wants to come out with something better, he's VERY welcome... I'll gladly accept his suggestions. Until then, I hope that this will actually help other PHP "double-clawed" developers to achieve decent results as well.

... It definitely seems like the PHP hammer has scored yet another hit!

I swear it won't happen again anytime soon... :)

Print Friendly & PDF Download

3 Comments on “How to strip invalid characters from an UTF-8 XML file or string in PHP”

Pingback: PHP - How to strip P7M data from a XML.P7M file or string (CAdES)
Christopher Dawkins says:

August 29, 2021 at 19:55

Many thanks for this code.

It save my life today (Sunday 29th August 2021) when I was transferring three gigabytes of old and potentially dodgy mails (eg lots of spam) from mbox format into a MySQL database.

All I had to do was cut’n’paste it into my script and it worked immediately and at the first time of asking.

Pingback: Remove non-utf8 characters from string

How to strip invalid characters from an UTF-8 XML file or string in PHP

About Ryan

3 Comments on “How to strip invalid characters from an UTF-8 XML file or string in PHP”

Leave a Reply Cancel reply

Related Posts

WordPress Security: What to Do When You Find Suspicious Files From Detecting Suspicious Files to Advanced Protection: Practical Strategies to Secure WordPress. Log Analysis, Malware Removal, and Security Settings

The role of the Web Server General overview of the tool that handles the HTTP requests and provides responses: what it is, what it does, what it is for

Web Administrator Training Course A learning path to acquire the necessary skills to configure, manage and administer a web server on Windows, Linux, and in the Cloud

About Ryan

3 Comments on “How to strip invalid characters from an UTF-8 XML file or string in PHP”

Leave a Reply Cancel reply