Yesterday I wrote something about stripping out P7M data from a XML P7M file or string, as long as it was encoded using CAdES format. It was quite ugly, yet it does the job for the most part - which is stripping the header & footer signature info.
Today I will raise the ugly-but-working bar even further by publishing the method I wrote as follow-up, which basically strips/skips all the invalid characters from the resulting XML string, so it can be cast into a SimpleXML PHP class:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 |
/** * Removes invalid characters from a UTF-8 XML string * * @access public * @param string a XML string potentially containing invalid characters * @return string */ function sanitizeXML($string) { if (!empty($string)) { // remove EOT+NOREP+EOX|EOT+<char> sequence (FatturaPA) $string = preg_replace('/(\x{0004}(?:\x{201A}|\x{FFFD})(?:\x{0003}|\x{0004}).)/u', '', $string); $regex = '/( [\xC0-\xC1] # Invalid UTF-8 Bytes | [\xF5-\xFF] # Invalid UTF-8 Bytes | \xE0[\x80-\x9F] # Overlong encoding of prior code point | \xF0[\x80-\x8F] # Overlong encoding of prior code point | [\xC2-\xDF](?![\x80-\xBF]) # Invalid UTF-8 Sequence Start | [\xE0-\xEF](?![\x80-\xBF]{2}) # Invalid UTF-8 Sequence Start | [\xF0-\xF4](?![\x80-\xBF]{3}) # Invalid UTF-8 Sequence Start | (?<=[\x0-\x7F\xF5-\xFF])[\x80-\xBF] # Invalid UTF-8 Sequence Middle | (?<![\xC2-\xDF]|[\xE0-\xEF]|[\xE0-\xEF][\x80-\xBF]|[\xF0-\xF4]|[\xF0-\xF4][\x80-\xBF]|[\xF0-\xF4][\x80-\xBF]{2})[\x80-\xBF] # Overlong Sequence | (?<=[\xE0-\xEF])[\x80-\xBF](?![\x80-\xBF]) # Short 3 byte sequence | (?<=[\xF0-\xF4])[\x80-\xBF](?![\x80-\xBF]{2}) # Short 4 byte sequence | (?<=[\xF0-\xF4][\x80-\xBF])[\x80-\xBF](?![\x80-\xBF]) # Short 4 byte sequence (2) )/x'; $string = preg_replace($regex, '', $string); $result = ""; $current; $length = strlen($string); for ($i=0; $i < $length; $i++) { $current = ord($string{$i}); if (($current == 0x9) || ($current == 0xA) || ($current == 0xD) || (($current >= 0x20) && ($current <= 0xD7FF)) || (($current >= 0xE000) && ($current <= 0xFFFD)) || (($current >= 0x10000) && ($current <= 0x10FFFF))) { $result .= chr($current); } else { $ret; // use this to strip invalid character(s) // $ret .= " "; // use this to replace them with spaces } } $string = $result; } return $string; } |
This is nothing less than a mixup of two methods I found here and here on StackOverflow, so the credits go to the respective authors (which I thank): I needed them both because I had to deal with invalid UTF-8 characters and invalid XML characters: as you can see, the method makes use of a regular expression which is shortly followed by an iterative, char-by-char approach.
As I said before, it's rather ugly and highly unefficient, possibly even more than the previous one... however it gets the job done, and since I had to complete the task in a ridiculously short amount of time that's the best I've come with. In case someone wants to come out with something better, he's VERY welcome... I'll gladly accept his suggestions. Until then, I hope that this will actually help other PHP "double-clawed" developers to achieve decent results as well.
... It definitely seems like the PHP hammer has scored yet another hit!
I swear it won't happen again anytime soon... :)
Many thanks for this code.
It save my life today (Sunday 29th August 2021) when I was transferring three gigabytes of old and potentially dodgy mails (eg lots of spam) from mbox format into a MySQL database.
All I had to do was cut’n’paste it into my script and it worked immediately and at the first time of asking.