Some days ago I had to write some PHP code to extract the contents from some XML files (italian electronic invoices for Public Administrations, also known in Italy as FatturaPA): the work was pretty simple, yet I had to quickly solve two main problems: extracting XML data from a digitally signed .xml.p7m file and stripping away some invalid UTF8 characters in the XML content itself.
Since I had to get the job done quickly, I've dealt with both tasks using quick'n'dirty workarounds by fully taking advantage of the famous PHP "double clawed hammer" features: we'll be dealing with the first one here, while the latter has been addressed in another dedicated post.
Regarding the P7M thing I have been lucky, since all the invoices were digitally signed using CAdES format, which - as you might already know - works by adding a PKCS#7 header and a signature info footer to the original file, meaning that we can easily get rid of them - as long as we don't need to check the signature. It's worth noting here that, as it was perfectly fine for my specific scenario - since everything was already verified - it could not be the case for most situations where you do want to check the signature before reading/using the file.
That said, here's the code that I came up with:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
/** * stripP7MData * * removes the PKCS#7 header and the signature info footer from a digitally-signed .xml.p7m file using CAdES format. * * @param ($string, string) the CAdES .xml.p7m file content (in string format). * @return (string) an arguably-valid XML string with the .p7m header and footer stripped away. */ function stripP7MData($string) { // skip everything before the XML content $string = substr($string, strpos($string, '<?xml ')); // skip everything after the XML content preg_match_all('/<\/.+?>/', $string, $matches, PREG_OFFSET_CAPTURE); $lastMatch = end($matches[0]); return substr($string, 0, $lastMatch[1]+strlen($lastMatch[0])); } |
There's no need to explain it, as the underlying logic is pretty simple: we just strip everything positioned before the XML tag (the PKCS#7 header) and after the last XML closing tag (the signature info footer). That's quite ugly, I second that, yet it gets the job done. I would be happy to replace it with some better code anytime soon, hoping I'll have the time.
The common usage case for such a function would be within the server-side script that receives a POST REQUEST in multipart format containing the XML.P7M file as a parameter, just like in the following example:
1 2 3 4 5 6 7 8 9 10 |
if (strtoupper($_SERVER['REQUEST_METHOD']) == 'POST') { $p7mFile = $_POST["file"]; $p7mContent = file_get_contents($p7mFile); $xmlContent = stripP7MData($p7mContent); // instantiate a PHP SimpleXML object $xf = simplexml_load_string($xmlContent); // ...to be continued } |
... and so on.
Use it with caution, and... happy parsing!
Useful References
- Italian Government Standards for Electronic Signatures.
- Electronic Invoice for the Italian Public Administration.