=^.^=

DOMDocument::loadHTML(): Tags Invalid Parsing HTML5 and More!

You may have noticed your log files (or - god forbid - your browser window) filling up with complaints from PHP's built-in DOM parser DOMDocument that the loadHTML() or LoadHTMLFile() method is being fed all manner of invalid tags. In the case of poorly formed (or both technically and awkwardly: not well-formed) markup or inadequately couched snippets of otherwise-valid HTML this might make some sense. But why is it whining about all-American, good-ol-boys like <article>, <nav>, <picture>... wait, there's a pattern here. It seems to have developed a food allergy for HTML5! Make haste, we must get to the bottom of this! (and somebody get to an epi pen...)

I had been detailing my debugging efforts and defending my use of a jquery-for-php-like library some decade or so old (phpquery @ http://code.google.com/p/phpquery/) in the production of this very site (foxpa.ws) which was generating most of these DOMDocument errors in my logs and mentioned that I might set out to write this, a "jump-the-shark article" - or a jump-the-sharticle, if you will. I had always believed that the concept of jumping the shark carried with it an implicit self-referential element. It turns out this is not so. But as I have already uploaded The Fonz and set up the CSS float it is no longer possible to avoid this shame and conceal my folly... well, I was forced to learn something this morning. So you should have to as well...

From https://en.wikipedia.org/wiki/Jumping_the_shark:

The idiom "jumping the shark" was coined in 1985 by Jon Hein in response to a 1977 episode from the fifth season of the American sitcom Happy Days, in which Fonzie (Henry Winkler) jumps over a shark while on water-skis. The phrase is pejorative and is used to argue that a creative outlet or work appears to be making a misguided attempt at generating new attention or publicity for something that is perceived once to have been widely popular, but is no longer.

The problem, it seems, is that PHP employs libxml2 to do the heavy lifting behind the DOMDocument API. Since it always has, perhaps developed wasn't the right way to put it - it's more likely that we are only now noticing this issue where it once went unseen is because the version of PHP we are using has updated and the error level assigned to this exception has entered the reporting window at hand. But why does it take offense to common HTML5? It's a simple matter of sloppy and outdated naming; libxml2 is an XML parser. HTML5 is not XML. Except it can be. But it's not. Huh?

NOTE: When I started to try explaining that nuance it soon became apparent that the topic was better suited to the confines of its own article. As such, please continue reading A Brief (and Delightfully Snarky) History of HTML5 if the details interest you... and you happen to be blessed with more time than good sense...

[attachment-yhTyOw]
HEY GUYS! GUYS! WHAT DO YOU CALL A BUNDLE OF LOGS? :D
(I'm allowed, it's our word :/)

...Right. So, the HTML5 you're trying to feed the XML parser isn't XML and the XML parser is just doing what it's supposed to by barfing in its mouth a little and we only have this problem because of the prevailing trends (see: XHTML, as above) at the time the API was drafted in PHP and the methods were named and that's all fine and dandy mister but what about my damn code?

  1. We can swap out HTML5's descriptive layout tags for plain-jane <div>s, like this enterprising fellow on Stack Overflow...
    $html = file_get_contents($url); $search = array( "<header", "</header>", "<nav", "</nav>", "<section", "</section>", "<article", "</article>", "<footer", "</footer>", "<aside", "</aside>", "<noindex", "</noindex>", ); $replace = array( "<div", "</div>", "<div", "</div>", "<div", "</div>", "<div", "</div>", "<div", "</div>", "<div", "</div>", "<div", "</div>", ); $html = str_replace($search, $replace, $html); $dom = new DOMDocument(); $dom->loadHTML($html);
    Notice how the potentiality of attributes has been thoughtfully taken into account. Unfortunately what hasn't been taken into account is about five times as many more defined tags and the whole make your own damn tags and name them whatever the hell you want thing that tickles the free spirits among us absolutely pink to this very day. Perhaps that's why this suggestion is, at present time of writing, ranked negative six (-6).

    Well, that and the fact that if we had been relying on being able to identify elements in the resulting object by those tags' proper names we have just accomplished an excellent unravelling of the entire undertaking. Bravo.

  2. We could suppress the errors with the STFU Operator (@)...
    $dom = new DOMDocument(); @$dom->loadHTML($html_data);
    Sure! And in the immortal words of my sensei: "I could make you S the F up - permanently - if you ever suggest that again within earshot!" :@

    Repeat after me: The STFU operator is not how real programmers program real programs. PHP is finally awesome, but all the programmers of all the other programming languages are watching you, and me, and frankly: I'm sick of being beat up and laughed at. q.q

    Also, it should be noted:

    From https://php.watch/versions/8.0/fatal-error-suppression:

    In PHP 8.0, the @ operator does not suppress certain types of errors that were silenced prior to PHP 8.0. This includes the following types of errors:

    All of these errors, if raised, halts [sic] the rest of the application from being run. The difference in PHP 8.0 is that the error message is not silenced, which would have otherwise resulted in a silent error.

  3. ...a-Alright, suppress the errors with a flag? *wince*:
    $dom = new DOMDocument(); $dom->loadHTML($html_data, LIBXML_NOERROR);
    Now that has an air of slight sophistication. Or at least it doesn't smell like outright crap. I'd accept it if we're on a deadline but this precludes our ability to access the errors in the event we have legitimate use for them in debugging. Particularly say, where we are parsing user-submitted or directed information or interfacing with a source that we do not control. Frankly that sounds like virtually every case in which I've employed this capability...
  4. What if we intercepted the errors with a non-libxml2, PHP-based set_error_handler()... error handler?
    From https://stackoverflow.com/questions/1148928/disable-warnings-when-loading-non-well-formed-html-by-domdocument-php:
    class ErrorTrap { protected $callback; protected $errors = array(); function __construct($callback) { $this->callback = $callback; } function call() { $result = null; set_error_handler(array($this, 'onError')); try { $result = call_user_func_array($this->callback, func_get_args()); } catch (Exception $ex) { restore_error_handler(); throw $ex; } restore_error_handler(); return $result; } function onError($errno, $errstr, $errfile, $errline) { $this->errors[] = array($errno, $errstr, $errfile, $errline); } function ok() { return count($this->errors) === 0; } function errors() { return $this->errors; } } // create a DOM document and load the HTML data $xmlDoc = new DOMDocument(); $caller = new ErrorTrap(array($xmlDoc, 'loadHTML')); // this doesn't dump out any warnings $caller->call($fetchResult); if (!$caller->ok()) { var_dump($caller->errors()); }
    While I will give full points for isolation, doesn't your app already have a thoughtful error and exception handling subsystem? Yes, this will keep PHP/the HTTP or fCGI daemon logs and our own custom error handling system free of all these extraneous parsing errors and that is what we set out to do... but now anything that isn't a parsing error that crops up is going to be hidden from our error handling. It just doesn't feel ideologically tidy and speaking of tidy, since the next option is available - and so much more elegant - all this fancy (or convoluted, you be the judge) footwork is rather moot....
  5. OK, let's leverage libxml2's built-in error management, as exposed to PHP:
    // create DOM $dom = new DOMDocument(); // modify state $libxml_previous_state = libxml_use_internal_errors(true); // parse $dom->loadHTML($html); // fetch the errors $errors = libxml_get_errors(); foreach ($errors as $error) { // handle the errors as you wish } // flush the libxml error queue libxml_clear_errors(); // restore state libxml_use_internal_errors($libxml_previous_state);
    Nicely done, in a one-off situation you could get away with ignoring the state of libxml error handling - but this is a classy way of making whatever code widget you're building here maximally portable and respectful of its surroundings so you can feel confident dropping it in wherever down the line.
  6. Sadly, none of these approaches solves the real issue: libxml2 is an XML parser. HTML5, in the common serialization - likely the only one you're even aware of unless you're a documentation addict (or read my bit about the history of HTML5) - just isn't XML.

    And the right answer is always to use the right tool for the job at hand: there are a number of third-party HTML5 parsing libraries out there.

    • HTML5DOMDocument bills itself as a drop-in extension and correction to the DOMDocument API which should make it ideal for those just looking for a quick fix - but only as quick as correct permits. It also sports CSS selector functionality - which is what I was using that rusty old phpquery library for. Two birds with one stone? That's my kinda library!
      $dom = new IvoPetkov\HTML5DOMDocument(); $dom->loadHTML('<!DOCTYPE html><html><body><h1>Hello</h1><div class="content">This is some text</div></body></html>'); echo $dom->querySelector('h1')->innerHTML; // Hello echo $dom->querySelector('.content')->outerHTML; // <div class="content">This is some text</div>
      Hot dog!
    • Masterminds HTML5-PHP is easily the most referenced PHP-based server-side HTML5 parsing library and with good reason: it boasts over 5 million downloads and a pedigree that extends well beyond the last decade. Surely if you are in the market for a complete, tried, tested and true solution it must not be ignored. GitHub activity shows that despite its age it is still being maintained, while the majority of its codebase has survived the test of time and settled into a rock-hard foundation.

Comments

There are no comments for this item.