So Many HTML Parsers Suck

Published at 11:42 on 24 December 2019

Why? They ram a document tree down your throat, that’s why. So you’re stuck writing code that:

  • Consumes more memory, since you must load the entire document in memory at once, and
  • Makes modifying the content tricky, since traversing a document tree you are modifying is a potential minefield. (The alternative is to create an entire new document tree from the old one, which doubles the already sometimes obscene memory footprint.), and
  • Consumes more processor time, because multiple tree traversals are typically necessary.

Slow, bloated, error-prone: In a word, document trees just plain suck. Yes, sometimes they are necessary. That just means they should be a necessary alternative. They should never be the only way you can parse HTML.

Yet, with all too many HTML parsers, they are the only way. And that’s why so many HTML parsers suck.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.