So Many HTML Parsers Suck

Published at 11:42 on 24 December 2019

Why? They ram a document tree down your throat, that’s why. So you’re stuck writing code that:

Consumes more memory, since you must load the entire document in memory at once, and
Makes modifying the content tricky, since traversing a document tree you are modifying is a potential minefield. (The alternative is to create an entire new document tree from the old one, which doubles the already sometimes obscene memory footprint.), and
Consumes more processor time, because multiple tree traversals are typically necessary.

Slow, bloated, error-prone: In a word, document trees just plain suck. Yes, sometimes they are necessary. That just means they should be a necessary alternative. They should never be the only way you can parse HTML.

Yet, with all too many HTML parsers, they are the only way. And that’s why so many HTML parsers suck.

Blackcap Blog

So Many HTML Parsers Suck

Leave a Reply Cancel reply