That is fine for throwaway scripts. But such "perl duct tape" is not 100% accura...

ivanhoe · on July 2, 2015

Why would it "break for no reason"?! For all I know regExps matching one small piece of the page if far less prone to breaking than parser that has to analyze the whole page. Designer changes one <div> or id/class somewhere in the top of the DOM tree and you can't reach the node that you are looking for anymore. Same goes for regExp of course, but it's looking at a smaller portion of the html, so it's less likely to be affected by small changes in some unrelated part of the page. And any major redesign will break any dedicated scraper, no matter which parser it uses...

banthar · on July 13, 2015

Lets try an example. Extract first link address from https://news.ycombinator.com/.

As DOM query:

    document.getElementsByClassName("title")[0].parentElement.getElementsByTagName("a")[1].href

This will break:

* When title element no longer has "title" class.

* When title is no longer a sibling of link.

* When link is no longer 2nd link of its parent.

As regular expression:

    document.documentElement.innerHTML.match('td class="title">.*a href="([^"]*)"')[1]

This will break:

* On any white space change.

* On any new attributes on td or a.

* When ' is used instead of "

* When href includes escaped "

* In most cases when DOM query will break.

Many of those can happen without any server-side changes. It will sometimes works sometimes won't - making it hard to test.

There are cases when regular expression will break less often than DOM but DOM is easier to reason about, more predictable and has less corner cases.

kwhitefoot · on June 21, 2015

And the xml paresr will fail if the xml is not well formed while the regex will just keep sailing along. I had an example of this with BlogPoster.py which uses python xmlrpc. There are Wordpress hosts which return invalid xml and this causes an exception, I reimplemented what I needed with Bash and cURL using regexes and it works fine.

banthar · on June 21, 2015

I'm not sure if you mean that pro or against regular expressions. That is a good example for #5. Regular expressions are great to quickly patch together something that kinda works, but:

1. Those hosts are still broken. The next person will have to jump the same hoops to support them.

2. You parser is very permissive. It will encourage people to create even more broken implementations.

3. The specification of this protocol is now worthless. There is no way to safely add new functionality. Any new element or attribute can break those regexes. Everyone has to take every implementation into account.

4. You are probably missing some corner cases like CDATA elements or quoted characters.

From your point of view, it probably makes sense to support even broken sites. But, you are helping to create next HTML - where every implementation works differently and you have to test everything on every browser.