Parsing HTML is literally orders of magnitude more complex. It is pretty rare to...

s_q_b · on Oct 26, 2016

Parsing JSON is indeed several orders of magnitude less complex than parsing HTML. In both cases there are excellent parsing libraries, but it would be very unwise to create your own HTML parser for production use, or to use any parser that hasn't seen some serious scrutiny.

Animats · on Oct 26, 2016

JSON at least has the concept of "invalid JSON". That's a big step forward. A JSON parser, like an XML parser, can say "Syntax error - rejected." There's no such thing as "invalid HTML". For that reason, parsing HTML is a huge pain.

As someone who has a web crawler, I'm painfully aware of how much syntactically incorrect HTML is out there. HTML5 has a whole section which standardizes how to parse bad HTML. That's just the syntax needed to parse it into a tree, without considering the semantics at all.

userbinator · on Oct 27, 2016

There's no such thing as "invalid HTML". For that reason, parsing HTML is a huge pain.

Actually, as someone who has written an HTML parser by following the HTML5 spec, I see it as the opposite: because every string of bytes essentially corresponds to some HTML tree, there are no special "invalid" edge cases to consider and everything is fully specified. That's the best situation, since bugs tend to arise at the edge cases, the boundaries between valid and invalid. But the HTML5 spec has no edge cases: the effect of any byte in any state of the parser has been specified.

HTML5 has a whole section which standardizes how to parse bad HTML.

In some ways, I think it's a bit of a moot point what is "bad HTML" if all parsers that conform to the standard parse it in the same way. The spec does mention certain points as being parse errors, but are they really errors after all, if the behaviour is fully specified (and isn't "give up")? In fact, disregarding whether some states are parse errors actually simplifies the spec greatly because many of the states can be coalesced, and for something like a browser or crawler, it's completely irrelevant whether any of these "parse errors" actually occurred during the parsing. One example that comes to mind is spaces after quoted attribute values; a="b" c="d" and a="b"c="d" are parsed identically, except the latter is supposedly a parse error. Yet both are unambiguous.

ubernostrum · on Oct 27, 2016

I've written implementations of the HTML5 color algorithms. There are some sequences of bytes which, when given as a color value in HTML5, don't correspond to an RGB color, which makes things interesting.

(for the record, they are the empty string, and any string that is an ASCII case-insensitive match for the string "transparent")

gsnedders · on Oct 27, 2016

It's worthwhile pointing out that HTML parsers are allowed to abort parsing the first time they hit each parse error, if they so choose. As such, not all implementations are guaranteed to parse content that contains parse errors, hence why it matters for authoring purposes.

kafkaesq · on Oct 26, 2016

It is pretty rare to need to parse JSON yourself but it isn't that difficult.

In theory, it's not supposed to be "that difficult". But in practice, according to the linked-to article, due to all rot and general clusterfuck-ery in the various competing specifications, apparently it is.

Or do you really think you could wrap your head all around those banana peels, and put together a robust, production-ready parser in a weekend?

teaearlgraycold · on Oct 26, 2016

Depends upon what you're doing.

I wouldn't want my own JSON parser out on the web, but if I needed to get JSON from $known_environment to $my_service, I'd feel safe enough with a parser I wrote.

drostie · on Oct 26, 2016

Well that's the difference: it's the discrepancy between handling the data you're handed, and all the data possible. JSON has a lot of edge cases that are very infrequently exercised. Therefore a "robust, production-ready parser" is not usually what's desired by the pragmatist with the deadline. This can inevitably lead to security holes, but it doesn't necessarily. For example, sometimes the inputs are config files curated by your coworkers, or outputs which come from a server under your control and will always be in UTF-8 and will never use floats or integers larger than 2^50.

Taking it the other way, we can also ask "how can you optimize the parser to be as simple as possible, so that everything is well-specified and nobody can eff it up, while still preserving the structure that JSON gives you?" I tried to experiment with that about five years ago and came up with [1], but it shows a nasty cost differential between "human-readable" and "easy to parse." For example, the easiest string type to parse is a netstring, and this means automatic consistent handling of embedded nulls and what-have-you... but when those unreadable characters aren't escaped then you inherently have trouble reading/writing the file with a text editor. Similarly the easiest type of float is just to take the 64 bits and either dump them directly or as hex... but either way you don't have the ability to properly edit them with the text editor. Etc.

But I am finding that the central problem I'm having with JSON and XML is that it's harder to find (and harder to control!) streaming parsers, so one thing I'm thinking about for the future is that formats that I use will probably need to be streaming from the top-level.[2] So if anyone's reading this and designing stuff, probably even more important than making the parser obviously correct is making it obviously streaming.

[1] https://github.com/drostie/bsencode is based on having an easy-to-parse "outer language" of s-expressions, symbols, and netstrings, followed by an interpretation step where e.g. (float 8:01234567) is evaluated to be the corresponding float.

[2] More recently I've had a lot of success in dealing with more-parallel things to have streamability; for example if you remove whitespace from JSON then [date][tab][process-id][tab][json][newline] is a nice sort of TSV that gets really useful for a workflow of "append what you're about to do to the journal, then do it, then append back that it's done" and so forth; when a human technician needs to go back through the logs they have what they need to narrow down (a) when something went wrong, (b) what else was on that process when it was going wrong, (c) what did it do and what was it trying to do? ... you can of course do all this in JSON but then you need a streaming JSON parser, whereas everyone can do the line-buffering of "buffer a line, split the next chunk by newlines, prepend the first line with the buffer, then save the last line to the buffer, then emit the buffered lines and wait on the next chunk."

dagw · on Oct 26, 2016

what environment doesn't have that available?

Autocad LISP is the one I ran into. At least it didn't 4 years ago. I'm sure there are other niche cases.

Although I'll admit I punted and wrote a trivial 'parser' that was only able to read and write the particular JSON I was dealing with in that project.

jlarocco · on Oct 26, 2016

FWIW, if the need comes up again: https://github.com/mbeloshitsky/autolisp-json

Note it's only two years old, so wouldn't have helped last time ;-)

ciupicri · on Oct 26, 2016

Wouldn't be possible to write an external tool that converts JSON to s-expressions and vice-versa?

dagw · on Oct 26, 2016

Sure, but then you're just adding additional layers to what was supposed to a fairly straight forward script. Much easier to just include everything in a single program so you can just say give path to input json file here, give name of output file here, and run.

jacquesm · on Oct 26, 2016

> It is pretty rare to need to parse JSON yourself (what environment doesn't have that available?) but it isn't that difficult. It's a simple language.

That, coupled with the fact that it is still so easy to get it wrong and to introduce security issues is exactly what should peak your attention to the seriousness of the subject. Building any parser is fraught with risk, it is super easy to get it subtly and horribly wrong.

inimino · on Oct 27, 2016

Writing any code is fraught with risk, but writing a parser in a modern and reasonably safe language is not something to be greatly feared. It's more likely that you'll introduce a security issue in what you do with the JSON immediately after you parse it.

int_19h · on Oct 27, 2016

> writing a parser in a modern and reasonably safe language is not something to be greatly feared

It ought to be feared, if interoperability is involved. The problem isn't that you might introduce security issues. The problem is usually that you introduce very subtle deviations to the spec that everyone else implemented correctly, and as a result, sometimes your input and/or output do not work with other stuff out there.

mathw · on Oct 27, 2016

Writing a parser for a badly-specified format which is widely used is a terrifying prospect in any language.

Okay so it's more terrifying in C than most other things, but still, it's terrifying. Runaway memory consumption, weird Unicode behaviour etc. etc. etc. It's easy to think you don't have to worry about Unicode because your language's string types will handle it for you - but what do they do if the input contains invalid codepoints? You're writing a parser, you need to know - and possibly override it if that behaviour conflicts with the spec.

Horrible business. Definitely not my favourite job.

ben_jones · on Oct 26, 2016

Silent moment for those of us using niche languages to meet production requirements in environments that do not allow third-party code and do not have JSON parsing in the std lib...

jfoutz · on Oct 26, 2016

If you don't have a solution, or you're not happy with your current solution, take a look at parsec style parsing. you can make a lot of progress with just a few combinators, and those style parsers are pretty easy to read.

You can get an implementation working with a fairly high level of confidence that it's right.

If it's not fast enough, make a pretty printer for your AST. Then do a CPS transform (by hand) on your library and parser, so you can make the stack explicit. Make sure the transformed version pretty prints exactly the same way.

Then make a third version that prints out the code that should run when parsing a document, rather than doing the parsing directly. You'll get a big case switch for each grammar you want to parse. Your pretty printer will help you find many bugs.

It's a pretty achievable path to get your grammar correct, and then get a specialized parser for it.

oblio · on Oct 26, 2016

If your language doesn't come with JSON in the stdlib, you're really on the cutting edge. Or using a language meant for embedding :)

grondo4 · on Oct 26, 2016

Doesn't Java lack a JSON parser in the standard library?

amake · on Oct 27, 2016

As of Java 8, the Nashorn JavaScript engine is included by default, and it does support JSON parsing.

[1] http://winterbe.com/posts/2014/04/05/java8-nashorn-tutorial/

prodigal_erik · on Oct 26, 2016

The standard Java library includes a JavaScript interpreter, so there has to be a parser for object literals in there somewhere.

tree_of_item · on Oct 26, 2016

Object literals aren't JSON, though.

Volt · on Oct 27, 2016

JS has stringify and parse, so there ought to be a JSON parser somewhere.

dfox · on Oct 27, 2016

One thing that the article mentions is that there are in fact strings that are valid JSON but not valid JS object literals.

Macha · on Oct 29, 2016

https://developer.mozilla.org/en/docs/Web/JavaScript/Referen...

A modern JS implementation should also have a JSON parser

int_19h · on Oct 27, 2016

> you're really on the cutting edge.

Or the other way around. Think about something like MUMPS.

differentman · on Oct 28, 2016

No, no thank you. Did it for a year. Pretty sure I don't even want to think about doing it again.

thechao · on Oct 26, 2016

Pour a 40 for me; I shall go wallow in my assembly shame.

favorited · on Oct 26, 2016

Last time I worked on a FileMaker DB, it didn't have JSON support. I don't know if that has changed in the last few years.