Parsing JSON is indeed several orders of magnitude less complex than parsing HTML. In both cases there are excellent parsing libraries, but it would be very unwise to create your own HTML parser for production use, or to use any parser that hasn't seen some serious scrutiny.
JSON at least has the concept of "invalid JSON". That's a big step forward. A JSON parser, like an XML parser, can say "Syntax error - rejected." There's no such thing as "invalid HTML". For that reason, parsing HTML is a huge pain.
As someone who has a web crawler, I'm painfully aware of how much syntactically incorrect HTML is out there. HTML5 has a whole section which standardizes how to parse bad HTML. That's just the syntax needed to parse it into a tree, without considering the semantics at all.
There's no such thing as "invalid HTML". For that reason, parsing HTML is a huge pain.
Actually, as someone who has written an HTML parser by following the HTML5 spec, I see it as the opposite: because every string of bytes essentially corresponds to some HTML tree, there are no special "invalid" edge cases to consider and everything is fully specified. That's the best situation, since bugs tend to arise at the edge cases, the boundaries between valid and invalid. But the HTML5 spec has no edge cases: the effect of any byte in any state of the parser has been specified.
HTML5 has a whole section which standardizes how to parse bad HTML.
In some ways, I think it's a bit of a moot point what is "bad HTML" if all parsers that conform to the standard parse it in the same way. The spec does mention certain points as being parse errors, but are they really errors after all, if the behaviour is fully specified (and isn't "give up")? In fact, disregarding whether some states are parse errors actually simplifies the spec greatly because many of the states can be coalesced, and for something like a browser or crawler, it's completely irrelevant whether any of these "parse errors" actually occurred during the parsing. One example that comes to mind is spaces after quoted attribute values; a="b" c="d" and a="b"c="d" are parsed identically, except the latter is supposedly a parse error. Yet both are unambiguous.
I've written implementations of the HTML5 color algorithms. There are some sequences of bytes which, when given as a color value in HTML5, don't correspond to an RGB color, which makes things interesting.
(for the record, they are the empty string, and any string that is an ASCII case-insensitive match for the string "transparent")
It's worthwhile pointing out that HTML parsers are allowed to abort parsing the first time they hit each parse error, if they so choose. As such, not all implementations are guaranteed to parse content that contains parse errors, hence why it matters for authoring purposes.
It is pretty rare to need to parse JSON yourself but it isn't that difficult.
In theory, it's not supposed to be "that difficult". But in practice, according to the linked-to article, due to all rot and general clusterfuck-ery in the various competing specifications, apparently it is.
Or do you really think you could wrap your head all around those banana peels, and put together a robust, production-ready parser in a weekend?
I wouldn't want my own JSON parser out on the web, but if I needed to get JSON from $known_environment to $my_service, I'd feel safe enough with a parser I wrote.
Well that's the difference: it's the discrepancy between handling the data you're handed, and all the data possible. JSON has a lot of edge cases that are very infrequently exercised. Therefore a "robust, production-ready parser" is not usually what's desired by the pragmatist with the deadline. This can inevitably lead to security holes, but it doesn't necessarily. For example, sometimes the inputs are config files curated by your coworkers, or outputs which come from a server under your control and will always be in UTF-8 and will never use floats or integers larger than 2^50.
Taking it the other way, we can also ask "how can you optimize the parser to be as simple as possible, so that everything is well-specified and nobody can eff it up, while still preserving the structure that JSON gives you?" I tried to experiment with that about five years ago and came up with [1], but it shows a nasty cost differential between "human-readable" and "easy to parse." For example, the easiest string type to parse is a netstring, and this means automatic consistent handling of embedded nulls and what-have-you... but when those unreadable characters aren't escaped then you inherently have trouble reading/writing the file with a text editor. Similarly the easiest type of float is just to take the 64 bits and either dump them directly or as hex... but either way you don't have the ability to properly edit them with the text editor. Etc.
But I am finding that the central problem I'm having with JSON and XML is that it's harder to find (and harder to control!) streaming parsers, so one thing I'm thinking about for the future is that formats that I use will probably need to be streaming from the top-level.[2] So if anyone's reading this and designing stuff, probably even more important than making the parser obviously correct is making it obviously streaming.
[1] https://github.com/drostie/bsencode is based on having an easy-to-parse "outer language" of s-expressions, symbols, and netstrings, followed by an interpretation step where e.g. (float 8:01234567) is evaluated to be the corresponding float.
[2] More recently I've had a lot of success in dealing with more-parallel things to have streamability; for example if you remove whitespace from JSON then [date][tab][process-id][tab][json][newline] is a nice sort of TSV that gets really useful for a workflow of "append what you're about to do to the journal, then do it, then append back that it's done" and so forth; when a human technician needs to go back through the logs they have what they need to narrow down (a) when something went wrong, (b) what else was on that process when it was going wrong, (c) what did it do and what was it trying to do? ... you can of course do all this in JSON but then you need a streaming JSON parser, whereas everyone can do the line-buffering of "buffer a line, split the next chunk by newlines, prepend the first line with the buffer, then save the last line to the buffer, then emit the buffered lines and wait on the next chunk."
Sure, but then you're just adding additional layers to what was supposed to a fairly straight forward script. Much easier to just include everything in a single program so you can just say give path to input json file here, give name of output file here, and run.
> It is pretty rare to need to parse JSON yourself (what environment doesn't have that available?) but it isn't that difficult. It's a simple language.
That, coupled with the fact that it is still so easy to get it wrong and to introduce security issues is exactly what should peak your attention to the seriousness of the subject. Building any parser is fraught with risk, it is super easy to get it subtly and horribly wrong.
Writing any code is fraught with risk, but writing a parser in a modern and reasonably safe language is not something to be greatly feared. It's more likely that you'll introduce a security issue in what you do with the JSON immediately after you parse it.
> writing a parser in a modern and reasonably safe language is not something to be greatly feared
It ought to be feared, if interoperability is involved. The problem isn't that you might introduce security issues. The problem is usually that you introduce very subtle deviations to the spec that everyone else implemented correctly, and as a result, sometimes your input and/or output do not work with other stuff out there.
Writing a parser for a badly-specified format which is widely used is a terrifying prospect in any language.
Okay so it's more terrifying in C than most other things, but still, it's terrifying. Runaway memory consumption, weird Unicode behaviour etc. etc. etc. It's easy to think you don't have to worry about Unicode because your language's string types will handle it for you - but what do they do if the input contains invalid codepoints? You're writing a parser, you need to know - and possibly override it if that behaviour conflicts with the spec.
Horrible business. Definitely not my favourite job.
Silent moment for those of us using niche languages to meet production requirements in environments that do not allow third-party code and do not have JSON parsing in the std lib...
If you don't have a solution, or you're not happy with your current solution, take a look at parsec style parsing. you can make a lot of progress with just a few combinators, and those style parsers are pretty easy to read.
You can get an implementation working with a fairly high level of confidence that it's right.
If it's not fast enough, make a pretty printer for your AST. Then do a CPS transform (by hand) on your library and parser, so you can make the stack explicit. Make sure the transformed version pretty prints exactly the same way.
Then make a third version that prints out the code that should run when parsing a document, rather than doing the parsing directly. You'll get a big case switch for each grammar you want to parse. Your pretty printer will help you find many bugs.
It's a pretty achievable path to get your grammar correct, and then get a specialized parser for it.
It is pretty rare to need to parse JSON yourself (what environment doesn't have that available?) but it isn't that difficult. It's a simple language.