The utf-8 tricks make me very nervous since I have seen too many attacks with pa...

dwattttt · on Aug 24, 2024

Luckily utf-8 structure is _very_ trivial compared to the average parser. Not to say there can't be bugs, but that the internal states of a parser shouldn't be large, and can be exhaustively tested.

hinkley · on Aug 24, 2024

This is the sort of space where I’d like to see a fuzzer.

hsbauauvhabzb · on Aug 24, 2024

Any bugs you can point to that come to mind of this class?

eesmith · on Aug 24, 2024

https://en.wikipedia.org/wiki/UTF-8#Invalid_sequences_and_er...

> Many of the first UTF-8 decoders would decode these, ignoring incorrect bits and accepting overlong results. Carefully crafted invalid UTF-8 could make them either skip or create ASCII characters such as NUL, slash, or quotes. Invalid UTF-8 has been used to bypass security validations in high-profile products including Microsoft's IIS web server[26] and Apache's Tomcat servlet container.[27] RFC 3629 states "Implementations of the decoding algorithm MUST protect against decoding invalid sequences."