For reference, this was the thread where someone explained that to me (from 5 mo...

RedNifre · 2025-04-05T13:29:30 1743859770

Oh, that's interesting! It sounds like it's not literally being fed UTF-8 bytes, but instead more like this: For rarely seen characters, it's two tokens, namely first a codeblock token ("Tag" token in this case), followed by a token like "1st character in this codeblock" or "2nd character in this code block" and so on and since many rare codeblocks are latin-like (tags, circled letters, mathematical Fraktur variables etc.), the LLM picks up that "some block token"+"1st character in the codeblock" kinda is like "A"? Is that how it works?

xg15 · 2025-04-05T13:40:27 1743860427

Had to read it again as well but yeah, that's how I'd understand it too. So the "offset in block" tokens are still not the same tokens as for the "real" ASCII letters, but they are the same tokens for all "weird ascii-like Unicode blocks". So the model can aggregate the training data from all those blocks and automatically "generalize" to similar characters in other blocks (by learning to ignore the "block identifier" tokens) even ones that have very little or no training examples themselves.

Edit: So this means if you want to sanitize text before passing it to an LLM, you don't only have to consider standard Unicode BMP characters but also everything that mirrors those characters in a different block. And because models can do Cesar ciphers with small offsets, possibly even blocks where the characters don't line up completely but are shifted by a small number.

Maybe it would be better to run the sanitizer on the tokens or even the embedding vectors instead of the "raw" text.