What's the benefit of OpenAI charging per-token instead of per-character or per-...

typest · on April 5, 2023

It's not that they're just charging per token -- the actual models are operating on a token level. The model sees things in terms of tokens, and in openai's case, these tokens are subword (pieces of words), not words themselves, not characters.

So the real question is, what is the benefit of modeling your tokens as subwords, rather than as characters or words?

I think there is a lot of nuance here, and I don't understand it all. But, some benefits:

* Words, at least in English, are composed of different pieces, like roots, prefixes, and stems. Modeling at the subword level more naturally aligns your model with this aspect of language. If I tokenize "warmest", I get "warm" and "est". So, the meaning of the token "est" can be learned by the model -- whereas if you modeled by words, the model would have to individually relearn this aspect of information for every word ending in "est".

* Modeling at the subword level makes your sequences a lot shorter than modeling at the character level, which should help with things like efficiency.

* Modeling at the subword level makes your vocabulary a lot bigger than just modeling at the character level, which I suspect helps the model, as it can assign the subwords themselves meaning. E.g., it can learn the meaning of the token "warm" on its own, rather than having to learn this meaning only through learning the relationship of the tokens "w" "a" "r" and "m".

Hope this helps! Would love for anyone else to chime in/add on/correct me.

rhdunn · on April 5, 2023

I've noticed that it correctly splits warm|est, cold|est, bleak|est, but darkest is a single token.

I've also seen it group `?"`, `."`, `!"`, and `.--` into single tokens.

It also splits some words like "Elton" as El|ton. Presumably in that case it has mis-idetified a -ton prefix.

bootsmann · on April 5, 2023

The tokenizer doesn’t actually change model to model, by the looks of it this is still the GPT-2 tokenizer. Also the per-token cost makes sense because predicticting a token is a single forward pass through the model, while for other cost measures they would need to do some science to make it work out on average.

hn_throwaway_99 · on April 5, 2023

It's not a "benefit", it's simply how the technology works - the underlying model just fundamentally works on tokens as it's atomic inputs.

The models don't know anything about words, just tokens.

smallerfish · on April 5, 2023

The models know how to decode base64, so if they were naive, you could pass them one base64 "word" representing a prompt thousands of lines long.

There are still ways to compress prompts though.

qeternity · on April 5, 2023

Because tokens are the unit of work in an LLM and it’s not correct to say that tokens or even embeddings change between models.