What's the benefit of OpenAI charging per-token instead of per-character or per-word?
Since token algorithms change model-to-model and version-to-version, it seems like they've added a lot of complication for no actual benefit to the user except for a little peek under the hood.
Is there a benefit to this scheme that I'm not seeing? Is there some way to game the system otherwise?
It's not that they're just charging per token -- the actual models are operating on a token level. The model sees things in terms of tokens, and in openai's case, these tokens are subword (pieces of words), not words themselves, not characters.
So the real question is, what is the benefit of modeling your tokens as subwords, rather than as characters or words?
I think there is a lot of nuance here, and I don't understand it all. But, some benefits:
* Words, at least in English, are composed of different pieces, like roots, prefixes, and stems. Modeling at the subword level more naturally aligns your model with this aspect of language. If I tokenize "warmest", I get "warm" and "est". So, the meaning of the token "est" can be learned by the model -- whereas if you modeled by words, the model would have to individually relearn this aspect of information for every word ending in "est".
* Modeling at the subword level makes your sequences a lot shorter than modeling at the character level, which should help with things like efficiency.
* Modeling at the subword level makes your vocabulary a lot bigger than just modeling at the character level, which I suspect helps the model, as it can assign the subwords themselves meaning. E.g., it can learn the meaning of the token "warm" on its own, rather than having to learn this meaning only through learning the relationship of the tokens "w" "a" "r" and "m".
Hope this helps! Would love for anyone else to chime in/add on/correct me.
The tokenizer doesn’t actually change model to model, by the looks of it this is still the GPT-2 tokenizer. Also the per-token cost makes sense because predicticting a token is a single forward pass through the model, while for other cost measures they would need to do some science to make it work out on average.
Since token algorithms change model-to-model and version-to-version, it seems like they've added a lot of complication for no actual benefit to the user except for a little peek under the hood.
Is there a benefit to this scheme that I'm not seeing? Is there some way to game the system otherwise?