Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

It's even more interesting, I think. The tokens are a byte pair encoding [1] of the input string. So a short, frequent word might be represented as one token, but an infrequent word (such as "bbbbbbb") might be represented by several tokens, each of which might or might not correspond to a letter.

This might also explain the weird "off-by-one" errors with the ROT13 task.

[1] https://en.m.wikipedia.org/wiki/Byte_pair_encoding



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: