It's a next-word predictor like a Markov chain, but a Markov chain couldn't do a...

uh_uh · on Feb 4, 2023

Is it actually a next-word predictor? I thought the training loss is against a set of words, not just one.

skybrian · on Feb 4, 2023

I'm not sure what distinction you're getting at, but transformers do use "fill in the missing word" training and text generation chooses the next word (token actually) one at a time. Once it chooses a word, it doesn't go back.