I get a lot of Latin and Spanish in mine but I think that's because they actuall...

I get a lot of Latin and Spanish in mine but I think that's because they actually are represented in the poetry corpus. Not too surprising that the regular GPT-2s are also exposed to a lot of foreign language, as Reddit is not a strictly anglophone website, and that it'll remember despite some finetuning (there are so many parameters in it, after all).

I do look at the training samples but I've never noticed a worsening of 'coherence' in the samples, so to speak. I wonder if that what overfitting looks like? My PG corpus is so large that the GPT-2s struggle to converge, much less overfit, so I don't know what overfitting would look like. You could try using the new pseudo-validation loss checking feature nshepperd added to see if there's any connection between the validation loss and your perception of coherence.