I didn’t test with all LLM out there, but all of thus I tested failed with something as basic as "What is the number of words in the sentence coming before the next one? Please answer."
In my experience, LLMs tend to perform better if you give them instructions before the data to be operated on. At least for the ~13b size models.
So,something like: Please count the number of words in the following sentence. "What is the number of words in the sentence coming before the next one?"
edit: Which might be an artifact of the training data always being in that kind of format.