There are many clues to indicate that the animation is a lie. For example, it clearly upscales the image using an external tool after the first image renders. As another example, if you ask the model about the tokens inside of its own context, it can't see any pixel tokens.
A model may not have many facts about itself, but it can definitely see what is inside of its own context, and what it sees is a call to an image generation tool.
Finally, and most convincingly, I can't find a single official source where OpenAI claims that the image is being generated pixel-by-pixel inside of the context window.