This is an awesome paper, and the somewhat negative sentiment in the discussion ...

This is an awesome paper, and the somewhat negative sentiment in the discussion here is surprising.

The ablation studies are well done, comprehensive and expensive to do. People will be using the conclusions from this for years, and that is much more impactful than if an upcoming Siri product ourperforms the GPT model at that same point in time.

A few really interesting points:

Synthetic datasets substantially (1%+) increase performance for Image Encoder Pre-training

Architecture of the Visual<->Language model connector doesn't seem to matter.

Interleaving text and image data improves few shot performance, but image captioning data improves zero-shot numbers.

The ideal mix of data types is 5:5:1 for Interleaved:Captions:Plain Text (!)

Synthetic captioning data helps substantially at this point too (up to 4% gain)

The appendices are amazing: lots of details about learning rates tried, batch sizes.

The "explain these figures" are really really good. See page 37.