Ideally raw images and audio with a few categories. This is because 1) categoriz...

Ideally raw images and audio with a few categories. This is because 1) categorization/labeling like you are suggesting does not scale and 2) feature building and detection is the most important part really, building in features like you are suggesting kind of defeats the point when you are testing your algorithm IMO.

You could stratify and charge for categories, pre built features etc.

The dataset for useful training of video, images and audio is going to easily range into tens of gigabytes or more.

This service would probably be best modeled on the mechanical-turk approach.

Worth noting is that Youtube is kind of a source already.