If this holds true, this would support the idea that much smaller, human curated datasets will be of much higher value than synthetic datasets generated by LLMs
Whichever has the most information wins. When the information has structure you can heavily exploit it for generating synthetic data. For this I point you to Apple Sim. It’s a repository of 3D models for interiors. You can generate many layers of information by controlling the renderer and then use it on real photos. That’s done all over images so vectorial spaces are pretty natural for embeddings. You don’t need to add much structure algebraically speaking.
If your domain is heavily algebraic, you might even be able to generate correct examples arbitrarily, which is a situation I recommend anyone to be in.
I assume there is a value metric that balances quantity with quantity that may be exploitable in our mid-gains period of understanding the tech behavior -- meaning potential gains from synthetic data. That said, I also expect no-free-lunch to kick in at some point, and synthetic data doesn't always pay attention to the data generating process for outliers.
You will find active learning interesting. It starts by attributing a value to each point in your domain that it learns to match the expected gain in some performance metric.
This metric can be learned so it’s okay if it’s really hard to specify.
I doubt it. If anything, ULMFiT era AI has finally killed the need for human curated data. ChatGPT 4 is already being used as an oracle model that everyday AI models are trained off of. A truly gargantuan oracle model will obviate all but the smallest of human input.
GPT4 relies heavily on human curated data. Both for specific domains and for instruction following. Any new model that tries to go beyond it will also likely rely on such data.
Yeah it's been known that OpenAI hires domain experts. If anything, they augment that high quality data rather than just starting from bare bones synthetic data.