We shouldn't call this open source. The model definition + the data is the sourc...

pabs3 · on Sept 22, 2022

The Debian deep learning team's machine learning policy would call this a "toxic candy" model:

https://salsa.debian.org/deeplearning-team/ml-policy

BTW, wouldn't you take the existing model and do additional Hokkaido Japanese speaker training on top of it, rather than retraining the model from scratch?

rvz · on Sept 22, 2022

Yes. It just like calling the release of compiled closed binary blobs as 'open source' even when the source of reproducing the compiled output is unavailable.

> If I asked a programmer from OpenAI to modify the model to better support Japanese speakers from Hokkaido, their "preferred form" of the model's source code would include the 680,000 hours of audio used to train the model.

Precisely. These 'users' lifting the model can't do it themselves. You will still be contacting OpenAI for support or to add support for another language and they will be the ones able to modify the model.

> Just don't call it open source.

That is true, it is still closed source and already we are seeing the hype squad already apologising to OpenAI as they 'open sourced' a closed model that you can't modify yourself.

OpenAI is still business as usual and nothing has changed.

MacsHeadroom · on Sept 22, 2022

>You will still be contacting OpenAI for support or to add support for another language and they will be the ones able to modify the model.

This isn't quite correct. The model weights are all you need to fine tune the data on your own with your own audio.

Without the original training set this still isn't open source. But you aren't powerless to modify the model without the original training set.

nl · on Sept 22, 2022

This isn't really true.

You can do a lot with weights and no training data - for example you can pull the end layer off it and use it as a feature extractor.

And to modify it for Japanese speakers you'd fine train the existing model on additional data. If you wanted to modify the model you can (sometimes, depending on what you want to do) modify an existing architecture by removing layers, adding replacements and fine tuning.

I don't quite know what the right analogy of trained data is. In many ways it is more valuable than the training data because the compute needed to generate it is significant. In other ways it is nice to be able to inspect the data.

> The source code must be the preferred form in which a programmer would modify the program.

As a machine learning programmer I'd much prefer the weights than the raw data. It's no realistic for me to use that training data in any way with any compute I have access to.