Beginner here, so just musing: I like the idea. You would need your own mutable ...

SubiculumCode · on Sept 6, 2023

Coming back to this. LORA training is only on the attention layer, and this was sufficient for memorization , per the article. So we wouldn't update all the model's weights in some kind of constant context one-shot learning scheme.

pests · on Sept 6, 2023

> own mutable copy of the model, which is usually huge

It could just be the diff against the main model or similar.

quickthrower2 · on Sept 6, 2023

But if you have say 50bn weights, and you run backprop, you are going to update most of the weights (except the dropout ones, but which ones drop out changes on every token I think). This means you need 50bn deltas. It might compress, but if you do then you need extra compute to do that.

jacquesm · on Sept 6, 2023

You would do dropout on every epoch of training, not on every token.

quickthrower2 · on Sept 7, 2023

I didn't know that, I might look at the NanoGPT code and torch.dropout docs a bit closer then. Thanks!