Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Beginner here, so just musing:

I like the idea. You would need your own mutable copy of the model, which is usually huge. And you need to backprop so there is a bit more computation. It might be doable for a local model that is smaller than GPT3.5/4.

You also need to decide what is worth memorizing long term vs short term.



Coming back to this. LORA training is only on the attention layer, and this was sufficient for memorization , per the article. So we wouldn't update all the model's weights in some kind of constant context one-shot learning scheme.


> own mutable copy of the model, which is usually huge

It could just be the diff against the main model or similar.


But if you have say 50bn weights, and you run backprop, you are going to update most of the weights (except the dropout ones, but which ones drop out changes on every token I think). This means you need 50bn deltas. It might compress, but if you do then you need extra compute to do that.


You would do dropout on every epoch of training, not on every token.


I didn't know that, I might look at the NanoGPT code and torch.dropout docs a bit closer then. Thanks!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: