I like the idea. You would need your own mutable copy of the model, which is usually huge. And you need to backprop so there is a bit more computation. It might be doable for a local model that is smaller than GPT3.5/4.
You also need to decide what is worth memorizing long term vs short term.
Coming back to this. LORA training is only on the attention layer, and this was sufficient for memorization , per the article. So we wouldn't update all the model's weights in some kind of constant context one-shot learning scheme.
But if you have say 50bn weights, and you run backprop, you are going to update most of the weights (except the dropout ones, but which ones drop out changes on every token I think). This means you need 50bn deltas. It might compress, but if you do then you need extra compute to do that.
I like the idea. You would need your own mutable copy of the model, which is usually huge. And you need to backprop so there is a bit more computation. It might be doable for a local model that is smaller than GPT3.5/4.
You also need to decide what is worth memorizing long term vs short term.