In the palm-e paper (https://palm-e.github.io/), when they try to unfreeze and train the LLM on new image data only, there is expectedly a lot of CF on NLP tasks but very interestingly, the effect diminishes greatly with the scale of the LLM prior to training.
From an average -87.3% performance drop on the 12B model to -61.6% on the 84B model then just -3.9% on the 562B model. Felt like we were just shy of an insight breakthrough here.
Is avoiding CF potentially just a matter of sheer scale ?
I think our experiments actually don't show catastrophic forgetting! The accuracy does not decrease as loss gets worse -- it's simply getting over-confident.
So I'm not even sure we're showing any problem to solve here -- it might be more of a opportunity, in fact!
I have been training a natural intelligence model for 3 years now and she still doesn’t get nuance. Things are either good or bad in her book: nothing in between. My plan is to let her train with binary good/bad labels till the age of 5 and then start smoothing the labels after that. Wonder if that works for your AI.
Related trick: I found that training two Natural Intelligence (NI) models in parallel, and having them train each other for most of the time, leads to significant leaps in capabilities. Notably, when one NI picks up a skill, it often results in spontaneous transfer learning - the other NI picks that skill up very quickly, much faster than it would through direct training.
This scales well, too. There are facilities that provide services of co-hosting and cross-training up to ~two dozen NI models in a shared environment - in my experience, this provides similar training benefits to running multiple NIs on your own, at fraction of the cost.
(The facilities are exploiting some neat economies of scale. Talking to some employees, I learned that the transfer learning and co-activation are embarrassingly scalable: if you get two-three NIs to pick up a thing, all the rest immediately follow.)
This took a couple reads, but it’s funny. The bad news is that I’ve been training mine for 17 years and nuance is still something that needs more training.
in my mind I've built an 'emotional engine' to add nuance to models understanding, take something like Plutchik's wheel of emotions and create a high quality multi-modal dataset based on that structure, given our current technology takes inspiration from the brain, it would seem like having discrete models specialising in particular aspects of 'intelligence' that are then organised into a mixture of experts is an interesting area to explore, and perhaps more accessible as smaller models require less resources.
I have code stubbed out for this in mitta.us. It has 9 states, based on the Plutchik wheel, with emojis for the states. States drive temp and a few other things and drop the state into prompts.
The accounts aren't wired up by default to the AI and I am refactoring the templating system right now, but you can definitely start storing and searching things.
Cross-entropy loss can start getting worse due to the model becoming less calibrated, even as the classification accuracy continues to improve. I first heard that here: https://arxiv.org/abs/1706.04599
Is this 'overconfidence' the leading explanation as to why LLMs continue to show qualitative improvement even after their test loss levels off?
I assume this means losing all the energy and compute input for a model to know, perform, infer on inputs already indexed(?) (What is the proper term here?)
But is this the premise -you lose all prior investment of resource to a (I don't know the term for an AI archetype of knowledge) {btw, I love the embedded etymology of knowledge
Suppose we have trained a model to perform a certain set of tasks. Later we would want to teach it a new task. Catastrophic forgetting means that teaching it a new task makes it unlearn some or all of its earlier tasks.
It occurs because training changes the weights of the model. The earlier set of weights was good for the previous tasks. The new set of weights is only good for the new task. Usually special care must be taken to overcome catastrophic forgetting.
Yeah, this is essentially how finetuned models work. If you fine tune stablediffusion to produce anime images, it might forget how to produce images in any other style. But it will become much better at anime images than the base model. If anime images are the art style you’re after, this is a good trade. Same with fine tuning LLMs for SQL or whatever.
Can it be taught "contextual matrices" where by it builds a new layer of construct but preserves the other, then cross learns between parameters or something (sorry for my poor lexicon, I'm wet-learning :-)
But imagine all LLMs in a macro view like a sponge entity
We wouldn't know how to construct those matrices because we don't know where in the layers what knowledge is represented. One thing that helps a little bit is freezing the lower layers, so at least the model won't forget its most fundamental knowledge.
Note that the only reason that things are catastrophically forgotten, is that the original examples are not shown again. If the model learns in a single shot, there might simply be no time to show both the old and the new examples. I don't think it would have a significant effect or else we'd know about this effect a lot sooner (i.e. the training of these LLM's would get less effective from a certain point)
You could simulate this by selectively locking and unlocking 'banks' of weights from a larger model to keep the influence there during training and to avoid losing them. Sort of a selective write-protect.
<"where in the layers what knowledge is represented."
This seems like a ripe angle for evolvement of our understanding of AIs use in LLMs... can we throw AIs at AIs (is AI synonymous to LLM?) Can we throw LLMs at LLMs? and have them recursively learn from themselves.. or is it a Rat King. AI recognize AI in the GangPlane
That's not how LLM's work. LLM's complete documents, they don't make statements about LLM's unless you explain to them how they should do it and give them all the information they need. If you could extract the information from an LLM well enough to supply that to an LLM with an explanation on how to summarize the behaviour of the LLM to a human, we would have already done that to a PhD student instead. A PhD student is a little bit slower than an LLM, but they require a lot less explanation.
In any case, looking at and understanding how a neural network encodes information is like gene editing. Perhaps you could isolate a gene in the human genome that achieves something interesting like giving a child blue eyes. But even if you would do that, there's a chance you break something else if you modify that gene and give the child health risk. Since all neurons in a deep neural network are interconnected, there is a butterfly effect in it that makes them inherently somewhat of a black box.
From an average -87.3% performance drop on the 12B model to -61.6% on the 84B model then just -3.9% on the 562B model. Felt like we were just shy of an insight breakthrough here.
Is avoiding CF potentially just a matter of sheer scale ?