In the palm-e paper (https://palm-e.github.io/), when they try to unfreeze and t...

jph00 · on Sept 6, 2023

I think our experiments actually don't show catastrophic forgetting! The accuracy does not decrease as loss gets worse -- it's simply getting over-confident.

So I'm not even sure we're showing any problem to solve here -- it might be more of a opportunity, in fact!

vvrm · on Sept 6, 2023

I have been training a natural intelligence model for 3 years now and she still doesn’t get nuance. Things are either good or bad in her book: nothing in between. My plan is to let her train with binary good/bad labels till the age of 5 and then start smoothing the labels after that. Wonder if that works for your AI.

TeMPOraL · on Sept 6, 2023

Related trick: I found that training two Natural Intelligence (NI) models in parallel, and having them train each other for most of the time, leads to significant leaps in capabilities. Notably, when one NI picks up a skill, it often results in spontaneous transfer learning - the other NI picks that skill up very quickly, much faster than it would through direct training.

This scales well, too. There are facilities that provide services of co-hosting and cross-training up to ~two dozen NI models in a shared environment - in my experience, this provides similar training benefits to running multiple NIs on your own, at fraction of the cost.

(The facilities are exploiting some neat economies of scale. Talking to some employees, I learned that the transfer learning and co-activation are embarrassingly scalable: if you get two-three NIs to pick up a thing, all the rest immediately follow.)

vineyardmike · on Sept 6, 2023

This took a couple reads, but it’s funny. The bad news is that I’ve been training mine for 17 years and nuance is still something that needs more training.

tudorw · on Sept 6, 2023

in my mind I've built an 'emotional engine' to add nuance to models understanding, take something like Plutchik's wheel of emotions and create a high quality multi-modal dataset based on that structure, given our current technology takes inspiration from the brain, it would seem like having discrete models specialising in particular aspects of 'intelligence' that are then organised into a mixture of experts is an interesting area to explore, and perhaps more accessible as smaller models require less resources.

joquarky · on Sept 7, 2023

That seems similar to https://en.m.wikipedia.org/wiki/Internal_Family_Systems_Mode...

tudorw · on Sept 7, 2023

very interesting, thanks.

kordlessagain · on Sept 6, 2023

I have code stubbed out for this in mitta.us. It has 9 states, based on the Plutchik wheel, with emojis for the states. States drive temp and a few other things and drop the state into prompts.

tudorw · on Sept 7, 2023

Interesting, do you have a mailing list or way I can be notified of progress?

kordlessagain · on Sept 9, 2023

You can signup: https://mitta.us/

The accounts aren't wired up by default to the AI and I am refactoring the templating system right now, but you can definitely start storing and searching things.

3abiton · on Sept 6, 2023

Awesome investigative work, what's the opportunity though, I don't get it

jph00 · on Sept 6, 2023

We don't know. It's a report of some early experimental results. Our hope is that it will stimulate discussion and further research and development.

mirekrusin · on Sept 6, 2023

It looks like something clicks in place.

Yenrabbit · on Sept 6, 2023

It does start getting worse at some point right?

minihat · on Sept 6, 2023

Cross-entropy loss can start getting worse due to the model becoming less calibrated, even as the classification accuracy continues to improve. I first heard that here: https://arxiv.org/abs/1706.04599

Is this 'overconfidence' the leading explanation as to why LLMs continue to show qualitative improvement even after their test loss levels off?

alekseiprokopev · on Sept 6, 2023

Is it possible to somehow modify the sampling from the model to account for that?

jph00 · on Sept 6, 2023

I'm sure eventually it would, but we haven't gotten to that point yet in our training.

samstave · on Sept 6, 2023

Plz eli5 catastrophic forgetting,

I assume this means losing all the energy and compute input for a model to know, perform, infer on inputs already indexed(?) (What is the proper term here?)

But is this the premise -you lose all prior investment of resource to a (I don't know the term for an AI archetype of knowledge) {btw, I love the embedded etymology of knowledge

"The ledger of things that we KNOW"}

tarvaina · on Sept 6, 2023

Suppose we have trained a model to perform a certain set of tasks. Later we would want to teach it a new task. Catastrophic forgetting means that teaching it a new task makes it unlearn some or all of its earlier tasks.

It occurs because training changes the weights of the model. The earlier set of weights was good for the previous tasks. The new set of weights is only good for the new task. Usually special care must be taken to overcome catastrophic forgetting.

antupis · on Sept 6, 2023

I think some cases CF would be even good eg you want llm that produces only valid json data as output.

josephg · on Sept 6, 2023

Yeah, this is essentially how finetuned models work. If you fine tune stablediffusion to produce anime images, it might forget how to produce images in any other style. But it will become much better at anime images than the base model. If anime images are the art style you’re after, this is a good trade. Same with fine tuning LLMs for SQL or whatever.

samstave · on Sept 6, 2023

Can it be taught "contextual matrices" where by it builds a new layer of construct but preserves the other, then cross learns between parameters or something (sorry for my poor lexicon, I'm wet-learning :-)

But imagine all LLMs in a macro view like a sponge entity

tinco · on Sept 6, 2023

We wouldn't know how to construct those matrices because we don't know where in the layers what knowledge is represented. One thing that helps a little bit is freezing the lower layers, so at least the model won't forget its most fundamental knowledge.

Note that the only reason that things are catastrophically forgotten, is that the original examples are not shown again. If the model learns in a single shot, there might simply be no time to show both the old and the new examples. I don't think it would have a significant effect or else we'd know about this effect a lot sooner (i.e. the training of these LLM's would get less effective from a certain point)

jacquesm · on Sept 6, 2023

You could simulate this by selectively locking and unlocking 'banks' of weights from a larger model to keep the influence there during training and to avoid losing them. Sort of a selective write-protect.

samstave · on Sept 6, 2023

<"where in the layers what knowledge is represented."

This seems like a ripe angle for evolvement of our understanding of AIs use in LLMs... can we throw AIs at AIs (is AI synonymous to LLM?) Can we throw LLMs at LLMs? and have them recursively learn from themselves.. or is it a Rat King. AI recognize AI in the GangPlane

tinco · on Sept 7, 2023

That's not how LLM's work. LLM's complete documents, they don't make statements about LLM's unless you explain to them how they should do it and give them all the information they need. If you could extract the information from an LLM well enough to supply that to an LLM with an explanation on how to summarize the behaviour of the LLM to a human, we would have already done that to a PhD student instead. A PhD student is a little bit slower than an LLM, but they require a lot less explanation.

In any case, looking at and understanding how a neural network encodes information is like gene editing. Perhaps you could isolate a gene in the human genome that achieves something interesting like giving a child blue eyes. But even if you would do that, there's a chance you break something else if you modify that gene and give the child health risk. Since all neurons in a deep neural network are interconnected, there is a butterfly effect in it that makes them inherently somewhat of a black box.

KoolKat23 · on Sept 6, 2023

What does CF stand for?

d4rkp4ttern · on Sept 6, 2023

Catastrophic Forgetting

KoolKat23 · on Sept 6, 2023

Thank you

t-vi · on Sept 7, 2023

> Is avoiding CF potentially just a matter of sheer scale ?

My intuition would be that you get more orthogonal directions to the gradient (of previous samples) if you have larger model.

Yenrabbit · on Sept 6, 2023

Ooh interesting, thanks for sharing!