Just an FYI, Mixtral is a Sparse Mixture of Experts that has 47B parameters for memory costs (but 13B active parameters per token). For those interested in reading more about how it works: https://arxiv.org/pdf/2401.04088.pdf
For those interested in some of the recent MoE work going on, some groups have been doing their own MoE adaptations, like this one, Sparsetral - this is pretty exciting as it's basically an MoE LoRA implementation that runs a 16x7B at 9.4B total parameters (the original paper introduced a model, Camelidae-8x34B, that ran at 38B total parameters, 35B activated parameters). For those interested, best to start here for discussion and links: https://www.reddit.com/r/LocalLLaMA/comments/1ajwijf/model_r...
For those interested in some of the recent MoE work going on, some groups have been doing their own MoE adaptations, like this one, Sparsetral - this is pretty exciting as it's basically an MoE LoRA implementation that runs a 16x7B at 9.4B total parameters (the original paper introduced a model, Camelidae-8x34B, that ran at 38B total parameters, 35B activated parameters). For those interested, best to start here for discussion and links: https://www.reddit.com/r/LocalLLaMA/comments/1ajwijf/model_r...