Is there a easy way to run a large language model and/or speech synthesis model ...

takantri · on Jan 2, 2023

For LLMs, the closest thing that comes to mind is KoboldAI[1]. The community isn't as big as Stable Diffusion's, but the Discord server is pretty active. I'm an active member of the community who likes to inform others on it (you can see my previous Hacker News comment was about the same thing, haha).

Like Stable Diffusion, it's a web UI (vaguely reminiscent of NovelAI's) that uses a backend (in this case, Huggingface Transformers). You can use different model architectures, as early as GPT-2 to the newer ones like BigScience's BLOOM, Meta's OPT, and EleutherAI's GPT-Neo and Pythia models, just as long as it was implemented in Huggingface.

They have official support for Google Colab[2][3]; most of the models shown are finetunes on novels (Janeway), choose-your-own-adventures (Nerys / Skein / Adventure), or erotic literature (Erebus / Shinen). You can use the models listed or provide a Huggingface URL.

[1] - https://github.com/koboldai/koboldai-client (source code)

[2] - https://colab.research.google.com/github/koboldai/KoboldAI-C... (TPU colab; 13B and 20B models)

[3] - https://colab.research.google.com/github/koboldai/KoboldAI-C... (GPU colab; 6B models and lower)

nmfisher · on Jan 3, 2023

I'm squarely in Kobold's userbase but hadn't come across it until your post, so thanks for your efforts to spread awareness.

Roark66 · on Jan 2, 2023

There is (for many, but not all large models). Specifically there is huggingface's accelerate library that let's you run the model partially on your gpu, partially on cpu/ram and what doesn't fit in ram is cached in nvme storage (a mirror of two fast drives recommended).

I didn't have much luck with stock accelerate, but once gpu is disabled (so it runs only on cpu offloading to nvme storage where ram is insufficient) worked pretty well with me. (there is a small code change that has to be done as the stock software refuses to run without gpu-it is a simple change described in its github issues). My gpu is 8gb vram, but this way I managed to run 7b parameter models. In principle I could run a lot larger ones, but of course it takes a lot more time. The 7b bloom takes 90s for one inference and additional 60s to load the model (from a spinning disc array) initially.

borzunov · on Jan 2, 2023

Really large (GPT-3-sized) language models have much more parameters than diffusion models, so it's difficult to load them locally unless you have a server with 8x 3090/3x A100 GPUs. Petals is the only way to fine-tune and inference 100B+ parameter models from Colab, as far as I know.

thot_experiment · on Jan 2, 2023

Interesting, how does that work with the multiple GPUs? I'm not familiar with the internal workings of these models, is there anywhere where I can get a brief rundown of how the processing is split. I imagine there can't me much swapping between GPUs as that seems prohibitively slow? How is the model split such that it can be worked on in parallel by multiple GPUs w/o being bottlenecked by IO?

borzunov · on Jan 2, 2023

I think this is a relevant link for you: https://huggingface.co/transformers/v4.9.0/parallelism.html

For large LMs, people usually use tensor-parallelism (TP) or pipeline-parallelism (PP). TP involves lots of communication, but uses all GPUs 100% of the time and works faster. PP requires much less communication, but may keep some GPUs idle while they are waiting for data from others.

Usually, TP is used when you have good communication channels between GPUs (e.g., they are in one data center and connected with NVLink), while PP is used when communication is a bottleneck (like in Petals, where the data is sent over the Internet, which is much slower than NVLink).

nmitchko · on Jan 2, 2023

You can split the model across devices with huggingface accelerate library.

Check out the infer_auto_memory_map metho which will optimize the model for your configuration (multi gpu, ram, nvme) and then run dispatch model on with that memory map.

zone411 · on Jan 2, 2023

You can read all the gory details here: https://arxiv.org/pdf/2207.00032.pdf

borzunov · on Jan 2, 2023

clarification: You can also use offloading on Colab, but inference with offloading is at least 10x slower (see other comment threads). So it can't really be used for interactive inference, but may be used for fine-tuning with large batches/sequence lengths.

Shindi · on Jan 3, 2023

Surprised mentioned GPT-J! Here is a colab link: https://colab.research.google.com/github/NielsRogge/Transfor...

Although you need a premium GPU. I admit it's not as good at zero shot or 1-shot as GPT-3 but if you provide examples, you can get as good of output. I feel like the team behind it needs better marketing.

thot_experiment · on Jan 3, 2023

Nice, this looks pretty good. I have Google Colab Pro+ so I can use the 40GB GPUs there. Am I correct that I could also run this locally on 2X 11gb 1080Ti?

FireInsight · on Jan 2, 2023

Not sure about TTS, but I've trained GPT-2 (a pytorch implementation I think) on my own data and it worked pretty well, also tried eleutherai's 6B model but, couldn't figure out how to run it.. About an "easy way", I don't think such user interface like what Stable Diffusion has got exists as of now.