Just to make sure I've got this right- running a llamafile in a shell script to ...

throwup238 · on Dec 13, 2023

The models are memory mapped from disk so the kernel handles reading them into memory. As long as there's nothing else requesting that RAM, those pages remain cached in memory between invocations of the command. On my 128 GB workstation, I can use several different 7B models on CPU and they all remain cached.

barrkel · on Dec 13, 2023

The difference between running llama.cpp main vs server + POST http request is fairly substantial but not earth shattering - like ~6s vs ~2s, for a few lines of completion, with 8GB VRAM models. I'm running with a 3090 and 96G RAM, all inference running on GPU. If you are really doing batch work you definitely want to persist the model between completions.

OTOH you're stuck with the model you loaded via server, while if you load on demand you can switch in and out. This is vital for multimodal image interrogation, since other models don't understand projected image tokens.