Just to make sure I've got this right- running a llamafile in a shell script to do something like rename files in a directory- it has to open and load that executable every time a new filename is passed to it, right? So, all that memory is loaded and unloaded each time? Or is there some fancy caching happening I don't understand? (first time I ran the image caption example it took 13s on my M1 Pro, the second time it only took 8s, and now it takes that same amount of time every subsequent run)
If you were doing a LOT of files like this, I would think you'd really want to run the model in a process where the weights are only loaded once and stay there while the process loops.
(this is all still really useful and fascinating; thanks Justine)
The models are memory mapped from disk so the kernel handles reading them into memory. As long as there's nothing else requesting that RAM, those pages remain cached in memory between invocations of the command. On my 128 GB workstation, I can use several different 7B models on CPU and they all remain cached.
The difference between running llama.cpp main vs server + POST http request is fairly substantial but not earth shattering - like ~6s vs ~2s, for a few lines of completion, with 8GB VRAM models. I'm running with a 3090 and 96G RAM, all inference running on GPU. If you are really doing batch work you definitely want to persist the model between completions.
OTOH you're stuck with the model you loaded via server, while if you load on demand you can switch in and out. This is vital for multimodal image interrogation, since other models don't understand projected image tokens.
If you were doing a LOT of files like this, I would think you'd really want to run the model in a process where the weights are only loaded once and stay there while the process loops.
(this is all still really useful and fascinating; thanks Justine)