The difference between running llama.cpp main vs server + POST http request is fairly substantial but not earth shattering - like ~6s vs ~2s, for a few lines of completion, with 8GB VRAM models. I'm running with a 3090 and 96G RAM, all inference running on GPU. If you are really doing batch work you definitely want to persist the model between completions.
OTOH you're stuck with the model you loaded via server, while if you load on demand you can switch in and out. This is vital for multimodal image interrogation, since other models don't understand projected image tokens.
OTOH you're stuck with the model you loaded via server, while if you load on demand you can switch in and out. This is vital for multimodal image interrogation, since other models don't understand projected image tokens.