>>2503Well the simplest way to get started is to download the latest release of koboldcpp
https://github.com/LostRuins/koboldcpp/releases/tag/v1.63It's the simplest of back-ends (and also comes with a rather basic bitch front-end as well) But it has an API that you can use to connect it to other front-ends such as SillyTavern
https://github.com/SillyTavern/SillyTavernAnd then you're basically looking for gguf format models on huggingface.co, of which there are many... countless...
System requirement wise definitely need some kind of GPU to accelerate the prompt processing but beyond that .gguf is actually a CPU optimized tensor library so you can actually load as much of the model you want into RAM as possible (although it will be much faster if you can offload it all into VRAM)
Then I suppose there's quantization which needs mentioning. We'll use Llama-3-8B-Instruct as an example.
https://huggingface.co/QuantFactory/Meta-Llama-3-8B-Instruct-GGUF/tree/mainGenerally the models are released in "fp16" form. Where the model weights are expressed in 16 bit floating point numbers (some of them are actually fp32 and some of them are actually smaller but that's besides the point).
Basically quantization is compression for lack of a better analogy. for example Q8_0 has had most of the tensor values reduced down to 8-bit values. With text-generation 8-bit is generally considered lossless. (it's a little bit lossy, but typically within the noise of fp16)
4-bit quantization is as low as I would ever recommend going. At this point on most tensor values you're basically cramming 2 model parameters into every single byte. You'll occasionally start getting weird things like reversed possessive clauses and what-not- a jumbling of close concepts, basically. Q5 is probably the most popular quantization level. It provides a pretty solid accuracy boost over Q4 although I generally prefer to just run Q8 or FP16 wherever possible just for the unvarnished experience.
The majority of a model's memory requirement can be calculated off of a model's parameter count. pretty straight forward. 8 bits per byte, so an 8 billion parameter model would require 16 gigabytes to represent all of the parameters at fp16, 8 gigs at Q8 and 4 gigs at Q4. (it's not quite that simple, since not all modules are quantized at the same level) but then you also have to consider overhead for context (how much text the model can see when calculating what to output) and compute buffer.
Llama-3 is a little bleeding edge at the moment so there's still a lot of quirks to be ironed out on front-ends and back-ends to make it work nicely but theoretically if you had a 6 gigabyte GPU you would be able to fully offload Llama-3-8B in 4-bit. But in either case even if not like if you just had some shitty little 1650 or something it would be able to handle the batch/prompt processing and because the model is relatively small you would still be able to get usable speed out of it even if you had to run most of the model off of cpu.
And then don't even make me start getting into sampler settings and prompt templating.