When I first got into local LLMs nearly 3 years ago, in mid 2023, the frontier closed models were ofcourse impressively capable.

I then tried my hand on running 7b size local models, primarily one called Zephyr-7b (what happened to these models?? Dolphin anyone??), on my gaming PC with 8GB AMD RX580 GPU. Fair to say it was just a curiosity exercise (in terms of model performance).

Fast forward to this month, I revisit local LLM. (Although I no longer have the gaming PC, cost-of-living-crisis anyone 😫 )

And, the 31b size models look very sufficient. #Qwen has taken the helm in this order. Which is still very expensive to setup locally, although within grasp.

I’m rooting for the edge-computing models now - the ~2b size models. Due to their low footprint, they are practical to run in a SBC 24/7 at home for many people.

But these edge models are the ‘curiosity category’ now.

    • SuspciousCarrot78@lemmy.world
      link
      fedilink
      English
      arrow-up
      0
      ·
      edit-2
      4 days ago

      It’s not just you. But while they may be natively “dumb”, but they can be augmented quite significantly. Even adding a simple web-search tool can help a lot.

      So, there are levels of “dumb”. Some - like Qwen3-4B 2507 instruct - may not have the world knowledge of a SOTA, but its reasoning abilities can be quite impressive. See HERE as an example of a self made test suite. You can run something similar yourself.

      I guess it depends what you mean by “dumb” and how that affects what you’re trying to do with them. Some are dumb at tool use, some have poor world knowledge etc. You can find small models that are good at what’s important to you if you dig around. Except for coding - that’s rough. Probably the smallest stand-alone that might make you sit up is something like Qwen2.5-Coder-14B-Instruct or FrogMini-14B-2510…but I wouldn’t trust them to go spelunking a code base.

      • ☂️-@lemmy.ml
        link
        fedilink
        English
        arrow-up
        0
        ·
        4 days ago

        how are some other ways to make it better beyond just adding a search tool? is 16gb vram sufficient for usable results?

        where do you think is the best place to go into this rabbit hole?

        • SuspciousCarrot78@lemmy.world
          link
          fedilink
          English
          arrow-up
          0
          ·
          edit-2
          3 days ago

          It’s really hard for me to answer this question without pointing to my project, because the project is sort of directly in response to this very problem. So, gauche as it may be, fuck it:

          https://codeberg.org/BobbyLLM/llama-conductor

          I mention this because 1) I am NOT trying to get you to install my shit but 2) my shit answers this directly. I note the conflict of interest, but OTOH you did ask me, and I sort of solved it in my way so…fuck. (It’s FOSS / I’m not trying to sell you anything etc etc).

          With that out of the way, I will answer from where I am sitting and then generically (if I understand your question right).

          Basically -

          Small models have problems with how much they can hold internally. There’s a finite meta-cognitive “headspace” for them to work with…and the lower the quant, the fuzzier that gets. Sadly, with weaker GPU, you’re almost forced to use lower quants.

          If you can’t upgrade the LLM (due to hardware), what you need to do is augment it with stuff that takes some of the load.

          What I did was this: I wrapped a small, powerful, well-benchmarking LLM in an infrastructure that takes the things it’s bad at outside of its immediate concern.

          Bad inbuilt model priors / knowledge base? No problem; force answers to go thru a tiered cascade.

          Inbuilt quick responses that you define yourself as grounding (cheatsheets) --> self-populating wiki-like structure (you drop in .md into one folder, hit >>summ and it cross-updates everywhere) --> wikipedia short lookup (800 character open box: most wiki articles are structured with the TL;DR in that section) --> web search (using trusted domains) or web synth (using trusted domains plus cross-verification) --> finally…model pre-baked priors.

          In my set up, the whole thing cascades from highest trust to lowest trust (human defined), stops when it hits the info it needs and tells you where the answer came from.

          Outside of that, sidecars that do specific things (maths solvers, currency look up tools, weather look up, >>judge comparitors…tricks on tricks on tricks).

          Based on my tests, with my corpus (shit I care about) I can confidently say my little 4B can go toe to toe with any naked 100B on my stuff. That’s a big claim, and I don’t expect you to take it at face value. It’s a bespoke system with opinions…but I have poked it to death and it refuses to die. So…shrug. I’m sanguine.

          Understand: I assume the human in the middle is the ultimate arbiter of what the LLM reasons over. This is a different school of thought to “just add more parameters, bro” or “just get a better rig, bro”, but it was my solution to constrained hardware and hallucinations.

          There are other schools of thought. Hell, others use things like MCP tool calls. The model pings cloud or self-host services (like farfalle or Perplexica), calls them when it decides it needs to, and the results land in context. But that’s a different locus of control; the model’s still driving…and I’m not a fan of that on principle. Because LLMs are beautiful liars and I don’t trust them.

          The other half of the problem isn’t knowledge - it’s behaviour. Small models drift. They go off-piste, ignore your instructions halfway through a long response, or confidently make shit up when they hit the edge of what they know. The other thing I built was a behavioural shaping layer that keeps the model constrained at inference time - no weight changes, just harness-level incentive structure. Hallucination = retry loop = cost. Refusal = path of least resistance. You’re not fixing the model; you’re making compliance cheaper than non-compliance.

          That’s how I solved it for me. YMMV.

          On 16GB VRAM: honestly, that’s decent - don’t let GPU envy get to you. You can comfortably run a Q4_K_M of a 14B model entirely in VRAM at usable speeds - something like Qwen3-14B or Mistral-Small. Those are genuinely capable; not frontier, but not a toy either. The painful zone is 4-8GB (hello!), where you’re either running small models natively or offloading layers to RAM and watching your tokens-per-second crater. You can do some good stuff with a 14B, augmented with the right tools.

          Where to start the rabbit hole: Do you mean generally? Either Jan.ai or LM Studio is the easiest on-ramp - drag and drop models, built-in chat UI, handles GGUF out of the box.

          Once you want more control, drop into llama.cpp directly. It’s just…better. Faster. Fiddlier, yes…but worth it.

          For finding good models, Unsloth’s HuggingFace page is consistently one of the better curators of well-quantised GGUFs. After that it’s just… digging through LocalLLaMA and benchmarking stuff yourself.

          There’s no substitute for running your own evals on your own hardware for your own use case - published benchmarks will lie to you. If you’re insane enough to do that, see my above “rubric” post.

          Not sure…have I answered your question?

          PS: for anyone that hits the repo and reads the 1.9.5 commit message - enjoy :) Twas a mighty bork indeed, worthy of the full “Bart Simpson writes on chalkboard x 1000” hall of shame message. Fucking Vscodium man…I don’t know how sandbox mode got triggered but it did and it ate half my frikken hard-drive and repo before I could stop it. Rookie shit.

        • ntn888@lemmy.mlOP
          link
          fedilink
          English
          arrow-up
          0
          ·
          4 days ago

          I didn;t try any 7b ones lately, they may be better fit for 16gb I think. I was able to try the 2b ones as I mentioned (on cpu). they are subpar. like mentioned the usable ones were 31b, I think you need atleast 24gb vram for most models though. maybe someone else can suggest better.