32 GB VRAM for less $1k sounds like a steal these days, and I’m sure it’s not getting cheaper any time soon.

Does anyone here use this GPU? Or any recent Arc Pros? I basically want someone to talk me out of driving to the nearest place that has it in stock and getting $1k poorer.

  • lavember@programming.dev
    link
    fedilink
    English
    arrow-up
    0
    ·
    9 days ago

    How reliable is this setup for local inference? For instance how many tokens/sec?

    I’m asking cause I’d guess sharing bandwith like that would have some cost in speed

    • afk_strats@lemmy.world
      link
      fedilink
      English
      arrow-up
      0
      ·
      9 days ago

      I find llama.cpp with Vulkan EXTREMELY reliable. I can have it running for days at once without a problem. As far as tokens/sec that’s that’s a complicated question because it depends on model, quant, sepculative, kv quant, context length, and card distribution. Generally:

      Models’ typical speeds at deep context for agentic use. Simple chats will be faster

      Model Quant Prompt Processing (tok/s) Token Generation (tok/s) Hardware Quality
      Qwen 3.5 397B Q2_K_M 100-120 18-22 2 x 7900 + 4 x Mi50 ★★★★★
      Gemma4 31B or Qwen3.5 27B Q8_0 400-800 20-25 2 x 7900xtx ★★★★
      Qwen 3.6 35B Q5_K_M 1000-2500 60-100 2 x 7900xtx ★★★★
      Qwen 3.5 122B Q4_0 200-300 30-35 4 x MI50 ★★★★
      gpt-oss 120b mxfp4 (native) 500-800 50-60 3 x Mi50 ★★
      Nemotron 3 Nano 30B IQ3_K_XXS 2500-3000 150-180 1 x 7900xtx