Pablo
Pablo

By Anonymous

via Pablo App

Qwen3 8B: Performance Breakdown Unveiled

Qwen3 8B's performance significantly improves with reasoning enabled, achieving impressive scores on various benchmarks.

·6 min read·1180 words
0:003:03
Original Question

Can you dive deeper on all of the evals for that QN8 billion parameter model? And I'm assuming that you're talking about the quantized version that's kind of lightweight. You said 4 gigabytes of VRAM. I'm interested in what kind of things it can do and at what level. and also at what speeds on the 3090. Okay.


Full Response

Now I have a comprehensive picture. Let me compile all of this into a thorough breakdown.


Qwen3 8B — Deep Dive on Evals, Benchmarks & Quantization Performance

Great question. There's a lot of data here, so let me walk through it layer by layer — the full-precision benchmarks, what happens when you quantize down to that lightweight ~4–5 GB model, and the real-world speeds on a 3090.


1. Full-Precision (FP16/BF16) Benchmark Scores

These are the official scores from the Qwen3 technical report and third-party evaluators. This is the ceiling — the best the model can do before any compression.

Thinking Mode (Reasoning Enabled)

When you let the model "think" before answering — similar to how OpenAI's o1 works — it scores significantly higher:

      • AIME 2025 (competition math): 66.2% — This is remarkable for an 8B model. It's solving American Invitational Mathematics Exam problems
      • MMLU-Pro (harder knowledge test): 72.1% — Strong general knowledge across dozens of academic domains
      • GPQA Diamond (PhD-level science): 59.7% — Graduate-level physics, chemistry, and biology questions
      • IFEval (instruction following): 85.6% — Very high compliance with complex multi-constraint instructions
      • LiveCodeBench v6 (fresh coding problems): ~39% — Competitive programming problems it's never seen before
      • MATH 500: 82.8% — Strong mathematical reasoning
      • Humanity's Last Exam: ~2.8% — This is the hardest benchmark in existence; even frontier models score single digits

Non-Thinking Mode (Fast, Direct Answers)

When you turn off the reasoning chain for quick responses:

      • MMLU-Pro: 64.3% — Drops about 8 points without thinking
      • GPQA Diamond: 45.2%
      • IFEval: 85.6% — Instruction following stays strong
      • AIME 2025: 24.3% — Big drop without reasoning, as expected
      • LiveCodeBench: 20.2%
      • SciCode: 16.8%

Key takeaway: Thinking mode roughly doubles or triples performance on math and coding tasks. For general knowledge and instruction following, the gap is smaller.


2. What Happens When You Quantize to Q4 (~4–5 GB)?

This is the critical question — you're compressing a 15.3 GB model down to under 5 GB. Here's exactly what you lose, based on the empirical quantization study:

Perplexity (Lower is Better)

| Precision | Perplexity (WikiText2) | Size | Loss vs FP16 |

|---|---|---|---|

| FP16 (baseline) | 10.11 | 15.26 GB | — |

| Q4_K_HIFI (best Q4) | 10.41 | 4.93 GB | +3.0% |

| Q4_K_M (standard) | 10.42 | 4.68 GB | +3.1% |

| Q4_K_S (smallest) | 10.68 | 4.47 GB | +5.7% |

With imatrix calibration (a technique that tells the quantizer which weights matter most), Q4_K_HIFI drops to only +1.1% perplexity loss — essentially negligible.

MMLU Score (Knowledge & Reasoning)

This is the big one from the academic study:

| Precision | MMLU Score | Drop |

|---|---|---|

| FP16 | 74.7% | — |

| 8-bit (AWQ) | ~74.5% | Near-zero loss |

| 4-bit (AWQ) | 69.3% | -5.4 points |

| 3-bit (AWQ) | Significant degradation | Not recommended |

So at 4-bit quantization, you lose about 5 points on MMLU — that's a real but manageable drop. You go from "strong" to "good." At 8-bit, it's essentially lossless.

Commonsense Reasoning (Zero-Shot)

At 4-bit AWQ, the 8B model retains most of its reasoning ability:

      • PIQA (physical intuition): ~78% → stays solid
      • ARC-Easy: ~80% → minimal drop
      • ARC-Challenge: ~54% → slight drop
      • HellaSwag: ~60% → moderate drop
      • WinoGrande: ~72% → holds well
      • BoolQ: ~82% → holds well

The average across all six reasoning benchmarks drops from about 68% to 64% at 4-bit — a roughly 4-point decline.


3. Speed on the RTX 3090 — Real Numbers

Here's where it gets exciting. Multiple sources give us concrete tokens-per-second data:

Quantized Q4 on RTX 3090 (via llama.cpp / Ollama)

| Quantization | Speed | File Size | Quality |

|---|---|---|---|

| Q4_K_M | ~118 tokens/sec | 4.68 GB | Excellent |

| Q4_K_S | ~123 tokens/sec | 4.47 GB | Good |

| Q4_K_HIFI | ~117 tokens/sec | 4.93 GB | Best Q4 quality |

Full Precision on RTX 3090

| Precision | Speed | VRAM Used |

|---|---|---|

| BF16 | ~59–82 tokens/sec | ~16 GB |

| FP8 | ~150 tokens/sec | ~8 GB |

| AWQ-INT4 | ~144 tokens/sec | ~4–5 GB |

Context Length Impact on Speed

This is important — as you feed the model more input text, throughput actually increases due to GPU parallelism on the prefill phase:

| Input Length | BF16 Speed | INT4 Speed |

|---|---|---|

| 1 token | 82 tok/s | 144 tok/s |

| 6K tokens | 296 tok/s | 478 tok/s |

| 14K tokens | 525 tok/s | 770 tok/s |

| 30K tokens | 833 tok/s | 1,076 tok/s |

| 64K tokens | 1,113 tok/s | 1,255 tok/s |

| 128K tokens | 1,173 tok/s | 1,198 tok/s |

These higher numbers reflect total throughput including prompt processing. For pure generation speed (the part you're waiting on), expect 45–50 tokens/sec at extended contexts with Q4_K_M on a 3090, and 117–123 tokens/sec at short contexts.


4. The Practical Bottom Line

Here's how I'd summarize what you can actually do at each quantization level:

Q4 (~4.5–5 GB) — The Sweet Spot for a 3090

      • Math: Still solves most problems; loses the hardest competition-level stuff
      • Coding: Writes functional code in 80+ languages; may miss edge cases more often
      • Knowledge: Drops from ~75% to ~69% on MMLU — still very capable for research, writing, and Q&A
      • Instruction following: Stays at 85%+ — barely affected
      • Speed: 117–123 tokens/sec generation — feels instant in conversation
      • Context: 65–80K tokens comfortably, with 20 GB of VRAM headroom

Q8 (~8 GB) — Best Balance

      • Near-zero quality loss across all benchmarks
      • ~108 tokens/sec — still very fast
      • 16 GB of headroom for context on a 3090

FP16 (~16 GB) — Maximum Quality

      • Full benchmark performance
      • ~59–82 tokens/sec — still responsive
      • 8 GB headroom — enough for ~32K context

5. Where Qwen3 8B Punches Above Its Weight

Based on the evals, these are the tasks where this model is surprisingly strong for its size:

  1. Instruction following — 85.6% on IFEval is competitive with models 4–10x larger

  2. Competition math in thinking mode — 66.2% on AIME 2025 is extraordinary for 8B parameters

  3. Tool calling and agentic tasks — The BFCL benchmark shows strong function-calling reliability

  4. Multilingual — 119 languages supported natively

  5. Code generation — MATH 500 at 82.8% shows strong structured reasoning

And where it's weaker: PhD-level science (GPQA at ~45–60%), the absolute hardest reasoning tasks, and ultra-long multi-step chains where larger models have more capacity to maintain coherence.

The Q4 quantized version is genuinely a remarkable value proposition — you lose about 5 points on knowledge benchmarks but gain a model that fits in under 5 GB, runs at over 100 tokens per second, and leaves your 3090 with massive headroom for context and other tasks.

Now it's your turn

Ask any AI, get voice answers in seconds

Claim Free Credits