Can you dive deeper on all of the evals for that QN8 billion parameter model? And I'm assuming that you're talking about the quantized version that's kind of lightweight. You said 4 gigabytes of VRAM. I'm interested in what kind of things it can do and at what level. and also at what speeds on the 3090. Okay.

Question

Accepted Answer

Now I have a comprehensive picture. Let me compile all of this into a thorough breakdown.

---

# Qwen3 8B — Deep Dive on Evals, Benchmarks & Quantization Performance

Great question. There's a **lot** of data here, so let me walk through it layer by layer — the full-precision benchmarks, what happens when you quantize down to that lightweight ~4–5 GB model, and the real-world speeds on a 3090.

---

## 1. Full-Precision (FP16/BF16) Benchmark Scores

These are the **official scores** from the Qwen3 technical report and third-party evaluators. This is the ceiling — the best the model can do before any compression.

### Thinking Mode (Reasoning Enabled)
When you let the model "think" before answering — similar to how OpenAI's o1 works — it scores significantly higher:

- **AIME 2025** (competition math): **66.2%** — This is remarkable for an 8B model. It's solving American Invitational Mathematics Exam problems
- **MMLU-Pro** (harder knowledge test): **72.1%** — Strong general knowledge across dozens of academic domains
- **GPQA Diamond** (PhD-level science): **59.7%** — Graduate-level physics, chemistry, and biology questions
- **IFEval** (instruction following): **85.6%** — Very high compliance with complex multi-constraint instructions
- **LiveCodeBench v6** (fresh coding problems): **~39%** — Competitive programming problems it's never seen before
- **MATH 500**: **82.8%** — Strong mathematical reasoning
- **Humanity's Last Exam**: **~2.8%** — This is the hardest benchmark in existence; even frontier models score single digits

### Non-Thinking Mode (Fast, Direct Answers)
When you turn off the reasoning chain for quick responses:

- **MMLU-Pro**: **64.3%** — Drops about 8 points without thinking
- **GPQA Diamond**: **45.2%**
- **IFEval**: **85.6%** — Instruction following stays strong
- **AIME 2025**: **24.3%** — Big drop without reasoning, as expected
- **LiveCodeBench**: **20.2%**
- **SciCode**: **16.8%**

**Key takeaway**: Thinking mode roughly **doubles or triples** performance on math and coding tasks. For general knowledge and instruction following, the gap is smaller.

---

## 2. What Happens When You Quantize to Q4 (~4–5 GB)?

This is the critical question — you're compressing a 15.3 GB model down to under 5 GB. Here's exactly what you lose, based on the empirical quantization study:

### Perplexity (Lower is Better)

| Precision | Perplexity (WikiText2) | Size | Loss vs FP16 |
|---|---|---|---|
| **FP16** (baseline) | **10.11** | 15.26 GB | — |
| **Q4_K_HIFI** (best Q4) | **10.41** | 4.93 GB | **+3.0%** |
| **Q4_K_M** (standard) | **10.42** | 4.68 GB | **+3.1%** |
| **Q4_K_S** (smallest) | **10.68** | 4.47 GB | **+5.7%** |

With **imatrix calibration** (a technique that tells the quantizer which weights matter most), Q4_K_HIFI drops to only **+1.1% perplexity loss** — essentially negligible.

### MMLU Score (Knowledge & Reasoning)

This is the big one from the academic study:

| Precision | MMLU Score | Drop |
|---|---|---|
| **FP16** | **74.7%** | — |
| **8-bit (AWQ)** | **~74.5%** | **Near-zero loss** |
| **4-bit (AWQ)** | **69.3%** | **-5.4 points** |
| **3-bit (AWQ)** | Significant degradation | Not recommended |

So at **4-bit quantization**, you lose about **5 points on MMLU** — that's a real but manageable drop. You go from "strong" to "good." At 8-bit, it's essentially lossless.

### Commonsense Reasoning (Zero-Shot)

At 4-bit AWQ, the 8B model retains most of its reasoning ability:

- **PIQA** (physical intuition): ~78% → stays solid
- **ARC-Easy**: ~80% → minimal drop
- **ARC-Challenge**: ~54% → slight drop
- **HellaSwag**: ~60% → moderate drop
- **WinoGrande**: ~72% → holds well
- **BoolQ**: ~82% → holds well

The **average across all six reasoning benchmarks** drops from about **68% to 64%** at 4-bit — a roughly **4-point decline**.

---

## 3. Speed on the RTX 3090 — Real Numbers

Here's where it gets exciting. Multiple sources give us concrete tokens-per-second data:

### Quantized Q4 on RTX 3090 (via llama.cpp / Ollama)

| Quantization | Speed | File Size | Quality |
|---|---|---|---|
| **Q4_K_M** | **~118 tokens/sec** | 4.68 GB | Excellent |
| **Q4_K_S** | **~123 tokens/sec** | 4.47 GB | Good |
| **Q4_K_HIFI** | **~117 tokens/sec** | 4.93 GB | Best Q4 quality |

### Full Precision on RTX 3090

| Precision | Speed | VRAM Used |
|---|---|---|
| **BF16** | **~59–82 tokens/sec** | ~16 GB |
| **FP8** | **~150 tokens/sec** | ~8 GB |
| **AWQ-INT4** | **~144 tokens/sec** | ~4–5 GB |

### Context Length Impact on Speed

This is important — as you feed the model more input text, throughput actually **increases** due to GPU parallelism on the prefill phase:

| Input Length | BF16 Speed | INT4 Speed |
|---|---|---|
| **1 token** | 82 tok/s | 144 tok/s |
| **6K tokens** | 296 tok/s | 478 tok/s |
| **14K tokens** | 525 tok/s | 770 tok/s |
| **30K tokens** | 833 tok/s | 1,076 tok/s |
| **64K tokens** | 1,113 tok/s | 1,255 tok/s |
| **128K tokens** | 1,173 tok/s | 1,198 tok/s |

These higher numb

Precision	Perplexity (WikiText2)	Size	Loss vs FP16
FP16 (baseline)	10.11	15.26 GB	—
Q4_K_HIFI (best Q4)	10.41	4.93 GB	+3.0%
Q4_K_M (standard)	10.42	4.68 GB	+3.1%
Q4_K_S (smallest)	10.68	4.47 GB	+5.7%

Precision	MMLU Score	Drop
FP16	74.7%	—
8-bit (AWQ)	~74.5%	Near-zero loss
4-bit (AWQ)	69.3%	-5.4 points
3-bit (AWQ)	Significant degradation	Not recommended

Qwen3 8B: Performance Breakdown Unveiled

Qwen3 8B — Deep Dive on Evals, Benchmarks & Quantization Performance

1. Full-Precision (FP16/BF16) Benchmark Scores

Thinking Mode (Reasoning Enabled)

Non-Thinking Mode (Fast, Direct Answers)

2. What Happens When You Quantize to Q4 (~4–5 GB)?

Perplexity (Lower is Better)

MMLU Score (Knowledge & Reasoning)

Commonsense Reasoning (Zero-Shot)

3. Speed on the RTX 3090 — Real Numbers

Quantized Q4 on RTX 3090 (via llama.cpp / Ollama)

Full Precision on RTX 3090

Context Length Impact on Speed

4. The Practical Bottom Line

Q4 (~4.5–5 GB) — The Sweet Spot for a 3090

Q8 (~8 GB) — Best Balance

FP16 (~16 GB) — Maximum Quality

5. Where Qwen3 8B Punches Above Its Weight

Quantization	Speed	File Size	Quality
Q4_K_M	~118 tokens/sec	4.68 GB	Excellent
Q4_K_S	~123 tokens/sec	4.47 GB	Good
Q4_K_HIFI	~117 tokens/sec	4.93 GB	Best Q4 quality

Precision	Speed	VRAM Used
BF16	~59–82 tokens/sec	~16 GB
FP8	~150 tokens/sec	~8 GB
AWQ-INT4	~144 tokens/sec	~4–5 GB

Input Length	BF16 Speed	INT4 Speed
1 token	82 tok/s	144 tok/s
6K tokens	296 tok/s	478 tok/s
14K tokens	525 tok/s	770 tok/s
30K tokens	833 tok/s	1,076 tok/s
64K tokens	1,113 tok/s	1,255 tok/s
128K tokens	1,173 tok/s	1,198 tok/s