Benchmark: GLM-4.7 MXFP4 and abliterated Q4_0 on 9950x+256GB 6000MT/s+RTX5090

Von t0mi
Veröffentlicht am 31. Dezember 202512. Jänner 2026
Veröffentlicht unter Allgemein

Benchmark: GLM-4.7 MXFP4 and abliterated Q4_0 on 9950x+256GB 6000MT/s+RTX5090

Squeezing Maximum Performance from a 355B MoE Model – A Practical Benchmark Guide

Introduction

GLM-4.7 is an impressive open-source MoE (Mixture of Experts) model with 400 billion parameters that can run on consumer hardware thanks to its architecture. In this comprehensive benchmark, I’ll show how I achieved 619 tokens per second for prompt processing and a stable 6.17 t/s for token generation on my system.

Hardware Setup

(base) wisdom@9950x:/data/models$ fastfetch 
             .',;::::;,'.                 wisdom@9950x
         .';:cccccccccccc:;,.             ------------
      .;cccccccccccccccccccccc;.          OS: Fedora Linux 43 (KDE Plasma Desktop Edition) x86_64
    .:cccccccccccccccccccccccccc:.        Host: X870E AORUS PRO (Default string-CF-WCP-ADO)
  .;ccccccccccccc;.:dddl:.;ccccccc;.      Kernel: Linux 6.17.12-300.fc43.x86_64
 .:ccccccccccccc;OWMKOOXMWd;ccccccc:.     Uptime: 7 days, 10 hours, 20 mins
.:ccccccccccccc;KMMc;cc;xMMc;ccccccc:.    Packages: 4485 (rpm), 64 (flatpak), 8 (snap)
,cccccccccccccc;MMM.;cc;;WW:;cccccccc,    Shell: bash 5.3.0
:cccccccccccccc;MMM.;cccccccccccccccc:    Display (OLED G42P5): 3840x2160 @ 138 Hz (as 3072x1728) in 48" [External, HDR] *
:ccccccc;oxOOOo;MMM000k.;cccccccccccc:    Display (HP E243i): 1200x1920 @ 60 Hz (as 960x1536) in 24" [External]
cccccc;0MMKxdd:;MMMkddc.;cccccccccccc;    DE: KDE Plasma 6.5.4
ccccc;XMO';cccc;MMM.;cccccccccccccccc'    WM: KWin (Wayland)
ccccc;MMo;ccccc;MMW.;ccccccccccccccc;     WM Theme: Breeze
ccccc;0MNc.ccc.xMMd;ccccccccccccccc;      Theme: Breeze (Light) [Qt], Breeze [GTK3]
cccccc;dNMWXXXWM0:;cccccccccccccc:,       Icons: Breeze [Qt], breeze-dark [GTK3/4]
cccccccc;.:odl:.;cccccccccccccc:,.        Font: Noto Sans (10pt) [Qt], Noto Sans (10pt) [GTK3/4]
ccccccccccccccccccccccccccccc:'.          Cursor: Breeze (24px)
:ccccccccccccccccccccccc:;,..             Terminal: /dev/pts/1
 ':cccccccccccccccc::;,.                  CPU: AMD Ryzen 9 9950X (32) @ 5.76 GHz
                                          GPU: NVIDIA GeForce RTX 5090 [Discrete]
                                          Memory: 39.70 GiB / 251.30 GiB (16%)
                                          Swap: 7.41 GiB / 8.00 GiB (93%)
                                          Disk (/): 2.98 TiB / 3.64 TiB (82%) - btrfs
                                          Disk (/data): 2.62 TiB / 3.58 TiB (73%) - ext4
                                          Local IP (enp15s0): 10.0.0.19/24
                                          Locale: de_AT.UTF-8

                                                                  
                                                                  
(base) wisdom@9950x:/data/models$

Models Tested

1. GLM-4.7-MXFP4_MOE

Size: 183 GB (12 GGUF shards)
Quantization: MXFP4 (Microscaling FP4)
Inference Engine: ik_llama.cpp

2. Huihui-GLM-4.7-abliterated-Q4_0

Size: 188 GB (21 GGUF shards)
Quantization: Q4_0
Inference Engine: llama.cpp

The Key to Success: MoE Layer Offloading

The secret to high performance with MoE models lies in intelligent offloading of expert layers. Instead of the classic -ncmoe parameter, I use the -ot (Override Tensor) parameter:

-ot '.*ffn_(up|down|gate)_exps\.weight=CPU'

This regex pattern specifically moves the MoE expert weights to the CPU while keeping all other layers on the GPU. The result: Significantly higher ngl values without out-of-memory errors.

Comparison: ncmoe vs. -ot Parameter

Method	Max ngl	PP (t/s)	TG (t/s)
`-ncmoe 999`	32	~65	~5.1
`-ot` Pattern	112	~87	~6.2

Conclusion: The -ot parameter enables almost 4x higher GPU layer utilization!

Detailed Benchmark Results

MXFP4 with ik_llama.cpp – Top Results

Small Prompts (PP=512, TG=128)

| model                 |       size |     params | backend | ngl | threads |    test |         t/s |
| --------------------- | ---------: | ---------: | ------- | --: | ------: | ------: | ----------: |
| glm4 MXFP4_MOE        | 182.92 GiB | 404.87 B   | CUDA    | 112 |      16 |   pp512 | 87.47 ± 0.00 |
| glm4 MXFP4_MOE        | 182.92 GiB | 404.87 B   | CUDA    | 112 |      16 |   tg128 |  6.25 ± 0.00 |

Medium Prompts (PP=4096, TG=512)

| model                 |       size |     params | backend | ngl | threads |    test |          t/s |
| --------------------- | ---------: | ---------: | ------- | --: | ------: | ------: | -----------: |
| glm4 MXFP4_MOE        | 182.92 GiB | 404.87 B   | CUDA    | 112 |      16 |  pp4096 | 401.72 ± 0.00 |
| glm4 MXFP4_MOE        | 182.92 GiB | 404.87 B   | CUDA    | 112 |      16 |   tg512 |   6.21 ± 0.00 |

Large Prompts (PP=16384, TG=2048)

| model                 |       size |     params | backend | ngl | threads |    test |          t/s |
| --------------------- | ---------: | ---------: | ------- | --: | ------: | ------: | -----------: |
| glm4 MXFP4_MOE        | 182.92 GiB | 404.87 B   | CUDA    | 112 |      16 | pp16384 | 619.54 ± 0.00 |
| glm4 MXFP4_MOE        | 182.92 GiB | 404.87 B   | CUDA    | 112 |      16 |  tg2048 |   6.17 ± 0.00 |

Huihui Q4_0 with llama.cpp – Top Results

Large Prompts (PP=16384, TG=2048)

| model                 |       size |     params | backend | ngl | fa | threads |    test |          t/s |
| --------------------- | ---------: | ---------: | ------- | --: | -: | ------: | ------: | -----------: |
| glm4 Q4_0             | 187.56 GiB | 404.87 B   | CUDA    |  96 |  1 |      16 | pp16384 | 599.51 ± 0.00 |
| glm4 Q4_0             | 187.56 GiB | 404.87 B   | CUDA    |  96 |  1 |      16 |  tg2048 |   6.10 ± 0.00 |

Performance Scaling by Prompt Size

An interesting finding: PP performance scales significantly with prompt size, while TG remains constant:

Prompt Size	PP (t/s)	TG (t/s)	PP Efficiency
512 Tokens	87.47	6.25	Baseline
1024 Tokens	123.60	6.23	+41%
2048 Tokens	228.69	6.22	+161%
4096 Tokens	401.72	6.21	+359%
8192 Tokens	512.38	6.19	+486%
16384 Tokens	619.54	6.17	+608%

Key Insight: For applications with large contexts (code analysis, document processing), the model reaches its maximum efficiency. Token generation remains constant at ~6.2 t/s.

ik_llama vs. llama.cpp Comparison

For MXFP4 models, ik_llama.cpp (a fork of llama.cpp with additional optimizations) offers significant advantages:

Engine	Model	PP (512)	PP (16K)	TG
ik_llama	MXFP4	87.47	619.54	6.25
llama.cpp	Q4_0	67.38	599.51	6.10

ik_llama advantage: ~30% faster prompt processing for small prompts, ~3% for large prompts.

Optimal Parameter Configuration

For MXFP4 (ik_llama.cpp)

llama-bench \
    -m GLM-4.7-MXFP4_MOE.gguf \
    -t 16 \
    -ngl 112 \
    -gr 1 \
    -ot '.*ffn_(up|down|gate)_exps\.weight=CPU' \
    -b 16384 -ub 8192 \
    -ctk q4_0 -ctv q4_0 \
    --numa distribute

For Q4_0 (llama.cpp)

llama-bench \
    -m Huihui-GLM-4.7-abliterated-Q4_0.gguf \
    -t 16 \
    -ngl 96 \
    -fa 1 \
    -ot '.*ffn_(up|down|gate)_exps\.weight=CPU' \
    -b 8192 -ub 4096 \
    -ctk q4_0 -ctv q4_0 \
    --numa distribute

Parameter Reference

Parameter	Value	Description
`-t 16`	16 Threads	Optimal for 9950X (more = worse TG!)
`-ngl 112/96`	GPU Layers	Maximum without OOM thanks to -ot
`-gr 1`	Graph Reuse	ik_llama specific, improves PP
`-fa 1`	Flash Attention	llama.cpp specific
`-ot '...'`	MoE to CPU	Enables high ngl values
`-b/-ub`	Batch Sizes	Larger = better PP for large prompts
`-ctk/-ctv q4_0`	KV Cache	Quantized for large contexts
`--numa distribute`	NUMA	Optimal RAM distribution

Key Findings

1. Thread Count is Critical

More threads ≠ better performance! At T=32, TG performance drops dramatically:

Threads	PP (t/s)	TG (t/s)
16	67.38	6.10
24	68.12	5.41
32	65.89	3.82

16 threads are optimal for the Ryzen 9 9950X.

2. KV Cache Quantization

For large contexts (>32K), q4_0 KV cache is essential:

f16 KV: OOM at 16K+ context
q4_0 KV: Stable up to 128K context

3. Batch Size by Application

Small prompts (<2K): B=4096, UB=2048
Large prompts (>4K): B=16384, UB=8192

Realistic Workload Scenarios

For practical coding assistant scenarios:

Scenario	Prompt	Generation	PP (t/s)	TG (t/s)	Wait Time*
Quick Question	512	256	87	6.2	~47s
Code Review	4096	512	402	6.2	~93s
Large Codebase	16384	1024	619	6.2	~192s
Maximum Context	32768	2048	~620	6.2	~383s

*Estimated total time for prompt processing + token generation

Conclusion

GLM-4.7 runs impressively on consumer hardware. With the right optimizations, you can achieve:

619 t/s prompt processing for large contexts
Stable 6.2 t/s token generation regardless of context size
Up to 128K context with q4_0 KV cache

The key lies in intelligent MoE offloading via the -ot parameter and finding the right balance between GPU layers, thread count, and batch sizes.

For coding assistant applications, the TG rate of 6.2 t/s (~370 words/minute) is absolutely practical – comparable to the reading speed of a fast reader.

Download Links

GLM-4.7-MXFP4_MOE: Hugging Face
Huihui-GLM-4.7-abliterated: Hugging Face
ik_llama.cpp: GitHub
llama.cpp: GitHub

postl.ai

Benchmark: GLM-4.7 MXFP4 and abliterated Q4_0 on 9950x+256GB 6000MT/s+RTX5090

Squeezing Maximum Performance from a 355B MoE Model – A Practical Benchmark Guide

Introduction

Hardware Setup

Models Tested

1. GLM-4.7-MXFP4_MOE

2. Huihui-GLM-4.7-abliterated-Q4_0

The Key to Success: MoE Layer Offloading

Comparison: ncmoe vs. -ot Parameter

Detailed Benchmark Results

MXFP4 with ik_llama.cpp – Top Results

Small Prompts (PP=512, TG=128)

Medium Prompts (PP=4096, TG=512)

Large Prompts (PP=16384, TG=2048)

Huihui Q4_0 with llama.cpp – Top Results

Large Prompts (PP=16384, TG=2048)

Performance Scaling by Prompt Size

ik_llama vs. llama.cpp Comparison

Optimal Parameter Configuration

For MXFP4 (ik_llama.cpp)

For Q4_0 (llama.cpp)

Parameter Reference

Key Findings

1. Thread Count is Critical

2. KV Cache Quantization

3. Batch Size by Application

Realistic Workload Scenarios

Conclusion

Download Links

Vorheriger Artikel

Nächster Artikel

Schreibe einen Kommentar Antworten abbrechen