Benchmark: GLM-4.7 MXFP4 and abliterated Q4_0 on 9950x+256GB 6000MT/s+RTX5090

Squeezing Maximum Performance from a 355B MoE Model – A Practical Benchmark Guide


Introduction

GLM-4.7 is an impressive open-source MoE (Mixture of Experts) model with 400 billion parameters that can run on consumer hardware thanks to its architecture. In this comprehensive benchmark, I’ll show how I achieved 619 tokens per second for prompt processing and a stable 6.17 t/s for token generation on my system.


Hardware Setup

(base) wisdom@9950x:/data/models$ fastfetch 
             .',;::::;,'.                 wisdom@9950x
         .';:cccccccccccc:;,.             ------------
      .;cccccccccccccccccccccc;.          OS: Fedora Linux 43 (KDE Plasma Desktop Edition) x86_64
    .:cccccccccccccccccccccccccc:.        Host: X870E AORUS PRO (Default string-CF-WCP-ADO)
  .;ccccccccccccc;.:dddl:.;ccccccc;.      Kernel: Linux 6.17.12-300.fc43.x86_64
 .:ccccccccccccc;OWMKOOXMWd;ccccccc:.     Uptime: 7 days, 10 hours, 20 mins
.:ccccccccccccc;KMMc;cc;xMMc;ccccccc:.    Packages: 4485 (rpm), 64 (flatpak), 8 (snap)
,cccccccccccccc;MMM.;cc;;WW:;cccccccc,    Shell: bash 5.3.0
:cccccccccccccc;MMM.;cccccccccccccccc:    Display (OLED G42P5): 3840x2160 @ 138 Hz (as 3072x1728) in 48" [External, HDR] *
:ccccccc;oxOOOo;MMM000k.;cccccccccccc:    Display (HP E243i): 1200x1920 @ 60 Hz (as 960x1536) in 24" [External]
cccccc;0MMKxdd:;MMMkddc.;cccccccccccc;    DE: KDE Plasma 6.5.4
ccccc;XMO';cccc;MMM.;cccccccccccccccc'    WM: KWin (Wayland)
ccccc;MMo;ccccc;MMW.;ccccccccccccccc;     WM Theme: Breeze
ccccc;0MNc.ccc.xMMd;ccccccccccccccc;      Theme: Breeze (Light) [Qt], Breeze [GTK3]
cccccc;dNMWXXXWM0:;cccccccccccccc:,       Icons: Breeze [Qt], breeze-dark [GTK3/4]
cccccccc;.:odl:.;cccccccccccccc:,.        Font: Noto Sans (10pt) [Qt], Noto Sans (10pt) [GTK3/4]
ccccccccccccccccccccccccccccc:'.          Cursor: Breeze (24px)
:ccccccccccccccccccccccc:;,..             Terminal: /dev/pts/1
 ':cccccccccccccccc::;,.                  CPU: AMD Ryzen 9 9950X (32) @ 5.76 GHz
                                          GPU: NVIDIA GeForce RTX 5090 [Discrete]
                                          Memory: 39.70 GiB / 251.30 GiB (16%)
                                          Swap: 7.41 GiB / 8.00 GiB (93%)
                                          Disk (/): 2.98 TiB / 3.64 TiB (82%) - btrfs
                                          Disk (/data): 2.62 TiB / 3.58 TiB (73%) - ext4
                                          Local IP (enp15s0): 10.0.0.19/24
                                          Locale: de_AT.UTF-8

                                                                  
                                                                  
(base) wisdom@9950x:/data/models$ 

Models Tested

1. GLM-4.7-MXFP4_MOE

  • Size: 183 GB (12 GGUF shards)
  • Quantization: MXFP4 (Microscaling FP4)
  • Inference Engine: ik_llama.cpp

2. Huihui-GLM-4.7-abliterated-Q4_0

  • Size: 188 GB (21 GGUF shards)
  • Quantization: Q4_0
  • Inference Engine: llama.cpp

The Key to Success: MoE Layer Offloading

The secret to high performance with MoE models lies in intelligent offloading of expert layers. Instead of the classic -ncmoe parameter, I use the -ot (Override Tensor) parameter:

-ot '.*ffn_(up|down|gate)_exps\.weight=CPU'

This regex pattern specifically moves the MoE expert weights to the CPU while keeping all other layers on the GPU. The result: Significantly higher ngl values without out-of-memory errors.

Comparison: ncmoe vs. -ot Parameter

MethodMax nglPP (t/s)TG (t/s)
-ncmoe 99932~65~5.1
-ot Pattern112~87~6.2

Conclusion: The -ot parameter enables almost 4x higher GPU layer utilization!


Detailed Benchmark Results

MXFP4 with ik_llama.cpp – Top Results

Small Prompts (PP=512, TG=128)

| model                 |       size |     params | backend | ngl | threads |    test |         t/s |
| --------------------- | ---------: | ---------: | ------- | --: | ------: | ------: | ----------: |
| glm4 MXFP4_MOE        | 182.92 GiB | 404.87 B   | CUDA    | 112 |      16 |   pp512 | 87.47 ± 0.00 |
| glm4 MXFP4_MOE        | 182.92 GiB | 404.87 B   | CUDA    | 112 |      16 |   tg128 |  6.25 ± 0.00 |

Medium Prompts (PP=4096, TG=512)

| model                 |       size |     params | backend | ngl | threads |    test |          t/s |
| --------------------- | ---------: | ---------: | ------- | --: | ------: | ------: | -----------: |
| glm4 MXFP4_MOE        | 182.92 GiB | 404.87 B   | CUDA    | 112 |      16 |  pp4096 | 401.72 ± 0.00 |
| glm4 MXFP4_MOE        | 182.92 GiB | 404.87 B   | CUDA    | 112 |      16 |   tg512 |   6.21 ± 0.00 |

Large Prompts (PP=16384, TG=2048)

| model                 |       size |     params | backend | ngl | threads |    test |          t/s |
| --------------------- | ---------: | ---------: | ------- | --: | ------: | ------: | -----------: |
| glm4 MXFP4_MOE        | 182.92 GiB | 404.87 B   | CUDA    | 112 |      16 | pp16384 | 619.54 ± 0.00 |
| glm4 MXFP4_MOE        | 182.92 GiB | 404.87 B   | CUDA    | 112 |      16 |  tg2048 |   6.17 ± 0.00 |

Huihui Q4_0 with llama.cpp – Top Results

Large Prompts (PP=16384, TG=2048)

| model                 |       size |     params | backend | ngl | fa | threads |    test |          t/s |
| --------------------- | ---------: | ---------: | ------- | --: | -: | ------: | ------: | -----------: |
| glm4 Q4_0             | 187.56 GiB | 404.87 B   | CUDA    |  96 |  1 |      16 | pp16384 | 599.51 ± 0.00 |
| glm4 Q4_0             | 187.56 GiB | 404.87 B   | CUDA    |  96 |  1 |      16 |  tg2048 |   6.10 ± 0.00 |

Performance Scaling by Prompt Size

An interesting finding: PP performance scales significantly with prompt size, while TG remains constant:

Prompt SizePP (t/s)TG (t/s)PP Efficiency
512 Tokens87.476.25Baseline
1024 Tokens123.606.23+41%
2048 Tokens228.696.22+161%
4096 Tokens401.726.21+359%
8192 Tokens512.386.19+486%
16384 Tokens619.546.17+608%

Key Insight: For applications with large contexts (code analysis, document processing), the model reaches its maximum efficiency. Token generation remains constant at ~6.2 t/s.


ik_llama vs. llama.cpp Comparison

For MXFP4 models, ik_llama.cpp (a fork of llama.cpp with additional optimizations) offers significant advantages:

EngineModelPP (512)PP (16K)TG
ik_llamaMXFP487.47619.546.25
llama.cppQ4_067.38599.516.10

ik_llama advantage: ~30% faster prompt processing for small prompts, ~3% for large prompts.


Optimal Parameter Configuration

For MXFP4 (ik_llama.cpp)

llama-bench \
    -m GLM-4.7-MXFP4_MOE.gguf \
    -t 16 \
    -ngl 112 \
    -gr 1 \
    -ot '.*ffn_(up|down|gate)_exps\.weight=CPU' \
    -b 16384 -ub 8192 \
    -ctk q4_0 -ctv q4_0 \
    --numa distribute

For Q4_0 (llama.cpp)

llama-bench \
    -m Huihui-GLM-4.7-abliterated-Q4_0.gguf \
    -t 16 \
    -ngl 96 \
    -fa 1 \
    -ot '.*ffn_(up|down|gate)_exps\.weight=CPU' \
    -b 8192 -ub 4096 \
    -ctk q4_0 -ctv q4_0 \
    --numa distribute

Parameter Reference

ParameterValueDescription
-t 1616 ThreadsOptimal for 9950X (more = worse TG!)
-ngl 112/96GPU LayersMaximum without OOM thanks to -ot
-gr 1Graph Reuseik_llama specific, improves PP
-fa 1Flash Attentionllama.cpp specific
-ot '...'MoE to CPUEnables high ngl values
-b/-ubBatch SizesLarger = better PP for large prompts
-ctk/-ctv q4_0KV CacheQuantized for large contexts
--numa distributeNUMAOptimal RAM distribution

Key Findings

1. Thread Count is Critical

More threads ≠ better performance! At T=32, TG performance drops dramatically:

ThreadsPP (t/s)TG (t/s)
1667.386.10
2468.125.41
3265.893.82

16 threads are optimal for the Ryzen 9 9950X.

2. KV Cache Quantization

For large contexts (>32K), q4_0 KV cache is essential:

  • f16 KV: OOM at 16K+ context
  • q4_0 KV: Stable up to 128K context

3. Batch Size by Application

  • Small prompts (<2K): B=4096, UB=2048
  • Large prompts (>4K): B=16384, UB=8192

Realistic Workload Scenarios

For practical coding assistant scenarios:

ScenarioPromptGenerationPP (t/s)TG (t/s)Wait Time*
Quick Question512256876.2~47s
Code Review40965124026.2~93s
Large Codebase1638410246196.2~192s
Maximum Context327682048~6206.2~383s

*Estimated total time for prompt processing + token generation


Conclusion

GLM-4.7 runs impressively on consumer hardware. With the right optimizations, you can achieve:

  • 619 t/s prompt processing for large contexts
  • Stable 6.2 t/s token generation regardless of context size
  • Up to 128K context with q4_0 KV cache

The key lies in intelligent MoE offloading via the -ot parameter and finding the right balance between GPU layers, thread count, and batch sizes.

For coding assistant applications, the TG rate of 6.2 t/s (~370 words/minute) is absolutely practical – comparable to the reading speed of a fast reader.


Download Links


Schreibe einen Kommentar

Deine E-Mail-Adresse wird nicht veröffentlicht. Pflichtfelder sind mit * gekennzeichnet.

*
*