Benchmark: GLM-4.7 MXFP4 on 9950x+256GB 6000MT/s+RTX5090
pp1024: 127.21 ± 0.00
tg1024: 6.09 ± 0.00
system:
(base) wisdom@9950x:/data/models$ fastfetch
.',;::::;,'. wisdom@9950x
.';:cccccccccccc:;,. ------------
.;cccccccccccccccccccccc;. OS: Fedora Linux 43 (KDE Plasma Desktop Edition) x86_64
.:cccccccccccccccccccccccccc:. Host: X870E AORUS PRO (Default string-CF-WCP-ADO)
.;ccccccccccccc;.:dddl:.;ccccccc;. Kernel: Linux 6.17.12-300.fc43.x86_64
.:ccccccccccccc;OWMKOOXMWd;ccccccc:. Uptime: 7 days, 10 hours, 20 mins
.:ccccccccccccc;KMMc;cc;xMMc;ccccccc:. Packages: 4485 (rpm), 64 (flatpak), 8 (snap)
,cccccccccccccc;MMM.;cc;;WW:;cccccccc, Shell: bash 5.3.0
:cccccccccccccc;MMM.;cccccccccccccccc: Display (OLED G42P5): 3840x2160 @ 138 Hz (as 3072x1728) in 48" [External, HDR] *
:ccccccc;oxOOOo;MMM000k.;cccccccccccc: Display (HP E243i): 1200x1920 @ 60 Hz (as 960x1536) in 24" [External]
cccccc;0MMKxdd:;MMMkddc.;cccccccccccc; DE: KDE Plasma 6.5.4
ccccc;XMO';cccc;MMM.;cccccccccccccccc' WM: KWin (Wayland)
ccccc;MMo;ccccc;MMW.;ccccccccccccccc; WM Theme: Breeze
ccccc;0MNc.ccc.xMMd;ccccccccccccccc; Theme: Breeze (Light) [Qt], Breeze [GTK3]
cccccc;dNMWXXXWM0:;cccccccccccccc:, Icons: Breeze [Qt], breeze-dark [GTK3/4]
cccccccc;.:odl:.;cccccccccccccc:,. Font: Noto Sans (10pt) [Qt], Noto Sans (10pt) [GTK3/4]
ccccccccccccccccccccccccccccc:'. Cursor: Breeze (24px)
:ccccccccccccccccccccccc:;,.. Terminal: /dev/pts/1
':cccccccccccccccc::;,. CPU: AMD Ryzen 9 9950X (32) @ 5.76 GHz
GPU: NVIDIA GeForce RTX 5090 [Discrete]
Memory: 39.70 GiB / 251.30 GiB (16%)
Swap: 7.41 GiB / 8.00 GiB (93%)
Disk (/): 2.98 TiB / 3.64 TiB (82%) - btrfs
Disk (/data): 2.62 TiB / 3.58 TiB (73%) - ext4
Local IP (enp15s0): 10.0.0.19/24
Locale: de_AT.UTF-8
(base) wisdom@9950x:/data/models$
(base) wisdom@9950x:/data/models$ MODEL="/data/models/GLM-4.7-MXFP4_MOE-00001-of-00012.gguf" BENCH="$HOME/llama.cpp/build/bin/llama-bench" for ngl in 96 112 120; do echo "=== ngl=$ngl ===" $BENCH -m "$MODEL" -t 32 -b 256 -ub 128 -p 2048 -n 256 -r 1 -ngl $ngl -ncmoe 999 -fa 1 done === ngl=96 === ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes | model | size | params | backend | ngl | threads | n_batch | n_ubatch | fa | test | t/s | | ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | ------: | -------: | -: | --------------: | -------------------: | | glm4moe 355B.A32B MXFP4 MoE | 186.70 GiB | 358.34 B | CUDA | 96 | 32 | 256 | 128 | 1 | pp2048 | 20.82 ± 0.00 | | glm4moe 355B.A32B MXFP4 MoE | 186.70 GiB | 358.34 B | CUDA | 96 | 32 | 256 | 128 | 1 | tg256 | 3.15 ± 0.00 | build: 7f766929 (6527) === ngl=112 === ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes | model | size | params | backend | ngl | threads | n_batch | n_ubatch | fa | test | t/s | | ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | ------: | -------: | -: | --------------: | -------------------: | | glm4moe 355B.A32B MXFP4 MoE | 186.70 GiB | 358.34 B | CUDA | 112 | 32 | 256 | 128 | 1 | pp2048 | 20.88 ± 0.00 | | glm4moe 355B.A32B MXFP4 MoE | 186.70 GiB | 358.34 B | CUDA | 112 | 32 | 256 | 128 | 1 | tg256 | 3.60 ± 0.00 | build: 7f766929 (6527) === ngl=120 === ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes | model | size | params | backend | ngl | threads | n_batch | n_ubatch | fa | test | t/s | | ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | ------: | -------: | -: | --------------: | -------------------: | | glm4moe 355B.A32B MXFP4 MoE | 186.70 GiB | 358.34 B | CUDA | 120 | 32 | 256 | 128 | 1 | pp2048 | 19.19 ± 0.00 | | glm4moe 355B.A32B MXFP4 MoE | 186.70 GiB | 358.34 B | CUDA | 120 | 32 | 256 | 128 | 1 | tg256 | 3.21 ± 0.00 | build: 7f766929 (6527) (base) wisdom@9950x:/data/models$ BEST=112 (base) wisdom@9950x:/data/models$ $BENCH -m "$MODEL" -t 32 -b 512 -ub 256 -p 2048 -n 256 -r 1 -ngl $BEST -ncmoe 999 -fa 1 $BENCH -m "$MODEL" -t 32 -b 1024 -ub 256 -p 2048 -n 256 -r 1 -ngl $BEST -ncmoe 999 -fa 1 ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes | model | size | params | backend | ngl | threads | n_batch | n_ubatch | fa | test | t/s | | ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | ------: | -------: | -: | --------------: | -------------------: | | glm4moe 355B.A32B MXFP4 MoE | 186.70 GiB | 358.34 B | CUDA | 112 | 32 | 512 | 256 | 1 | pp2048 | 36.92 ± 0.00 | | glm4moe 355B.A32B MXFP4 MoE | 186.70 GiB | 358.34 B | CUDA | 112 | 32 | 512 | 256 | 1 | tg256 | 3.68 ± 0.00 | build: 7f766929 (6527) ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes | model | size | params | backend | ngl | threads | n_batch | n_ubatch | fa | test | t/s | | ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | ------: | -------: | -: | --------------: | -------------------: | | glm4moe 355B.A32B MXFP4 MoE | 186.70 GiB | 358.34 B | CUDA | 112 | 32 | 1024 | 256 | 1 | pp2048 | 37.06 ± 0.00 | | glm4moe 355B.A32B MXFP4 MoE | 186.70 GiB | 358.34 B | CUDA | 112 | 32 | 1024 | 256 | 1 | tg256 | 3.67 ± 0.00 | build: 7f766929 (6527) (base) wisdom@9950x:/data/models$ $BENCH -m "$MODEL" -t 32 -b 128 -ub 64 -p 2048 -n 256 -r 1 -ngl $BEST -ncmoe 999 -fa 1 ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes | model | size | params | backend | ngl | threads | n_batch | n_ubatch | fa | test | t/s | | ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | ------: | -------: | -: | --------------: | -------------------: | | glm4moe 355B.A32B MXFP4 MoE | 186.70 GiB | 358.34 B | CUDA | 112 | 32 | 128 | 64 | 1 | pp2048 | 12.00 ± 0.00 | | glm4moe 355B.A32B MXFP4 MoE | 186.70 GiB | 358.34 B | CUDA | 112 | 32 | 128 | 64 | 1 | tg256 | 3.58 ± 0.00 | build: 7f766929 (6527) (base) wisdom@9950x:/data/models$ for t in 16 24 32; do echo "=== threads=$t ===" $BENCH -m "$MODEL" -t $t -b 256 -ub 128 -p 2048 -n 256 -r 1 -ngl $BEST -ncmoe 999 -fa 1 done === threads=16 === ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes | model | size | params | backend | ngl | n_batch | n_ubatch | fa | test | t/s | | ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | -------: | -: | --------------: | -------------------: | | glm4moe 355B.A32B MXFP4 MoE | 186.70 GiB | 358.34 B | CUDA | 112 | 256 | 128 | 1 | pp2048 | 20.78 ± 0.00 | | glm4moe 355B.A32B MXFP4 MoE | 186.70 GiB | 358.34 B | CUDA | 112 | 256 | 128 | 1 | tg256 | 6.26 ± 0.00 | build: 7f766929 (6527) === threads=24 === ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes | model | size | params | backend | ngl | threads | n_batch | n_ubatch | fa | test | t/s | | ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | ------: | -------: | -: | --------------: | -------------------: | | glm4moe 355B.A32B MXFP4 MoE | 186.70 GiB | 358.34 B | CUDA | 112 | 24 | 256 | 128 | 1 | pp2048 | 20.86 ± 0.00 | | glm4moe 355B.A32B MXFP4 MoE | 186.70 GiB | 358.34 B | CUDA | 112 | 24 | 256 | 128 | 1 | tg256 | 6.09 ± 0.00 | build: 7f766929 (6527) === threads=32 === ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes | model | size | params | backend | ngl | threads | n_batch | n_ubatch | fa | test | t/s | | ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | ------: | -------: | -: | --------------: | -------------------: | | glm4moe 355B.A32B MXFP4 MoE | 186.70 GiB | 358.34 B | CUDA | 112 | 32 | 256 | 128 | 1 | pp2048 | 20.80 ± 0.00 | | glm4moe 355B.A32B MXFP4 MoE | 186.70 GiB | 358.34 B | CUDA | 112 | 32 | 256 | 128 | 1 | tg256 | 4.25 ± 0.00 | build: 7f766929 (6527) (base) wisdom@9950x:/data/models$
(base) wisdom@9950x:/data/models$ cat > b1.sh <<'BASH'
#!/usr/bin/env bash
set -euo pipefail
cd ~/llama.cpp/build/bin
MODEL="/data/models/GLM-4.7-MXFP4_MOE-00001-of-00012.gguf"
echo "=== GLM-4.7-MXFP4_MOE Real-World Benchmark (CPU-only) ==="
echo "threads=16 | ngl=0 | batch=512 ubatch=128 | reps=1"
echo
./llama-bench -m "$MODEL" \
-t 16 \
-ngl 0 \
-b 512 -ub 128 \
-r 1 --no-warmup \
-fa 0 \
-p 512,2048,8192,16384 \
-n 256
BASH
chmod +x b1.sh
bash b1.sh
=== GLM-4.7-MXFP4_MOE Real-World Benchmark (CPU-only) ===
threads=16 | ngl=0 | batch=512 ubatch=128 | reps=1
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes
| model | size | params | backend | ngl | n_batch | n_ubatch | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | -------: | --------------: | -------------------: |
| glm4moe 355B.A32B MXFP4 MoE | 186.70 GiB | 358.34 B | CUDA | 0 | 512 | 128 | pp512 | 18.30 ± 0.00 |
| glm4moe 355B.A32B MXFP4 MoE | 186.70 GiB | 358.34 B | CUDA | 0 | 512 | 128 | pp2048 | 17.97 ± 0.00 |
| glm4moe 355B.A32B MXFP4 MoE | 186.70 GiB | 358.34 B | CUDA | 0 | 512 | 128 | pp8192 | 17.00 ± 0.00 |
| glm4moe 355B.A32B MXFP4 MoE | 186.70 GiB | 358.34 B | CUDA | 0 | 512 | 128 | pp16384 | 16.61 ± 0.00 |
| glm4moe 355B.A32B MXFP4 MoE | 186.70 GiB | 358.34 B | CUDA | 0 | 512 | 128 | tg256 | 2.14 ± 0.00 |
build: 7f766929 (6527)
(base) wisdom@9950x:/data/models$
(base) wisdom@9950x:/data/models$ cat > b1_gpu.sh <<'BASH'
#!/usr/bin/env bash
set -euo pipefail
cd ~/llama.cpp/build/bin
MODEL="/data/models/GLM-4.7-MXFP4_MOE-00001-of-00012.gguf"
# Deine bisherigen Bestwerte:
NGL=112
NCMOE=999
FA=1
T=16
echo "=== GLM-4.7-MXFP4_MOE Real-World Benchmark (GPU hybrid) ==="
echo "ngl=$NGL | ncmoe=$NCMOE | threads=$T | fa=$FA | reps=1 | numa=distribute"
echo
# 1) Chat/Decode-orientiert (TG ist König)
# Batch moderat, ubatch moderat -> in deinen Tests gut für TG
./llama-bench -m "$MODEL" \
--numa distribute \
-t $T \
-ngl $NGL -ncmoe $NCMOE \
-fa $FA \
-b 256 -ub 128 \
-r 1 --no-warmup \
-p 2048 \
-n 256
echo
echo "=== Prefill-orientiert (schnell große Prompts reinladen) ==="
echo
# 2) Prefill-orientiert (PP hoch, TG bleibt meist ähnlich)
./llama-bench -m "$MODEL" \
--numa distribute \
-t $T \
-ngl $NGL -ncmoe $NCMOE \
-fa $FA \
-b 1024 -ub 256 \
-r 1 --no-warmup \
-p 8192 \
-n 128
echo
echo "=== Scaling-Check (mehrere Promptgrößen, gleicher Decode) ==="
echo
# 3) “Real-world scaling” über Promptgrößen
./llama-bench -m "$MODEL" \
--numa distribute \
-t $T \
-ngl $NGL -ncmoe $NCMOE \
-fa $FA \
-b 512 -ub 256 \
-r 1 --no-warmup \
-p 512,2048,8192,16384 \
-n 256
BASH
chmod +x b1_gpu.sh
bash b1_gpu.sh
=== GLM-4.7-MXFP4_MOE Real-World Benchmark (GPU hybrid) ===
ngl=112 | ncmoe=999 | threads=16 | fa=1 | reps=1 | numa=distribute
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes
| model | size | params | backend | ngl | n_batch | n_ubatch | fa | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | -------: | -: | --------------: | -------------------: |
| glm4moe 355B.A32B MXFP4 MoE | 186.70 GiB | 358.34 B | CUDA | 112 | 256 | 128 | 1 | pp2048 | 19.80 ± 0.00 |
| glm4moe 355B.A32B MXFP4 MoE | 186.70 GiB | 358.34 B | CUDA | 112 | 256 | 128 | 1 | tg256 | 5.58 ± 0.00 |
build: 7f766929 (6527)
=== Prefill-orientiert (schnell große Prompts reinladen) ===
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes
| model | size | params | backend | ngl | n_batch | n_ubatch | fa | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | -------: | -: | --------------: | -------------------: |
| glm4moe 355B.A32B MXFP4 MoE | 186.70 GiB | 358.34 B | CUDA | 112 | 1024 | 256 | 1 | pp8192 | 34.89 ± 0.00 |
| glm4moe 355B.A32B MXFP4 MoE | 186.70 GiB | 358.34 B | CUDA | 112 | 1024 | 256 | 1 | tg128 | 5.72 ± 0.00 |
build: 7f766929 (6527)
=== Scaling-Check (mehrere Promptgrößen, gleicher Decode) ===
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes
| model | size | params | backend | ngl | n_batch | n_ubatch | fa | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | -------: | -: | --------------: | -------------------: |
| glm4moe 355B.A32B MXFP4 MoE | 186.70 GiB | 358.34 B | CUDA | 112 | 512 | 256 | 1 | pp512 | 35.93 ± 0.00 |
| glm4moe 355B.A32B MXFP4 MoE | 186.70 GiB | 358.34 B | CUDA | 112 | 512 | 256 | 1 | pp2048 | 35.63 ± 0.00 |
| glm4moe 355B.A32B MXFP4 MoE | 186.70 GiB | 358.34 B | CUDA | 112 | 512 | 256 | 1 | pp8192 | 35.51 ± 0.00 |
| glm4moe 355B.A32B MXFP4 MoE | 186.70 GiB | 358.34 B | CUDA | 112 | 512 | 256 | 1 | pp16384 | 35.50 ± 0.00 |
| glm4moe 355B.A32B MXFP4 MoE | 186.70 GiB | 358.34 B | CUDA | 112 | 512 | 256 | 1 | tg256 | 5.73 ± 0.00 |
build: 7f766929 (6527)
(base) wisdom@9950x:/data/models$
ik_llama.cpp
(base) wisdom@9950x:/data/models/ik_llama.cpp$ ./build/bin/llama-bench -m /data/models/GLM-4.7-MXFP4_MOE-00001-of-00012.gguf -t 16 -ngl 112 -ot ".*ffn_(up|down|gate)_exps\.weight=CPU" -b 8192 -ub 8192 -p 1024 -n 1
024 -r 1 -w 0 -fa 1 -fmoe 1 -nkvo 0 -ctk q4_0 -ctv q4_0
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes, VRAM: 32075 MiB
| model | size | params | backend | ngl | n_batch | n_ubatch | type_k | type_v | ot | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | -------: | -----: | -----: | --------------- | ------------: | ---------------: |
| glm4moe 355B.A32B MXFP4 - 4.25 bpw | 183.09 GiB | 352.80 B | CUDA | 112 | 8192 | 8192 | q4_0 | q4_0 | .*ffn_(up|down|gate)_exps\.weight=CPU | pp1024 | 127.21 ± 0.00 |
| glm4moe 355B.A32B MXFP4 - 4.25 bpw | 183.09 GiB | 352.80 B | CUDA | 112 | 8192 | 8192 | q4_0 | q4_0 | .*ffn_(up|down|gate)_exps\.weight=CPU | tg1024 | 6.09 ± 0.00 |
build: 5a206e3c (4084)
(base) wisdom@9950x:/data/models/ik_llama.cpp$
(base) wisdom@9950x:/data/models/ik_llama.cpp$ ./build/bin/llama-bench -m /data/models/GLM-4.7-MXFP4_MOE-00001-of-00012.gguf -t 16 -ngl 112 -ot ".*ffn_(up|down|gate)_exps\.weight=CPU" -b 8192 -ub 8192 -p 1024 -n 1024 -r 1 -w 0 -fa 1 -fmoe 1 -nkvo 0 -ctk q4_0 -ctv q4_0
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes, VRAM: 32075 MiB
| model | size | params | backend | ngl | n_batch | n_ubatch | type_k | type_v | ot | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | -------: | -----: | -----: | --------------- | ------------: | ---------------: |
| glm4moe 355B.A32B MXFP4 - 4.25 bpw | 183.09 GiB | 352.80 B | CUDA | 112 | 8192 | 8192 | q4_0 | q4_0 | .*ffn_(up|down|gate)_exps\.weight=CPU | pp1024 | 127.21 ± 0.00 |
| glm4moe 355B.A32B MXFP4 - 4.25 bpw | 183.09 GiB | 352.80 B | CUDA | 112 | 8192 | 8192 | q4_0 | q4_0 | .*ffn_(up|down|gate)_exps\.weight=CPU | tg1024 | 6.09 ± 0.00 |
build: 5a206e3c (4084)
(base) wisdom@9950x:/data/models/ik_llama.cpp$
