Différences

Ci-dessous, les différences entre deux révisions de la page.

--- informatique:ai_lm:gpu_bench:llama-cpp_mtp [02/07/2026 18:19] – cyrille
+++ informatique:ai_lm:gpu_bench:llama-cpp_mtp [02/07/2026 18:44] (Version actuelle) – cyrille
@@ Ligne 2: / Ligne 2: @@
   * [[https://github.com/ggml-org/llama.cpp/blob/master/docs/speculative.md|llama.cpp docs/speculative.md]]
+Ce bench n'est pas pertinent à cause de :
+  * défaut connu de l'implémentation draft-mtp
+  * sur les modèles MoE
+  * avec VRAM limitée sous CUDA
 Avec "Nvidia RTX 5060 Ti 16 Go" + "Intel Core Ultra 7 270K +".
 <code>
+Modèle MTP avec spec draft-mtp :
 $ llama-cli Qwen3.6-35B-A3B-MTP-UD-Q4_K_XL.gguf --single-turn --spec-type draft-mtp --spec-draft-n-max 3 --predict 2048 -f prompt-1.txt
 [ Prompt: 320.3 t/s | Generation: 59.2 t/s ]
 [ Prompt: 321.5 t/s | Generation: 59.2 t/s ]
 [ Prompt: 322.3 t/s | Generation: 57.2 t/s ]
+Modèle MTP sans spec draft-mtp :
 $ llama-cli Qwen3.6-35B-A3B-MTP-UD-Q4_K_XL.gguf --single-turn --predict 2048 -f prompt-1.txt
@@ Ligne 15: / Ligne 24: @@
 [ Prompt: 377.0 t/s | Generation: 62.0 t/s ]
 [ Prompt: 372.3 t/s | Generation: 61.9 t/s ]
+Modèle non MTP :
+$ llama-cli Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf --single-turn --predict 2048 -f prompt-1.txt
+[ Prompt: 386.3 t/s | Generation: 67.6 t/s ]
+[ Prompt: 387.1 t/s | Generation: 67.2 t/s ]
+[ Prompt: 389.5 t/s | Generation: 67.6 t/s ]
+Modèle non MTP avec décharge de MoE :
+$ llama-cli Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf --single-turn --predict 2048 -f prompt-1.txt --n-cpu-moe 13 --gpu-layers 99
+[ Prompt: 384.3 t/s | Generation: 69.3 t/s ]
+[ Prompt: 385.0 t/s | Generation: 69.3 t/s ]
+[ Prompt: 385.9 t/s | Generation: 69.4 t/s ]
 </code>