====== GPU Bench ======

  * [[/informatique/ai_lm/gpu_bench/llama-cpp_MTP|Multi-Tokens Prediction]]

  * [[https://blogs.nvidia.com/blog/tag/rtx-ai-garage/|RTX AI Garage]] sur blog de nvidia

  * Gigabyte Windforce OC 12GB Geforce RTX 3060, **354 €TTC** neuve 2025-11
  * PNY OC 16 Go Geforce RTX 5060 Ti, **450 €TTC** neuve 2025-11

Benchmark d'IA pour [[https://lab.cyrille.giquello.fr/Anticor/graphLmExtract.html|extraction de noms]] :
  * avec service Mistral, modèle Codestral = ''00j 01h 02m 48s''
  * RTX3060 + Intel-i7, modèle granite-4.0-h-small-Q8_0  = ''02j 16h 11m 34s''

Selon LeChat:

^ Carte graphique ^ TOPS (INT8) ^  TOPS (FP16) ^ Architecture ^
| RTX 3060 (12 Go) |  ~120 TOPS |  ~60 TOPS | Ampere |
| RTX 5060 Ti (16 Go) |  ~759 TOPS |  ~380 TOPS | Blackwell |

===== Bench llama.cpp =====

  * ''-p'' : Prompt processing (pp): processing a prompt in batches
  * ''-n'' : Text generation (tg): generating a sequence of tokens
  * ''-pg'' : Prompt processing + text generation (pg): processing a prompt followed by generating a sequence of tokens

  * Text generation: tg128, tg256, tg512 : ''-p 0 -n 128,256,512''
  * Prompt processing: b128, b256, b512 : ''-p 1024 -n 0 -b 128,256,512''

^  models                           ^ test   ^  tokens/seconds                             ^^^              ^
^                                   ^        ^ i7-1360P         ^ i7-1360P SYCL  ^ RTX 3060  ^ RTX 5060 Ti  ^
| Qwen2.5-coder-7b-instruct-q5_k_m  | tg128  |             5.47 |                |     57.65 |        73.54 |
| //size: 5.07 GiB//                | tg256  |              ... |                |     57.61 |        73.32 |
|                                   | tg512  |              ... |                |     56.20 |        71.80 |
|                                   | b128   |              ... |                |   1825.17 |      2840.57 |
|                                   | b256   |              ... |                |   1924.10 |      3209.52 |
|                                   | b512   |              ... |                |   1959.18 |      3271.22 |
| Qwen2.5-coder-7b-instruct-q8_0    | tg128  |              ... |                |     41.42 |        50.33 |
| //size: 7.54 GiB//                | tg256  |              ... |                |     41.38 |        50.33 |
|                                   | tg512  |              ... |                |     40.70 |        49.62 |
|                                   | b128   |            13.98 |          36.34 |   1952.96 |      2972.52 |
|                                   | b256   |              ... |          42.28 |   2054.09 |      3460.41 |
|                                   | b512   |              ... |          45.99 |   2093.21 |      3511.29 |
| EuroLLM-9B-Instruct-Q4_0          | tg128  |              ... |                |     56.06 |        71.41 |
| //size: 4.94 GiB//                | tg256  |              ... |                |     55.96 |        71.15 |
|                                   | tg512  |              ... |                |     53.87 |        69.45 |
|                                   | b128   |              ... |                |   1433.95 |   CUDA error |
|                                   | b256   |              ... |                |   1535.06 |          ... |
|                                   | b512   |              ... |                |   1559.88 |          ... |
| Qwen3-14B-UD-Q5_K_XL              | tg128  |              ... |                |     30.00 |        37.66 |
| //size: 9.82 GiB//                | tg256  |              ... |                |     29.97 |        38.17 |
|                                   | tg512  |              ... |                |     29.25 |        37.30 |
|                                   | b128   |              ... |                |    903.97 |   CUDA error |
|                                   | b256   |              ... |                |    951.71 |          ... |
|                                   | b512   |              ... |                |    963.76 |          ... |
| Qwen3-4B-UD-Q8_K_XL               | tg128  |             7.37 |                |     56.35 |          ... |
| //size: 4.70 GiB//                | tg256  |             6.63 |                |     56.35 |          ... |
|                                   | tg512  |             6.24 |                |     54.56 |          ... |
|                                   | b128   |            20.66 |                |   2163.17 |          ... |
|                                   | b256   |              ... |                |   2405.27 |          ... |
|                                   | b512   |              ... |                |   2495.35 |          ... |
| GemmaCoder3-12B-IQ4_NL.gguf       | tg128  |              ... |                |     40.70 |          ... |
| //size: 6.41 GiB//                | tg256  |              ... |                |     40.67 |          ... |
|                                   | tg512  |              ... |                |     39.54 |          ... |
|                                   | b128   |              ... |                |   1150.11 |          ... |
|                                   | b256   |              ... |                |   1218.27 |          ... |
|                                   | b512   |              ... |                |   1253.92 |          ... |
| Gemma3-Code-Reasoning-4B.Q8_0     | tg128  |              ... |                |     66.98 |          ... |
| //size: 3.84 GiB//                | tg256  |              ... |                |     66.95 |          ... |
|                                   | tg512  |              ... |                |     65.75 |          ... |
|                                   | b128   |              ... |                |   2885.80 |          ... |
|                                   | b256   |              ... |                |   3266.87 |          ... |
|                                   | b512   |              ... |                |   3457.03 |          ... |
| GemmaCoder3-12B-Q5_K_M            | tg128  |              ... |                |     34.10 |          ... |
| //size: 7.86 GiB//                | tg256  |              ... |                |     34.06 |          ... |
|                                   | tg512  |              ... |                |     33.28 |          ... |
|                                   | b128   |              ... |                |   1045.27 |          ... |
|                                   | b256   |              ... |                |   1108.95 |          ... |
|                                   | b512   |              ... |                |   1144.97 |          ... |
| gpt-oss 20B MXFP4 MoE             | tg128  |              ... |                |     92.86 |          ... |
| gpt-oss-20b-mxfp4.gguf            | tg256  |              ... |                |     92.69 |          ... |
| //size: 11.27 GiB//               | tg512  |              ... |                |     88.17 |          ... |
|                                   | b128   |              ... |                |   1036.08 |          ... |
|                                   | b256   |              ... |                |   1452.01 |          ... |
|                                   | b512   |              ... |                |   1744.71 |          ... |
| gpt-oss 20B Q4_K - Medium         | tg128  |              ... |                |     98.05 |          ... |
| gpt-oss-20b-UD-Q4_K_XL.gguf       | tg256  |              ... |                |     97.20 |          ... |
| //size: 11.04 GiB//               | tg512  |              ... |                |     92.43 |          ... |
|                                   | b128   |              ... |                |   1034.15 |          ... |
|                                   | b256   |              ... |                |   1450.77 |          ... |
|                                   | b512   |              ... |                |   1734.35 |          ... |


  * Les "CUDA error" apparaissent avec la RTX 5060 Ti et le bridge PCIe/THB4 "Wikingoo L17" et le driver nvidia 580.
  * Avec le CPU, laisser le nombre de cœurs en automatique, ce sont les physiques qui seront utilisés. Si on force plus de thread, les perfs diminuent.
    * le multi-threads physique est utile. Ex: en auto 7.37 t/s, avec 1 thread 3.39 t/s


===== Intel® Core™ i7-1360P 13th Gen =====

Pour comparaison ...

**Qwen2.5-coder-7b-instruct-q5_k_m**:
<code>
./llama-bench -m ~/Data/AI_Models/Qwen2.5-coder-7b-instruct-q5_k_m.gguf -p 0 -n 128
load_backend: loaded RPC backend from /home/.../llama-b7109/libggml-rpc.so
load_backend: loaded CPU backend from /home/.../llama-b7109/libggml-cpu-alderlake.so
| model                          |       size |     params | backend    | threads |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | --------------: | -------------------: |
| qwen2 7B Q5_K - Medium         |   5.07 GiB |     7.62 B | CPU        |       4 |           tg128 |          5.47 ± 0.72 |
</code>

===== Gigabyte Windforce OC 12GB Geforce RTX 3060 =====

{{ :informatique:ai_coding:ia_rtx_3060_small.jpg?direct&400|}}

Avec ''sudo nsys-ui'' :
^ NVIDIA GeForce RTX 3060 ^^
| Chip Name	| GA104 |
| SM Count	| 28 |
| L2 Cache Size	| 2,25 MiB |
| Memory Bandwidth	| 335,32 GiB/s |
| Memory Size	| 11,63 GiB |
| Core Clock	| 1,79 GHz |
| Bus Location	| 0000:05:00.0 |
| GSP firmware version	| 580.105.08 |
| Video accelerator tracing	| Supported |


Avec llama.cpp et CUDA 12.9.

==== Qwen2.5-coder-7b-instruct-q5_k_m ====

<code>
./build/bin/llama-bench -m ~/Data/AI_Models/Qwen2.5-coder-7b-instruct-q5_k_m.gguf -p 0 -n 128,256,512
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes
| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| qwen2 7B Q5_K - Medium         |   5.07 GiB |     7.62 B | CUDA       |  99 |           tg128 |         57.65 ± 0.03 |
| qwen2 7B Q5_K - Medium         |   5.07 GiB |     7.62 B | CUDA       |  99 |           tg256 |         57.61 ± 0.03 |
| qwen2 7B Q5_K - Medium         |   5.07 GiB |     7.62 B | CUDA       |  99 |           tg512 |         56.24 ± 0.05 |
</code>

==== GemmaCoder3-12B-Q5_K_M ====

Pour lancer ''llama-server'' avec le modèle "GemmaCoder3-12B-Q5_K_M.gguf" (fichier 8.4Go) fait de 49 layers en utilisant son **contexte maximale "131072" avec ''<nowiki>--ctx-size 0</nowiki>'' au lieu du par défaut "4096"** il faut décharger des layers sur le CPU, sinon c'est ''main: error: unable to load model''. À noter que c'est pareil avec ''llama-cli''.

^ n-gpu-layers ^ test ^ tokens/s ^    time    ^ % perte perf |
|  (all) 49 | tg128|    34.15 |  0m25,904s |   0.00%  |
|           | b128 |  1041.60 |  0m13,117s |   0.00%  |
|        44 | tg128|    15.55 |  0m48,049s |  54.47%  |
|           | b128 |   279.26 |  0m28,613s |  73.19%  |
|        39 | tg128|    10.74 |  1m07,555s |  68.55%  |
|           | b128 |   150.49 |  0m46,996s |  85.55%  |
|        30 | tg128|     6.83 |  1m42,221s |  80.01%  |
|           | b128 |    82.91 |  1m19,729s |  92.04%  |
|  full cpu | tg128|     3.12 |  3m28,308s |  90.86%  |
|           | b128 |     4.50 |  22m37,674s |  99.57%  |

Les valeurs qui permettent de charger ce modèle :

  * ''llama-cli'' :
    * avec son context max 131072 c'est 30 layers sur GPU : ''<nowiki>--n-gpu-layers 30</nowiki>'', donc 80% perte perf
    * ''<nowiki>--ctx-size 70000 --n-gpu-layers 41</nowiki>''
    * et pour tous les layers sur le GPU : ''<nowiki>--ctx-size 42000</nowiki>''
  * ''llama-server'' :
    * ''<nowiki>--ctx-size 40000 --n-gpu-layers 44</nowiki>''
    * ''<nowiki>--ctx-size 43500 --n-gpu-layers 43</nowiki>''
    * ''<nowiki>--ctx-size 52500 --n-gpu-layers 42</nowiki>''

Avec ''<nowiki>--ctx-size 52500 --n-gpu-layers 42</nowiki>'' :

<code>
...
NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes
...
print_info: n_ctx_train      = 131072
print_info: n_embd           = 3840
print_info: n_embd_inp       = 3840
print_info: n_layer          = 48
print_info: n_head           = 16
print_info: n_head_kv        = 8
print_info: n_rot            = 256
print_info: n_swa            = 1024
print_info: is_swa_any       = 1
print_info: n_embd_head_k    = 256
print_info: n_embd_head_v    = 256
print_info: n_gqa            = 2
print_info: n_embd_k_gqa     = 2048
print_info: n_embd_v_gqa     = 2048
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-06
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 6.2e-02
print_info: n_ff             = 15360
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: n_expert_groups  = 0
print_info: n_group_used     = 0
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 2
print_info: rope scaling     = linear
print_info: freq_base_train  = 1000000.0
print_info: freq_scale_train = 0.125
print_info: n_ctx_orig_yarn  = 131072
print_info: rope_finetuned   = unknown
print_info: model type       = 12B
print_info: model params     = 11.77 B
print_info: general.name     = gemma-3-12b-it-codeforces-SFT
print_info: vocab type       = SPM
print_info: n_vocab          = 262208
print_info: n_merges         = 0
...
print_info: max token length = 48
...
load_tensors: offloading 42 repeating layers to GPU
load_tensors: offloaded 42/49 layers to GPU
load_tensors:   CPU_Mapped model buffer size =  1720.59 MiB
load_tensors:        CUDA0 model buffer size =  6327.03 MiB
llama_context: constructing llama_context
llama_context: n_seq_max     = 4
llama_context: n_ctx         = 52736
llama_context: n_ctx_seq     = 52736
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = auto
llama_context: kv_unified    = true
llama_context: freq_base     = 1000000.0
llama_context: freq_scale    = 0.125
llama_context: n_ctx_seq (52736) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_context:        CPU  output buffer size =     4.00 MiB
llama_kv_cache_iswa: creating non-SWA KV cache, size = 52736 cells
llama_kv_cache:        CPU KV buffer size =   412.00 MiB
llama_kv_cache:      CUDA0 KV buffer size =  2884.00 MiB
llama_kv_cache: size = 3296.00 MiB ( 52736 cells,   8 layers,  4/1 seqs), K (f16): 1648.00 MiB, V (f16): 1648.00 MiB
llama_kv_cache_iswa: creating     SWA KV cache, size = 4608 cells
llama_kv_cache:        CPU KV buffer size =   180.00 MiB
llama_kv_cache:      CUDA0 KV buffer size =  1260.00 MiB
llama_kv_cache: size = 1440.00 MiB (  4608 cells,  40 layers,  4/1 seqs), K (f16):  720.00 MiB, V (f16):  720.00 MiB
llama_context: Flash Attention was auto, set to enabled
llama_context:      CUDA0 compute buffer size =  1307.32 MiB
llama_context:  CUDA_Host compute buffer size =   120.02 MiB
llama_context: graph nodes  = 1929
llama_context: graph splits = 94 (with bs=512), 27 (with bs=1)

</code>


===== PNY OC 16 Go Geforce RTX 5060 Ti =====

==== Avec vrai PCIe ✅ ====

Sur une vrai tour avec PCIe x16 et Intel(R) Core(TM) Ultra 7 270K Plus.

**Environnement et compilation sensible** pour llama.cpp :
  * https://github.com/ggml-org/llama.cpp/issues/23546#issuecomment-4662239477


^ Modèle ^ params ^ Offload GPU ^ Prompt (t/s) ^ Eval (t/s) ^ Total (ms) ^ Tokens générés ^ Graphs reused ^
| Devstral-Small-2-24B-Instruct-2512-UD-Q4_K_XL | 24B | 17/41 | 427.81 – 545.85 | 0.80 – 3.19 | 123,500 – 568,458 | 9,629 – 47,241 | 0 |
| Qwen3-Coder-30B-A3B-Instruct-UD-Q4_K_XL | 30B | 49/49 | 590.38 – 591.76 | 28.64 – 30.06 | 4,715 – 12,818 | 19,919 – 22,804 | 294 – 530 |
| Qwen3-Coder-Next-UD-Q4_K_XL | 80B | 49/49 | 29.00 – 400.09 | 18.68 – 32.44 | 25,057 – 87,659 | 719 – 43,214 | 10 – 1,024 |
| DeepSeek-R1-Distill-Qwen-32B-Q4_K_M | 32B | 24/65 | 88.97 – 428.81 | 2.14 – 2.32 | 116,052 – 189,566 | 925 – 3,397 | 228 – 419 |
| DeepSeek-R1-Distill-Qwen-14B-Q8_0 | 14B | 24/49 | 225.55 – 775.01 | 4.10 – 4.13 | 81,383 – 147,476 | 1,307 – 3,858 | 313 – 582 |

=== gpt-oss-20b-UD-Q4_K_XL ===

<code>
$ ./llama.cpp/build/bin/llama-bench -m /data/models/gpt-oss-20b-UD-Q4_K_XL.gguf -p 0 -n 128,256,512
ggml_cuda_init: found 1 CUDA devices (Total VRAM: 15849 MiB):
  Device 0: NVIDIA GeForce RTX 5060 Ti, compute capability 12.0, VMM: yes, VRAM: 15849 MiB
| model                     |       size |     params | backend | ngl |    test |            t/s |
| ------------------------- | ---------: | ---------: | ------- | --: | ------: | -------------: |
| gpt-oss 20B Q4_K - Medium |  11.04 GiB |    20.91 B | CUDA    |  -1 |   tg128 |  155.79 ± 0.21 |
| gpt-oss 20B Q4_K - Medium |  11.04 GiB |    20.91 B | CUDA    |  -1 |   tg256 |  155.81 ± 0.03 |
| gpt-oss 20B Q4_K - Medium |  11.04 GiB |    20.91 B | CUDA    |  -1 |   tg512 |  155.15 ± 0.01 |

build: e25a32e98 (9584)

$ ./llama.cpp/build/bin/llama-bench -m /data/models/gpt-oss-20b-UD-Q4_K_XL.gguf -p 1024 -n 0 -b 128,256,512
ggml_cuda_init: found 1 CUDA devices (Total VRAM: 15849 MiB):
  Device 0: NVIDIA GeForce RTX 5060 Ti, compute capability 12.0, VMM: yes, VRAM: 15849 MiB
| model                     |       size |  params | backend | ngl | n_batch |    test |             t/s |
| ------------------------- | ---------: | ------: | ------- | --: | ------: | ------: | --------------: |
| gpt-oss 20B Q4_K - Medium |  11.04 GiB | 20.91 B | CUDA    |  -1 |     128 |  pp1024 | 3308.23 ± 19.28 |
| gpt-oss 20B Q4_K - Medium |  11.04 GiB | 20.91 B | CUDA    |  -1 |     256 |  pp1024 | 4792.27 ± 39.25 |
| gpt-oss 20B Q4_K - Medium |  11.04 GiB | 20.91 B | CUDA    |  -1 |     512 |  pp1024 | 6048.13 ± 32.16 |

build: e25a32e98 (9584)
</code>

=== Qwen2.5-coder-7b-instruct-q8_0 ===

<code>
$ ./llama.cpp/build/bin/llama-bench -m ~/models/Qwen2.5-coder-7b-instruct-q8_0.gguf -p 0 -n 128,256,512
ggml_cuda_init: found 1 CUDA devices (Total VRAM: 15849 MiB):
  Device 0: NVIDIA GeForce RTX 5060 Ti, compute capability 12.0, VMM: yes, VRAM: 15849 MiB
| model            |       size |     params | backend   | ngl |        test |               t/s |
| ---------------- | ---------: | ---------: | --------- | --: | ----------: | ----------------: |
| qwen2 7B Q8_0    |   7.54 GiB |     7.62 B | CUDA      |  -1 |       tg128 |      54.23 ± 0.02 |
| qwen2 7B Q8_0    |   7.54 GiB |     7.62 B | CUDA      |  -1 |       tg256 |      54.23 ± 0.00 |
| qwen2 7B Q8_0    |   7.54 GiB |     7.62 B | CUDA      |  -1 |       tg512 |      54.12 ± 0.00 |

build: e25a32e98 (9584)

$ ./llama.cpp/build/bin/llama-bench -m ~/models/Qwen2.5-coder-7b-instruct-q8_0.gguf -p 1024 -n 0 -b 128,256,512
ggml_cuda_init: found 1 CUDA devices (Total VRAM: 15849 MiB):
  Device 0: NVIDIA GeForce RTX 5060 Ti, compute capability 12.0, VMM: yes, VRAM: 15849 MiB
| model            |       size |     params | backend   | ngl | n_batch |      test |              t/s |
| ---------------- | ---------: | ---------: | --------- | --: | ------: | --------: | ---------------: |
| qwen2 7B Q8_0    |   7.54 GiB |     7.62 B | CUDA      |  -1 |     128 |    pp1024 |   3746.31 ± 4.80 |
| qwen2 7B Q8_0    |   7.54 GiB |     7.62 B | CUDA      |  -1 |     256 |    pp1024 |   4174.39 ± 0.45 |
| qwen2 7B Q8_0    |   7.54 GiB |     7.62 B | CUDA      |  -1 |     512 |    pp1024 |   4354.18 ± 5.39 |

build: e25a32e98 (9584)
</code>

=== Qwen2.5-coder-14b-instruct-q5_k_m ===

<code>
$ ./llama.cpp/build/bin/llama-bench -m ~/models/Qwen2.5-coder-14b-instruct-q5_k_m.gguf -p 0 -n 128,256,512
ggml_cuda_init: found 1 CUDA devices (Total VRAM: 15849 MiB):
  Device 0: NVIDIA GeForce RTX 5060 Ti, compute capability 12.0, VMM: yes, VRAM: 15849 MiB
| model                   |       size |   params | backend | ngl |     test |             t/s |
| ----------------------- | ---------: | -------: | ------- | --: | -------: | --------------: |
| qwen2 14B Q5_K - Medium |   9.78 GiB |  14.77 B | CUDA    |  -1 |    tg128 |    39.54 ± 0.02 |
| qwen2 14B Q5_K - Medium |   9.78 GiB |  14.77 B | CUDA    |  -1 |    tg256 |    39.53 ± 0.01 |
| qwen2 14B Q5_K - Medium |   9.78 GiB |  14.77 B | CUDA    |  -1 |    tg512 |    39.38 ± 0.01 |

build: e25a32e98 (9584)

  Device 0: NVIDIA GeForce RTX 5060 Ti, compute capability 12.0, VMM: yes, VRAM: 15849 MiB
| model                   |       size |   params | backend | ngl | n_batch |    test |             t/s |
| ----------------------- | ---------: | -------: | ------- | --: | ------: | ------: | --------------: |
| qwen2 14B Q5_K - Medium |   9.78 GiB |  14.77 B | CUDA    |  -1 |     128 |  pp1024 |  1835.16 ± 1.69 |
| qwen2 14B Q5_K - Medium |   9.78 GiB |  14.77 B | CUDA    |  -1 |     256 |  pp1024 |  1967.12 ± 1.01 |
| qwen2 14B Q5_K - Medium |   9.78 GiB |  14.77 B | CUDA    |  -1 |     512 |  pp1024 |  1995.02 ± 0.84 |

build: e25a32e98 (9584)
</code>

=== gemma-4-26B-A4B-it-qat-UD-Q4_K_XL ===

<code>
prompt eval time =     318.17 ms /   165 tokens (    1.93 ms per token,   518.59 tokens per second)
       eval time =    1338.88 ms /    86 tokens (   15.57 ms per token,    64.23 tokens per second)
      total time =    1657.05 ms /   251 tokens
   graphs reused =       1916
stop processing: n_tokens = 20931, truncated = 0

prompt eval time =    3143.73 ms /  4850 tokens (    0.65 ms per token,  1542.75 tokens per second)
       eval time =   31502.45 ms /  1854 tokens (   16.99 ms per token,    58.85 tokens per second)
      total time =   34646.18 ms /  6704 tokens
   graphs reused =       3762
stop processing: n_tokens = 27604, truncated = 0
</code>


=== Qwen3-Coder-30B-A3B-Instruct-UD-Q4_K_XL ===

RTX 5060 Ti + Ultra 7 270K.

<code>
Device 0: NVIDIA GeForce RTX 5060 Ti, compute capability 12.0, VMM: yes, VRAM: 15849 MiB

| model                          |       size |     params | backend    |
| ------------------------------ | ---------: | ---------: | ---------- |
| qwen3moe 30B.A3B Q4_K - Medium |  16.45 GiB |    30.53 B | CUDA       |
| qwen3moe 30B.A3B Q4_K - Medium |  16.45 GiB |    30.53 B | CUDA       |
| qwen3moe 30B.A3B Q4_K - Medium |  16.45 GiB |    30.53 B | CUDA       |
build: 931eb37f8 (9848)
</code>

<code>
$ ./llama.cpp/build/bin/llama-bench -m /data/models/Qwen3-Coder-30B-A3B-Instruct-UD-Q4_K_XL.gguf -p 0 -n 128,256,512

ggml_cuda_init: found 1 CUDA devices (Total VRAM: 15849 MiB):
  Device 0: NVIDIA GeForce RTX 5060 Ti, compute capability 12.0, VMM: yes, VRAM: 15849 MiB
| ngl |            test |                  t/s |
| --: | --------------: | -------------------: |
llama_bench: error: failed to load model '/data/models/Qwen3-Coder-30B-A3B-Instruct-UD-Q4_K_XL.gguf'
</code>

Comparaison des perfs en jouant sur la décharge des MoE sur le CPU. 

<code>
$ llama-bench -m Qwen3-Coder-30B-A3B-Instruct-UD-Q4_K_XL.gguf -p 1024 -n 0 -b 128,256,512 --n-cpu-moe 6 --n-gpu-layers 99
| ngl |  n_cpu_moe | n_batch |            test |                  t/s |
| --: | ---------: | ------: | --------------: | -------------------: |
|  99 |          6 |     128 |          pp1024 |        697.99 ± 4.45 |
|  99 |          6 |     256 |          pp1024 |       1174.76 ± 9.12 |
|  99 |          6 |     512 |          pp1024 |       1886.14 ± 9.84 |

$ llama-bench -m Qwen3-Coder-30B-A3B-Instruct-UD-Q4_K_XL.gguf -p 1024 -n 0 -b 128,256,512 --n-gpu-layers 44
| ngl | n_batch |            test |                  t/s |
| --: | ------: | --------------: | -------------------: |
|  44 |     128 |          pp1024 |        730.75 ± 4.96 |
|  44 |     256 |          pp1024 |       1228.75 ± 7.90 |
|  44 |     512 |          pp1024 |       1959.76 ± 7.00 |

</code>

<code>
$ llama-bench -m Qwen3-Coder-30B-A3B-Instruct-UD-Q4_K_XL.gguf -p 0 -n 128,256,512 --n-cpu-moe 6 --n-gpu-layers 99
| ngl |  n_cpu_moe |            test |                  t/s |
| --: | ---------: | --------------: | -------------------: |
|  99 |          6 |           tg128 |         94.79 ± 5.26 |
|  99 |          6 |           tg256 |         95.48 ± 1.26 |
|  99 |          6 |           tg512 |         95.12 ± 3.66 |

$ llama-bench -m Qwen3-Coder-30B-A3B-Instruct-UD-Q4_K_XL.gguf -p 0 -n 128,256,512 --n-gpu-layers 44
| ngl |            test |                  t/s |
| --: | --------------: | -------------------: |
|  44 |           tg128 |         98.73 ± 0.53 |
|  44 |           tg256 |         97.58 ± 0.10 |
|  44 |           tg512 |         94.90 ± 0.09 |
</code>

=== Qwen3-Coder-Next-UD-Q4_K_XL (80B) ===

voir le [[#qwen3-coder-next_80b|chapitre dédié]].
=== Nemotron-Cascade-2-30B-A3B ===

RTX 5060 Ti + Ultra 7 270K.

<code>
  Device 0: NVIDIA GeForce RTX 5060 Ti, compute capability 12.0, VMM: yes, VRAM: 15849 MiB
| model                          |       size |     params | backend    |
| ------------------------------ | ---------: | ---------: | ---------- |
| nemotron_h_moe 31B.A3.5B Q4_K - Medium |  23.02 GiB |    31.58 B | CUDA       |
build: 931eb37f8 (9848)
</code>

<code>
$ llama-bench -m Nemotron-Cascade-2-30B-A3B-Q4_K_M.gguf -p 0 -n 128,256,512
ggml_cuda_init: found 1 CUDA devices (Total VRAM: 15849 MiB):
  Device 0: NVIDIA GeForce RTX 5060 Ti, compute capability 12.0, VMM: yes, VRAM: 15849 MiB
| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
llama_bench: error: failed to load model '/data/models/Nemotron-Cascade-2-30B-A3B-Q4_K_M.gguf'
</code>

<code>
$ llama-bench -m Nemotron-Cascade-2-30B-A3B-Q4_K_M.gguf -p 0 -n 128,256,512 --n-gpu-layers 32
| ngl |            test |                  t/s |
| --: | --------------: | -------------------: |
|  32 |           tg128 |         32.65 ± 0.24 |
|  32 |           tg256 |         32.69 ± 0.34 |
|  32 |           tg512 |         32.72 ± 0.11 |

$ llama-bench -m Nemotron-Cascade-2-30B-A3B-Q4_K_M.gguf -p 0 -n 128,256,512 --n-cpu-moe 24 --n-gpu-layers 99
| ngl |  n_cpu_moe |            test |                  t/s |
| --: | ---------: | --------------: | -------------------: |
|  99 |         24 |           tg128 |         54.37 ± 0.12 |
|  99 |         24 |           tg256 |         54.44 ± 0.02 |
|  99 |         24 |           tg512 |         54.08 ± 0.30 |
</code>

<code>
$ llama-bench -m Nemotron-Cascade-2-30B-A3B-Q4_K_M.gguf -p 1024 -n 0 -b 128,256,512 --n-gpu-layers 32
| ngl | n_batch |            test |                  t/s |
| --: | ------: | --------------: | -------------------: |
|  32 |     128 |          pp1024 |        214.26 ± 0.55 |
|  32 |     256 |          pp1024 |        370.70 ± 2.07 |
|  32 |     512 |          pp1024 |        638.52 ± 7.24 |

$ llama-bench -m Nemotron-Cascade-2-30B-A3B-Q4_K_M.gguf -p 1024 -n 0 -b 128,256,512 --n-cpu-moe 24 --n-gpu-layers 99
| ngl |  n_cpu_moe | n_batch |            test |                  t/s |
| --: | ---------: | ------: | --------------: | -------------------: |
|  99 |         24 |     128 |          pp1024 |        242.47 ± 1.27 |
|  99 |         24 |     256 |          pp1024 |        422.45 ± 3.50 |
|  99 |         24 |     512 |          pp1024 |       750.21 ± 11.20 |
</code>

==== INstabilité avec eGPU 😩 ====

Reset nvidia et CUDA:
<code>
$ sudo rmmod nvidia_uvm nvidia
</code>

  * Lucie-7B_OpenLLM-France.Instruct-human-data.Q8_0.gguf
  * Meta-Llama-3.1-8B-Instruct-Q8_0.gguf
    * CUDA0 model buffer size =  7605,33 MiB
    * CUDA0 compute buffer size =   258,50 MiB
    * 

Après 2 mois de re-essais avec des configs grub et modprobe de toutes sortes avec l'aide de forums et d'assistants (Claude, ChatGpt, LeChat), une solution apparaît [[https://github.com/NVIDIA/open-gpu-kernel-modules/issues/974|sur ce ticket]] : forcer le PCI en "Gen 3"

<code>
# Pour récupérer l'adresse PCI "0000:05:00.0" de la RTX:
lspci | grep -i nvidia

sudo lspci -vvv -s 0000:05:00.0 | grep -i "LnkCap\|LnkSta"
   LnkCap:	Port #0, Speed 32GT/s, Width x8, ASPM L1, Exit Latency L1 unlimited
   LnkSta:	Speed 8GT/s (downgraded), Width x4 (downgraded)
   LnkCap2: Supported Link Speeds: 2.5-32GT/s, Crosslink- Retimer+ 2Retimers+ DRS-
   LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+ EqualizationPhase1+

sudo setpci -s 0000:05:00.0 CAP_EXP+0xC.W=0x0003

sudo lspci -vvv -s 0000:05:00.0 | grep -i "LnkCap\|LnkSta"
   LnkCap:	Port #0, Speed 32GT/s, Width x8, ASPM L1, Exit Latency L1 unlimited
   LnkSta:	Speed 2.5GT/s (downgraded), Width x4 (downgraded)
   LnkCap2: Supported Link Speeds: 2.5-32GT/s, Crosslink- Retimer+ 2Retimers+ DRS-
   LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+ EqualizationPhase1+
</code>

**Mais non**, ça a bien fonctionné avec ''llama-bench'' mais pas avec Yolo: 😩

<code>
kernel: NVRM: GPU at PCI:0000:05:00: GPU-ab296f23-e6a6-a23b-b6c1-33f9b813df84
kernel: NVRM: GPU Board Serial Number: 0
kernel: NVRM: Xid (PCI:0000:05:00): 13, Graphics Exception: Class 0xffff Subchannel 0x0 Mismatch
kernel: NVRM: Xid (PCI:0000:05:00): 13, Graphics Exception: ESR 0x4041b0=0x3f20ffff
kernel: NVRM: Xid (PCI:0000:05:00): 13, Graphics Exception: ESR 0x404000=0x80000002
kernel: NVRM: Xid (PCI:0000:05:00): 13, pid=6871, name=python3, Graphics Exception: channel 0x00000002, Class 0000cec0, Offset 00000100, Data deaddead
</code>
 
===== Qwen3-Coder-Next 80B =====

Focus sur le modèle [[https://huggingface.co/unsloth/Qwen3-Coder-Next-GGUF|Qwen3-Coder-Next 80B]] avec la "RTX 5060 Ti" avec 16 Go de VRAM connectée sur une "ASUS ProArt Z890-Creator" avec un "Intel Core Ultra 7 270K Plus" et 96 Go de DDR5 (2x48).

Llama.cpp build: 082b326fc (9951)

exploration des paramètres :
  * ''--threads'' number of CPU threads to use during generation
  * ''--n-cpu-moe''
  * ''--override-tensor''


<code>
Device 0: NVIDIA GeForce RTX 5060 Ti, compute capability 12.0, VMM: yes, VRAM: 15849 MiB
| model                          |       size |     params |
| ------------------------------ | ---------: | ---------: |
| qwen3next 80B.A3B Q4_K - Medium |  46.20 GiB |    79.67 B |
| qwen3next 80B.A3B Q4_K - Medium |  46.20 GiB |    79.67 B |
| qwen3next 80B.A3B Q4_K - Medium |  46.20 GiB |    79.67 B |
build: 931eb37f8 (9848)
</code>

==== --ubatch-size ====

Avec un grand context: 196 k

  * ''--ubatch-size'' définit la taille du graphe de calcul réellement exécuté sur le GPU à chaque passe. C'est lui qui dimensionne les buffers de calcul intermédiaires (mul_mat_q, attention)


<code bash>
$ llama-bench -m Qwen3-Coder-Next-UD-Q4_K_XL.gguf --mmap 0 --direct-io 1 \
 --n-cpu-moe 39 --n-gpu-layers 99 --fit-ctx 196000 \
 --flash-attn on --cache-type-k q8_0 --cache-type-v q8_0 \
 --threads 7,8 -p 8192,16384 -n 256,512 -b 4096,8192 -ub 1024,2048,3072
</code>

| threads | ubatch | pp8192 (t/s) | pp16384 (t/s)| tg256 (t/s)| tg512 (t/s) |
| 7 | 1024 | 873.6 | 866.7 | 37.8 | 36.4 |
| 8 | 1024 | 871.9 | 866.1 | 38.7 | 39.8 |
| 7 | 2048 | 1121.7 | 1108.8 | 40.3 | 38.9 |
| 8 | 2048 | 1125.7 | 1107.6 | 38.0 | 39.4 |
| 7 | 3072 | 1108.6 | crash | — | — |

L'allocateur VMM CUDA de ggml gère un pool réservé par device qui a tendance à grossir au fil des configurations testées dans le même process, sans forcément être libéré/réduit entre deux réglages de contexte successifs. Du coup je passe à des commandes isolées.

<code>
./llama.cpp/build/bin/llama-bench -m /data/models/Qwen3-Coder-Next-UD-Q4_K_XL.gguf --mmap 0 --direct-io 1 \
 --n-cpu-moe 39 --n-gpu-layers 99 --flash-attn on --cache-type-k q8_0 --cache-type-v q8_0 \
 --threads 8 --fit-ctx 196000

-p 32768 -n 256 -b 4096,8192 -ub 2048,2560

| n_batch | n_ubatch |            test |                  t/s |
| ------: | -------: | --------------: | -------------------: |
|    4096 |     2048 |         pp32768 |       1073.39 ± 1.15 |
|    4096 |     2560 |         pp32768 |       1072.82 ± 1.55 |
|    8192 |     2048 |         pp32768 |       1071.79 ± 1.34 |
|    8192 |     2560 |         pp32768 |       1081.85 ± 1.80 |

Le débit en pp plafonne après ub=2048.

Avec -p 65536 ça plante ...

Décharger un MoE pour un micro-batch plus grand n'est pas le bon chemin !
|  n_cpu_moe | n_batch | n_ubatch |            test |                  t/s |
| ---------: | ------: | -------: | --------------: | -------------------: |
|         40 |    8192 |     2048 |        pp131072 |        885.22 ± 0.73 |

</code>

Avec ''llama-batched-bench'' :

<code>
#!/usr/bin/bash
LLAMA_DIR="$HOME/llama.cpp/build/bin"
MODEL_DIR="/data/models"
# https://huggingface.co/unsloth/Qwen3-Coder-Next-GGUF
MODEL="Qwen3-Coder-Next-UD-Q4_K_XL.gguf"

#
# ne gère pas options multiples pour "-ub" et "-b"
# -b 2048,4096,6144,8192
# -ub 1024,1536,2048,2560
#

"$LLAMA_DIR/llama-batched-bench" -m "$MODEL_DIR/$MODEL" \
  -c 196000 \
  -b 2048 \
  -ub 1024 \
  -ngl 99 \
  --n-cpu-moe 39 \
  --cache-type-k q8_0 --cache-type-v q8_0 \
  --threads 8 \
  --temp 1.0 --top-p 0.95 --top-k 40 --min-p 0.01 --repeat-penalty 1.0 \
  --flash-attn on \
  -npp 512,1024,2048,4096 \
  -ntg 32,64,128 \
  -npl 1 \
  --output-format md

== Résultats:

llama_batched_bench: n_kv_max = 196096, n_batch = 2048, n_ubatch = 1024, flash_attn = 1,
is_pp_shared = 0, is_tg_separate = 0, n_gpu_layers = 99, n_threads = 8, n_threads_batch = 8

|    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |    S t/s |
|-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
|   512 |     32 |    1 |    544 |    1.752 |   292.21 |    0.835 |    38.32 |    2.587 |   210.26 |
|   512 |     64 |    1 |    576 |    1.681 |   304.54 |    1.634 |    39.17 |    3.315 |   173.75 |
|   512 |    128 |    1 |    640 |    1.677 |   305.36 |    3.422 |    37.41 |    5.098 |   125.53 |
|  1024 |     32 |    1 |   1056 |    2.124 |   482.20 |    0.838 |    38.18 |    2.962 |   356.54 |
|  1024 |     64 |    1 |   1088 |    2.139 |   478.71 |    1.669 |    38.35 |    3.808 |   285.70 |
|  1024 |    128 |    1 |   1152 |    2.156 |   474.87 |    3.327 |    38.47 |    5.484 |   210.07 |
|  2048 |     32 |    1 |   2080 |    4.304 |   475.82 |    0.845 |    37.88 |    5.149 |   403.97 |
|  2048 |     64 |    1 |   2112 |    4.422 |   463.18 |    1.675 |    38.20 |    6.097 |   346.41 |
|  2048 |    128 |    1 |   2176 |    4.419 |   463.44 |    3.347 |    38.24 |    7.766 |   280.19 |

</code>

==== --n-cpu-moe ====

Comparaison des perfs en jouant sur la décharge des MoE sur le CPU.

"Génération":
<code>
$ llama-bench -m Qwen3-Coder-Next-UD-Q4_K_XL.gguf -p 0 -n 128,256,512 --n-cpu-moe 35 --n-gpu-layers 99

| ngl |  n_cpu_moe |            test |                  t/s |
| --: | ---------: | --------------: | -------------------: |
|  99 |         35 |           tg128 |         36.52 ± 0.03 |
|  99 |         35 |           tg256 |         36.43 ± 0.06 |
|  99 |         35 |           tg512 |         36.40 ± 0.03 |

$ llama-bench -m Qwen3-Coder-Next-UD-Q4_K_XL.gguf -p 0 -n 128,256,512 --n-cpu-moe 39 --n-gpu-layers 99

| ngl |  n_cpu_moe |            test |                  t/s |
| --: | ---------: | --------------: | -------------------: |
|  99 |         39 |           tg128 |         34.49 ± 0.03 |
|  99 |         39 |           tg256 |         34.53 ± 0.03 |
|  99 |         39 |           tg512 |         34.52 ± 0.02 |
build: 931eb37f8 (9848)

$ llama-bench -m Qwen3-Coder-Next-UD-Q4_K_XL.gguf -p 0 -n 128,256,512 --n-gpu-layers 16
| ngl |            test |                  t/s |
| --: | --------------: | -------------------: |
|  16 |           tg128 |         16.02 ± 0.08 |
|  16 |           tg256 |         16.09 ± 0.10 |
|  16 |           tg512 |         15.92 ± 0.06 |
</code>

"Prompt"
<code>
$ llama-bench -m Qwen3-Coder-Next-UD-Q4_K_XL.gguf -p 1024 -n 0 -b 128,256,512 --n-cpu-moe 35 --n-gpu-layers 99
| ngl |  n_cpu_moe | n_batch |            test |                  t/s |
| --: | ---------: | ------: | --------------: | -------------------: |
|  99 |         35 |     128 |          pp1024 |        138.94 ± 1.34 |
|  99 |         35 |     256 |          pp1024 |        212.98 ± 2.30 |
|  99 |         35 |     512 |          pp1024 |        336.39 ± 3.29 |

$ llama-bench -m Qwen3-Coder-Next-UD-Q4_K_XL.gguf -p 1024 -n 0 -b 128,256,512 --n-cpu-moe 39 --n-gpu-layers 99
| ngl |  n_cpu_moe | n_batch |            test |                  t/s |
| --: | ---------: | ------: | --------------: | -------------------: |
|  99 |         39 |     128 |          pp1024 |        130.04 ± 0.99 |
|  99 |         39 |     256 |          pp1024 |        199.63 ± 1.95 |
|  99 |         39 |     512 |          pp1024 |        315.40 ± 2.98 |


$ llama-bench -m Qwen3-Coder-Next-UD-Q4_K_XL.gguf -p 1024 -n 0 -b 128,256,512 --n-gpu-layers 16
| ngl | n_batch |            test |                  t/s |
| --: | ------: | --------------: | -------------------: |
|  16 |     128 |          pp1024 |        124.25 ± 0.77 |
|  16 |     256 |          pp1024 |        195.94 ± 1.72 |
|  16 |     512 |          pp1024 |        314.51 ± 2.39 |

</code>

==== --threads ====

Number of CPU threads to use during generation

À noter:
  * À 8 threads la courbe ''nvtop'' du %GPU n'est plus droite, des créneaux apparaissent.
  * un réglage de ''--n-cpu-moe'' moins agressif pour llama-server qui a besoin de VRAM pour d'autres choses 

**--n-cpu-moe 35 :**
<code>
$ llama-bench -m Qwen3-Coder-Next-UD-Q4_K_XL.gguf -p 0 -n 512,1024 --n-cpu-moe 35 --n-gpu-layers 99 --threads 1,2,4,8

| ngl |  n_cpu_moe | threads |            test |                  t/s |
| --: | ---------: | ------: | --------------: | -------------------: |
|  99 |         35 |       1 |           tg512 |         21.25 ± 0.10 |
|  99 |         35 |       1 |          tg1024 |         21.44 ± 0.07 |
|  99 |         35 |       2 |           tg512 |         26.96 ± 0.01 |
|  99 |         35 |       2 |          tg1024 |         27.08 ± 0.00 |
|  99 |         35 |       4 |           tg512 |         37.06 ± 0.03 |
|  99 |         35 |       4 |          tg1024 |         36.70 ± 0.01 |
|  99 |         35 |       8 |           tg512 |         41.51 ± 1.44 | <-- 🚀
|  99 |         35 |       8 |          tg1024 |         39.34 ± 2.49 |

nvtop:
1 thread: GPU=17%, MEM=95%, CPU=100%
2 threads: GPU=22%, MEM=95%, CPU=173%
4 threads: GPU=30%, MEM=95%, CPU=280%
8 threads: GPU=36%, MEM=95%, CPU=370%
</code>

**--n-cpu-moe 39 :**
<code>
$ llama-bench -m Qwen3-Coder-Next-UD-Q4_K_XL.gguf -p 0 -n 512,1024 --n-cpu-moe 39 --n-gpu-layers 99 --threads 1,2,4,8

| ngl |  n_cpu_moe | threads |            test |                  t/s |
| --: | ---------: | ------: | --------------: | -------------------: |
|  99 |         39 |       1 |           tg512 |         16.67 ± 0.04 |
|  99 |         39 |       1 |          tg1024 |         16.80 ± 0.04 |
|  99 |         39 |       2 |           tg512 |         21.54 ± 0.05 |
|  99 |         39 |       2 |          tg1024 |         21.46 ± 0.03 |
|  99 |         39 |       4 |           tg512 |         30.36 ± 0.10 |
|  99 |         39 |       4 |          tg1024 |         30.46 ± 0.02 |
|  99 |         39 |       8 |           tg512 |         37.54 ± 1.59 |
|  99 |         39 |       8 |          tg1024 |         37.64 ± 0.53 |

nvtop:
1 thread: GPU=15%, MEM=, CPU=100%
2 threads: GPU=20%, MEM=72%, CPU=175%
4 threads: GPU=27%, MEM=72%, CPU=290%
8 threads: GPU=35%, MEM=72%, CPU=420%
</code>

==== --override-tensor ====

Je n'ai pas trouvé la bonne recette pour remplacer ''--n-cpu-moe'' par un ''--override-tensor'' plus précis pour économiser de la mémoire sans perdre trop de performance.

<code>
./llama.cpp/build/bin/llama-bench -m /data/models/Qwen3-Coder-Next-UD-Q4_K_XL.gguf \
 --mmap 0 --direct-io 1 --n-gpu-layers 99 --flash-attn on --cache-type-k q8_0 --cache-type-v q8_0  --threads 8 \
 --n-cpu-moe 39 --fit-ctx 196000 \
  -p 16384 -n 256 -b 8192 -ub 2560

... --n-cpu-moe 39

|  n_cpu_moe | n_batch | n_ubatch |            test |                  t/s |
| ---------: | ------: | -------: | --------------: | -------------------: |
|         39 |    8192 |     2560 |         pp16384 |       1117.40 ± 3.07 |
|         39 |    8192 |     2560 |           tg256 |         40.05 ± 1.01 |
|         39 |    8192 |     2560 |         pp32768 |       1082.09 ± 1.67 |
|         39 |    8192 |     2560 |           tg256 |         38.57 ± 2.42 |
|         39 |    8192 |     2560 |         pp65536 |       1007.56 ± 1.14 |
|         39 |    8192 |     2560 |           tg256 |         38.69 ± 1.52 |

... --n-cpu-moe 39 --override-tensor 'token_embd|output_norm=CPU'

|  n_cpu_moe | n_batch | n_ubatch | ot                         |     test |             t/s |
| ---------: | ------: | -------: | -------------------------- | -------: | --------------: |
|         39 |    8192 |     2560 | token_embd|output_norm=CPU |  pp16384 |  1117.62 ± 3.23 |
|         39 |    8192 |     2560 | token_embd|output_norm=CPU |    tg256 |    38.65 ± 0.39 |

... --n-cpu-moe 38

|  n_cpu_moe | n_batch | n_ubatch |            test |                  t/s |
| ---------: | ------: | -------: | --------------: | -------------------: |
|         38 |    8192 |     2560 |         pp32768 |       1081.99 ± 1.69 |
|         38 |    8192 |     2560 |           tg256 |         41.13 ± 0.15 |
|         38 |    8192 |     2560 |         pp65536 |       1007.76 ± 1.09 |
|         38 |    8192 |     2560 |           tg256 |         34.95 ± 1.32 |
|         38 |    8192 |     2560 |        pp131072 |        |
|         38 |    8192 |     2560 |           tg256 |          |

... --n-cpu-moe 38 --override-tensor 'token_embd|output_norm=CPU'

|  n_cpu_moe | n_batch | n_ubatch | ot                         |     test |             t/s |
| ---------: | ------: | -------: | -------------------------- | -------: | --------------: |
|         38 |    8192 |     2560 | token_embd|output_norm=CPU |  pp32768 |  1082.00 ± 1.84 |
|         38 |    8192 |     2560 | token_embd|output_norm=CPU |    tg256 |    40.96 ± 0.35 |

... --n-cpu-moe 38 -ot ffn_up_exps=CPU -ot ffn_gate_exps=CPU

|  n_cpu_moe | n_batch | n_ubatch | ot                         |     test |             t/s |
| ---------: | ------: | -------: | -------------------------- | -------: | --------------: |
|         38 |    8192 |     2560 |  |  pp32768 |   |
|         38 |    8192 |     2560 |  |    tg256 |     |


</code>

On dirait que llama calcule la sélection des tensors à placer sur le GPU ou CPU :

<code>
$ llama-bench --verbose ...

...
create_tensor: loading tensor blk.46.ffn_gate_inp.weight
tensor blk.46.ffn_down_exps.weight (420 MiB q6_K) buffer type overridden to CUDA_Host
create_tensor: loading tensor blk.46.ffn_down_exps.weight
tensor blk.46.ffn_gate_exps.weight (352 MiB q5_K) buffer type overridden to CUDA_Host
create_tensor: loading tensor blk.46.ffn_gate_exps.weight
tensor blk.46.ffn_up_exps.weight (352 MiB q5_K) buffer type overridden to CUDA_Host
create_tensor: loading tensor blk.46.ffn_up_exps.weight
create_tensor: loading tensor blk.46.ffn_gate_inp_shexp.weight
create_tensor: loading tensor blk.46.ffn_gate_shexp.weight
create_tensor: loading tensor blk.46.ffn_up_shexp.weight
create_tensor: loading tensor blk.46.ffn_down_shexp.weight
create_tensor: loading tensor blk.47.attn_norm.weight
create_tensor: loading tensor blk.47.post_attention_norm.weight
create_tensor: loading tensor blk.47.attn_q.weight
create_tensor: loading tensor blk.47.attn_k.weight
create_tensor: loading tensor blk.47.attn_v.weight
create_tensor: loading tensor blk.47.attn_output.weight
create_tensor: loading tensor blk.47.attn_q_norm.weight
create_tensor: loading tensor blk.47.attn_k_norm.weight
create_tensor: loading tensor blk.47.ffn_gate_inp.weight
tensor blk.47.ffn_down_exps.weight (352 MiB q5_K) buffer type overridden to CUDA_Host
create_tensor: loading tensor blk.47.ffn_down_exps.weight
tensor blk.47.ffn_gate_exps.weight (288 MiB q4_K) buffer type overridden to CUDA_Host
create_tensor: loading tensor blk.47.ffn_gate_exps.weight
tensor blk.47.ffn_up_exps.weight (288 MiB q4_K) buffer type overridden to CUDA_Host
create_tensor: loading tensor blk.47.ffn_up_exps.weight
create_tensor: loading tensor blk.47.ffn_gate_inp_shexp.weight
create_tensor: loading tensor blk.47.ffn_gate_shexp.weight
create_tensor: loading tensor blk.47.ffn_up_shexp.weight
create_tensor: loading tensor blk.47.ffn_down_shexp.weight
done_getting_tensors: tensor 'blk.12.ffn_gate_inp.weight' (f32) (and 112 others) cannot be used with preferred buffer type CUDA0, using CUDA_Host instead
load_tensors: offloading output layer to GPU
load_tensors: offloading 47 repeating layers to GPU
load_tensors: offloaded 49/49 layers to GPU
load_tensors:        CUDA0 model buffer size = 13378.13 MiB
load_tensors:    CUDA_Host model buffer size = 33926.49 MiB
...
</code>

=== Avec Opencode ===

Avec Opencode sur un gros projet Php

Context ~ 36k ~ 42k

''--ubatch-size 2048 --n-cpu-moe 35 --override-tensor '(token_embd|output_norm|ffn_up_exps)=CPU''' :

<code>
nvtop memory = 91%

prompt eval time =   34637.21 ms / 26409 tokens (    1.31 ms per token,   762.45 tokens per second)
       eval time =   27538.78 ms /   807 tokens (   34.12 ms per token,    29.30 tokens per second)
      total time =   62175.99 ms / 27216 tokens
   graphs reused =       4024

prompt eval time =    7333.79 ms /  4423 tokens (    1.66 ms per token,   603.10 tokens per second)
       eval time =    3952.75 ms /   111 tokens (   35.61 ms per token,    28.08 tokens per second)
      total time =   11286.54 ms /  4534 tokens
   graphs reused =       5107

prompt eval time =    2740.89 ms /  2051 tokens (    1.34 ms per token,   748.30 tokens per second)
       eval time =    8666.17 ms /   227 tokens (   38.18 ms per token,    26.19 tokens per second)
      total time =   11407.06 ms /  2278 tokens
   graphs reused =       5940

</code>

''--ubatch-size 2048 --n-cpu-moe 39'' :
<code>
nvtop memory = 91%

prompt eval time =   20448.15 ms / 15360 tokens (    1.33 ms per token,   751.17 tokens per second)
       eval time =   16597.85 ms /   587 tokens (   28.28 ms per token,    35.37 tokens per second)
      total time =   37046.00 ms / 15947 tokens
   graphs reused =        584

prompt eval time =    8837.18 ms /  6270 tokens (    1.41 ms per token,   709.50 tokens per second)
       eval time =    1544.96 ms /    48 tokens (   32.19 ms per token,    31.07 tokens per second)
      total time =   10382.14 ms /  6318 tokens
   graphs reused =       1582

prompt eval time =   14653.00 ms / 10704 tokens (    1.37 ms per token,   730.50 tokens per second)
       eval time =   22224.00 ms /   817 tokens (   27.20 ms per token,    36.76 tokens per second)
      total time =   36877.00 ms / 11521 tokens
   graphs reused =       3360

</code>

''--ubatch-size 2560 --n-cpu-moe 39'' :
<code>
nvtop memory = 92%

prompt eval time =   19592.18 ms / 17101 tokens (    1.15 ms per token,   872.85 tokens per second)
       eval time =    1747.92 ms /    61 tokens (   28.65 ms per token,    34.90 tokens per second)
      total time =   21340.10 ms / 17162 tokens
   graphs reused =        153

prompt eval time =    8109.03 ms /  7356 tokens (    1.10 ms per token,   907.14 tokens per second)
       eval time =    5224.43 ms /   154 tokens (   33.92 ms per token,    29.48 tokens per second)
      total time =   13333.45 ms /  7510 tokens
   graphs reused =       1198

init sampler, took 3.03 ms, tokens: text = 38663, total = 38663
prompt eval time =    2758.26 ms /  2168 tokens (    1.27 ms per token,   786.00 tokens per second)
       eval time =    1580.45 ms /    52 tokens (   30.39 ms per token,    32.90 tokens per second)
      total time =    4338.71 ms /  2220 tokens
   graphs reused =       9322
stop processing: n_tokens = 38714, truncated = 0

init sampler, took 3.42 ms, tokens: text = 44117, total = 44117
prompt eval time =    3915.89 ms /  2673 tokens (    1.46 ms per token,   682.60 tokens per second)
       eval time =    2496.57 ms /    78 tokens (   32.01 ms per token,    31.24 tokens per second)
      total time =    6412.46 ms /  2751 tokens
   graphs reused =       2677
stop processing: n_tokens = 44194, truncated = 0

init sampler, took 2.71 ms, tokens: text = 35046, total = 35046
prompt eval time =    1804.42 ms /   902 tokens (    2.00 ms per token,   499.88 tokens per second)
       eval time =   24947.27 ms /   739 tokens (   33.76 ms per token,    29.62 tokens per second)
      total time =   26751.69 ms /  1641 tokens
   graphs reused =       4255
stop processing: n_tokens = 35784, truncated = 0
</code>