====== GPU Bench ======
* [[https://blogs.nvidia.com/blog/tag/rtx-ai-garage/|RTX AI Garage]] sur blog de nvidia
Benchmark d'IA pour [[https://lab.cyrille.giquello.fr/Anticor/graphLmExtract.html|extraction de noms]] :
* avec service Mistral, modèle Codestral = ''00j 01h 02m 48s''
* RTX3060 + Intel-i7, modèle granite-4.0-h-small-Q8_0 = ''02j 16h 11m 34s''
Selon LeChat:
^ Carte graphique ^ TOPS (INT8) ^ TOPS (FP16) ^ Architecture ^
| RTX 3060 (12 Go) | ~120 TOPS | ~60 TOPS | Ampere |
| RTX 5060 Ti (16 Go) | ~759 TOPS | ~380 TOPS | Blackwell |
Bench llama.cpp :
* Text generation: tg128, tg256, tg512 : ''-p 0 -n 128,256,512''
* Prompt processing: b128, b256, b512 : ''-p 1024 -n 0 -b 128,256,512''
^ models ^ test ^ tokens/seconds ^^^
^ ^ ^ i7-1360P ^ RTX 3060 ^ RTX 5060 Ti ^
| Qwen2.5-coder-7b-instruct-q5_k_m | tg128 | 5.47 | 57.65 | 73.54 |
| //size: 5.07 GiB// | tg256 | ... | 57.61 | 73.32 |
| | tg512 | ... | 56.20 | 71.80 |
| | b128 | ... | 1825.17 | 2840.57 |
| | b256 | ... | 1924.10 | 3209.52 |
| | b512 | ... | 1959.18 | 3271.22 |
| Qwen2.5-coder-7b-instruct-q8_0 | tg128 | ... | 41.42 | 50.33 |
| //size: 7.54 GiB// | tg256 | ... | 41.38 | 50.33 |
| | tg512 | ... | 40.70 | 49.62 |
| | b128 | 13.98 | 1952.96 | 2972.52 |
| | b256 | ... | 2054.09 | 3460.41 |
| | b512 | ... | 2093.21 | 3511.29 |
| EuroLLM-9B-Instruct-Q4_0 | tg128 | ... | 56.06 | 71.41 |
| //size: 4.94 GiB// | tg256 | ... | 55.96 | 71.15 |
| | tg512 | ... | 53.87 | 69.45 |
| | b128 | ... | 1433.95 | CUDA error |
| | b256 | ... | 1535.06 | ... |
| | b512 | ... | 1559.88 | ... |
| Qwen3-14B-UD-Q5_K_XL | tg128 | ... | 30.00 | 37.66 |
| //size: 9.82 GiB// | tg256 | ... | 29.97 | 38.17 |
| | tg512 | ... | 29.25 | 37.30 |
| | b128 | ... | 903.97 | CUDA error |
| | b256 | ... | 951.71 | ... |
| | b512 | ... | 963.76 | ... |
| Qwen3-4B-UD-Q8_K_XL | tg128 | 7.37 | 56.35 | ... |
| //size: 4.70 GiB// | tg256 | 6.63 | 56.35 | ... |
| | tg512 | 6.24 | 54.56 | ... |
| | b128 | 20.66 | 2163.17 | ... |
| | b256 | ... | 2405.27 | ... |
| | b512 | ... | 2495.35 | ... |
| GemmaCoder3-12B-IQ4_NL.gguf | tg128 | ... | 40.70 | ... |
| //size: 6.41 GiB// | tg256 | ... | 40.67 | ... |
| | tg512 | ... | 39.54 | ... |
| | b128 | ... | 1150.11 | ... |
| | b256 | ... | 1218.27 | ... |
| | b512 | ... | 1253.92 | ... |
| Gemma3-Code-Reasoning-4B.Q8_0 | tg128 | ... | 66.98 | ... |
| //size: 3.84 GiB// | tg256 | ... | 66.95 | ... |
| | tg512 | ... | 65.75 | ... |
| | b128 | ... | 2885.80 | ... |
| | b256 | ... | 3266.87 | ... |
| | b512 | ... | 3457.03 | ... |
| GemmaCoder3-12B-Q5_K_M | tg128 | ... | 34.10 | ... |
| //size: 7.86 GiB// | tg256 | ... | 34.06 | ... |
| | tg512 | ... | 33.28 | ... |
| | b128 | ... | 1045.27 | ... |
| | b256 | ... | 1108.95 | ... |
| | b512 | ... | 1144.97 | ... |
* Les "CUDA error" apparaissent avec la RTX 5060 Ti et le bridge PCIe/THB4 "Wikingoo L17" et le driver nvidia 580.
* Avec le CPU, laisser le nombre de cœurs en automatique, ce sont les physiques qui seront utilisés. Si on force plus de thread, les perfs diminuent.
* le multi-threads physique est utile. Ex: en auto 7.37 t/s, avec 1 thread 3.39 t/s
===== Intel® Core™ i7-1360P 13th Gen =====
Pour comparaison ...
**Qwen2.5-coder-7b-instruct-q5_k_m**:
./llama-bench -m ~/Data/AI_Models/Qwen2.5-coder-7b-instruct-q5_k_m.gguf -p 0 -n 128
load_backend: loaded RPC backend from /home/.../llama-b7109/libggml-rpc.so
load_backend: loaded CPU backend from /home/.../llama-b7109/libggml-cpu-alderlake.so
| model | size | params | backend | threads | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | --------------: | -------------------: |
| qwen2 7B Q5_K - Medium | 5.07 GiB | 7.62 B | CPU | 4 | tg128 | 5.47 ± 0.72 |
===== Gigabyte Windforce OC 12GB Geforce RTX 3060 =====
{{ :informatique:ai_coding:ia_rtx_3060_small.jpg?direct&400|}}
Avec ''sudo nsys-ui'' :
^ NVIDIA GeForce RTX 3060 ^^
| Chip Name | GA104 |
| SM Count | 28 |
| L2 Cache Size | 2,25 MiB |
| Memory Bandwidth | 335,32 GiB/s |
| Memory Size | 11,63 GiB |
| Core Clock | 1,79 GHz |
| Bus Location | 0000:05:00.0 |
| GSP firmware version | 580.105.08 |
| Video accelerator tracing | Supported |
Avec llama.cpp et CUDA 12.9.
==== Qwen2.5-coder-7b-instruct-q5_k_m ====
./build/bin/llama-bench -m ~/Data/AI_Models/Qwen2.5-coder-7b-instruct-q5_k_m.gguf -p 0 -n 128,256,512
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes
| model | size | params | backend | ngl | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| qwen2 7B Q5_K - Medium | 5.07 GiB | 7.62 B | CUDA | 99 | tg128 | 57.65 ± 0.03 |
| qwen2 7B Q5_K - Medium | 5.07 GiB | 7.62 B | CUDA | 99 | tg256 | 57.61 ± 0.03 |
| qwen2 7B Q5_K - Medium | 5.07 GiB | 7.62 B | CUDA | 99 | tg512 | 56.24 ± 0.05 |
==== GemmaCoder3-12B-Q5_K_M ====
Pour lancer ''llama-server'' avec le modèle "GemmaCoder3-12B-Q5_K_M.gguf" (fichier 8.4Go) fait de 49 layers en utilisant son **contexte maximale "131072" avec ''--ctx-size 0'' au lieu du par défaut "4096"** il faut décharger des layers sur le CPU, sinon c'est ''main: error: unable to load model''. À noter que c'est pareil avec ''llama-cli''.
^ n-gpu-layers ^ test ^ tokens/s ^ time ^ % perte perf |
| (all) 49 | tg128| 34.15 | 0m25,904s | 0.00% |
| | b128 | 1041.60 | 0m13,117s | 0.00% |
| 44 | tg128| 15.55 | 0m48,049s | 54.47% |
| | b128 | 279.26 | 0m28,613s | 73.19% |
| 39 | tg128| 10.74 | 1m07,555s | 68.55% |
| | b128 | 150.49 | 0m46,996s | 85.55% |
| 30 | tg128| 6.83 | 1m42,221s | 80.01% |
| | b128 | 82.91 | 1m19,729s | 92.04% |
| full cpu | tg128| 3.12 | 3m28,308s | 90.86% |
| | b128 | 4.50 | 22m37,674s | 99.57% |
Les valeurs qui permettent de charger ce modèle :
* ''llama-cli'' :
* avec son context max 131072 c'est 30 layers sur GPU : ''--n-gpu-layers 30'', donc 80% perte perf
* ''--ctx-size 70000 --n-gpu-layers 41''
* et pour tous les layers sur le GPU : ''--ctx-size 42000''
* ''llama-server'' :
* ''--ctx-size 40000 --n-gpu-layers 44''
* ''--ctx-size 43500 --n-gpu-layers 43''
* ''--ctx-size 52500 --n-gpu-layers 42''
Avec ''--ctx-size 52500 --n-gpu-layers 42'' :
...
NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes
...
print_info: n_ctx_train = 131072
print_info: n_embd = 3840
print_info: n_embd_inp = 3840
print_info: n_layer = 48
print_info: n_head = 16
print_info: n_head_kv = 8
print_info: n_rot = 256
print_info: n_swa = 1024
print_info: is_swa_any = 1
print_info: n_embd_head_k = 256
print_info: n_embd_head_v = 256
print_info: n_gqa = 2
print_info: n_embd_k_gqa = 2048
print_info: n_embd_v_gqa = 2048
print_info: f_norm_eps = 0.0e+00
print_info: f_norm_rms_eps = 1.0e-06
print_info: f_clamp_kqv = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale = 0.0e+00
print_info: f_attn_scale = 6.2e-02
print_info: n_ff = 15360
print_info: n_expert = 0
print_info: n_expert_used = 0
print_info: n_expert_groups = 0
print_info: n_group_used = 0
print_info: causal attn = 1
print_info: pooling type = 0
print_info: rope type = 2
print_info: rope scaling = linear
print_info: freq_base_train = 1000000.0
print_info: freq_scale_train = 0.125
print_info: n_ctx_orig_yarn = 131072
print_info: rope_finetuned = unknown
print_info: model type = 12B
print_info: model params = 11.77 B
print_info: general.name = gemma-3-12b-it-codeforces-SFT
print_info: vocab type = SPM
print_info: n_vocab = 262208
print_info: n_merges = 0
...
print_info: max token length = 48
...
load_tensors: offloading 42 repeating layers to GPU
load_tensors: offloaded 42/49 layers to GPU
load_tensors: CPU_Mapped model buffer size = 1720.59 MiB
load_tensors: CUDA0 model buffer size = 6327.03 MiB
llama_context: constructing llama_context
llama_context: n_seq_max = 4
llama_context: n_ctx = 52736
llama_context: n_ctx_seq = 52736
llama_context: n_batch = 2048
llama_context: n_ubatch = 512
llama_context: causal_attn = 1
llama_context: flash_attn = auto
llama_context: kv_unified = true
llama_context: freq_base = 1000000.0
llama_context: freq_scale = 0.125
llama_context: n_ctx_seq (52736) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_context: CPU output buffer size = 4.00 MiB
llama_kv_cache_iswa: creating non-SWA KV cache, size = 52736 cells
llama_kv_cache: CPU KV buffer size = 412.00 MiB
llama_kv_cache: CUDA0 KV buffer size = 2884.00 MiB
llama_kv_cache: size = 3296.00 MiB ( 52736 cells, 8 layers, 4/1 seqs), K (f16): 1648.00 MiB, V (f16): 1648.00 MiB
llama_kv_cache_iswa: creating SWA KV cache, size = 4608 cells
llama_kv_cache: CPU KV buffer size = 180.00 MiB
llama_kv_cache: CUDA0 KV buffer size = 1260.00 MiB
llama_kv_cache: size = 1440.00 MiB ( 4608 cells, 40 layers, 4/1 seqs), K (f16): 720.00 MiB, V (f16): 720.00 MiB
llama_context: Flash Attention was auto, set to enabled
llama_context: CUDA0 compute buffer size = 1307.32 MiB
llama_context: CUDA_Host compute buffer size = 120.02 MiB
llama_context: graph nodes = 1929
llama_context: graph splits = 94 (with bs=512), 27 (with bs=1)
===== PNY OC 16 Go Geforce RTX 5060 Ti =====
=== Qwen2.5-coder-7b-instruct-q5_k_m ===
$ ./llama.cpp/build/bin/llama-bench -m ~/Data/AI_Models/Qwen2.5-coder-7b-instruct-q5_k_m.gguf -p 0 -n 128,256,512
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 5060 Ti, compute capability 12.0, VMM: yes
| model | size | params | backend | ngl | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| qwen2 7B Q5_K - Medium | 5.07 GiB | 7.62 B | CUDA | 99 | tg128 | 73.54 ± 0.01 |
| qwen2 7B Q5_K - Medium | 5.07 GiB | 7.62 B | CUDA | 99 | tg256 | 73.32 ± 0.40 |
| qwen2 7B Q5_K - Medium | 5.07 GiB | 7.62 B | CUDA | 99 | tg512 | 71.80 ± 0.61 |
build: 3f3a4fb9c (7130)
=== Stabilité ===
Reset nvidia et CUDA:
$ sudo rmmod nvidia_uvm nvidia
* Lucie-7B_OpenLLM-France.Instruct-human-data.Q8_0.gguf
* Meta-Llama-3.1-8B-Instruct-Q8_0.gguf
* CUDA0 model buffer size = 7605,33 MiB
* CUDA0 compute buffer size = 258,50 MiB
*
Après 2 mois de re-essais avec des configs grub et modprobe de toutes sortes avec l'aide de forums et d'assistants (Claude, ChatGpt, LeChat), une solution apparaît [[https://github.com/NVIDIA/open-gpu-kernel-modules/issues/974|sur ce ticket]] : forcer le PCI en "Gen 3"
# Pour récupérer l'adresse PCI "0000:05:00.0" de la RTX:
lspci | grep -i nvidia
sudo lspci -vvv -s 0000:05:00.0 | grep -i "LnkCap\|LnkSta"
LnkCap: Port #0, Speed 32GT/s, Width x8, ASPM L1, Exit Latency L1 unlimited
LnkSta: Speed 8GT/s (downgraded), Width x4 (downgraded)
LnkCap2: Supported Link Speeds: 2.5-32GT/s, Crosslink- Retimer+ 2Retimers+ DRS-
LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+ EqualizationPhase1+
sudo setpci -s 0000:05:00.0 CAP_EXP+0xC.W=0x0003
sudo lspci -vvv -s 0000:05:00.0 | grep -i "LnkCap\|LnkSta"
LnkCap: Port #0, Speed 32GT/s, Width x8, ASPM L1, Exit Latency L1 unlimited
LnkSta: Speed 2.5GT/s (downgraded), Width x4 (downgraded)
LnkCap2: Supported Link Speeds: 2.5-32GT/s, Crosslink- Retimer+ 2Retimers+ DRS-
LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+ EqualizationPhase1+
**Mais non**, ça a bien fonctionné avec ''llama-bench'' mais pas avec Yolo:
kernel: NVRM: GPU at PCI:0000:05:00: GPU-ab296f23-e6a6-a23b-b6c1-33f9b813df84
kernel: NVRM: GPU Board Serial Number: 0
kernel: NVRM: Xid (PCI:0000:05:00): 13, Graphics Exception: Class 0xffff Subchannel 0x0 Mismatch
kernel: NVRM: Xid (PCI:0000:05:00): 13, Graphics Exception: ESR 0x4041b0=0x3f20ffff
kernel: NVRM: Xid (PCI:0000:05:00): 13, Graphics Exception: ESR 0x404000=0x80000002
kernel: NVRM: Xid (PCI:0000:05:00): 13, pid=6871, name=python3, Graphics Exception: channel 0x00000002, Class 0000cec0, Offset 00000100, Data deaddead