Benchmark d'IA pour extraction de noms :
00j 01h 02m 48s02j 16h 11m 34sSelon LeChat:
| Carte graphique | TOPS (INT8) | TOPS (FP16) | Architecture |
|---|---|---|---|
| RTX 3060 (12 Go) | ~120 TOPS | ~60 TOPS | Ampere |
| RTX 5060 Ti (16 Go) | ~759 TOPS | ~380 TOPS | Blackwell |
Bench llama.cpp :
-p 0 -n 128,256,512-p 1024 -n 0 -b 128,256,512| models | test | tokens/seconds | ||
|---|---|---|---|---|
| i7-1360P | RTX 3060 | RTX 5060 Ti | ||
| Qwen2.5-coder-7b-instruct-q5_k_m | tg128 | 5.47 | 57.65 | 73.54 |
| size: 5.07 GiB | tg256 | … | 57.61 | 73.32 |
| tg512 | … | 56.20 | 71.80 | |
| b128 | … | 1825.17 | 2840.57 | |
| b256 | … | 1924.10 | 3209.52 | |
| b512 | … | 1959.18 | 3271.22 | |
| Qwen2.5-coder-7b-instruct-q8_0 | tg128 | … | 41.42 | 50.33 |
| size: 7.54 GiB | tg256 | … | 41.38 | 50.33 |
| tg512 | … | 40.70 | 49.62 | |
| b128 | 13.98 | 1952.96 | 2972.52 | |
| b256 | … | 2054.09 | 3460.41 | |
| b512 | … | 2093.21 | 3511.29 | |
| EuroLLM-9B-Instruct-Q4_0 | tg128 | … | 56.06 | 71.41 |
| size: 4.94 GiB | tg256 | … | 55.96 | 71.15 |
| tg512 | … | 53.87 | 69.45 | |
| b128 | … | 1433.95 | CUDA error | |
| b256 | … | 1535.06 | … | |
| b512 | … | 1559.88 | … | |
| Qwen3-14B-UD-Q5_K_XL | tg128 | … | 30.00 | 37.66 |
| size: 9.82 GiB | tg256 | … | 29.97 | 38.17 |
| tg512 | … | 29.25 | 37.30 | |
| b128 | … | 903.97 | CUDA error | |
| b256 | … | 951.71 | … | |
| b512 | … | 963.76 | … | |
| Qwen3-4B-UD-Q8_K_XL | tg128 | 7.37 | 56.35 | … |
| size: 4.70 GiB | tg256 | 6.63 | 56.35 | … |
| tg512 | 6.24 | 54.56 | … | |
| b128 | 20.66 | 2163.17 | … | |
| b256 | … | 2405.27 | … | |
| b512 | … | 2495.35 | … | |
| GemmaCoder3-12B-IQ4_NL.gguf | tg128 | … | 40.70 | … |
| size: 6.41 GiB | tg256 | … | 40.67 | … |
| tg512 | … | 39.54 | … | |
| b128 | … | 1150.11 | … | |
| b256 | … | 1218.27 | … | |
| b512 | … | 1253.92 | … | |
| Gemma3-Code-Reasoning-4B.Q8_0 | tg128 | … | 66.98 | … |
| size: 3.84 GiB | tg256 | … | 66.95 | … |
| tg512 | … | 65.75 | … | |
| b128 | … | 2885.80 | … | |
| b256 | … | 3266.87 | … | |
| b512 | … | 3457.03 | … | |
| GemmaCoder3-12B-Q5_K_M | tg128 | … | 34.10 | … |
| size: 7.86 GiB | tg256 | … | 34.06 | … |
| tg512 | … | 33.28 | … | |
| b128 | … | 1045.27 | … | |
| b256 | … | 1108.95 | … | |
| b512 | … | 1144.97 | … | |
Pour comparaison …
Qwen2.5-coder-7b-instruct-q5_k_m:
./llama-bench -m ~/Data/AI_Models/Qwen2.5-coder-7b-instruct-q5_k_m.gguf -p 0 -n 128 load_backend: loaded RPC backend from /home/.../llama-b7109/libggml-rpc.so load_backend: loaded CPU backend from /home/.../llama-b7109/libggml-cpu-alderlake.so | model | size | params | backend | threads | test | t/s | | ------------------------------ | ---------: | ---------: | ---------- | ------: | --------------: | -------------------: | | qwen2 7B Q5_K - Medium | 5.07 GiB | 7.62 B | CPU | 4 | tg128 | 5.47 ± 0.72 |
Avec sudo nsys-ui :
| NVIDIA GeForce RTX 3060 | |
|---|---|
| Chip Name | GA104 |
| SM Count | 28 |
| L2 Cache Size | 2,25 MiB |
| Memory Bandwidth | 335,32 GiB/s |
| Memory Size | 11,63 GiB |
| Core Clock | 1,79 GHz |
| Bus Location | 0000:05:00.0 |
| GSP firmware version | 580.105.08 |
| Video accelerator tracing | Supported |
Avec llama.cpp et CUDA 12.9.
./build/bin/llama-bench -m ~/Data/AI_Models/Qwen2.5-coder-7b-instruct-q5_k_m.gguf -p 0 -n 128,256,512 ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes | model | size | params | backend | ngl | test | t/s | | ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: | | qwen2 7B Q5_K - Medium | 5.07 GiB | 7.62 B | CUDA | 99 | tg128 | 57.65 ± 0.03 | | qwen2 7B Q5_K - Medium | 5.07 GiB | 7.62 B | CUDA | 99 | tg256 | 57.61 ± 0.03 | | qwen2 7B Q5_K - Medium | 5.07 GiB | 7.62 B | CUDA | 99 | tg512 | 56.24 ± 0.05 |
Pour lancer llama-server avec le modèle “GemmaCoder3-12B-Q5_K_M.gguf” (fichier 8.4Go) fait de 49 layers en utilisant son contexte maximale “131072” avec --ctx-size 0 au lieu du par défaut “4096” il faut décharger des layers sur le CPU, sinon c'est main: error: unable to load model. À noter que c'est pareil avec llama-cli.
| n-gpu-layers | test | tokens/s | time | % perte perf |
|---|---|---|---|---|
| (all) 49 | tg128 | 34.15 | 0m25,904s | 0.00% |
| b128 | 1041.60 | 0m13,117s | 0.00% | |
| 44 | tg128 | 15.55 | 0m48,049s | 54.47% |
| b128 | 279.26 | 0m28,613s | 73.19% | |
| 39 | tg128 | 10.74 | 1m07,555s | 68.55% |
| b128 | 150.49 | 0m46,996s | 85.55% | |
| 30 | tg128 | 6.83 | 1m42,221s | 80.01% |
| b128 | 82.91 | 1m19,729s | 92.04% | |
| full cpu | tg128 | 3.12 | 3m28,308s | 90.86% |
| b128 | 4.50 | 22m37,674s | 99.57% |
Les valeurs qui permettent de charger ce modèle :
llama-cli :--n-gpu-layers 30, donc 80% perte perf--ctx-size 70000 --n-gpu-layers 41--ctx-size 42000llama-server :--ctx-size 40000 --n-gpu-layers 44--ctx-size 43500 --n-gpu-layers 43--ctx-size 52500 --n-gpu-layers 42
Avec --ctx-size 52500 --n-gpu-layers 42 :
... NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes ... print_info: n_ctx_train = 131072 print_info: n_embd = 3840 print_info: n_embd_inp = 3840 print_info: n_layer = 48 print_info: n_head = 16 print_info: n_head_kv = 8 print_info: n_rot = 256 print_info: n_swa = 1024 print_info: is_swa_any = 1 print_info: n_embd_head_k = 256 print_info: n_embd_head_v = 256 print_info: n_gqa = 2 print_info: n_embd_k_gqa = 2048 print_info: n_embd_v_gqa = 2048 print_info: f_norm_eps = 0.0e+00 print_info: f_norm_rms_eps = 1.0e-06 print_info: f_clamp_kqv = 0.0e+00 print_info: f_max_alibi_bias = 0.0e+00 print_info: f_logit_scale = 0.0e+00 print_info: f_attn_scale = 6.2e-02 print_info: n_ff = 15360 print_info: n_expert = 0 print_info: n_expert_used = 0 print_info: n_expert_groups = 0 print_info: n_group_used = 0 print_info: causal attn = 1 print_info: pooling type = 0 print_info: rope type = 2 print_info: rope scaling = linear print_info: freq_base_train = 1000000.0 print_info: freq_scale_train = 0.125 print_info: n_ctx_orig_yarn = 131072 print_info: rope_finetuned = unknown print_info: model type = 12B print_info: model params = 11.77 B print_info: general.name = gemma-3-12b-it-codeforces-SFT print_info: vocab type = SPM print_info: n_vocab = 262208 print_info: n_merges = 0 ... print_info: max token length = 48 ... load_tensors: offloading 42 repeating layers to GPU load_tensors: offloaded 42/49 layers to GPU load_tensors: CPU_Mapped model buffer size = 1720.59 MiB load_tensors: CUDA0 model buffer size = 6327.03 MiB llama_context: constructing llama_context llama_context: n_seq_max = 4 llama_context: n_ctx = 52736 llama_context: n_ctx_seq = 52736 llama_context: n_batch = 2048 llama_context: n_ubatch = 512 llama_context: causal_attn = 1 llama_context: flash_attn = auto llama_context: kv_unified = true llama_context: freq_base = 1000000.0 llama_context: freq_scale = 0.125 llama_context: n_ctx_seq (52736) < n_ctx_train (131072) -- the full capacity of the model will not be utilized llama_context: CPU output buffer size = 4.00 MiB llama_kv_cache_iswa: creating non-SWA KV cache, size = 52736 cells llama_kv_cache: CPU KV buffer size = 412.00 MiB llama_kv_cache: CUDA0 KV buffer size = 2884.00 MiB llama_kv_cache: size = 3296.00 MiB ( 52736 cells, 8 layers, 4/1 seqs), K (f16): 1648.00 MiB, V (f16): 1648.00 MiB llama_kv_cache_iswa: creating SWA KV cache, size = 4608 cells llama_kv_cache: CPU KV buffer size = 180.00 MiB llama_kv_cache: CUDA0 KV buffer size = 1260.00 MiB llama_kv_cache: size = 1440.00 MiB ( 4608 cells, 40 layers, 4/1 seqs), K (f16): 720.00 MiB, V (f16): 720.00 MiB llama_context: Flash Attention was auto, set to enabled llama_context: CUDA0 compute buffer size = 1307.32 MiB llama_context: CUDA_Host compute buffer size = 120.02 MiB llama_context: graph nodes = 1929 llama_context: graph splits = 94 (with bs=512), 27 (with bs=1)
$ ./llama.cpp/build/bin/llama-bench -m ~/Data/AI_Models/Qwen2.5-coder-7b-instruct-q5_k_m.gguf -p 0 -n 128,256,512 ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 5060 Ti, compute capability 12.0, VMM: yes | model | size | params | backend | ngl | test | t/s | | ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: | | qwen2 7B Q5_K - Medium | 5.07 GiB | 7.62 B | CUDA | 99 | tg128 | 73.54 ± 0.01 | | qwen2 7B Q5_K - Medium | 5.07 GiB | 7.62 B | CUDA | 99 | tg256 | 73.32 ± 0.40 | | qwen2 7B Q5_K - Medium | 5.07 GiB | 7.62 B | CUDA | 99 | tg512 | 71.80 ± 0.61 | build: 3f3a4fb9c (7130)
Reset nvidia et CUDA:
$ sudo rmmod nvidia_uvm nvidia
Après 2 mois de re-essais avec des configs grub et modprobe de toutes sortes avec l'aide de forums et d'assistants (Claude, ChatGpt, LeChat), une solution apparaît sur ce ticket : forcer le PCI en “Gen 3”
# Pour récupérer l'adresse PCI "0000:05:00.0" de la RTX: lspci | grep -i nvidia sudo lspci -vvv -s 0000:05:00.0 | grep -i "LnkCap\|LnkSta" LnkCap: Port #0, Speed 32GT/s, Width x8, ASPM L1, Exit Latency L1 unlimited LnkSta: Speed 8GT/s (downgraded), Width x4 (downgraded) LnkCap2: Supported Link Speeds: 2.5-32GT/s, Crosslink- Retimer+ 2Retimers+ DRS- LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+ EqualizationPhase1+ sudo setpci -s 0000:05:00.0 CAP_EXP+0xC.W=0x0003 sudo lspci -vvv -s 0000:05:00.0 | grep -i "LnkCap\|LnkSta" LnkCap: Port #0, Speed 32GT/s, Width x8, ASPM L1, Exit Latency L1 unlimited LnkSta: Speed 2.5GT/s (downgraded), Width x4 (downgraded) LnkCap2: Supported Link Speeds: 2.5-32GT/s, Crosslink- Retimer+ 2Retimers+ DRS- LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+ EqualizationPhase1+
Mais non, ça a bien fonctionné avec llama-bench mais pas avec Yolo:
kernel: NVRM: GPU at PCI:0000:05:00: GPU-ab296f23-e6a6-a23b-b6c1-33f9b813df84 kernel: NVRM: GPU Board Serial Number: 0 kernel: NVRM: Xid (PCI:0000:05:00): 13, Graphics Exception: Class 0xffff Subchannel 0x0 Mismatch kernel: NVRM: Xid (PCI:0000:05:00): 13, Graphics Exception: ESR 0x4041b0=0x3f20ffff kernel: NVRM: Xid (PCI:0000:05:00): 13, Graphics Exception: ESR 0x404000=0x80000002 kernel: NVRM: Xid (PCI:0000:05:00): 13, pid=6871, name=python3, Graphics Exception: channel 0x00000002, Class 0000cec0, Offset 00000100, Data deaddead