Outils pour utilisateurs

Outils du site


informatique:ai_lm:gpu_bench

GPU Bench

Benchmark d'IA pour extraction de noms :

  • avec service Mistral, modèle Codestral = 00j 01h 02m 48s
  • RTX3060 + Intel-i7, modèle granite-4.0-h-small-Q8_0 = 02j 16h 11m 34s

Selon LeChat:

Carte graphique TOPS (INT8) TOPS (FP16) Architecture
RTX 3060 (12 Go) ~120 TOPS ~60 TOPS Ampere
RTX 5060 Ti (16 Go) ~759 TOPS ~380 TOPS Blackwell

Bench llama.cpp :

  • Text generation: tg128, tg256, tg512 : -p 0 -n 128,256,512
  • Prompt processing: b128, b256, b512 : -p 1024 -n 0 -b 128,256,512
models test tokens/seconds
i7-1360P RTX 3060 RTX 5060 Ti
Qwen2.5-coder-7b-instruct-q5_k_m tg128 5.47 57.65 73.54
size: 5.07 GiB tg256 57.61 73.32
tg512 56.20 71.80
b128 1825.17 2840.57
b256 1924.10 3209.52
b512 1959.18 3271.22
Qwen2.5-coder-7b-instruct-q8_0 tg128 41.42 50.33
size: 7.54 GiB tg256 41.38 50.33
tg512 40.70 49.62
b128 13.98 1952.96 2972.52
b256 2054.09 3460.41
b512 2093.21 3511.29
EuroLLM-9B-Instruct-Q4_0 tg128 56.06 71.41
size: 4.94 GiB tg256 55.96 71.15
tg512 53.87 69.45
b128 1433.95 CUDA error
b256 1535.06
b512 1559.88
Qwen3-14B-UD-Q5_K_XL tg128 30.00 37.66
size: 9.82 GiB tg256 29.97 38.17
tg512 29.25 37.30
b128 903.97 CUDA error
b256 951.71
b512 963.76
Qwen3-4B-UD-Q8_K_XL tg128 7.37 56.35
size: 4.70 GiB tg256 6.63 56.35
tg512 6.24 54.56
b128 20.66 2163.17
b256 2405.27
b512 2495.35
GemmaCoder3-12B-IQ4_NL.gguf tg128 40.70
size: 6.41 GiB tg256 40.67
tg512 39.54
b128 1150.11
b256 1218.27
b512 1253.92
Gemma3-Code-Reasoning-4B.Q8_0 tg128 66.98
size: 3.84 GiB tg256 66.95
tg512 65.75
b128 2885.80
b256 3266.87
b512 3457.03
GemmaCoder3-12B-Q5_K_M tg128 34.10
size: 7.86 GiB tg256 34.06
tg512 33.28
b128 1045.27
b256 1108.95
b512 1144.97
  • Les “CUDA error” apparaissent avec la RTX 5060 Ti et le bridge PCIe/THB4 “Wikingoo L17” et le driver nvidia 580.
  • Avec le CPU, laisser le nombre de cœurs en automatique, ce sont les physiques qui seront utilisés. Si on force plus de thread, les perfs diminuent.
    • le multi-threads physique est utile. Ex: en auto 7.37 t/s, avec 1 thread 3.39 t/s

Intel® Core™ i7-1360P 13th Gen

Pour comparaison …

Qwen2.5-coder-7b-instruct-q5_k_m:

./llama-bench -m ~/Data/AI_Models/Qwen2.5-coder-7b-instruct-q5_k_m.gguf -p 0 -n 128
load_backend: loaded RPC backend from /home/.../llama-b7109/libggml-rpc.so
load_backend: loaded CPU backend from /home/.../llama-b7109/libggml-cpu-alderlake.so
| model                          |       size |     params | backend    | threads |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | --------------: | -------------------: |
| qwen2 7B Q5_K - Medium         |   5.07 GiB |     7.62 B | CPU        |       4 |           tg128 |          5.47 ± 0.72 |

Gigabyte Windforce OC 12GB Geforce RTX 3060

Avec sudo nsys-ui :

NVIDIA GeForce RTX 3060
Chip Name GA104
SM Count 28
L2 Cache Size 2,25 MiB
Memory Bandwidth 335,32 GiB/s
Memory Size 11,63 GiB
Core Clock 1,79 GHz
Bus Location 0000:05:00.0
GSP firmware version 580.105.08
Video accelerator tracing Supported

Avec llama.cpp et CUDA 12.9.

Qwen2.5-coder-7b-instruct-q5_k_m

./build/bin/llama-bench -m ~/Data/AI_Models/Qwen2.5-coder-7b-instruct-q5_k_m.gguf -p 0 -n 128,256,512
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes
| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| qwen2 7B Q5_K - Medium         |   5.07 GiB |     7.62 B | CUDA       |  99 |           tg128 |         57.65 ± 0.03 |
| qwen2 7B Q5_K - Medium         |   5.07 GiB |     7.62 B | CUDA       |  99 |           tg256 |         57.61 ± 0.03 |
| qwen2 7B Q5_K - Medium         |   5.07 GiB |     7.62 B | CUDA       |  99 |           tg512 |         56.24 ± 0.05 |

GemmaCoder3-12B-Q5_K_M

Pour lancer llama-server avec le modèle “GemmaCoder3-12B-Q5_K_M.gguf” (fichier 8.4Go) fait de 49 layers en utilisant son contexte maximale “131072” avec --ctx-size 0 au lieu du par défaut “4096” il faut décharger des layers sur le CPU, sinon c'est main: error: unable to load model. À noter que c'est pareil avec llama-cli.

n-gpu-layers test tokens/s time % perte perf
(all) 49 tg128 34.15 0m25,904s 0.00%
b128 1041.60 0m13,117s 0.00%
44 tg128 15.55 0m48,049s 54.47%
b128 279.26 0m28,613s 73.19%
39 tg128 10.74 1m07,555s 68.55%
b128 150.49 0m46,996s 85.55%
30 tg128 6.83 1m42,221s 80.01%
b128 82.91 1m19,729s 92.04%
full cpu tg128 3.12 3m28,308s 90.86%
b128 4.50 22m37,674s 99.57%

Les valeurs qui permettent de charger ce modèle :

  • llama-cli :
    • avec son context max 131072 c'est 30 layers sur GPU : --n-gpu-layers 30, donc 80% perte perf
    • --ctx-size 70000 --n-gpu-layers 41
    • et pour tous les layers sur le GPU : --ctx-size 42000
  • llama-server :
    • --ctx-size 40000 --n-gpu-layers 44
    • --ctx-size 43500 --n-gpu-layers 43
    • --ctx-size 52500 --n-gpu-layers 42

Avec --ctx-size 52500 --n-gpu-layers 42 :

...
NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes
...
print_info: n_ctx_train      = 131072
print_info: n_embd           = 3840
print_info: n_embd_inp       = 3840
print_info: n_layer          = 48
print_info: n_head           = 16
print_info: n_head_kv        = 8
print_info: n_rot            = 256
print_info: n_swa            = 1024
print_info: is_swa_any       = 1
print_info: n_embd_head_k    = 256
print_info: n_embd_head_v    = 256
print_info: n_gqa            = 2
print_info: n_embd_k_gqa     = 2048
print_info: n_embd_v_gqa     = 2048
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-06
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 6.2e-02
print_info: n_ff             = 15360
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: n_expert_groups  = 0
print_info: n_group_used     = 0
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 2
print_info: rope scaling     = linear
print_info: freq_base_train  = 1000000.0
print_info: freq_scale_train = 0.125
print_info: n_ctx_orig_yarn  = 131072
print_info: rope_finetuned   = unknown
print_info: model type       = 12B
print_info: model params     = 11.77 B
print_info: general.name     = gemma-3-12b-it-codeforces-SFT
print_info: vocab type       = SPM
print_info: n_vocab          = 262208
print_info: n_merges         = 0
...
print_info: max token length = 48
...
load_tensors: offloading 42 repeating layers to GPU
load_tensors: offloaded 42/49 layers to GPU
load_tensors:   CPU_Mapped model buffer size =  1720.59 MiB
load_tensors:        CUDA0 model buffer size =  6327.03 MiB
llama_context: constructing llama_context
llama_context: n_seq_max     = 4
llama_context: n_ctx         = 52736
llama_context: n_ctx_seq     = 52736
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = auto
llama_context: kv_unified    = true
llama_context: freq_base     = 1000000.0
llama_context: freq_scale    = 0.125
llama_context: n_ctx_seq (52736) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_context:        CPU  output buffer size =     4.00 MiB
llama_kv_cache_iswa: creating non-SWA KV cache, size = 52736 cells
llama_kv_cache:        CPU KV buffer size =   412.00 MiB
llama_kv_cache:      CUDA0 KV buffer size =  2884.00 MiB
llama_kv_cache: size = 3296.00 MiB ( 52736 cells,   8 layers,  4/1 seqs), K (f16): 1648.00 MiB, V (f16): 1648.00 MiB
llama_kv_cache_iswa: creating     SWA KV cache, size = 4608 cells
llama_kv_cache:        CPU KV buffer size =   180.00 MiB
llama_kv_cache:      CUDA0 KV buffer size =  1260.00 MiB
llama_kv_cache: size = 1440.00 MiB (  4608 cells,  40 layers,  4/1 seqs), K (f16):  720.00 MiB, V (f16):  720.00 MiB
llama_context: Flash Attention was auto, set to enabled
llama_context:      CUDA0 compute buffer size =  1307.32 MiB
llama_context:  CUDA_Host compute buffer size =   120.02 MiB
llama_context: graph nodes  = 1929
llama_context: graph splits = 94 (with bs=512), 27 (with bs=1)

PNY OC 16 Go Geforce RTX 5060 Ti

Qwen2.5-coder-7b-instruct-q5_k_m

$ ./llama.cpp/build/bin/llama-bench -m ~/Data/AI_Models/Qwen2.5-coder-7b-instruct-q5_k_m.gguf -p 0 -n 128,256,512
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 5060 Ti, compute capability 12.0, VMM: yes
| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| qwen2 7B Q5_K - Medium         |   5.07 GiB |     7.62 B | CUDA       |  99 |           tg128 |         73.54 ± 0.01 |
| qwen2 7B Q5_K - Medium         |   5.07 GiB |     7.62 B | CUDA       |  99 |           tg256 |         73.32 ± 0.40 |
| qwen2 7B Q5_K - Medium         |   5.07 GiB |     7.62 B | CUDA       |  99 |           tg512 |         71.80 ± 0.61 |

build: 3f3a4fb9c (7130)

Stabilité

Reset nvidia et CUDA:

$ sudo rmmod nvidia_uvm nvidia
  • Lucie-7B_OpenLLM-France.Instruct-human-data.Q8_0.gguf
  • Meta-Llama-3.1-8B-Instruct-Q8_0.gguf
    • CUDA0 model buffer size = 7605,33 MiB
    • CUDA0 compute buffer size = 258,50 MiB

Après 2 mois de re-essais avec des configs grub et modprobe de toutes sortes avec l'aide de forums et d'assistants (Claude, ChatGpt, LeChat), une solution apparaît sur ce ticket : forcer le PCI en “Gen 3”

# Pour récupérer l'adresse PCI "0000:05:00.0" de la RTX:
lspci | grep -i nvidia

sudo lspci -vvv -s 0000:05:00.0 | grep -i "LnkCap\|LnkSta"
   LnkCap:	Port #0, Speed 32GT/s, Width x8, ASPM L1, Exit Latency L1 unlimited
   LnkSta:	Speed 8GT/s (downgraded), Width x4 (downgraded)
   LnkCap2: Supported Link Speeds: 2.5-32GT/s, Crosslink- Retimer+ 2Retimers+ DRS-
   LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+ EqualizationPhase1+

sudo setpci -s 0000:05:00.0 CAP_EXP+0xC.W=0x0003

sudo lspci -vvv -s 0000:05:00.0 | grep -i "LnkCap\|LnkSta"
   LnkCap:	Port #0, Speed 32GT/s, Width x8, ASPM L1, Exit Latency L1 unlimited
   LnkSta:	Speed 2.5GT/s (downgraded), Width x4 (downgraded)
   LnkCap2: Supported Link Speeds: 2.5-32GT/s, Crosslink- Retimer+ 2Retimers+ DRS-
   LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+ EqualizationPhase1+

Mais non, ça a bien fonctionné avec llama-bench mais pas avec Yolo:

kernel: NVRM: GPU at PCI:0000:05:00: GPU-ab296f23-e6a6-a23b-b6c1-33f9b813df84
kernel: NVRM: GPU Board Serial Number: 0
kernel: NVRM: Xid (PCI:0000:05:00): 13, Graphics Exception: Class 0xffff Subchannel 0x0 Mismatch
kernel: NVRM: Xid (PCI:0000:05:00): 13, Graphics Exception: ESR 0x4041b0=0x3f20ffff
kernel: NVRM: Xid (PCI:0000:05:00): 13, Graphics Exception: ESR 0x404000=0x80000002
kernel: NVRM: Xid (PCI:0000:05:00): 13, pid=6871, name=python3, Graphics Exception: channel 0x00000002, Class 0000cec0, Offset 00000100, Data deaddead
informatique/ai_lm/gpu_bench.txt · Dernière modification : de cyrille

Sauf mention contraire, le contenu de ce wiki est placé sous les termes de la licence suivante : CC0 1.0 Universal
CC0 1.0 Universal Donate Powered by PHP Valid HTML5 Valid CSS Driven by DokuWiki