====== GPU Bench ====== * [[https://blogs.nvidia.com/blog/tag/rtx-ai-garage/|RTX AI Garage]] sur blog de nvidia Benchmark d'IA pour [[https://lab.cyrille.giquello.fr/Anticor/graphLmExtract.html|extraction de noms]] : * avec service Mistral, modèle Codestral = ''00j 01h 02m 48s'' * RTX3060 + Intel-i7, modèle granite-4.0-h-small-Q8_0 = ''02j 16h 11m 34s'' Selon LeChat: ^ Carte graphique ^ TOPS (INT8) ^ TOPS (FP16) ^ Architecture ^ | RTX 3060 (12 Go) | ~120 TOPS | ~60 TOPS | Ampere | | RTX 5060 Ti (16 Go) | ~759 TOPS | ~380 TOPS | Blackwell | Bench llama.cpp : * Text generation: tg128, tg256, tg512 : ''-p 0 -n 128,256,512'' * Prompt processing: b128, b256, b512 : ''-p 1024 -n 0 -b 128,256,512'' ^ models ^ test ^ tokens/seconds ^^^ ^ ^ ^ i7-1360P ^ RTX 3060 ^ RTX 5060 Ti ^ | Qwen2.5-coder-7b-instruct-q5_k_m | tg128 | 5.47 | 57.65 | 73.54 | | //size: 5.07 GiB// | tg256 | ... | 57.61 | 73.32 | | | tg512 | ... | 56.20 | 71.80 | | | b128 | ... | 1825.17 | 2840.57 | | | b256 | ... | 1924.10 | 3209.52 | | | b512 | ... | 1959.18 | 3271.22 | | Qwen2.5-coder-7b-instruct-q8_0 | tg128 | ... | 41.42 | 50.33 | | //size: 7.54 GiB// | tg256 | ... | 41.38 | 50.33 | | | tg512 | ... | 40.70 | 49.62 | | | b128 | 13.98 | 1952.96 | 2972.52 | | | b256 | ... | 2054.09 | 3460.41 | | | b512 | ... | 2093.21 | 3511.29 | | EuroLLM-9B-Instruct-Q4_0 | tg128 | ... | 56.06 | 71.41 | | //size: 4.94 GiB// | tg256 | ... | 55.96 | 71.15 | | | tg512 | ... | 53.87 | 69.45 | | | b128 | ... | 1433.95 | CUDA error | | | b256 | ... | 1535.06 | ... | | | b512 | ... | 1559.88 | ... | | Qwen3-14B-UD-Q5_K_XL | tg128 | ... | 30.00 | 37.66 | | //size: 9.82 GiB// | tg256 | ... | 29.97 | 38.17 | | | tg512 | ... | 29.25 | 37.30 | | | b128 | ... | 903.97 | CUDA error | | | b256 | ... | 951.71 | ... | | | b512 | ... | 963.76 | ... | | Qwen3-4B-UD-Q8_K_XL | tg128 | 7.37 | 56.35 | ... | | //size: 4.70 GiB// | tg256 | 6.63 | 56.35 | ... | | | tg512 | 6.24 | 54.56 | ... | | | b128 | 20.66 | 2163.17 | ... | | | b256 | ... | 2405.27 | ... | | | b512 | ... | 2495.35 | ... | | GemmaCoder3-12B-IQ4_NL.gguf | tg128 | ... | 40.70 | ... | | //size: 6.41 GiB// | tg256 | ... | 40.67 | ... | | | tg512 | ... | 39.54 | ... | | | b128 | ... | 1150.11 | ... | | | b256 | ... | 1218.27 | ... | | | b512 | ... | 1253.92 | ... | | Gemma3-Code-Reasoning-4B.Q8_0 | tg128 | ... | 66.98 | ... | | //size: 3.84 GiB// | tg256 | ... | 66.95 | ... | | | tg512 | ... | 65.75 | ... | | | b128 | ... | 2885.80 | ... | | | b256 | ... | 3266.87 | ... | | | b512 | ... | 3457.03 | ... | | GemmaCoder3-12B-Q5_K_M | tg128 | ... | 34.10 | ... | | //size: 7.86 GiB// | tg256 | ... | 34.06 | ... | | | tg512 | ... | 33.28 | ... | | | b128 | ... | 1045.27 | ... | | | b256 | ... | 1108.95 | ... | | | b512 | ... | 1144.97 | ... | * Les "CUDA error" apparaissent avec la RTX 5060 Ti et le bridge PCIe/THB4 "Wikingoo L17" et le driver nvidia 580. * Avec le CPU, laisser le nombre de cœurs en automatique, ce sont les physiques qui seront utilisés. Si on force plus de thread, les perfs diminuent. * le multi-threads physique est utile. Ex: en auto 7.37 t/s, avec 1 thread 3.39 t/s ===== Intel® Core™ i7-1360P 13th Gen ===== Pour comparaison ... **Qwen2.5-coder-7b-instruct-q5_k_m**: ./llama-bench -m ~/Data/AI_Models/Qwen2.5-coder-7b-instruct-q5_k_m.gguf -p 0 -n 128 load_backend: loaded RPC backend from /home/.../llama-b7109/libggml-rpc.so load_backend: loaded CPU backend from /home/.../llama-b7109/libggml-cpu-alderlake.so | model | size | params | backend | threads | test | t/s | | ------------------------------ | ---------: | ---------: | ---------- | ------: | --------------: | -------------------: | | qwen2 7B Q5_K - Medium | 5.07 GiB | 7.62 B | CPU | 4 | tg128 | 5.47 ± 0.72 | ===== Gigabyte Windforce OC 12GB Geforce RTX 3060 ===== {{ :informatique:ai_coding:ia_rtx_3060_small.jpg?direct&400|}} Avec ''sudo nsys-ui'' : ^ NVIDIA GeForce RTX 3060 ^^ | Chip Name | GA104 | | SM Count | 28 | | L2 Cache Size | 2,25 MiB | | Memory Bandwidth | 335,32 GiB/s | | Memory Size | 11,63 GiB | | Core Clock | 1,79 GHz | | Bus Location | 0000:05:00.0 | | GSP firmware version | 580.105.08 | | Video accelerator tracing | Supported | Avec llama.cpp et CUDA 12.9. ==== Qwen2.5-coder-7b-instruct-q5_k_m ==== ./build/bin/llama-bench -m ~/Data/AI_Models/Qwen2.5-coder-7b-instruct-q5_k_m.gguf -p 0 -n 128,256,512 ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes | model | size | params | backend | ngl | test | t/s | | ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: | | qwen2 7B Q5_K - Medium | 5.07 GiB | 7.62 B | CUDA | 99 | tg128 | 57.65 ± 0.03 | | qwen2 7B Q5_K - Medium | 5.07 GiB | 7.62 B | CUDA | 99 | tg256 | 57.61 ± 0.03 | | qwen2 7B Q5_K - Medium | 5.07 GiB | 7.62 B | CUDA | 99 | tg512 | 56.24 ± 0.05 | ==== GemmaCoder3-12B-Q5_K_M ==== Pour lancer ''llama-server'' avec le modèle "GemmaCoder3-12B-Q5_K_M.gguf" (fichier 8.4Go) fait de 49 layers en utilisant son **contexte maximale "131072" avec ''--ctx-size 0'' au lieu du par défaut "4096"** il faut décharger des layers sur le CPU, sinon c'est ''main: error: unable to load model''. À noter que c'est pareil avec ''llama-cli''. ^ n-gpu-layers ^ test ^ tokens/s ^ time ^ % perte perf | | (all) 49 | tg128| 34.15 | 0m25,904s | 0.00% | | | b128 | 1041.60 | 0m13,117s | 0.00% | | 44 | tg128| 15.55 | 0m48,049s | 54.47% | | | b128 | 279.26 | 0m28,613s | 73.19% | | 39 | tg128| 10.74 | 1m07,555s | 68.55% | | | b128 | 150.49 | 0m46,996s | 85.55% | | 30 | tg128| 6.83 | 1m42,221s | 80.01% | | | b128 | 82.91 | 1m19,729s | 92.04% | | full cpu | tg128| 3.12 | 3m28,308s | 90.86% | | | b128 | 4.50 | 22m37,674s | 99.57% | Les valeurs qui permettent de charger ce modèle : * ''llama-cli'' : * avec son context max 131072 c'est 30 layers sur GPU : ''--n-gpu-layers 30'', donc 80% perte perf * ''--ctx-size 70000 --n-gpu-layers 41'' * et pour tous les layers sur le GPU : ''--ctx-size 42000'' * ''llama-server'' : * ''--ctx-size 40000 --n-gpu-layers 44'' * ''--ctx-size 43500 --n-gpu-layers 43'' * ''--ctx-size 52500 --n-gpu-layers 42'' Avec ''--ctx-size 52500 --n-gpu-layers 42'' : ... NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes ... print_info: n_ctx_train = 131072 print_info: n_embd = 3840 print_info: n_embd_inp = 3840 print_info: n_layer = 48 print_info: n_head = 16 print_info: n_head_kv = 8 print_info: n_rot = 256 print_info: n_swa = 1024 print_info: is_swa_any = 1 print_info: n_embd_head_k = 256 print_info: n_embd_head_v = 256 print_info: n_gqa = 2 print_info: n_embd_k_gqa = 2048 print_info: n_embd_v_gqa = 2048 print_info: f_norm_eps = 0.0e+00 print_info: f_norm_rms_eps = 1.0e-06 print_info: f_clamp_kqv = 0.0e+00 print_info: f_max_alibi_bias = 0.0e+00 print_info: f_logit_scale = 0.0e+00 print_info: f_attn_scale = 6.2e-02 print_info: n_ff = 15360 print_info: n_expert = 0 print_info: n_expert_used = 0 print_info: n_expert_groups = 0 print_info: n_group_used = 0 print_info: causal attn = 1 print_info: pooling type = 0 print_info: rope type = 2 print_info: rope scaling = linear print_info: freq_base_train = 1000000.0 print_info: freq_scale_train = 0.125 print_info: n_ctx_orig_yarn = 131072 print_info: rope_finetuned = unknown print_info: model type = 12B print_info: model params = 11.77 B print_info: general.name = gemma-3-12b-it-codeforces-SFT print_info: vocab type = SPM print_info: n_vocab = 262208 print_info: n_merges = 0 ... print_info: max token length = 48 ... load_tensors: offloading 42 repeating layers to GPU load_tensors: offloaded 42/49 layers to GPU load_tensors: CPU_Mapped model buffer size = 1720.59 MiB load_tensors: CUDA0 model buffer size = 6327.03 MiB llama_context: constructing llama_context llama_context: n_seq_max = 4 llama_context: n_ctx = 52736 llama_context: n_ctx_seq = 52736 llama_context: n_batch = 2048 llama_context: n_ubatch = 512 llama_context: causal_attn = 1 llama_context: flash_attn = auto llama_context: kv_unified = true llama_context: freq_base = 1000000.0 llama_context: freq_scale = 0.125 llama_context: n_ctx_seq (52736) < n_ctx_train (131072) -- the full capacity of the model will not be utilized llama_context: CPU output buffer size = 4.00 MiB llama_kv_cache_iswa: creating non-SWA KV cache, size = 52736 cells llama_kv_cache: CPU KV buffer size = 412.00 MiB llama_kv_cache: CUDA0 KV buffer size = 2884.00 MiB llama_kv_cache: size = 3296.00 MiB ( 52736 cells, 8 layers, 4/1 seqs), K (f16): 1648.00 MiB, V (f16): 1648.00 MiB llama_kv_cache_iswa: creating SWA KV cache, size = 4608 cells llama_kv_cache: CPU KV buffer size = 180.00 MiB llama_kv_cache: CUDA0 KV buffer size = 1260.00 MiB llama_kv_cache: size = 1440.00 MiB ( 4608 cells, 40 layers, 4/1 seqs), K (f16): 720.00 MiB, V (f16): 720.00 MiB llama_context: Flash Attention was auto, set to enabled llama_context: CUDA0 compute buffer size = 1307.32 MiB llama_context: CUDA_Host compute buffer size = 120.02 MiB llama_context: graph nodes = 1929 llama_context: graph splits = 94 (with bs=512), 27 (with bs=1) ===== PNY OC 16 Go Geforce RTX 5060 Ti ===== === Qwen2.5-coder-7b-instruct-q5_k_m === $ ./llama.cpp/build/bin/llama-bench -m ~/Data/AI_Models/Qwen2.5-coder-7b-instruct-q5_k_m.gguf -p 0 -n 128,256,512 ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 5060 Ti, compute capability 12.0, VMM: yes | model | size | params | backend | ngl | test | t/s | | ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: | | qwen2 7B Q5_K - Medium | 5.07 GiB | 7.62 B | CUDA | 99 | tg128 | 73.54 ± 0.01 | | qwen2 7B Q5_K - Medium | 5.07 GiB | 7.62 B | CUDA | 99 | tg256 | 73.32 ± 0.40 | | qwen2 7B Q5_K - Medium | 5.07 GiB | 7.62 B | CUDA | 99 | tg512 | 71.80 ± 0.61 | build: 3f3a4fb9c (7130) === Stabilité === Reset nvidia et CUDA: $ sudo rmmod nvidia_uvm nvidia * Lucie-7B_OpenLLM-France.Instruct-human-data.Q8_0.gguf * Meta-Llama-3.1-8B-Instruct-Q8_0.gguf * CUDA0 model buffer size = 7605,33 MiB * CUDA0 compute buffer size = 258,50 MiB * Après 2 mois de re-essais avec des configs grub et modprobe de toutes sortes avec l'aide de forums et d'assistants (Claude, ChatGpt, LeChat), une solution apparaît [[https://github.com/NVIDIA/open-gpu-kernel-modules/issues/974|sur ce ticket]] : forcer le PCI en "Gen 3" # Pour récupérer l'adresse PCI "0000:05:00.0" de la RTX: lspci | grep -i nvidia sudo lspci -vvv -s 0000:05:00.0 | grep -i "LnkCap\|LnkSta" LnkCap: Port #0, Speed 32GT/s, Width x8, ASPM L1, Exit Latency L1 unlimited LnkSta: Speed 8GT/s (downgraded), Width x4 (downgraded) LnkCap2: Supported Link Speeds: 2.5-32GT/s, Crosslink- Retimer+ 2Retimers+ DRS- LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+ EqualizationPhase1+ sudo setpci -s 0000:05:00.0 CAP_EXP+0xC.W=0x0003 sudo lspci -vvv -s 0000:05:00.0 | grep -i "LnkCap\|LnkSta" LnkCap: Port #0, Speed 32GT/s, Width x8, ASPM L1, Exit Latency L1 unlimited LnkSta: Speed 2.5GT/s (downgraded), Width x4 (downgraded) LnkCap2: Supported Link Speeds: 2.5-32GT/s, Crosslink- Retimer+ 2Retimers+ DRS- LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+ EqualizationPhase1+ **Mais non**, ça a bien fonctionné avec ''llama-bench'' mais pas avec Yolo: kernel: NVRM: GPU at PCI:0000:05:00: GPU-ab296f23-e6a6-a23b-b6c1-33f9b813df84 kernel: NVRM: GPU Board Serial Number: 0 kernel: NVRM: Xid (PCI:0000:05:00): 13, Graphics Exception: Class 0xffff Subchannel 0x0 Mismatch kernel: NVRM: Xid (PCI:0000:05:00): 13, Graphics Exception: ESR 0x4041b0=0x3f20ffff kernel: NVRM: Xid (PCI:0000:05:00): 13, Graphics Exception: ESR 0x404000=0x80000002 kernel: NVRM: Xid (PCI:0000:05:00): 13, pid=6871, name=python3, Graphics Exception: channel 0x00000002, Class 0000cec0, Offset 00000100, Data deaddead