====== GPU Bench ====== * [[https://blogs.nvidia.com/blog/tag/rtx-ai-garage/|RTX AI Garage]] sur blog de nvidia * Gigabyte Windforce OC 12GB Geforce RTX 3060, **354 €TTC** neuve 2025-11 * PNY OC 16 Go Geforce RTX 5060 Ti, **450 €TTC** neuve 2025-11 Benchmark d'IA pour [[https://lab.cyrille.giquello.fr/Anticor/graphLmExtract.html|extraction de noms]] : * avec service Mistral, modèle Codestral = ''00j 01h 02m 48s'' * RTX3060 + Intel-i7, modèle granite-4.0-h-small-Q8_0 = ''02j 16h 11m 34s'' Selon LeChat: ^ Carte graphique ^ TOPS (INT8) ^ TOPS (FP16) ^ Architecture ^ | RTX 3060 (12 Go) | ~120 TOPS | ~60 TOPS | Ampere | | RTX 5060 Ti (16 Go) | ~759 TOPS | ~380 TOPS | Blackwell | Bench llama.cpp : * Text generation: tg128, tg256, tg512 : ''-p 0 -n 128,256,512'' * Prompt processing: b128, b256, b512 : ''-p 1024 -n 0 -b 128,256,512'' ^ models ^ test ^ tokens/seconds ^^^ ^ ^ ^ ^ i7-1360P ^ i7-1360P SYCL ^ RTX 3060 ^ RTX 5060 Ti ^ | Qwen2.5-coder-7b-instruct-q5_k_m | tg128 | 5.47 | | 57.65 | 73.54 | | //size: 5.07 GiB// | tg256 | ... | | 57.61 | 73.32 | | | tg512 | ... | | 56.20 | 71.80 | | | b128 | ... | | 1825.17 | 2840.57 | | | b256 | ... | | 1924.10 | 3209.52 | | | b512 | ... | | 1959.18 | 3271.22 | | Qwen2.5-coder-7b-instruct-q8_0 | tg128 | ... | | 41.42 | 50.33 | | //size: 7.54 GiB// | tg256 | ... | | 41.38 | 50.33 | | | tg512 | ... | | 40.70 | 49.62 | | | b128 | 13.98 | 36.34 | 1952.96 | 2972.52 | | | b256 | ... | 42.28 | 2054.09 | 3460.41 | | | b512 | ... | 45.99 | 2093.21 | 3511.29 | | EuroLLM-9B-Instruct-Q4_0 | tg128 | ... | | 56.06 | 71.41 | | //size: 4.94 GiB// | tg256 | ... | | 55.96 | 71.15 | | | tg512 | ... | | 53.87 | 69.45 | | | b128 | ... | | 1433.95 | CUDA error | | | b256 | ... | | 1535.06 | ... | | | b512 | ... | | 1559.88 | ... | | Qwen3-14B-UD-Q5_K_XL | tg128 | ... | | 30.00 | 37.66 | | //size: 9.82 GiB// | tg256 | ... | | 29.97 | 38.17 | | | tg512 | ... | | 29.25 | 37.30 | | | b128 | ... | | 903.97 | CUDA error | | | b256 | ... | | 951.71 | ... | | | b512 | ... | | 963.76 | ... | | Qwen3-4B-UD-Q8_K_XL | tg128 | 7.37 | | 56.35 | ... | | //size: 4.70 GiB// | tg256 | 6.63 | | 56.35 | ... | | | tg512 | 6.24 | | 54.56 | ... | | | b128 | 20.66 | | 2163.17 | ... | | | b256 | ... | | 2405.27 | ... | | | b512 | ... | | 2495.35 | ... | | GemmaCoder3-12B-IQ4_NL.gguf | tg128 | ... | | 40.70 | ... | | //size: 6.41 GiB// | tg256 | ... | | 40.67 | ... | | | tg512 | ... | | 39.54 | ... | | | b128 | ... | | 1150.11 | ... | | | b256 | ... | | 1218.27 | ... | | | b512 | ... | | 1253.92 | ... | | Gemma3-Code-Reasoning-4B.Q8_0 | tg128 | ... | | 66.98 | ... | | //size: 3.84 GiB// | tg256 | ... | | 66.95 | ... | | | tg512 | ... | | 65.75 | ... | | | b128 | ... | | 2885.80 | ... | | | b256 | ... | | 3266.87 | ... | | | b512 | ... | | 3457.03 | ... | | GemmaCoder3-12B-Q5_K_M | tg128 | ... | | 34.10 | ... | | //size: 7.86 GiB// | tg256 | ... | | 34.06 | ... | | | tg512 | ... | | 33.28 | ... | | | b128 | ... | | 1045.27 | ... | | | b256 | ... | | 1108.95 | ... | | | b512 | ... | | 1144.97 | ... | | gpt-oss 20B MXFP4 MoE | tg128 | ... | | 92.86 | ... | | gpt-oss-20b-mxfp4.gguf | tg256 | ... | | 92.69 | ... | | //size: 11.27 GiB// | tg512 | ... | | 88.17 | ... | | | b128 | ... | | 1036.08 | ... | | | b256 | ... | | 1452.01 | ... | | | b512 | ... | | 1744.71 | ... | | gpt-oss 20B Q4_K - Medium | tg128 | ... | | 98.05 | ... | | gpt-oss-20b-UD-Q4_K_XL.gguf | tg256 | ... | | 97.20 | ... | | //size: 11.04 GiB// | tg512 | ... | | 92.43 | ... | | | b128 | ... | | 1034.15 | ... | | | b256 | ... | | 1450.77 | ... | | | b512 | ... | | 1734.35 | ... | * Les "CUDA error" apparaissent avec la RTX 5060 Ti et le bridge PCIe/THB4 "Wikingoo L17" et le driver nvidia 580. * Avec le CPU, laisser le nombre de cœurs en automatique, ce sont les physiques qui seront utilisés. Si on force plus de thread, les perfs diminuent. * le multi-threads physique est utile. Ex: en auto 7.37 t/s, avec 1 thread 3.39 t/s ===== Intel® Core™ i7-1360P 13th Gen ===== Pour comparaison ... **Qwen2.5-coder-7b-instruct-q5_k_m**: ./llama-bench -m ~/Data/AI_Models/Qwen2.5-coder-7b-instruct-q5_k_m.gguf -p 0 -n 128 load_backend: loaded RPC backend from /home/.../llama-b7109/libggml-rpc.so load_backend: loaded CPU backend from /home/.../llama-b7109/libggml-cpu-alderlake.so | model | size | params | backend | threads | test | t/s | | ------------------------------ | ---------: | ---------: | ---------- | ------: | --------------: | -------------------: | | qwen2 7B Q5_K - Medium | 5.07 GiB | 7.62 B | CPU | 4 | tg128 | 5.47 ± 0.72 | ===== Gigabyte Windforce OC 12GB Geforce RTX 3060 ===== {{ :informatique:ai_coding:ia_rtx_3060_small.jpg?direct&400|}} Avec ''sudo nsys-ui'' : ^ NVIDIA GeForce RTX 3060 ^^ | Chip Name | GA104 | | SM Count | 28 | | L2 Cache Size | 2,25 MiB | | Memory Bandwidth | 335,32 GiB/s | | Memory Size | 11,63 GiB | | Core Clock | 1,79 GHz | | Bus Location | 0000:05:00.0 | | GSP firmware version | 580.105.08 | | Video accelerator tracing | Supported | Avec llama.cpp et CUDA 12.9. ==== Qwen2.5-coder-7b-instruct-q5_k_m ==== ./build/bin/llama-bench -m ~/Data/AI_Models/Qwen2.5-coder-7b-instruct-q5_k_m.gguf -p 0 -n 128,256,512 ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes | model | size | params | backend | ngl | test | t/s | | ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: | | qwen2 7B Q5_K - Medium | 5.07 GiB | 7.62 B | CUDA | 99 | tg128 | 57.65 ± 0.03 | | qwen2 7B Q5_K - Medium | 5.07 GiB | 7.62 B | CUDA | 99 | tg256 | 57.61 ± 0.03 | | qwen2 7B Q5_K - Medium | 5.07 GiB | 7.62 B | CUDA | 99 | tg512 | 56.24 ± 0.05 | ==== GemmaCoder3-12B-Q5_K_M ==== Pour lancer ''llama-server'' avec le modèle "GemmaCoder3-12B-Q5_K_M.gguf" (fichier 8.4Go) fait de 49 layers en utilisant son **contexte maximale "131072" avec ''--ctx-size 0'' au lieu du par défaut "4096"** il faut décharger des layers sur le CPU, sinon c'est ''main: error: unable to load model''. À noter que c'est pareil avec ''llama-cli''. ^ n-gpu-layers ^ test ^ tokens/s ^ time ^ % perte perf | | (all) 49 | tg128| 34.15 | 0m25,904s | 0.00% | | | b128 | 1041.60 | 0m13,117s | 0.00% | | 44 | tg128| 15.55 | 0m48,049s | 54.47% | | | b128 | 279.26 | 0m28,613s | 73.19% | | 39 | tg128| 10.74 | 1m07,555s | 68.55% | | | b128 | 150.49 | 0m46,996s | 85.55% | | 30 | tg128| 6.83 | 1m42,221s | 80.01% | | | b128 | 82.91 | 1m19,729s | 92.04% | | full cpu | tg128| 3.12 | 3m28,308s | 90.86% | | | b128 | 4.50 | 22m37,674s | 99.57% | Les valeurs qui permettent de charger ce modèle : * ''llama-cli'' : * avec son context max 131072 c'est 30 layers sur GPU : ''--n-gpu-layers 30'', donc 80% perte perf * ''--ctx-size 70000 --n-gpu-layers 41'' * et pour tous les layers sur le GPU : ''--ctx-size 42000'' * ''llama-server'' : * ''--ctx-size 40000 --n-gpu-layers 44'' * ''--ctx-size 43500 --n-gpu-layers 43'' * ''--ctx-size 52500 --n-gpu-layers 42'' Avec ''--ctx-size 52500 --n-gpu-layers 42'' : ... NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes ... print_info: n_ctx_train = 131072 print_info: n_embd = 3840 print_info: n_embd_inp = 3840 print_info: n_layer = 48 print_info: n_head = 16 print_info: n_head_kv = 8 print_info: n_rot = 256 print_info: n_swa = 1024 print_info: is_swa_any = 1 print_info: n_embd_head_k = 256 print_info: n_embd_head_v = 256 print_info: n_gqa = 2 print_info: n_embd_k_gqa = 2048 print_info: n_embd_v_gqa = 2048 print_info: f_norm_eps = 0.0e+00 print_info: f_norm_rms_eps = 1.0e-06 print_info: f_clamp_kqv = 0.0e+00 print_info: f_max_alibi_bias = 0.0e+00 print_info: f_logit_scale = 0.0e+00 print_info: f_attn_scale = 6.2e-02 print_info: n_ff = 15360 print_info: n_expert = 0 print_info: n_expert_used = 0 print_info: n_expert_groups = 0 print_info: n_group_used = 0 print_info: causal attn = 1 print_info: pooling type = 0 print_info: rope type = 2 print_info: rope scaling = linear print_info: freq_base_train = 1000000.0 print_info: freq_scale_train = 0.125 print_info: n_ctx_orig_yarn = 131072 print_info: rope_finetuned = unknown print_info: model type = 12B print_info: model params = 11.77 B print_info: general.name = gemma-3-12b-it-codeforces-SFT print_info: vocab type = SPM print_info: n_vocab = 262208 print_info: n_merges = 0 ... print_info: max token length = 48 ... load_tensors: offloading 42 repeating layers to GPU load_tensors: offloaded 42/49 layers to GPU load_tensors: CPU_Mapped model buffer size = 1720.59 MiB load_tensors: CUDA0 model buffer size = 6327.03 MiB llama_context: constructing llama_context llama_context: n_seq_max = 4 llama_context: n_ctx = 52736 llama_context: n_ctx_seq = 52736 llama_context: n_batch = 2048 llama_context: n_ubatch = 512 llama_context: causal_attn = 1 llama_context: flash_attn = auto llama_context: kv_unified = true llama_context: freq_base = 1000000.0 llama_context: freq_scale = 0.125 llama_context: n_ctx_seq (52736) < n_ctx_train (131072) -- the full capacity of the model will not be utilized llama_context: CPU output buffer size = 4.00 MiB llama_kv_cache_iswa: creating non-SWA KV cache, size = 52736 cells llama_kv_cache: CPU KV buffer size = 412.00 MiB llama_kv_cache: CUDA0 KV buffer size = 2884.00 MiB llama_kv_cache: size = 3296.00 MiB ( 52736 cells, 8 layers, 4/1 seqs), K (f16): 1648.00 MiB, V (f16): 1648.00 MiB llama_kv_cache_iswa: creating SWA KV cache, size = 4608 cells llama_kv_cache: CPU KV buffer size = 180.00 MiB llama_kv_cache: CUDA0 KV buffer size = 1260.00 MiB llama_kv_cache: size = 1440.00 MiB ( 4608 cells, 40 layers, 4/1 seqs), K (f16): 720.00 MiB, V (f16): 720.00 MiB llama_context: Flash Attention was auto, set to enabled llama_context: CUDA0 compute buffer size = 1307.32 MiB llama_context: CUDA_Host compute buffer size = 120.02 MiB llama_context: graph nodes = 1929 llama_context: graph splits = 94 (with bs=512), 27 (with bs=1) ===== PNY OC 16 Go Geforce RTX 5060 Ti ===== ==== Avec vrai PCIe ✅ ==== Sur une vrai tour avec PCIe x16 et Intel(R) Core(TM) Ultra 7 270K Plus. **Environnement et compilation sensible** pour llama.cpp : * https://github.com/ggml-org/llama.cpp/issues/23546#issuecomment-4662239477 ^ Modèle ^ params ^ Offload GPU ^ Prompt (t/s) ^ Eval (t/s) ^ Total (ms) ^ Tokens générés ^ Graphs reused ^ | Devstral-Small-2-24B-Instruct-2512-UD-Q4_K_XL | 24B | 17/41 | 427.81 – 545.85 | 0.80 – 3.19 | 123,500 – 568,458 | 9,629 – 47,241 | 0 | | Qwen3-Coder-30B-A3B-Instruct-UD-Q4_K_XL | 30B | 49/49 | 590.38 – 591.76 | 28.64 – 30.06 | 4,715 – 12,818 | 19,919 – 22,804 | 294 – 530 | | Qwen3-Coder-Next-UD-Q4_K_XL | 80B | 49/49 | 29.00 – 400.09 | 18.68 – 32.44 | 25,057 – 87,659 | 719 – 43,214 | 10 – 1,024 | | DeepSeek-R1-Distill-Qwen-32B-Q4_K_M | 32B | 24/65 | 88.97 – 428.81 | 2.14 – 2.32 | 116,052 – 189,566 | 925 – 3,397 | 228 – 419 | | DeepSeek-R1-Distill-Qwen-14B-Q8_0 | 14B | 24/49 | 225.55 – 775.01 | 4.10 – 4.13 | 81,383 – 147,476 | 1,307 – 3,858 | 313 – 582 | === gpt-oss-20b-UD-Q4_K_XL === $ ./llama.cpp/build/bin/llama-bench -m /data/models/gpt-oss-20b-UD-Q4_K_XL.gguf -p 0 -n 128,256,512 ggml_cuda_init: found 1 CUDA devices (Total VRAM: 15849 MiB): Device 0: NVIDIA GeForce RTX 5060 Ti, compute capability 12.0, VMM: yes, VRAM: 15849 MiB | model | size | params | backend | ngl | test | t/s | | ------------------------- | ---------: | ---------: | ------- | --: | ------: | -------------: | | gpt-oss 20B Q4_K - Medium | 11.04 GiB | 20.91 B | CUDA | -1 | tg128 | 155.79 ± 0.21 | | gpt-oss 20B Q4_K - Medium | 11.04 GiB | 20.91 B | CUDA | -1 | tg256 | 155.81 ± 0.03 | | gpt-oss 20B Q4_K - Medium | 11.04 GiB | 20.91 B | CUDA | -1 | tg512 | 155.15 ± 0.01 | build: e25a32e98 (9584) $ ./llama.cpp/build/bin/llama-bench -m /data/models/gpt-oss-20b-UD-Q4_K_XL.gguf -p 1024 -n 0 -b 128,256,512 ggml_cuda_init: found 1 CUDA devices (Total VRAM: 15849 MiB): Device 0: NVIDIA GeForce RTX 5060 Ti, compute capability 12.0, VMM: yes, VRAM: 15849 MiB | model | size | params | backend | ngl | n_batch | test | t/s | | ------------------------- | ---------: | ------: | ------- | --: | ------: | ------: | --------------: | | gpt-oss 20B Q4_K - Medium | 11.04 GiB | 20.91 B | CUDA | -1 | 128 | pp1024 | 3308.23 ± 19.28 | | gpt-oss 20B Q4_K - Medium | 11.04 GiB | 20.91 B | CUDA | -1 | 256 | pp1024 | 4792.27 ± 39.25 | | gpt-oss 20B Q4_K - Medium | 11.04 GiB | 20.91 B | CUDA | -1 | 512 | pp1024 | 6048.13 ± 32.16 | build: e25a32e98 (9584) === Qwen2.5-coder-7b-instruct-q8_0 === $ ./llama.cpp/build/bin/llama-bench -m ~/models/Qwen2.5-coder-7b-instruct-q8_0.gguf -p 0 -n 128,256,512 ggml_cuda_init: found 1 CUDA devices (Total VRAM: 15849 MiB): Device 0: NVIDIA GeForce RTX 5060 Ti, compute capability 12.0, VMM: yes, VRAM: 15849 MiB | model | size | params | backend | ngl | test | t/s | | ---------------- | ---------: | ---------: | --------- | --: | ----------: | ----------------: | | qwen2 7B Q8_0 | 7.54 GiB | 7.62 B | CUDA | -1 | tg128 | 54.23 ± 0.02 | | qwen2 7B Q8_0 | 7.54 GiB | 7.62 B | CUDA | -1 | tg256 | 54.23 ± 0.00 | | qwen2 7B Q8_0 | 7.54 GiB | 7.62 B | CUDA | -1 | tg512 | 54.12 ± 0.00 | build: e25a32e98 (9584) $ ./llama.cpp/build/bin/llama-bench -m ~/models/Qwen2.5-coder-7b-instruct-q8_0.gguf -p 1024 -n 0 -b 128,256,512 ggml_cuda_init: found 1 CUDA devices (Total VRAM: 15849 MiB): Device 0: NVIDIA GeForce RTX 5060 Ti, compute capability 12.0, VMM: yes, VRAM: 15849 MiB | model | size | params | backend | ngl | n_batch | test | t/s | | ---------------- | ---------: | ---------: | --------- | --: | ------: | --------: | ---------------: | | qwen2 7B Q8_0 | 7.54 GiB | 7.62 B | CUDA | -1 | 128 | pp1024 | 3746.31 ± 4.80 | | qwen2 7B Q8_0 | 7.54 GiB | 7.62 B | CUDA | -1 | 256 | pp1024 | 4174.39 ± 0.45 | | qwen2 7B Q8_0 | 7.54 GiB | 7.62 B | CUDA | -1 | 512 | pp1024 | 4354.18 ± 5.39 | build: e25a32e98 (9584) === Qwen2.5-coder-14b-instruct-q5_k_m === $ ./llama.cpp/build/bin/llama-bench -m ~/models/Qwen2.5-coder-14b-instruct-q5_k_m.gguf -p 0 -n 128,256,512 ggml_cuda_init: found 1 CUDA devices (Total VRAM: 15849 MiB): Device 0: NVIDIA GeForce RTX 5060 Ti, compute capability 12.0, VMM: yes, VRAM: 15849 MiB | model | size | params | backend | ngl | test | t/s | | ----------------------- | ---------: | -------: | ------- | --: | -------: | --------------: | | qwen2 14B Q5_K - Medium | 9.78 GiB | 14.77 B | CUDA | -1 | tg128 | 39.54 ± 0.02 | | qwen2 14B Q5_K - Medium | 9.78 GiB | 14.77 B | CUDA | -1 | tg256 | 39.53 ± 0.01 | | qwen2 14B Q5_K - Medium | 9.78 GiB | 14.77 B | CUDA | -1 | tg512 | 39.38 ± 0.01 | build: e25a32e98 (9584) Device 0: NVIDIA GeForce RTX 5060 Ti, compute capability 12.0, VMM: yes, VRAM: 15849 MiB | model | size | params | backend | ngl | n_batch | test | t/s | | ----------------------- | ---------: | -------: | ------- | --: | ------: | ------: | --------------: | | qwen2 14B Q5_K - Medium | 9.78 GiB | 14.77 B | CUDA | -1 | 128 | pp1024 | 1835.16 ± 1.69 | | qwen2 14B Q5_K - Medium | 9.78 GiB | 14.77 B | CUDA | -1 | 256 | pp1024 | 1967.12 ± 1.01 | | qwen2 14B Q5_K - Medium | 9.78 GiB | 14.77 B | CUDA | -1 | 512 | pp1024 | 1995.02 ± 0.84 | build: e25a32e98 (9584) === gemma-4-26B-A4B-it-qat-UD-Q4_K_XL === prompt eval time = 318.17 ms / 165 tokens ( 1.93 ms per token, 518.59 tokens per second) eval time = 1338.88 ms / 86 tokens ( 15.57 ms per token, 64.23 tokens per second) total time = 1657.05 ms / 251 tokens graphs reused = 1916 stop processing: n_tokens = 20931, truncated = 0 prompt eval time = 3143.73 ms / 4850 tokens ( 0.65 ms per token, 1542.75 tokens per second) eval time = 31502.45 ms / 1854 tokens ( 16.99 ms per token, 58.85 tokens per second) total time = 34646.18 ms / 6704 tokens graphs reused = 3762 stop processing: n_tokens = 27604, truncated = 0 === Qwen3-Coder-30B-A3B-Instruct-Q4_K_M === $ ./llama.cpp/build/bin/llama-bench -m ~/models/Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf -p 0 -n 128,256,512 llama_bench: error: failed to load model '/data/models/Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf' exec llama-server \ -m Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf \ --host 0.0.0.0 --port 8012 \ --verbosity $VERBOSITY \ --threads-http 2 \ --flash-attn on \ --no-mmap \ --cache-type-k q8_0 --cache-type-v q8_0 \ --jinja \ -c 96000 common_params_print_info: build 9584 (e25a32e98) with GNU 15.2.0 for Linux x86_64 log_info: verbosity = 4 (adjust with the `-lv N` CLI arg) device_info: - CUDA0 : NVIDIA GeForce RTX 5060 Ti (15849 MiB, 15712 MiB free) - CPU : Intel(R) Core(TM) Ultra 7 270K Plus (93508 MiB, 93508 MiB free) system_info: n_threads = 4 (n_threads_batch = 4) / 24 | CUDA : ARCHS = 1200 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | BLACKWELL_NATIVE_FP4 = 1 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 | srv llama_server: n_parallel is set to auto, using n_parallel = 4 and kv_unified = true ... common_params_fit_impl: memory for test allocation by device: common_params_fit_impl: id=0, n_layer=49, n_part=24, overflow_type=3, mem= 14787 MiB common_params_fit_impl: - CUDA0 (NVIDIA GeForce RTX 5060 Ti): 49 layers (24 overflowing), 14678 MiB used, 1034 MiB free common_fit_params: successfully fit params to free device memory common_fit_params: fitting params to free memory took 6.76 seconds llama_model_loader: loaded meta data with 44 key-value pairs and 579 tensors from /data/models/Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf (version GGUF V3 (latest)) ... load_tensors: loading model tensors, this can take a while... (mmap = false, direct_io = false) load_tensors: offloading output layer to GPU load_tensors: offloading 47 repeating layers to GPU load_tensors: offloaded 49/49 layers to GPU load_tensors: CPU model buffer size = 166.92 MiB load_tensors: CUDA0 model buffer size = 9585.43 MiB load_tensors: CUDA_Host model buffer size = 7939.00 MiB ... llama_context: n_ctx_seq (96000) < n_ctx_train (262144) -- the full capacity of the model will not be utilized llama_context: CUDA_Host output buffer size = 2.32 MiB llama_kv_cache: CUDA0 KV buffer size = 4781.25 MiB llama_kv_cache: size = 4781.25 MiB ( 96000 cells, 48 layers, 4/1 seqs), K (q8_0): 2390.62 MiB, V (q8_0): 2390.62 MiB ... sched_reserve: resolving fused Gated Delta Net support: sched_reserve: fused Gated Delta Net (autoregressive) enabled sched_reserve: fused Gated Delta Net (chunked) enabled sched_reserve: CUDA0 compute buffer size = 311.34 MiB sched_reserve: CUDA_Host compute buffer size = 101.84 MiB sched_reserve: graph nodes = 3606 sched_reserve: graph splits = 70 (with bs=512), 50 (with bs=1) ... srv load_model: prompt cache is enabled, size limit: 8192 MiB ... srv init: init: chat template, thinking = 0 srv llama_server: model loaded srv llama_server: server is listening on http://0.0.0.0:8012 srv update_slots: all slots are idle $ nvidia-smi +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 595.71.05 Driver Version: 595.71.05 CUDA Version: 13.2 | +-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA GeForce RTX 5060 Ti Off | 00000000:02:00.0 Off | N/A | | 0% 29C P8 6W / 180W | 14856MiB / 16311MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ +-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | 0 N/A N/A 2643 C ...ma.cpp/build/bin/llama-server 14830MiB | +-----------------------------------------------------------------------------------------+ ==== INstabilité avec eGPU 😩 ==== Reset nvidia et CUDA: $ sudo rmmod nvidia_uvm nvidia * Lucie-7B_OpenLLM-France.Instruct-human-data.Q8_0.gguf * Meta-Llama-3.1-8B-Instruct-Q8_0.gguf * CUDA0 model buffer size = 7605,33 MiB * CUDA0 compute buffer size = 258,50 MiB * Après 2 mois de re-essais avec des configs grub et modprobe de toutes sortes avec l'aide de forums et d'assistants (Claude, ChatGpt, LeChat), une solution apparaît [[https://github.com/NVIDIA/open-gpu-kernel-modules/issues/974|sur ce ticket]] : forcer le PCI en "Gen 3" # Pour récupérer l'adresse PCI "0000:05:00.0" de la RTX: lspci | grep -i nvidia sudo lspci -vvv -s 0000:05:00.0 | grep -i "LnkCap\|LnkSta" LnkCap: Port #0, Speed 32GT/s, Width x8, ASPM L1, Exit Latency L1 unlimited LnkSta: Speed 8GT/s (downgraded), Width x4 (downgraded) LnkCap2: Supported Link Speeds: 2.5-32GT/s, Crosslink- Retimer+ 2Retimers+ DRS- LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+ EqualizationPhase1+ sudo setpci -s 0000:05:00.0 CAP_EXP+0xC.W=0x0003 sudo lspci -vvv -s 0000:05:00.0 | grep -i "LnkCap\|LnkSta" LnkCap: Port #0, Speed 32GT/s, Width x8, ASPM L1, Exit Latency L1 unlimited LnkSta: Speed 2.5GT/s (downgraded), Width x4 (downgraded) LnkCap2: Supported Link Speeds: 2.5-32GT/s, Crosslink- Retimer+ 2Retimers+ DRS- LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+ EqualizationPhase1+ **Mais non**, ça a bien fonctionné avec ''llama-bench'' mais pas avec Yolo: 😩 kernel: NVRM: GPU at PCI:0000:05:00: GPU-ab296f23-e6a6-a23b-b6c1-33f9b813df84 kernel: NVRM: GPU Board Serial Number: 0 kernel: NVRM: Xid (PCI:0000:05:00): 13, Graphics Exception: Class 0xffff Subchannel 0x0 Mismatch kernel: NVRM: Xid (PCI:0000:05:00): 13, Graphics Exception: ESR 0x4041b0=0x3f20ffff kernel: NVRM: Xid (PCI:0000:05:00): 13, Graphics Exception: ESR 0x404000=0x80000002 kernel: NVRM: Xid (PCI:0000:05:00): 13, pid=6871, name=python3, Graphics Exception: channel 0x00000002, Class 0000cec0, Offset 00000100, Data deaddead