Outils pour utilisateurs

Outils du site


informatique:ai_lm:gpu_bench

Ceci est une ancienne révision du document !


GPU Bench

  • Gigabyte Windforce OC 12GB Geforce RTX 3060, 354 €TTC neuve 2025-11
  • PNY OC 16 Go Geforce RTX 5060 Ti, 450 €TTC neuve 2025-11

Benchmark d'IA pour extraction de noms :

  • avec service Mistral, modèle Codestral = 00j 01h 02m 48s
  • RTX3060 + Intel-i7, modèle granite-4.0-h-small-Q8_0 = 02j 16h 11m 34s

Selon LeChat:

Carte graphique TOPS (INT8) TOPS (FP16) Architecture
RTX 3060 (12 Go) ~120 TOPS ~60 TOPS Ampere
RTX 5060 Ti (16 Go) ~759 TOPS ~380 TOPS Blackwell

Bench llama.cpp :

  • Text generation: tg128, tg256, tg512 : -p 0 -n 128,256,512
  • Prompt processing: b128, b256, b512 : -p 1024 -n 0 -b 128,256,512
models test tokens/seconds
i7-1360P i7-1360P SYCL RTX 3060 RTX 5060 Ti
Qwen2.5-coder-7b-instruct-q5_k_m tg128 5.47 57.65 73.54
size: 5.07 GiB tg256 57.61 73.32
tg512 56.20 71.80
b128 1825.17 2840.57
b256 1924.10 3209.52
b512 1959.18 3271.22
Qwen2.5-coder-7b-instruct-q8_0 tg128 41.42 50.33
size: 7.54 GiB tg256 41.38 50.33
tg512 40.70 49.62
b128 13.98 36.34 1952.96 2972.52
b256 42.28 2054.09 3460.41
b512 45.99 2093.21 3511.29
EuroLLM-9B-Instruct-Q4_0 tg128 56.06 71.41
size: 4.94 GiB tg256 55.96 71.15
tg512 53.87 69.45
b128 1433.95 CUDA error
b256 1535.06
b512 1559.88
Qwen3-14B-UD-Q5_K_XL tg128 30.00 37.66
size: 9.82 GiB tg256 29.97 38.17
tg512 29.25 37.30
b128 903.97 CUDA error
b256 951.71
b512 963.76
Qwen3-4B-UD-Q8_K_XL tg128 7.37 56.35
size: 4.70 GiB tg256 6.63 56.35
tg512 6.24 54.56
b128 20.66 2163.17
b256 2405.27
b512 2495.35
GemmaCoder3-12B-IQ4_NL.gguf tg128 40.70
size: 6.41 GiB tg256 40.67
tg512 39.54
b128 1150.11
b256 1218.27
b512 1253.92
Gemma3-Code-Reasoning-4B.Q8_0 tg128 66.98
size: 3.84 GiB tg256 66.95
tg512 65.75
b128 2885.80
b256 3266.87
b512 3457.03
GemmaCoder3-12B-Q5_K_M tg128 34.10
size: 7.86 GiB tg256 34.06
tg512 33.28
b128 1045.27
b256 1108.95
b512 1144.97
gpt-oss 20B MXFP4 MoE tg128 92.86
gpt-oss-20b-mxfp4.gguf tg256 92.69
size: 11.27 GiB tg512 88.17
b128 1036.08
b256 1452.01
b512 1744.71
gpt-oss 20B Q4_K - Medium tg128 98.05
gpt-oss-20b-UD-Q4_K_XL.gguf tg256 97.20
size: 11.04 GiB tg512 92.43
b128 1034.15
b256 1450.77
b512 1734.35
  • Les “CUDA error” apparaissent avec la RTX 5060 Ti et le bridge PCIe/THB4 “Wikingoo L17” et le driver nvidia 580.
  • Avec le CPU, laisser le nombre de cœurs en automatique, ce sont les physiques qui seront utilisés. Si on force plus de thread, les perfs diminuent.
    • le multi-threads physique est utile. Ex: en auto 7.37 t/s, avec 1 thread 3.39 t/s

Intel® Core™ i7-1360P 13th Gen

Pour comparaison …

Qwen2.5-coder-7b-instruct-q5_k_m:

./llama-bench -m ~/Data/AI_Models/Qwen2.5-coder-7b-instruct-q5_k_m.gguf -p 0 -n 128
load_backend: loaded RPC backend from /home/.../llama-b7109/libggml-rpc.so
load_backend: loaded CPU backend from /home/.../llama-b7109/libggml-cpu-alderlake.so
| model                          |       size |     params | backend    | threads |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | --------------: | -------------------: |
| qwen2 7B Q5_K - Medium         |   5.07 GiB |     7.62 B | CPU        |       4 |           tg128 |          5.47 ± 0.72 |

Gigabyte Windforce OC 12GB Geforce RTX 3060

Avec sudo nsys-ui :

NVIDIA GeForce RTX 3060
Chip Name GA104
SM Count 28
L2 Cache Size 2,25 MiB
Memory Bandwidth 335,32 GiB/s
Memory Size 11,63 GiB
Core Clock 1,79 GHz
Bus Location 0000:05:00.0
GSP firmware version 580.105.08
Video accelerator tracing Supported

Avec llama.cpp et CUDA 12.9.

Qwen2.5-coder-7b-instruct-q5_k_m

./build/bin/llama-bench -m ~/Data/AI_Models/Qwen2.5-coder-7b-instruct-q5_k_m.gguf -p 0 -n 128,256,512
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes
| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| qwen2 7B Q5_K - Medium         |   5.07 GiB |     7.62 B | CUDA       |  99 |           tg128 |         57.65 ± 0.03 |
| qwen2 7B Q5_K - Medium         |   5.07 GiB |     7.62 B | CUDA       |  99 |           tg256 |         57.61 ± 0.03 |
| qwen2 7B Q5_K - Medium         |   5.07 GiB |     7.62 B | CUDA       |  99 |           tg512 |         56.24 ± 0.05 |

GemmaCoder3-12B-Q5_K_M

Pour lancer llama-server avec le modèle “GemmaCoder3-12B-Q5_K_M.gguf” (fichier 8.4Go) fait de 49 layers en utilisant son contexte maximale “131072” avec --ctx-size 0 au lieu du par défaut “4096” il faut décharger des layers sur le CPU, sinon c'est main: error: unable to load model. À noter que c'est pareil avec llama-cli.

n-gpu-layers test tokens/s time % perte perf
(all) 49 tg128 34.15 0m25,904s 0.00%
b128 1041.60 0m13,117s 0.00%
44 tg128 15.55 0m48,049s 54.47%
b128 279.26 0m28,613s 73.19%
39 tg128 10.74 1m07,555s 68.55%
b128 150.49 0m46,996s 85.55%
30 tg128 6.83 1m42,221s 80.01%
b128 82.91 1m19,729s 92.04%
full cpu tg128 3.12 3m28,308s 90.86%
b128 4.50 22m37,674s 99.57%

Les valeurs qui permettent de charger ce modèle :

  • llama-cli :
    • avec son context max 131072 c'est 30 layers sur GPU : --n-gpu-layers 30, donc 80% perte perf
    • --ctx-size 70000 --n-gpu-layers 41
    • et pour tous les layers sur le GPU : --ctx-size 42000
  • llama-server :
    • --ctx-size 40000 --n-gpu-layers 44
    • --ctx-size 43500 --n-gpu-layers 43
    • --ctx-size 52500 --n-gpu-layers 42

Avec --ctx-size 52500 --n-gpu-layers 42 :

...
NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes
...
print_info: n_ctx_train      = 131072
print_info: n_embd           = 3840
print_info: n_embd_inp       = 3840
print_info: n_layer          = 48
print_info: n_head           = 16
print_info: n_head_kv        = 8
print_info: n_rot            = 256
print_info: n_swa            = 1024
print_info: is_swa_any       = 1
print_info: n_embd_head_k    = 256
print_info: n_embd_head_v    = 256
print_info: n_gqa            = 2
print_info: n_embd_k_gqa     = 2048
print_info: n_embd_v_gqa     = 2048
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-06
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 6.2e-02
print_info: n_ff             = 15360
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: n_expert_groups  = 0
print_info: n_group_used     = 0
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 2
print_info: rope scaling     = linear
print_info: freq_base_train  = 1000000.0
print_info: freq_scale_train = 0.125
print_info: n_ctx_orig_yarn  = 131072
print_info: rope_finetuned   = unknown
print_info: model type       = 12B
print_info: model params     = 11.77 B
print_info: general.name     = gemma-3-12b-it-codeforces-SFT
print_info: vocab type       = SPM
print_info: n_vocab          = 262208
print_info: n_merges         = 0
...
print_info: max token length = 48
...
load_tensors: offloading 42 repeating layers to GPU
load_tensors: offloaded 42/49 layers to GPU
load_tensors:   CPU_Mapped model buffer size =  1720.59 MiB
load_tensors:        CUDA0 model buffer size =  6327.03 MiB
llama_context: constructing llama_context
llama_context: n_seq_max     = 4
llama_context: n_ctx         = 52736
llama_context: n_ctx_seq     = 52736
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = auto
llama_context: kv_unified    = true
llama_context: freq_base     = 1000000.0
llama_context: freq_scale    = 0.125
llama_context: n_ctx_seq (52736) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_context:        CPU  output buffer size =     4.00 MiB
llama_kv_cache_iswa: creating non-SWA KV cache, size = 52736 cells
llama_kv_cache:        CPU KV buffer size =   412.00 MiB
llama_kv_cache:      CUDA0 KV buffer size =  2884.00 MiB
llama_kv_cache: size = 3296.00 MiB ( 52736 cells,   8 layers,  4/1 seqs), K (f16): 1648.00 MiB, V (f16): 1648.00 MiB
llama_kv_cache_iswa: creating     SWA KV cache, size = 4608 cells
llama_kv_cache:        CPU KV buffer size =   180.00 MiB
llama_kv_cache:      CUDA0 KV buffer size =  1260.00 MiB
llama_kv_cache: size = 1440.00 MiB (  4608 cells,  40 layers,  4/1 seqs), K (f16):  720.00 MiB, V (f16):  720.00 MiB
llama_context: Flash Attention was auto, set to enabled
llama_context:      CUDA0 compute buffer size =  1307.32 MiB
llama_context:  CUDA_Host compute buffer size =   120.02 MiB
llama_context: graph nodes  = 1929
llama_context: graph splits = 94 (with bs=512), 27 (with bs=1)

PNY OC 16 Go Geforce RTX 5060 Ti

Avec vrai PCIe ✅

Sur une vrai tour avec PCIe x16 et Intel(R) Core(TM) Ultra 7 270K Plus.

Environnement et compilation sensible pour llama.cpp :

gpt-oss-20b-UD-Q4_K_XL

$ ./llama.cpp/build/bin/llama-bench -m /data/models/gpt-oss-20b-UD-Q4_K_XL.gguf -p 0 -n 128,256,512
ggml_cuda_init: found 1 CUDA devices (Total VRAM: 15849 MiB):
  Device 0: NVIDIA GeForce RTX 5060 Ti, compute capability 12.0, VMM: yes, VRAM: 15849 MiB
| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| gpt-oss 20B Q4_K - Medium      |  11.04 GiB |    20.91 B | CUDA       |  -1 |           tg128 |        155.79 ± 0.21 |
| gpt-oss 20B Q4_K - Medium      |  11.04 GiB |    20.91 B | CUDA       |  -1 |           tg256 |        155.81 ± 0.03 |
| gpt-oss 20B Q4_K - Medium      |  11.04 GiB |    20.91 B | CUDA       |  -1 |           tg512 |        155.15 ± 0.01 |

build: e25a32e98 (9584)

$ ./llama.cpp/build/bin/llama-bench -m /data/models/gpt-oss-20b-UD-Q4_K_XL.gguf -p 1024 -n 0 -b 128,256,512
ggml_cuda_init: found 1 CUDA devices (Total VRAM: 15849 MiB):
  Device 0: NVIDIA GeForce RTX 5060 Ti, compute capability 12.0, VMM: yes, VRAM: 15849 MiB
| model                          |       size |     params | backend    | ngl | n_batch |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | --------------: | -------------------: |
| gpt-oss 20B Q4_K - Medium      |  11.04 GiB |    20.91 B | CUDA       |  -1 |     128 |          pp1024 |      3308.23 ± 19.28 |
| gpt-oss 20B Q4_K - Medium      |  11.04 GiB |    20.91 B | CUDA       |  -1 |     256 |          pp1024 |      4792.27 ± 39.25 |
| gpt-oss 20B Q4_K - Medium      |  11.04 GiB |    20.91 B | CUDA       |  -1 |     512 |          pp1024 |      6048.13 ± 32.16 |

build: e25a32e98 (9584)

Qwen2.5-coder-7b-instruct-q8_0

$ ./llama.cpp/build/bin/llama-bench -m ~/models/Qwen2.5-coder-7b-instruct-q8_0.gguf -p 0 -n 128,256,512
ggml_cuda_init: found 1 CUDA devices (Total VRAM: 15849 MiB):
  Device 0: NVIDIA GeForce RTX 5060 Ti, compute capability 12.0, VMM: yes, VRAM: 15849 MiB
| model            |       size |     params | backend   | ngl |        test |               t/s |
| ---------------- | ---------: | ---------: | --------- | --: | ----------: | ----------------: |
| qwen2 7B Q8_0    |   7.54 GiB |     7.62 B | CUDA      |  -1 |       tg128 |      54.23 ± 0.02 |
| qwen2 7B Q8_0    |   7.54 GiB |     7.62 B | CUDA      |  -1 |       tg256 |      54.23 ± 0.00 |
| qwen2 7B Q8_0    |   7.54 GiB |     7.62 B | CUDA      |  -1 |       tg512 |      54.12 ± 0.00 |

build: e25a32e98 (9584)

$ ./llama.cpp/build/bin/llama-bench -m ~/models/Qwen2.5-coder-7b-instruct-q8_0.gguf -p 1024 -n 0 -b 128,256,512
ggml_cuda_init: found 1 CUDA devices (Total VRAM: 15849 MiB):
  Device 0: NVIDIA GeForce RTX 5060 Ti, compute capability 12.0, VMM: yes, VRAM: 15849 MiB
| model            |       size |     params | backend   | ngl | n_batch |      test |              t/s |
| ---------------- | ---------: | ---------: | --------- | --: | ------: | --------: | ---------------: |
| qwen2 7B Q8_0    |   7.54 GiB |     7.62 B | CUDA      |  -1 |     128 |    pp1024 |   3746.31 ± 4.80 |
| qwen2 7B Q8_0    |   7.54 GiB |     7.62 B | CUDA      |  -1 |     256 |    pp1024 |   4174.39 ± 0.45 |
| qwen2 7B Q8_0    |   7.54 GiB |     7.62 B | CUDA      |  -1 |     512 |    pp1024 |   4354.18 ± 5.39 |

build: e25a32e98 (9584)

Qwen2.5-coder-14b-instruct-q5_k_m

$ ./llama.cpp/build/bin/llama-bench -m ~/models/Qwen2.5-coder-14b-instruct-q5_k_m.gguf -p 0 -n 128,256,512
ggml_cuda_init: found 1 CUDA devices (Total VRAM: 15849 MiB):
  Device 0: NVIDIA GeForce RTX 5060 Ti, compute capability 12.0, VMM: yes, VRAM: 15849 MiB
| model                   |       size |   params | backend | ngl |     test |             t/s |
| ----------------------- | ---------: | -------: | ------- | --: | -------: | --------------: |
| qwen2 14B Q5_K - Medium |   9.78 GiB |  14.77 B | CUDA    |  -1 |    tg128 |    39.54 ± 0.02 |
| qwen2 14B Q5_K - Medium |   9.78 GiB |  14.77 B | CUDA    |  -1 |    tg256 |    39.53 ± 0.01 |
| qwen2 14B Q5_K - Medium |   9.78 GiB |  14.77 B | CUDA    |  -1 |    tg512 |    39.38 ± 0.01 |

build: e25a32e98 (9584)

  Device 0: NVIDIA GeForce RTX 5060 Ti, compute capability 12.0, VMM: yes, VRAM: 15849 MiB
| model                   |       size |   params | backend | ngl | n_batch |    test |             t/s |
| ----------------------- | ---------: | -------: | ------- | --: | ------: | ------: | --------------: |
| qwen2 14B Q5_K - Medium |   9.78 GiB |  14.77 B | CUDA    |  -1 |     128 |  pp1024 |  1835.16 ± 1.69 |
| qwen2 14B Q5_K - Medium |   9.78 GiB |  14.77 B | CUDA    |  -1 |     256 |  pp1024 |  1967.12 ± 1.01 |
| qwen2 14B Q5_K - Medium |   9.78 GiB |  14.77 B | CUDA    |  -1 |     512 |  pp1024 |  1995.02 ± 0.84 |

build: e25a32e98 (9584)

Qwen3-Coder-30B-A3B-Instruct-Q4_K_M

$ ./llama.cpp/build/bin/llama-bench -m ~/models/Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf -p 0 -n 128,256,512

llama_bench: error: failed to load model '/data/models/Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf'
exec llama-server \
 -m Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf \
 --host 0.0.0.0 --port 8012 \
 --verbosity $VERBOSITY \
 --threads-http 2 \
 --flash-attn on \
 --no-mmap \
 --cache-type-k q8_0 --cache-type-v q8_0 \
 --jinja \
 -c 96000

common_params_print_info: build 9584 (e25a32e98) with GNU 15.2.0 for Linux x86_64
log_info: verbosity = 4 (adjust with the `-lv N` CLI arg)
device_info:
  - CUDA0   : NVIDIA GeForce RTX 5060 Ti (15849 MiB, 15712 MiB free)
  - CPU     : Intel(R) Core(TM) Ultra 7 270K Plus (93508 MiB, 93508 MiB free)
system_info: n_threads = 4 (n_threads_batch = 4) / 24 | CUDA : ARCHS = 1200 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | BLACKWELL_NATIVE_FP4 = 1 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 | 
srv  llama_server: n_parallel is set to auto, using n_parallel = 4 and kv_unified = true
...
common_params_fit_impl: memory for test allocation by device:
common_params_fit_impl: id=0, n_layer=49, n_part=24, overflow_type=3, mem= 14787 MiB
common_params_fit_impl:   - CUDA0 (NVIDIA GeForce RTX 5060 Ti): 49 layers (24 overflowing),  14678 MiB used,   1034 MiB free
common_fit_params: successfully fit params to free device memory
common_fit_params: fitting params to free memory took 6.76 seconds
llama_model_loader: loaded meta data with 44 key-value pairs and 579 tensors from /data/models/Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf (version GGUF V3 (latest))
...
load_tensors: loading model tensors, this can take a while... (mmap = false, direct_io = false)
load_tensors: offloading output layer to GPU
load_tensors: offloading 47 repeating layers to GPU
load_tensors: offloaded 49/49 layers to GPU
load_tensors:          CPU model buffer size =   166.92 MiB
load_tensors:        CUDA0 model buffer size =  9585.43 MiB
load_tensors:    CUDA_Host model buffer size =  7939.00 MiB
...
llama_context: n_ctx_seq (96000) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
llama_context:  CUDA_Host  output buffer size =     2.32 MiB
llama_kv_cache:      CUDA0 KV buffer size =  4781.25 MiB
llama_kv_cache: size = 4781.25 MiB ( 96000 cells,  48 layers,  4/1 seqs), K (q8_0): 2390.62 MiB, V (q8_0): 2390.62 MiB
...
sched_reserve: resolving fused Gated Delta Net support:
sched_reserve: fused Gated Delta Net (autoregressive) enabled
sched_reserve: fused Gated Delta Net (chunked) enabled
sched_reserve:      CUDA0 compute buffer size =   311.34 MiB
sched_reserve:  CUDA_Host compute buffer size =   101.84 MiB
sched_reserve: graph nodes  = 3606
sched_reserve: graph splits = 70 (with bs=512), 50 (with bs=1)
...
srv    load_model: prompt cache is enabled, size limit: 8192 MiB
...
srv          init: init: chat template, thinking = 0
srv  llama_server: model loaded
srv  llama_server: server is listening on http://0.0.0.0:8012
srv  update_slots: all slots are idle

$ nvidia-smi      
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 595.71.05              Driver Version: 595.71.05      CUDA Version: 13.2     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 5060 Ti     Off |   00000000:02:00.0 Off |                  N/A |
|  0%   29C    P8              6W /  180W |   14856MiB /  16311MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A            2643      C   ...ma.cpp/build/bin/llama-server      14830MiB |
+-----------------------------------------------------------------------------------------+

Stabilité Avec eGPU 😩

Reset nvidia et CUDA:

$ sudo rmmod nvidia_uvm nvidia
  • Lucie-7B_OpenLLM-France.Instruct-human-data.Q8_0.gguf
  • Meta-Llama-3.1-8B-Instruct-Q8_0.gguf
    • CUDA0 model buffer size = 7605,33 MiB
    • CUDA0 compute buffer size = 258,50 MiB

Après 2 mois de re-essais avec des configs grub et modprobe de toutes sortes avec l'aide de forums et d'assistants (Claude, ChatGpt, LeChat), une solution apparaît sur ce ticket : forcer le PCI en “Gen 3”

# Pour récupérer l'adresse PCI "0000:05:00.0" de la RTX:
lspci | grep -i nvidia

sudo lspci -vvv -s 0000:05:00.0 | grep -i "LnkCap\|LnkSta"
   LnkCap:	Port #0, Speed 32GT/s, Width x8, ASPM L1, Exit Latency L1 unlimited
   LnkSta:	Speed 8GT/s (downgraded), Width x4 (downgraded)
   LnkCap2: Supported Link Speeds: 2.5-32GT/s, Crosslink- Retimer+ 2Retimers+ DRS-
   LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+ EqualizationPhase1+

sudo setpci -s 0000:05:00.0 CAP_EXP+0xC.W=0x0003

sudo lspci -vvv -s 0000:05:00.0 | grep -i "LnkCap\|LnkSta"
   LnkCap:	Port #0, Speed 32GT/s, Width x8, ASPM L1, Exit Latency L1 unlimited
   LnkSta:	Speed 2.5GT/s (downgraded), Width x4 (downgraded)
   LnkCap2: Supported Link Speeds: 2.5-32GT/s, Crosslink- Retimer+ 2Retimers+ DRS-
   LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+ EqualizationPhase1+

Mais non, ça a bien fonctionné avec llama-bench mais pas avec Yolo: 😩

kernel: NVRM: GPU at PCI:0000:05:00: GPU-ab296f23-e6a6-a23b-b6c1-33f9b813df84
kernel: NVRM: GPU Board Serial Number: 0
kernel: NVRM: Xid (PCI:0000:05:00): 13, Graphics Exception: Class 0xffff Subchannel 0x0 Mismatch
kernel: NVRM: Xid (PCI:0000:05:00): 13, Graphics Exception: ESR 0x4041b0=0x3f20ffff
kernel: NVRM: Xid (PCI:0000:05:00): 13, Graphics Exception: ESR 0x404000=0x80000002
kernel: NVRM: Xid (PCI:0000:05:00): 13, pid=6871, name=python3, Graphics Exception: channel 0x00000002, Class 0000cec0, Offset 00000100, Data deaddead
informatique/ai_lm/gpu_bench.1781036432.txt.gz · Dernière modification : de cyrille

Sauf mention contraire, le contenu de ce wiki est placé sous les termes de la licence suivante : CC0 1.0 Universal
CC0 1.0 Universal Donate Powered by PHP Valid HTML5 Valid CSS Driven by DokuWiki