Différences

Ci-dessous, les différences entre deux révisions de la page.

--- informatique:ai_lm:gpu_bench [15/01/2026 13:21] – supprimée - modification externe (Date inconnue) 127.0.0.1
+++ informatique:ai_lm:gpu_bench [16/03/2026 17:04] (Version actuelle) – [Stabilité] cyrille
@@ Ligne 1: / Ligne 1: @@
+====== GPU Bench ======
+  * [[https://blogs.nvidia.com/blog/tag/rtx-ai-garage/|RTX AI Garage]] sur blog de nvidia
+Benchmark d'IA pour [[https://lab.cyrille.giquello.fr/Anticor/graphLmExtract.html|extraction de noms]] :
+  * avec service Mistral, modèle Codestral = ''00j 01h 02m 48s''
+  * RTX3060 + Intel-i7, modèle granite-4.0-h-small-Q8_0  = ''02j 16h 11m 34s''
+Selon LeChat:
+^ Carte graphique ^ TOPS (INT8) ^  TOPS (FP16) ^ Architecture ^
+| RTX 3060 (12 Go) |  ~120 TOPS |  ~60 TOPS | Ampere |
+| RTX 5060 Ti (16 Go) |  ~759 TOPS |  ~380 TOPS | Blackwell |
+Bench llama.cpp :
+  * Text generation: tg128, tg256, tg512 : ''-p 0 -n 128,256,512''
+  * Prompt processing: b128, b256, b512 : ''-p 1024 -n 0 -b 128,256,512''
+^  models                          ^ test  ^  tokens/seconds  ^^^
+^                                  ^       ^ i7-1360P ^ RTX 3060 ^ RTX 5060 Ti ^
+| Qwen2.5-coder-7b-instruct-q5_k_m | tg128 |    5.47 |     57.65 |       73.54 |
+| //size: 5.07 GiB//               | tg256 |     ... |     57.61 |       73.32 |
+|                                  | tg512 |     ... |     56.20 |       71.80 |
+|                                  | b128  |     ... |   1825.17 |     2840.57 |
+|                                  | b256  |     ... |   1924.10 |     3209.52 |
+|                                  | b512  |     ... |   1959.18 |     3271.22 |
+| Qwen2.5-coder-7b-instruct-q8_0   | tg128 |     ... |     41.42 |       50.33 |
+| //size: 7.54 GiB//               | tg256 |     ... |     41.38 |       50.33 |
+|                                  | tg512 |     ... |     40.70 |       49.62 |
+|                                  | b128  |   13.98 |   1952.96 |     2972.52 |
+|                                  | b256  |     ... |   2054.09 |     3460.41 |
+|                                  | b512  |     ... |   2093.21 |     3511.29 |
+| EuroLLM-9B-Instruct-Q4_0         | tg128 |     ... |     56.06 |       71.41 |
+| //size: 4.94 GiB//               | tg256 |     ... |     55.96 |       71.15 |
+|                                  | tg512 |     ... |     53.87 |       69.45 |
+|                                  | b128  |     ... |   1433.95 |  CUDA error |
+|                                  | b256  |     ... |   1535.06 |         ... |
+|                                  | b512  |     ... |   1559.88 |         ... |
+| Qwen3-14B-UD-Q5_K_XL             | tg128 |     ... |     30.00 |       37.66 |
+| //size: 9.82 GiB//               | tg256 |     ... |     29.97 |       38.17 |
+|                                  | tg512 |     ... |     29.25 |       37.30 |
+|                                  | b128  |     ... |    903.97 |  CUDA error |
+|                                  | b256  |     ... |    951.71 |         ... |
+|                                  | b512  |     ... |    963.76 |         ... |
+| Qwen3-4B-UD-Q8_K_XL              | tg128 |    7.37 |     56.35 |         ... |
+| //size: 4.70 GiB//               | tg256 |    6.63 |     56.35 |         ... |
+|                                  | tg512 |    6.24 |     54.56 |         ... |
+|                                  | b128  |   20.66 |   2163.17 |         ... |
+|                                  | b256  |     ... |   2405.27 |         ... |
+|                                  | b512  |     ... |   2495.35 |         ... |
+| GemmaCoder3-12B-IQ4_NL.gguf      | tg128 |     ... |     40.70 |         ... |
+| //size: 6.41 GiB//               | tg256 |     ... |     40.67 |         ... |
+|                                  | tg512 |     ... |     39.54 |         ... |
+|                                  | b128  |     ... |   1150.11 |         ... |
+|                                  | b256  |     ... |   1218.27 |         ... |
+|                                  | b512  |     ... |   1253.92 |         ... |
+| Gemma3-Code-Reasoning-4B.Q8_0    | tg128 |     ... |     66.98 |         ... |
+| //size: 3.84 GiB//               | tg256 |     ... |     66.95 |         ... |
+|                                  | tg512 |     ... |     65.75 |         ... |
+|                                  | b128  |     ... |   2885.80 |         ... |
+|                                  | b256  |     ... |   3266.87 |         ... |
+|                                  | b512  |     ... |   3457.03 |         ... |
+| GemmaCoder3-12B-Q5_K_M           | tg128 |     ... |     34.10 |         ... |
+| //size: 7.86 GiB//               | tg256 |     ... |     34.06 |         ... |
+|                                  | tg512 |     ... |     33.28 |         ... |
+|                                  | b128  |     ... |   1045.27 |         ... |
+|                                  | b256  |     ... |   1108.95 |         ... |
+|                                  | b512  |     ... |   1144.97 |         ... |
+  * Les "CUDA error" apparaissent avec la RTX 5060 Ti et le bridge PCIe/THB4 "Wikingoo L17" et le driver nvidia 580.
+  * Avec le CPU, laisser le nombre de cœurs en automatique, ce sont les physiques qui seront utilisés. Si on force plus de thread, les perfs diminuent.
+    * le multi-threads physique est utile. Ex: en auto 7.37 t/s, avec 1 thread 3.39 t/s
+===== Intel® Core™ i7-1360P 13th Gen =====
+Pour comparaison ...
+**Qwen2.5-coder-7b-instruct-q5_k_m**:
+<code>
+./llama-bench -m ~/Data/AI_Models/Qwen2.5-coder-7b-instruct-q5_k_m.gguf -p 0 -n 128
+load_backend: loaded RPC backend from /home/.../llama-b7109/libggml-rpc.so
+load_backend: loaded CPU backend from /home/.../llama-b7109/libggml-cpu-alderlake.so
+| model                          |       size |     params | backend    | threads |            test |                  t/s |
+| ------------------------------ | ---------: | ---------: | ---------- | ------: | --------------: | -------------------: |
+| qwen2 7B Q5_K - Medium         |   5.07 GiB |     7.62 B | CPU        |       4 |           tg128 |          5.47 ± 0.72 |
+</code>
+===== Gigabyte Windforce OC 12GB Geforce RTX 3060 =====
+{{ :informatique:ai_coding:ia_rtx_3060_small.jpg?direct&400|}}
+Avec ''sudo nsys-ui'' :
+^ NVIDIA GeForce RTX 3060 ^^
+| Chip Name	| GA104 |
+| SM Count	| 28 |
+| L2 Cache Size	| 2,25 MiB |
+| Memory Bandwidth	| 335,32 GiB/s |
+| Memory Size	| 11,63 GiB |
+| Core Clock	| 1,79 GHz |
+| Bus Location	| 0000:05:00.0 |
+| GSP firmware version	| 580.105.08 |
+| Video accelerator tracing	| Supported |
+Avec llama.cpp et CUDA 12.9.
+==== Qwen2.5-coder-7b-instruct-q5_k_m ====
+<code>
+./build/bin/llama-bench -m ~/Data/AI_Models/Qwen2.5-coder-7b-instruct-q5_k_m.gguf -p 0 -n 128,256,512
+ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
+ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
+ggml_cuda_init: found 1 CUDA devices:
+  Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes
+| model                          |       size |     params | backend    | ngl |            test |                  t/s |
+| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
+| qwen2 7B Q5_K - Medium         |   5.07 GiB |     7.62 B | CUDA       |  99 |           tg128 |         57.65 ± 0.03 |
+| qwen2 7B Q5_K - Medium         |   5.07 GiB |     7.62 B | CUDA       |  99 |           tg256 |         57.61 ± 0.03 |
+| qwen2 7B Q5_K - Medium         |   5.07 GiB |     7.62 B | CUDA       |  99 |           tg512 |         56.24 ± 0.05 |
+</code>
+==== GemmaCoder3-12B-Q5_K_M ====
+Pour lancer ''llama-server'' avec le modèle "GemmaCoder3-12B-Q5_K_M.gguf" (fichier 8.4Go) fait de 49 layers en utilisant son **contexte maximale "131072" avec ''<nowiki>--ctx-size 0</nowiki>'' au lieu du par défaut "4096"** il faut décharger des layers sur le CPU, sinon c'est ''main: error: unable to load model''. À noter que c'est pareil avec ''llama-cli''.
+^ n-gpu-layers ^ test ^ tokens/s ^    time    ^ % perte perf |
+|  (all) 49 | tg128|    34.15 |  0m25,904s |   0.00%  |
+|           | b128 |  1041.60 |  0m13,117s |   0.00%  |
+|        44 | tg128|    15.55 |  0m48,049s |  54.47%  |
+|           | b128 |   279.26 |  0m28,613s |  73.19%  |
+|        39 | tg128|    10.74 |  1m07,555s |  68.55%  |
+|           | b128 |   150.49 |  0m46,996s |  85.55%  |
+|        30 | tg128|     6.83 |  1m42,221s |  80.01%  |
+|           | b128 |    82.91 |  1m19,729s |  92.04%  |
+|  full cpu | tg128|     3.12 |  3m28,308s |  90.86%  |
+|           | b128 |     4.50 |  22m37,674s |  99.57%  |
+Les valeurs qui permettent de charger ce modèle :
+  * ''llama-cli'' :
+    * avec son context max 131072 c'est 30 layers sur GPU : ''<nowiki>--n-gpu-layers 30</nowiki>'', donc 80% perte perf
+    * ''<nowiki>--ctx-size 70000 --n-gpu-layers 41</nowiki>''
+    * et pour tous les layers sur le GPU : ''<nowiki>--ctx-size 42000</nowiki>''
+  * ''llama-server'' :
+    * ''<nowiki>--ctx-size 40000 --n-gpu-layers 44</nowiki>''
+    * ''<nowiki>--ctx-size 43500 --n-gpu-layers 43</nowiki>''
+    * ''<nowiki>--ctx-size 52500 --n-gpu-layers 42</nowiki>''
+Avec ''<nowiki>--ctx-size 52500 --n-gpu-layers 42</nowiki>'' :
+<code>
+...
+NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes
+...
+print_info: n_ctx_train      = 131072
+print_info: n_embd           = 3840
+print_info: n_embd_inp       = 3840
+print_info: n_layer          = 48
+print_info: n_head           = 16
+print_info: n_head_kv        = 8
+print_info: n_rot            = 256
+print_info: n_swa            = 1024
+print_info: is_swa_any       = 1
+print_info: n_embd_head_k    = 256
+print_info: n_embd_head_v    = 256
+print_info: n_gqa            = 2
+print_info: n_embd_k_gqa     = 2048
+print_info: n_embd_v_gqa     = 2048
+print_info: f_norm_eps       = 0.0e+00
+print_info: f_norm_rms_eps   = 1.0e-06
+print_info: f_clamp_kqv      = 0.0e+00
+print_info: f_max_alibi_bias = 0.0e+00
+print_info: f_logit_scale    = 0.0e+00
+print_info: f_attn_scale     = 6.2e-02
+print_info: n_ff             = 15360
+print_info: n_expert         = 0
+print_info: n_expert_used    = 0
+print_info: n_expert_groups  = 0
+print_info: n_group_used     = 0
+print_info: causal attn      = 1
+print_info: pooling type     = 0
+print_info: rope type        = 2
+print_info: rope scaling     = linear
+print_info: freq_base_train  = 1000000.0
+print_info: freq_scale_train = 0.125
+print_info: n_ctx_orig_yarn  = 131072
+print_info: rope_finetuned   = unknown
+print_info: model type       = 12B
+print_info: model params     = 11.77 B
+print_info: general.name     = gemma-3-12b-it-codeforces-SFT
+print_info: vocab type       = SPM
+print_info: n_vocab          = 262208
+print_info: n_merges         = 0
+...
+print_info: max token length = 48
+...
+load_tensors: offloading 42 repeating layers to GPU
+load_tensors: offloaded 42/49 layers to GPU
+load_tensors:   CPU_Mapped model buffer size =  1720.59 MiB
+load_tensors:        CUDA0 model buffer size =  6327.03 MiB
+llama_context: constructing llama_context
+llama_context: n_seq_max     = 4
+llama_context: n_ctx         = 52736
+llama_context: n_ctx_seq     = 52736
+llama_context: n_batch       = 2048
+llama_context: n_ubatch      = 512
+llama_context: causal_attn   = 1
+llama_context: flash_attn    = auto
+llama_context: kv_unified    = true
+llama_context: freq_base     = 1000000.0
+llama_context: freq_scale    = 0.125
+llama_context: n_ctx_seq (52736) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
+llama_context:        CPU  output buffer size =     4.00 MiB
+llama_kv_cache_iswa: creating non-SWA KV cache, size = 52736 cells
+llama_kv_cache:        CPU KV buffer size =   412.00 MiB
+llama_kv_cache:      CUDA0 KV buffer size =  2884.00 MiB
+llama_kv_cache: size = 3296.00 MiB ( 52736 cells,   8 layers,  4/1 seqs), K (f16): 1648.00 MiB, V (f16): 1648.00 MiB
+llama_kv_cache_iswa: creating     SWA KV cache, size = 4608 cells
+llama_kv_cache:        CPU KV buffer size =   180.00 MiB
+llama_kv_cache:      CUDA0 KV buffer size =  1260.00 MiB
+llama_kv_cache: size = 1440.00 MiB (  4608 cells,  40 layers,  4/1 seqs), K (f16):  720.00 MiB, V (f16):  720.00 MiB
+llama_context: Flash Attention was auto, set to enabled
+llama_context:      CUDA0 compute buffer size =  1307.32 MiB
+llama_context:  CUDA_Host compute buffer size =   120.02 MiB
+llama_context: graph nodes  = 1929
+llama_context: graph splits = 94 (with bs=512), 27 (with bs=1)
+</code>
+===== PNY OC 16 Go Geforce RTX 5060 Ti =====
+=== Qwen2.5-coder-7b-instruct-q5_k_m ===
+<code>
+$ ./llama.cpp/build/bin/llama-bench -m ~/Data/AI_Models/Qwen2.5-coder-7b-instruct-q5_k_m.gguf -p 0 -n 128,256,512
+ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
+ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
+ggml_cuda_init: found 1 CUDA devices:
+  Device 0: NVIDIA GeForce RTX 5060 Ti, compute capability 12.0, VMM: yes
+| model                          |       size |     params | backend    | ngl |            test |                  t/s |
+| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
+| qwen2 7B Q5_K - Medium         |   5.07 GiB |     7.62 B | CUDA       |  99 |           tg128 |         73.54 ± 0.01 |
+| qwen2 7B Q5_K - Medium         |   5.07 GiB |     7.62 B | CUDA       |  99 |           tg256 |         73.32 ± 0.40 |
+| qwen2 7B Q5_K - Medium         |   5.07 GiB |     7.62 B | CUDA       |  99 |           tg512 |         71.80 ± 0.61 |
+build: 3f3a4fb9c (7130)
+</code>
+=== Stabilité ===
+Reset nvidia et CUDA:
+<code>
+$ sudo rmmod nvidia_uvm nvidia
+</code>
+  * Lucie-7B_OpenLLM-France.Instruct-human-data.Q8_0.gguf
+  * Meta-Llama-3.1-8B-Instruct-Q8_0.gguf
+    * CUDA0 model buffer size =  7605,33 MiB
+    * CUDA0 compute buffer size =   258,50 MiB
+    *
+Après 2 mois de re-essais avec des configs grub et modprobe de toutes sortes avec l'aide de forums et d'assistants (Claude, ChatGpt, LeChat), une solution apparaît [[https://github.com/NVIDIA/open-gpu-kernel-modules/issues/974|sur ce ticket]] : forcer le PCI en "Gen 3"
+<code>
+# Pour récupérer l'adresse PCI "0000:05:00.0" de la RTX:
+lspci | grep -i nvidia
+sudo lspci -vvv -s 0000:05:00.0 | grep -i "LnkCap\|LnkSta"
+   LnkCap:	Port #0, Speed 32GT/s, Width x8, ASPM L1, Exit Latency L1 unlimited
+   LnkSta:	Speed 8GT/s (downgraded), Width x4 (downgraded)
+   LnkCap2: Supported Link Speeds: 2.5-32GT/s, Crosslink- Retimer+ 2Retimers+ DRS-
+   LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+ EqualizationPhase1+
+sudo setpci -s 0000:05:00.0 CAP_EXP+0xC.W=0x0003
+sudo lspci -vvv -s 0000:05:00.0 | grep -i "LnkCap\|LnkSta"
+   LnkCap:	Port #0, Speed 32GT/s, Width x8, ASPM L1, Exit Latency L1 unlimited
+   LnkSta:	Speed 2.5GT/s (downgraded), Width x4 (downgraded)
+   LnkCap2: Supported Link Speeds: 2.5-32GT/s, Crosslink- Retimer+ 2Retimers+ DRS-
+   LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+ EqualizationPhase1+
+</code>
+**Mais non**, ça a bien fonctionné avec ''llama-bench'' mais pas avec Yolo:
+<code>
+kernel: NVRM: GPU at PCI:0000:05:00: GPU-ab296f23-e6a6-a23b-b6c1-33f9b813df84
+kernel: NVRM: GPU Board Serial Number: 0
+kernel: NVRM: Xid (PCI:0000:05:00): 13, Graphics Exception: Class 0xffff Subchannel 0x0 Mismatch
+kernel: NVRM: Xid (PCI:0000:05:00): 13, Graphics Exception: ESR 0x4041b0=0x3f20ffff
+kernel: NVRM: Xid (PCI:0000:05:00): 13, Graphics Exception: ESR 0x404000=0x80000002
+kernel: NVRM: Xid (PCI:0000:05:00): 13, pid=6871, name=python3, Graphics Exception: channel 0x00000002, Class 0000cec0, Offset 00000100, Data deaddead
+</code>
+===== Traduction =====
+==== 1 phrase ====
+Très rapide avec RTX 3060.
+''./build/bin/llama-cli -m ~/Data/AI_Models/EuroLLM-9B-Instruct-Q4_0.gguf -i''
+Prompt:
+> french to english "C'est une occasion rare pour les élèves de pouvoir échanger sur les parcours de nos anciens élèves et ainsi, de répondre à de nombreuses questions voire à lever des freins."
+Replies:
+EuroLLM-9B-Instruct:
+> "This is a rare opportunity for students to be able to discuss the career paths of our former students and, in doing so, to answer many questions and even remove some of the barriers they may be facing."
+EuroLLM-9B-q8_0:
+> The provided text is in French and needs to be translated into English.
+> I'll translate it step-by-step:
+>
+> 1. **Identify the source language:** The text is in French.
+> 2. **Analyze the text:** The text appears to be a sentence describing an event or opportunity. It mentions "élèves" (students), "anciens élèves" (former students), and "lever des freins" (lifting barriers or obstacles).
+> 3. **Translate each part:**
+>     * "C'est une occasion rare pour les élèves de pouvoir échanger sur les parcours de nos anciens élèves" translates to "This is a rare opportunity for students to be able to discuss the journeys of our former students."
+>     * "et ainsi, de répondre à de nombreuses questions voire à lever des freins" translates to "and thus, to answer many questions and even to lift barriers."
+> 4. **Combine the translated parts:** The complete translation is: "This is a rare opportunity for students to be able to discuss the journeys of our former students and thus, to answer many questions and even to lift barriers."
+Qwen2.5-coder-7b-instruct-q5_k_m :
+> Here's the English translation:
+> "It's a rare opportunity for students to exchange on the paths of our former students, thus being able to answer many questions or even overcome obstacles."
+> This translation conveys the meaning that the situation being described is unusual and provides students with a valuable learning experience by allowing them to share and discuss the career paths of alumni, which can help them answer questions and potentially resolve challenges they may face in their own professional development.
+Qwen2.5-coder-7b-instruct-q8_0:
+> Here's the English translation of the French sentence:
+> "This is a rare opportunity for students to exchange about the paths of our former students, and thus, to answer many questions or even to overcome obstacles."
+> This translation preserves the meaning and tone of the original French statement, highlighting the value of this unique educational experience.
+Qwen3-14B-UD-Q5_K_XL:
+> //<think>//
+> Okay, let's tackle this French to English translation. The user provided the sentence: “C'est une occasion rare pour les élèves de pouvoir échanger sur les parcours de nos anciens élèves et ainsi, de répondre à de nombreuses questions voire à lever des freins.”
+> First, I need to understand the structure and meaning. The sentence starts with "C'est une occasion rare" which translates to "It's a rare opportunity." The subject is "les élèves" (the students), and the verb is "pouvoir échanger" (to be able to exchange). The object of the exchange is "les parcours de nos anciens élèves" (the paths/careers of our alumni). Then, the purpose is "de répondre à de nombreuses questions voire à lever des freins" (to answer many questions or even to remove barriers).
+> Now, I should check for any nuances. "Parcours" here likely refers to career paths or journeys, not just physical paths. "Anciens élèves" is alumni, so "our alumni's paths" makes sense. "Répondre à de nombreuses questions" is straightforward: answer many questions. "Voire à lever des freins" – "voire" is like "or even," so "or even to remove barriers."
+> I need to make sure the flow is natural in English. Maybe "a rare opportunity for students to exchange about the career paths of our alumni, and thus answer many questions or even overcome barriers."
+> Wait, "échanger sur" could be "discuss" instead of "exchange about." Also, "lever des freins" – "freins" can mean barriers or brakes. In this context, it's more about barriers. "Overcome barriers" sounds better than "remove barriers."
+> Putting it all together: "It's a rare opportunity for students to discuss the career paths of our alumni and thus answer many questions or even overcome barriers."
+> I should check if "thus" is the best conjunction. "And thus" or "and thereby" might work. Alternatively, "thereby" could be more formal. Let me see. "And thus" is okay, but maybe "and thereby" sounds more academic.
+> Another point: "pouvoir échanger" is "to be able to exchange," but in English, "to exchange" might not be the best term here. "Discuss" or "talk about" is more natural.
+> So revised: "It's a rare opportunity for students to discuss the career paths of our alumni and thereby answer many questions or even overcome barriers."
+> That seems accurate. Let me make sure there are no other nuances. The original sentence emphasizes that this opportunity is rare, which is captured. The purpose is to answer questions and overcome barriers, which is clear. I think that's a solid translation.
+> //</think>//
+>
+> It's a rare opportunity for students to discuss the career paths of our alumni and thereby answer many questions or even overcome barriers.
+==== texte long ====
+=== 2.7 ko ===
+Pour quitter après traduction c'est l'option ''<nowiki>--single-turn</nowiki>'', à ne pas confondre avec ''-no-cnv'' qui fait partir en boucle certains modèles jusqu'à "//context full and context shift is disabled => stopping//".
+Certains modèle comme ''EuroLLM-9B-q8_0'' ne font pas le job correctement, sauf avec l'option ''<nowiki>--ctx-size 0</nowiki>'' => //0 = loaded from model//.
+Les 4 modèles ont été jusqu'au bout. Je n'ai pas analysé le rendu ...
+=== 8.8 ko ===
+Comme pour le texte précédent.
+En passant, avec le Intel® Core™ 13th Gen i7-1360P et EuroLLM-9B-Instruct-Q4_0 Statistics: 3.96 tokens/s, 1330 tokens, 335.56s soit presque 6 minutes.
+=== 19 ko ===
+  * EuroLLM-9B-Instruct-Q4_0 :
+    * context full and context shift is disabled => stopping
+    * ''<nowiki>--ctx-size 20000</nowiki>'' : Ok mais ne produit pas une traduction mais un résumé en français
+  * Qwen3-14B-UD-Q5_K_XL :
+    * prompt is too long (4267 tokens, max 4092)
+    * ''<nowiki>--ctx-size 20000</nowiki>'' : unable to load model
+  * Qwen2.5-coder-7b-instruct-q8_0
+    * prompt is too long (4267 tokens, max 4092)
+    * ''<nowiki>--ctx-size 20000</nowiki>'' : Ok, mais pas sûr qu'il est "tout" traduit (pas relu complètement)
+===== Taille du « context » =====
+Avec Llama.cpp CUDA, RTX 3060 12 GB et opencode avec le modèle ''Qwen2.5-coder-7b-instruct-q8_0.gguf'' qui fait 8.1 Go sur le disque.
+<code>
+./build/bin/llama-server --port 8012 --jinja -m ~/Data/AI_Models/Qwen2.5-coder-7b-instruct-q8_0.gguf --ctx-size <CONTEXT IN BYTES>
+# puis
+time opencode run -m llamacpp/Qwen2.5-coder-7b-instruct-q8_0.gguf --attach=http://127.0.0.1:30781 --agent=plan "Explain async/await in JavaScript"
+</code>
+--ctx-size 4096
+<code>
+Error: the request exceeds the available context size, try increasing it
+</code>
+--ctx-size 12288
+<code>
+Async/await in JavaScript is used to simplify asynchronous code by making it look more synchronous. It allows you to write asynchronous code in a cleaner, more readable way.
+real	0m6,294s
+user	0m1,067s
+sys	0m0,286s
+</code>
+--ctx-size 40000
+<code>
+Async/await in JavaScript is a feature that simplifies the syntax for handling asynchronous operations, making the code more readable and manageable. It allows you to write asynchronous code in a synchronous manner, using the `async` and `await` keywords.
+Here's a brief explanation:
+- **Async Function**: A function declared with the `async` keyword is an async function. Inside an async function, you can use the `await` keyword to pause the function's execution until a Promise is resolved or rejected.
+- **Await Keyword**: The `await` keyword is used to pause the execution of the async function until the Promise it's waiting on is settled (either resolved or rejected). This makes it easier to read and understand asynchronous code, as it behaves more like synchronous code.
+- **Promise**: A Promise is an object that represents the eventual completion (or failure) of an asynchronous operation and its resulting value.
+Here's an example to illustrate:
+```javascript
+async function fetchData() {
+  try {
+    let response = await fetch('https://api.example.com/data');
+    let data = await response.json();
+    console.log(data);
+  } catch (error) {
+    console.error('Error fetching data:', error);
+  }
+}
+```
+In this example:
+- The `fetchData` function is declared as `async`, allowing the use of `await`.
+- `await fetch('https://api.example.com/data')` waits for the fetch request to complete.
+- `await response.json()` waits for the JSON parsing to complete.
+- `try...catch` is used to handle any errors that occur during the fetch or parsing operations.
+Async/await makes it easier to handle multiple asynchronous operations in sequence, reducing the need for nested callbacks and improving the overall code structure.
+real	0m16,037s
+user	0m2,594s
+sys	0m0,980s
+</code>
+mais ça peut aussi répondre :
+<code>
+Async/await in JavaScript is a syntax that simplifies the behavior of using promises synchronously. It makes the code cleaner and easier to read by allowing you to write asynchronous code in a more linear, synchronous style.
+real	0m2,224s
+user	0m1,139s
+sys	0m0,329s
+</code>
+--ctx-size 60000
+<code>
+Async/await is a syntax introduced in JavaScript to simplify working with promises, making asynchronous code more readable and easier to write. It allows you to write asynchronous code in a sequential, synchronous style, which can improve the overall flow and readability of your code. Here's a basic explanation:
+- **Async Function**: A function declared with the `async` keyword returns a promise. This means that even if the function does not explicitly return a promise, it will implicitly return one.
+- **Await Keyword**: The `await` keyword is used inside an async function to pause the execution of the function until a promise is resolved or rejected. This allows you to handle the resolved value directly, making the code look synchronous.
+Here's a simple example to illustrate:
+```javascript
+// Define an async function
+async function fetchData() {
+  // Await a promise
+  let response = await fetch('https://api.example.com/data');
+  let data = await response.json();
+  return data;
+}
+// Call the async function
+fetchData().then(data => {
+  console.log(data);
+}).catch(error => {
+  console.error('Error fetching data:', error);
+});
+```
+In this example:
+- `fetchData` is an async function that fetches data from an API.
+- The `await` keyword is used to wait for the `fetch` and `response.json()` operations to complete.
+- The resolved data is then logged to the console.
+Using async/await, you can avoid deeply nested callbacks and make your asynchronous code more linear and easier to manage.
+real	0m14,785s
+user	0m2,381s
+sys	0m0,984s
+</code>
+ou bien
+<code>
+Async/await is a feature in JavaScript that makes it easier to write and read asynchronous code, which is code that performs operations over time, like fetching data from a server or reading a file. It allows you to write asynchronous code in a more sequential and readable way, similar to synchronous code.
+Here's a simple example to illustrate:
+```javascript
+// Without async/await
+function fetchData(callback) {
+  setTimeout(() => {
+    callback('Data fetched');
+  }, 1000);
+}
+fetchData((data) => {
+  console.log(data);
+});
+```
+With async/await:
+```javascript
+// With async/await
+async function fetchData() {
+  return new Promise((resolve) => {
+    setTimeout(() => {
+      resolve('Data fetched');
+    }, 1000);
+  });
+}
+async function getData() {
+  try {
+    const data = await fetchData();
+    console.log(data);
+  } catch (error) {
+    console.error(error);
+  }
+}
+getData();
+```
+In the async/await example:
+- The `fetchData` function returns a promise that resolves after 1 second.
+- The `getData` function is marked with `async`, allowing the use of `await` inside it.
+- `await fetchData()` pauses the execution of `getData` until the promise resolves, and then the resolved value is logged to the console.
+This makes the asynchronous code easier to understand and maintain, as it looks more like synchronous code.
+real	0m15,296s
+user	0m2,434s
+sys	0m0,918s
+</code>
+ou encore
+<code>
+`async/await` in JavaScript is a syntactic sugar built on top of Promises to simplify asynchronous code. It allows you to write asynchronous code in a more synchronous and readable manner.
+Here's a brief explanation:
+. **Async Function**: You declare a function as `async` using the `async` keyword. This makes the function return a Promise.
+    ```javascript
+    async function fetchData() {
+        // function body
+    }
+    ```
+. **Await Keyword**: You use the `await` keyword inside an `async` function to pause the execution of the function until the Promise is resolved or rejected.
+    ```javascript
+    async function fetchData() {
+        const response = await fetch('https://api.example.com/data');
+        const data = await response.json();
+        return data;
+    }
+    ```
+. **Error Handling**: You can handle errors using `try...catch` blocks.
+    ```javascript
+    async function fetchData() {
+        try {
+            const response = await fetch('https://api.example.com/data');
+            const data = await response.json();
+            return data;
+        } catch (error) {
+            console.error('Error fetching data:', error);
+        }
+    }
+    ```
+In summary, `async/await` makes asynchronous code easier to read and write by allowing you to write it in a more linear and synchronous style.
+real	0m9,332s
+user	0m2,170s
+sys	0m0,836s
+</code>
+--ctx-size 74000
+Ça passe bien, avec ce prompt, peut-être limite avec plus complexe ...
+--ctx-size 76000
+<code>
+# Llama-server démarre bien, mais plante au calcul
+slot update_slots: id  3 | task 0 | prompt processing progress, n_tokens = 2048, batch.n_tokens = 2048, progress = 0.220286
+/home/.../llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:89: CUDA error
+CUDA error: out of memory
+  current device: 0, in function alloc at /home/.../llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:466
+  cuMemCreate(&handle, reserve_size, &prop, 0)
+...
+</code>
+--ctx-size 80000
+<code>
+# Llama-server plante dès le démarrage :
+ggml_backend_cuda_buffer_type_alloc_buffer: allocating 422.00 MiB on device 0: cudaMalloc failed: out of memory
+ggml_gallocr_reserve_n: failed to allocate CUDA0 buffer of size 442499072
+graph_reserve: failed to allocate compute buffers
+llama_init_from_model: failed to initialize the context: failed to allocate compute pp buffers
+</code>
+Avec la **version Q5 medium** ''Qwen2.5-coder-7b-instruct-q8_0.gguf'' (fichier 5.4 Go)
+  * un ''--ctx-size 100000'' fonctionne
+  * mais ''--ctx-size 120000'' plante dans le calcul (out of memory)