Différences

Ci-dessous, les différences entre deux révisions de la page.

--- informatique:ai_lm [30/04/2026 16:49] – [Compilation pour CPU (SYCL)] cyrille
+++ informatique:ai_lm [23/06/2026 05:40] (Version actuelle) – [Compilation pour GPU] cyrille
@@ Ligne 51: / Ligne 51: @@
   * https://www.glukhov.org/fr/post/2025/05/ollama-cpu-cores-usage/
-==== Estimations ====
+  * [[/informatique/ai_lm/gpu_bench|GPU Benchmarks]]
-**Devstral avec llama.cpp sur RTX 3060 12 Go.**
+Installer [[/informatique/nvidia|nvidia-drivers et CUDA]].
-by ChatGPT :
-| Modèle            | Contexte (seq_len) | Batch_size recommandé | Remarques                                |
-| ----------------- | ------------------ | --------------------- | ---------------------------------------- |
-| Devstral Small 7B | 1024               | 4                     | Très sûr, VRAM ample                     |
-| Devstral Small 7B | 2048               | 2‑3                   | Bon compromis vitesse/VRAM               |
-| Devstral Small 7B | 4096               | 1‑2                   | VRAM presque saturée                     |
-| Devstral 13B      | 1024               | 2                     | VRAM limitée                             |
-| Devstral 13B      | 2048               | 1‑2                   | Optimal, attention VRAM                  |
-| Devstral 13B      | 4096               | 1                     | VRAM saturée, offload CPU conseillé      |
-| Devstral 13B      | 8192               | 1                     | Possible mais contexte long → risque OOM |
-by LeChat:
-| contexte (tokens) | modèle (paramètres) | VRAM estimée (Go) | Batch size optimal | Latence estimée (tok/s) | Notes |
-| 512 | 7B | ~5.5 | 8 | 15-25 | Idéal pour des tâches courtes et rapides. |
-| 1024 | 7B | ~6.0 | 4 | 10-20 | Bon compromis pour des prompts moyens. |
-| 2048 | 7B | ~7.0 | 2 | 5-15 | Nécessite une gestion fine de la VRAM. |
-| 4096 | 7B | ~8.5 | 1 | 3-10 | Proche de la limite VRAM, risque de ralentissement. |
-| 512 | 13B | ~9.0 | 4 | 8-15 | Modèle plus gros, latence accrue. |
-| 1024 | 13B | ~10.0 | 2 | 4-10 | VRAM presque saturée, batch_size réduit. |
-| 2048 | 13B | ~11.5 | 1 | 2-8 | Risque élevé de dépassement VRAM, latence importante. |
 ==== Online services ====
@@ Ligne 288: / Ligne 265: @@
 # RTX 5060 : 120
-$ export CUDA_VERSION=12.9 && cmake -B build -DGGML_CUDA=ON \
+$ export CUDA_VERSION=12.9
+$ export CUDA_VERSION=13.3
+$ cmake -B build -DGGML_CUDA=ON \
  -DCMAKE_CUDA_ARCHITECTURES="86;120" \
  -DCMAKE_BUILD_WITH_INSTALL_RPATH=ON \
@@ Ligne 310: / Ligne 289: @@
 -- Build files have been written to: /home/cyrille/Code/bronx/AI_Coding/llama.cpp/build
-$ time cmake --build build --config Release -j 10
+$ time cmake --build build --clean-first --config Release -j 10
 # host: i7-1360P + SSD
@@ Ligne 326: / Ligne 305: @@
 user	61m37,436s
 sys	2m37,613s
-</code>
-Avec CUDA 13.1 llama.cpp plante direct à la 1ère requête, mais sans message dans syslog : ce n'est donc pas le driver mais le logiciel llama.cpp qui ne support pas cette version de CUDA :
+# host: Core(TM) Ultra 7 270K Plus
-<code>
+real	3m6.637s
-/home/cyrille/Code/bronx/AI_Coding/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:97: CUDA error
+user	27m13.877s
-CUDA error: invalid argument
+sys	1m24.687s
-  current device: 0, in function ggml_cuda_mul_mat_q at /home/cyrille/Code/bronx/AI_Coding/llama.cpp/ggml/src/ggml-cuda/mmq.cu:179
 </code>
@@ Ligne 392: / Ligne 369: @@
 ./examples/sycl/build.sh
 </code>
+Compilation sans erreur, mais ... "what():  can not find preferred GPU platform" 😩
+<code>
+$ ./build/bin/llama-ls-sycl-device
+# idem avec
+$ ./build/bin/llama-bench -p 0 -n 128,256,512
+[New LWP 35410]
+[New LWP 35409]
+[New LWP 35408]
+[New LWP 35407]
+[New LWP 35406]
+[New LWP 35405]
+[New LWP 35404]
+[New LWP 35403]
+[New LWP 35402]
+[New LWP 35401]
+[New LWP 35400]
+[New LWP 35399]
+[New LWP 35398]
+[New LWP 35397]
+[New LWP 35396]
+This GDB supports auto-downloading debuginfo from the following URLs:
+  <https://debuginfod.ubuntu.com>
+Enable debuginfod for this session? (y or [n]) [answered N; input not from terminal]
+Debuginfod has been disabled.
+...
+Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
+x000079304a910813 in __GI___wait4 (pid=35411, stat_loc=0x0, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
+warning: 30	../sysdeps/unix/sysv/linux/wait4.c: Aucun fichier ou dossier de ce nom
+#0  0x000079304a910813 in __GI___wait4 (pid=35411, stat_loc=0x0, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
+	in ../sysdeps/unix/sysv/linux/wait4.c
+#1  0x000079304e48aa1a in ggml_print_backtrace () from /home/cyrille/Code/bronx/AI_Coding/llama.cpp-SYCL/build/bin/libggml-base.so.0
+#2  0x000079304e4a3d76 in ggml_uncaught_exception() () from /home/cyrille/Code/bronx/AI_Coding/llama.cpp-SYCL/build/bin/libggml-base.so.0
+#3  0x000079304acbb0da in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
+#4  0x000079304aca5a55 in std::terminate() () from /lib/x86_64-linux-gnu/libstdc++.so.6
+#5  0x000079304acbb391 in __cxa_throw () from /lib/x86_64-linux-gnu/libstdc++.so.6
+#6  0x000079304b19e765 in dpct::dev_mgr::dev_mgr() () from /home/cyrille/Code/bronx/AI_Coding/llama.cpp-SYCL/build/bin/libggml-sycl.so.0
+#7  0x000079304b16e8f3 in ggml_backend_sycl_print_sycl_devices () from /home/cyrille/Code/bronx/AI_Coding/llama.cpp-SYCL/build/bin/libggml-sycl.so.0
+#8  0x0000000000405527 in main ()
+[Inferior 1 (process 35394) detached]
+terminate called after throwing an instance of 'std::runtime_error'
+  what():  can not find preferred GPU platform
+PLEASE submit a bug report to https://software.intel.com/en-us/support/priority-support and include the crash backtrace and instructions to reproduce the bug.
+Abandon (core dumped)
+</code>
+Et fait un reboot puis ça fonctionne. Les perfs: 2.6 plus rapide que sans SYCL (36.34 vs 13.94).
+==== mistral.rs ====
+Aucun rapport avec Mistral.ai
+https://github.com/EricLBuehler/mistral.rs
+  * Any Hugging Face model, zero config
+  * True multimodality: Text, vision, video, and audio, speech generation, image generation, and embeddings in one engine.
+  * Smart quantization
+  * Built-in web UI
+  * Hardware-aware
+  * Flexible SDKs: Python package and Rust crate to build your projects.
+  * Native agentic support: built-in agentic loop with web search, local Python code execution with model feedback, session management, and custom tool hooks.
+À l'installation :
+  * la compilation est très longue (743 fichiers) et s'accapare toute la puissance de la machine...
+  * brancher le eGpu avant, sinon faudra re-installer 😩
+    * ça va activer ''flash-attn'' et la compilation de ''candle-flash-attn'' peut prendre 45 minutes !!!
@@ Ligne 426: / Ligne 473: @@
   * https://dusty-nv.github.io/NanoLLM/
   * https://www.jetson-ai-lab.com/tutorial_nano-llm.html
 Todo
   * [[https://towardsdatascience.com/how-to-build-an-openai-compatible-api-87c8edea2f06/|How to build an OpenAI-compatible API]]
+==== ZML ====
+https://github.com/zml/zml/
+===== Réduction de tokens =====
+Headroom
+  * Compress tool outputs, logs, files, and RAG chunks before they reach the LLM. 60-95% fewer tokens, same answers. Library, proxy, MCP server.
+  * https://headroom-docs.vercel.app/docs
+  * https://github.com/chopratejas/headroom
+  * https://www.lemondeinformatique.fr/actualites/lire-headroom-un-projet-open-source-pour-reduire-la-facture-des-tokens-100357.html
+RTK
+  * CLI proxy that reduces LLM token consumption by 60-90% on common dev commands. Single Rust binary, zero dependencies
+  * https://www.rtk-ai.app/
+  * https://github.com/rtk-ai/rtk
+Openwolf
+  * Sharper context. Fewer tokens. Open-source middleware for Claude Code.
+  * https://openwolf.com/
+  * https://github.com/cytostack/openwolf