====== AI Coding ======

Coder avec une IA LLM.

Explications:
  * introduction aux LLM : [[https://www.linagora.ai/introduction-aux-llm|Démystifier les (LLM) et comment les mettre en œuvre chez vous pour les étudier de plus près]]
  * [[https://berdachuk.com/ai/how-to-run-llms-locally|How to Run LLMs Locally: A Complete Step-by-Step Guide]] (//2025-05-27//) sur la quatification, format GGUF, Group Size Suffix (S/M/L)

Autres pages:
  * [[/informatique/ai_coding/samples|Quelques essais perso]]
  * Un peu de [[/informatique/ai_coding/gpu_bench|GPU bench]]
  * Pour d'[[#autres_usages|Autres usages]] que la programmation informatique (codage)

===== Sur les Modèles =====

  * **LoRA** (Low-Rank Adaptation): une méthode permet de "spécialiser" un peu un modèle est ajoutant des couches légères (qlqs Mo) et adaptables à un modèle pré-entraîné, au lieu de modifier tous ses poids ; 0,1% à 1% des paramètres du modèle sont entraînés. Le modèle de base reste inchangé, LoRA peut être désactivé.
    * [[https://github.com/axolotl-ai-cloud/axolotl|Axolotl]] A Free and Open Source LLM Fine-tuning Framework
  * techniques pour améliorer la gestion des longs textes
    * **YaRN** (Yet Another Recurrent Network): une technique pour améliorer la gestion des textes longs par les LLMs  sans nécessiter de réentraînement complet ou de modifications lourdes, seules quelques couches sont ajustées. Vise à étendre la fenêtre de contexte (context window) des LLMs, passer de 4k à 128k tokens. Autres techniques: ALiBi, NTK-aware scaling.
    * **RoPE** (Rotary Position Embedding): technique pour intégrer des informations de position dans les séquences de tokens, tout en permettant une meilleure généralisation à des longueurs de texte variables. Contrairement aux méthodes comme les embeddings de position absolus (BERT), RoPE utilise une représentation relative et rotative des positions qui améliore la capacité des modèles à comprendre les relations entre les tokens, même sur de longues distances.
    * RoPE + YaRN : RoPEE Fournit la base mathématique pour comprendre les positions relatives et YaRN étend cette base pour permettre des fenêtres de contexte encore plus grandes
    * **ALiBi** ... technique plus ancienne que les 2 précédentes ...
  * **GGUF** (GPT-Generated Unified Format):  format binaire optimisé pour l’inférence (exécution de modèles), développé par la communauté open-source, notamment par [[https://github.com/ggerganov|ggerganov]] le créateur de Llama.cpp. Remplace l’ancien format GGML.
    * afficher les metadata du fichier: [[https://github.com/gpustack/gguf-parser-go|gpustack/gguf-parser-go]]
    * Autres formats: PyTorch, ONNX, TensorRT, GGML (déprécié)
  * **MoE** (Mixture of Experts):  architecture de modèle où plusieurs "experts" (sous-réseaux spécialisés) sont activés de manière conditionnelle pour traiter différentes parties des données.
    * Le modèle est composé de plusieurs sous-réseaux appelés "experts".
    * Un générateur de sélection (router) détermine quels experts utiliser pour chaque entrée.
    * permet de réduire le coût de calcul en ne passant les données que par un sous-ensemble des experts.

Classification de modèles ouverts: [[https://www.ibm.com/fr-fr/products/watsonx-ai/foundation-models|Foundation models]] by Ibm

[[https://claude.ai/share/5d0d1604-20cd-4ec9-9f39-c2797197603d|Comment faire pour qu'un appel à un LLM ait un résultat reproductible d'une fois sur l'autre ?]]

===== Sur les Agents =====

  * **LangChain**: un framework open-source conçu pour faciliter la création d’applications alimentées par des modèles de langage (comme GPT, Llama, etc.). Il permet de combiner des LLMs avec d’autres sources de données, outils externes, ou encore des bases de connaissances, pour construire des workflows complexes.
  * [[https://github.com/LLPhant/LLPhant|LLPhant]] : A comprehensive PHP Generative AI Framework, inspired by Langchain, sur lequel est construit [[https://github.com/LLPhant/AutoPHP|AutoPHP]] an agent PHP framework. Avec notamment présentation et usage de [[https://github.com/LLPhant/LLPhant?tab=readme-ov-file#vectorstores|vectorstores]] et [[https://github.com/LLPhant/LLPhant?tab=readme-ov-file#embeddings|embeddings]]
  * **LangSmith**: une plateforme de débogage, de test et de monitoring pour les applications construites avec LangChain ou d’autres frameworks similaires
  * **LangGraph**: une extension de LangChain qui permet de modéliser des workflows d’IA sous forme de graphes. Contrairement à LangChain, qui utilise des chaînes linéaires ou séquentielles, LangGraph permet de créer des processus dynamiques et non linéaires, où les étapes peuvent s’enchaîner de manière conditionnelle ou parallèle.
  * STM (Short Term Memory): permet à un agent IA de se souvenir des entrées récente. Généralement mise en œuvre à l’aide d’une mémoire tampon circulaire ou d’une fenêtre contextuelle (context window), qui contient une quantité limitée de données récentes avant d’être écrasée.
  * LTM (Long Term Memory): permet aux agents IA de stocker et de récupérer des informations entre différentes sessions. souvent mise en œuvre à l’aide de bases de données, de [[https://www.ibm.com/fr-fr/think/topics/knowledge-graph|graphes de connaissances]] ou d’[[https://www.ibm.com/fr-fr/think/topics/vector-embedding|embeddings vectoriels]].
    * **RAG** (Retrieval-Augmented Generation): combine deux capacités de l’IA : la récupération d’informations et la génération de texte.
  * **ACP** (Agent Communication Protocol): transformeles agents d’IA en coéquipiers interconnectés.
  * **MCP** (Model Context Protocol): une couche de standardisation pour permettre aux applications d’IA de communiquer efficacement avec des services externes tels que des outils, des bases de données et des modèles prédéfinis.

  * [[https://docs.mistral.ai/agents/introduction|What are AI agents?]] by Mistral
  * [[https://www.ibm.com/fr-fr/think/ai-agents|guide des agents d’IA]] par Ibm
    * [[https://www.ibm.com/fr-fr/think/insights/top-ai-agent-frameworks|Frameworks pour agents IA]]
  * [[https://www.linkedin.com/posts/godefroy_le-rag-est-mort-voici-pourquoi-en-2022-activity-7387725857659723776-h3ff|Le RAG est mort. Voici pourquoi]]. Article comparent RAG et GREP ; les commentaires sont une bonne source de connaissance.


===== Sur les perfs =====

  * https://cosmo-games.com/quels-modeles-llm-installes-local-8-ou-16-go-vram/
  * https://www.glukhov.org/fr/post/2025/05/ollama-cpu-cores-usage/

==== Estimations ====

**Devstral avec llama.cpp sur RTX 3060 12 Go.**

by ChatGPT :

| Modèle            | Contexte (seq_len) | Batch_size recommandé | Remarques                                |
| ----------------- | ------------------ | --------------------- | ---------------------------------------- |
| Devstral Small 7B | 1024               | 4                     | Très sûr, VRAM ample                     |
| Devstral Small 7B | 2048               | 2‑3                   | Bon compromis vitesse/VRAM               |
| Devstral Small 7B | 4096               | 1‑2                   | VRAM presque saturée                     |
| Devstral 13B      | 1024               | 2                     | VRAM limitée                             |
| Devstral 13B      | 2048               | 1‑2                   | Optimal, attention VRAM                  |
| Devstral 13B      | 4096               | 1                     | VRAM saturée, offload CPU conseillé      |
| Devstral 13B      | 8192               | 1                     | Possible mais contexte long → risque OOM |

by LeChat:

| contexte (tokens) | modèle (paramètres) | VRAM estimée (Go) | Batch size optimal | Latence estimée (tok/s) | Notes |
| 512 | 7B | ~5.5 | 8 | 15-25 | Idéal pour des tâches courtes et rapides. |
| 1024 | 7B | ~6.0 | 4 | 10-20 | Bon compromis pour des prompts moyens. |
| 2048 | 7B | ~7.0 | 2 | 5-15 | Nécessite une gestion fine de la VRAM. |
| 4096 | 7B | ~8.5 | 1 | 3-10 | Proche de la limite VRAM, risque de ralentissement. |
| 512 | 13B | ~9.0 | 4 | 8-15 | Modèle plus gros, latence accrue. |
| 1024 | 13B | ~10.0 | 2 | 4-10 | VRAM presque saturée, batch_size réduit. |
| 2048 | 13B | ~11.5 | 1 | 2-8 | Risque élevé de dépassement VRAM, latence importante. |

==== Online services ====

launch a opencode server :
<code>
opencode serve --port=30781 --print-logs --log-level DEBUG
</code>

Then **prompt : "Explain async/await in JavaScript"**

with:
<code>
time opencode run -m <ProviderId/ModelId> --attach=http://127.0.0.1:30781 --agent=plan "Explain async/await in JavaScript"
</code>

👾 Attention, les résultats peuvent être très différents:
  * d'une simple phrase de définition à un exemple de code
    * mais je n'ai pas modifier la taille du ''context'', ce qui a une grande importance sur la taille/qualité de la réponse ...
  * aussi, le ''system message prompt'' est sélectionné par opencode ...
    * https://github.com/sst/opencode/issues/4861

  * ovhcloud/Qwen3-Coder-30B-A3B-Instruct = 2,008s / 3,100s
  * ovhcloud/gpt-oss-20b = 14,219s / 21,714s
  * ovhcloud/Mistral-Nemo-Instruct-2407 = abandon après 7 minutes d'attente ...
  * ovhcloud/DeepSeek-R1-Distill-Llama-70B = 22,301s / 29,187s
  * opencode/big-pickle = 2,858s / 3,479s
  * mistral-codestral/codestral-latest = 2,320s / 3,427s


===== Cartes IA =====

Hailo
  * [[https://hailo.ai/products/ai-accelerators/hailo-8-ai-accelerator/#hailo8-performance|Hailo 8]]
  * packaging in a box with a [[https://www.seeedstudio.com/reComputer-AI-R2140-12-p-6431.html|Raspberry by SeedStudio]], 26 TOPS, 15 GB RAM - $289

Axelera
  * [[https://axelera.ai/ai-accelerators/metis-pcie-ai-acceleration-card|Metis]]
  * [[https://axelera.ai/ai-accelerators/aipu/europa|Europa]]

seeedstudio
  * [[https://www.seeedstudio.com/reComputer-Mini-J4012-p-6355.html|reComputer Mini J4012]] is a tiny AI computer powered by NVIDIA® Jetson Orin™ NX **16GB** module,delivering up to 100 TOPS AI performance - $900
  * [[https://www.seeedstudio.com/reComputer-J2022-p-5497.html|reComputer J2022 - Edge AI Computer with NVIDIA® Jetson Xavier™ NX 16GB]] - $759

Ollama & Nvidia Jetpack
  * https://www.jetson-ai-lab.com/tutorial_ollama.html
  * pour plus de performance utiliser [[https://www.jetson-ai-lab.com/tutorial_nano-llm.html|NanoLLM - Optimized LLM Inference]]
    * Ollama uses llama.cpp for inference, which various API benchmarks and comparisons are provided for on the Llava page. It gets roughly half of peak performance versus the faster APIs like NanoLLM , but is generally considered fast enough for text chat. 

Nvidia
  * A10
    * https://askgeek.io/en/gpus/NVIDIA/NVIDIA-A10
    * [[https://askgeek.io/en/gpus/vs/NVIDIA_NVIDIA-A10-vs-NVIDIA_GeForce-RTX-3060|NVIDIA A10 vs RTX 3060]]

^                             ^ A 10                ^ A 30                ^ A 40                ^ A 100 SXM4          ^ A 800      ^ H 100 SMX5  ^
| Prix eBay                   | $2,330              | $3,999              | $9,950              | $4,000              | $20,000    | $20,000     |
| Architecture                | Ampere              | Ampere              | Ampere              | Ampere              | Ampere     | Hopper      |
| Code name                   | GA102               | GA100               | GA102               | GA100               | GA100      | GH100       |
| Launch date                 | 2021-04             | 2021-04             | 2020-10             | 2020-05             | 2022-11    | 2022-03     |
| Maximum RAM                 | **24** GB           | **24** GB           | **48** GB           | **40** GB           | **40** GB  | **96** GB   |
| Memory type                 | GDDR6               | HBM2e               | GDDR6               | HBM2e               | HBM2e      | HBM3        |
| Memory bandwidth            | 600.2 GB/s          | 933.1 GB/s          | 695.8 GB/s          | 1555 GB/s           | 1.56 TB/s  | 1,681 GB/s  |
| Memory bus width            | 384 bit             | 3072 bit            | 384 bit             | 5120 bit            | 5120 bit   | 5120 bit    |
| Memory clock speed          | 1563 MHz            | 1215 MHz            | 1812 MHz            | 1215 MHz            | 1215 MHz   | 1313 MHz    |
| Core clock speed            | 885 MHz             | 930 MHz             | 1305 MHz            | 1095 MHz            | 765 MHz    | 1837 MHz    |
| Boost clock speed           | 1695 MHz            | 1440 MHz            | 1740 MHz            | 1410 MHz            | 1410 MHz   | 1665 MHz    |
| Peak Half Precision (FP16)  | 31.24 TFLOPS (1:1)  | 10.32 TFLOPS (1:1)  | 37.42 TFLOPS (1:1)  | 77.97 TFLOPS (4:1)  |            |             |
| Pipelines                   | 9216                | 3584                | 10752               | 6912                | 6912       | 16896       |
| Thermal Design Power        | 150 Watt            | 165 Watt            | 300 Watt            | 400 Watt            | 250 Watt   | 700 Watt    |
| OpenCL                      | 3.0                 | 3.0                 | 3.0                 |                     | 3.0        |             |

  * [[https://marketplace.nvidia.com/en-us/enterprise/personal-ai-supercomputers/dgx-spark/|NVIDIA DGX Spark]] ($4,000) : GB10 Grace Blackwel, 1 FP4 PFLOPS, 128GB, ConnectX-7 Smart NIC, 4TB NVME.M2 with self-encryption
  * [[https://developer.nvidia.com/buy-jetson?product=all&location=FR|Jetson AI products]]
    * Jetson AGX Orin™ 64GB, 275 TOPS, [[https://fr.rs-online.com/web/p/modules-de-developpement-pour-processeurs/2539662|2500 €]]
    * Jetson Thor: Blackwell GPU, 128 GB, 2070 FP4 TFLOPS, [[https://uk.rs-online.com/web/p/processor-development-tools/0606863?searchId=129e076e-6b25-48c6-a98f-afba08066e18&gb=s|£3200]]


===== Cartes graphiques =====

Nvidia
  * RTX 3060
    * CUDA GPU Compute Capability: 8.6
  * [[https://chipset.fr/boutique/composants-pc/carte-graphique/asus-dual-rtx5060ti-o16g-nvidia-geforce-rtx-5060-ti-16-go-gddr7/|RTX 5060 TI 16 Go]] 475€ TTC chipset.fr
    * CUDA GPU Compute Capability: 12.0
  * [[https://www.grosbill.com/carte-graphique/pny-rtx-5060ti-16go-overclocked-dual-fan-155315.aspx|pny-rtx-5060ti-16go-overclocked]] 445€ TTC grosbill.com

[[/informatique/ai_coding/gpu_bench|gpu_bench]]

Tips: Reset nvidia et CUDA:
<code bash>
# éteindre la carte
# débrancher THB
$ sudo rmmod nvidia_uvm nvidia
</code>

==== Adaptateur GPU externe ====

En anglais "**GPU enclosures**". Nécessite un port Thunderbolt 3, 4 ou à venir 5.

egpu docks

[[https://developer.nvidia.com/blog/accelerating-machine-learning-on-a-linux-laptop-with-an-external-gpu/|Accelerating Machine Learning on a Linux Laptop with an External GPU]] by NVidia (Setting up Ubuntu to use NVIDIA eGPU)

[[/informatique/egpu|eGPU]]


===== Models =====

{{ :informatique:ai_coding:tokens-input-ouput_20251214-172030.png?nolink&200|Il en faut des tokens pour un petit programme}}

Pour de l'assistance au code

[[https://lab.cyrille.giquello.fr/AI-compare/models-metadata/#state=eyJjb2x1bW5zIjpbImFyY2hpdGVjdHVyZS5tYXhpbXVtQ29udGV4dExlbmd0aCIsIm1ldGFkYXRhLmZpbGVUeXBlRGV0YWlsIiwibWV0YWRhdGEuYml0c1BlcldlaWdodCIsInRva2VuaXplci5tZXJnZXNMZW5ndGgiLCJ0b2tlbml6ZXIudG9rZW5zU2l6ZSIsInRva2VuaXplci5tZXJnZXNTaXplIiwiYXJjaGl0ZWN0dXJlLmFyY2hpdGVjdHVyZSIsIm1ldGFkYXRhLnBhcmFtZXRlcnMiLCJhcmNoaXRlY3R1cmUuZW1iZWRkaW5nTGVuZ3RoIiwiYXJjaGl0ZWN0dXJlLmJsb2NrQ291bnQiLCJhcmNoaXRlY3R1cmUuZXhwZXJ0Q291bnQiLCJhcmNoaXRlY3R1cmUuZXhwZXJ0VXNlZENvdW50IiwiYXJjaGl0ZWN0dXJlLmF0dGVudGlvbkhlYWRDb3VudCIsImFyY2hpdGVjdHVyZS5yb3BlU2NhbGluZ1R5cGUiLCJtZXRhZGF0YS5maWxlU2l6ZSIsInRva2VuaXplci5tb2RlbCIsImFyY2hpdGVjdHVyZS5hdHRlbnRpb25TbGlkaW5nV2luZG93Il0sIm9yZGVyIjpbIm1ldGFkYXRhLnBhcmFtZXRlcnMiLCJhcmNoaXRlY3R1cmUubWF4aW11bUNvbnRleHRMZW5ndGgiLCJhcmNoaXRlY3R1cmUuYXJjaGl0ZWN0dXJlIiwidG9rZW5pemVyLm1vZGVsIiwibWV0YWRhdGEuZmlsZVNpemUiLCJtZXRhZGF0YS5maWxlVHlwZURldGFpbCIsIm1ldGFkYXRhLmJpdHNQZXJXZWlnaHQiLCJ0b2tlbml6ZXIubWVyZ2VzTGVuZ3RoIiwidG9rZW5pemVyLnRva2Vuc1NpemUiLCJ0b2tlbml6ZXIubWVyZ2VzU2l6ZSIsImFyY2hpdGVjdHVyZS5lbWJlZGRpbmdMZW5ndGgiLCJhcmNoaXRlY3R1cmUuYmxvY2tDb3VudCIsImFyY2hpdGVjdHVyZS5leHBlcnRDb3VudCIsImFyY2hpdGVjdHVyZS5leHBlcnRVc2VkQ291bnQiLCJhcmNoaXRlY3R1cmUuYXR0ZW50aW9uSGVhZENvdW50IiwiYXJjaGl0ZWN0dXJlLnJvcGVTY2FsaW5nVHlwZSIsImFyY2hpdGVjdHVyZS5hdHRlbnRpb25TbGlkaW5nV2luZG93IiwibWV0YWRhdGEuZmlsZVNpemUiXSwibGVuZ3RoIjotMSwiY29uZmlnQ29sbGFwc2VkIjp0cnVlLCJzb3J0IjpbWzIsImRlc2MiXV19|GGUF Models Metadata Viewer]] : Un viewer des meta-données des modèles que j'essaye en local réalisé sans coder, juste assistant IA et "[[https://www.crackedaiengineering.com/ai-models/opencode-big-pickle|OpenCode Zen Big Pickle]]" et "[[https://mistral.ai/news/devstral-2-vibe-cli|Mistral Devstral 2]]".

  * [[https://qwen.ai|Qwen]]
    * [[https://huggingface.co/bartowski/Qwen2.5-Coder-7B-Instruct-GGUF|Qwen2.5-Coder-7B-Instruct]]
      * layers=37, n_ctx_train=40960
      * avec 12Go ''<nowiki>--ctx-size 0</nowiki>''
    * Qwen2.5-coder-7b-instruct-q8_0.gguf
      * file: 8.1 Go, n_ctx=131072
      * avec RTX 5060 16Go ''<nowiki>--ctx-size 0</nowiki>'', nvidia-smi Memory-Usage 14920MiB / 16311MiB
    * [[https://huggingface.co/Qwen/Qwen3-8B-GGUF|Qwen3-8B]]
      * default context 40960, 37 layers
      * avec 12Go ''<nowiki>--ctx-size 0</nowiki>''
    * [[https://huggingface.co/Qwen/Qwen3-14B-GGUF|Qwen3-14B]]
      * default context 40960, 40 layers
      * avec 12Go ''<nowiki>--ctx-size 0  --n-gpu-layers 28</nowiki>''
    * [[https://huggingface.co/unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF|Qwen3-Coder-30B-A3B-Instruct]]
      * layers=48, n_ctx_train=262144, n_embd=2048, n_rot=128, n_expert=128, n_expert_used=8, n_vocab=151936, n_merges=151387, max token length=256
      * avec 12Go ''<nowiki>--ctx-size 70000  --n-gpu-layers 23</nowiki>''
      * avec 12Go ''<nowiki>--ctx-size 40000  --n-gpu-layers 26</nowiki>''
  * DeepSeek2
    * [[https://huggingface.co/bartowski/DeepSeek-Coder-V2-Lite-Instruct-GGUF|DeepSeek-Coder-V2-Lite-Instruct]] **16B** by bartowski
      * **layers=28**, **n_ctx_train=163840**, **n_ctx_orig_yarn=4096**, n_embd=2048, n_rot=64, rope scaling=yarn
      * avec 12Go ''<nowiki>--ctx-size 30000  --n-gpu-layers 15</nowiki>''
    * [[https://huggingface.co/second-state/Deepseek-Coder-6.7B-Instruct-GGUF|Deepseek-Coder-6.7B-Instruct]] by second-state
      * default context 16384, 32 layers
      * avec 12 Go ''<nowiki>--ctx-size 0 --n-gpu-layers 30</nowiki>''
      * context trop petit pour projet code
  * Google DeepMind Gemma
    * [[https://huggingface.co/google/gemma-3-4b-it|google/gemma-3-4b-it]], entraîné Web Documents, 140 langages, Code, Mathematics, Images
    * [[https://huggingface.co/GetSoloTech/Gemma3-Code-Reasoning-4B-GGUF|GetSoloTech/Gemma3-Code-Reasoning-4B]]
    * [[https://huggingface.co/bartowski/burtenshaw_GemmaCoder3-12B-GGUF|GemmaCoder3-12B-IQ4_NL]]
      * file 8.4 Go, context 131k, 49 layers,
      * RTX3060 12Go:
        * ''<nowiki>--ctx-size 42000</nowiki>''
        * ''<nowiki>--ctx-size 70000 --n-gpu-layers 41</nowiki>''
      * RTX 5060 16Go:
        * ''<nowiki>--ctx-size 0</nowiki>'', ggml_backend_cuda_buffer_type_alloc_buffer: allocating 1440.00 MiB on device 0: cudaMalloc failed: out of memory, alloc_tensor_range: failed to allocate CUDA0 buffer of size 1509949440
        * ''<nowiki>--ctx-size 0 --n-gpu-layers 42</nowiki>'', model loaded
        * ''<nowiki>--ctx-size 70000</nowiki>'', model loaded, nvidia-smi Memory-Usage: 13060MiB / 16311MiB
    * https://huggingface.co/bartowski/codegemma-7b-it-GGUF
  * Meta
    * [[https://huggingface.co/bartowski/Meta-Llama-3.1-8B-Instruct-GGUF|Meta-Llama-3.1-8B-Instruct]] by bartowski
      * Cutting Knowledge Date: December 2023
      * layers=33, n_ctx_train=131072
      * Avec 12 Go
        * llama-cli ''<nowiki>--ctx-size 55000</nowiki>'' 
        * llama-server ''<nowiki>--ctx-size 50000</nowiki>'' 
        * ''<nowiki>--ctx-size 65000 --n-gpu-layers 29</nowiki>'' 
    * [[https://huggingface.co/bartowski/Llama-3-8B-Instruct-Coder-v2-GGUF|Llama-3-8B-Instruct-Coder-v2]] by bartowski
      * layers=33, n_ctx_train=8192
    * [[https://huggingface.co/TheBloke/CodeLlama-13B-Instruct-GGUF|codellama-13b-instruct.Q4_K_M.gguf]]
      * file: 7.9 Go
      * layers=40, **n_ctx_train=16384**, n_embd=5120, n_rot=128, n_expert=0, n_merges=0, max token length=48, n_vocab=32016
      * ''<nowiki>--ctx-size 0 --n-gpu-layers 22</nowiki>''
      * RTX 5060 16Go
        * ''<nowiki>--ctx-size 0</nowiki>'', ggml_backend_cuda_buffer_type_alloc_buffer: allocating 12800.00 MiB on device 0: cudaMalloc failed: out of memory
        * ''<nowiki>--ctx-size 0 -n-gpu-layers 30</nowiki>'', model loaded
      * context trop petit pour projet code
    * [[https://huggingface.co/TheBloke/CodeLlama-13B-GGUF|CodeLlama-13B]]
      * layers=40, **n_ctx_train=16384**
      * context trop petit pour projet code
  * Mistral
    * Mistral-7B-Instruct-v0.3 https://huggingface.co/lmstudio-community/Mistral-7B-Instruct-v0.3-GGUF
      * Q8_0 7.7 GB
    * Codestral-22B-v0.1 https://huggingface.co/lmstudio-community/Codestral-22B-v0.1-GGUF
      * Q5_K_M 15.7 GB, Q4_K_M 13.3 GB
    * Magistral-Small-2509 https://huggingface.co/bartowski/mistralai_Magistral-Small-2509-GGUF
      * Q4_1 14.9 GB, Q4_K_M 14.3 GB
    * Devstral-Small-2507 https://huggingface.co/unsloth/Devstral-Small-2507-GGUF
      * agentic LLM for software engineering tasks, finetuned from Mistral-Small-3.1, context window of up to 128k tokens
      * [[https://docs.unsloth.ai/models/tutorials-how-to-fine-tune-and-run-llms/devstral-how-to-run-and-fine-tune|Devstral: How to Run & Fine-tune]]
      * Q4_K_XL 14.5 GB
    * [[https://huggingface.co/mistralai/Mamba-Codestral-7B-v0.1|Mamba-Codestral-7B-v0.1]]
      * Codestral on the Mamba2 architecture
  * VibeThinker-1.5B (Weibo)
    * https://huggingface.co/Mungert/VibeThinker-1.5B-GGUF
  * OpenAI
    * gpt-oss-20b https://huggingface.co/unsloth/gpt-oss-20b-GGUF
  * IBM
    * [[https://huggingface.co/ibm-granite/granite-4.0-h-tiny-GGUF|Granite-4.0-H-Tiny]]
      * layer=40, n_ctx=1048576 (**1M !**), model type=1B, model params=6.94 B, n_embd=1536, n_merges=100000, max token length=256, n_rot=128, n_expert=64, n_expert_used=6
      * RTX 3060 12 Go
        * ''<nowiki>--ctx-size 500000</nowiki>'', model loaded, nvidia-smi Memory-Usage 9766MiB/12288MiB
    * [[https://huggingface.co/ibm-granite/granite-8b-code-instruct-4k|granite-8b-code-instruct-4k]] (May 6th, 2024)
    * [[https://huggingface.co/ibm-granite/granite-8b-code-instruct-128k|granite-8b-code-instruct-128]]
    * Granite 2.0 Code Model [[https://huggingface.co/aifoundry-org/granite-8b-code-instruct-128k-Q4_K_M-GGUF|granite-8b-code-instruct-128k]]
    * granite-8b-code-instruct-128k-Q5_K_M.gguf
      * file=5.7 Go, n_ctx=128000, n_layer=36
      * RTX 5060 16 Go
        * ''<nowiki>--ctx-size 0</nowiki>'', model loading error: ggml_backend_cuda_buffer_type_alloc_buffer: allocating 18000.00 MiB on device 0: cudaMalloc failed: out of memory, alloc_tensor_range: failed to allocate CUDA0 buffer of size 18874368000
        * ''<nowiki>--ctx-size 70000</nowiki>'', model loaded, nvidia-smi Memory-Usage 15710MiB/16311MiB
    * granite-8b-code-instruct-128k-Q4_K_M.gguf
      * file=4.9 Go, n_ctx=128000, n_layer=36
      * RTX 5060 16 Go
        * ''<nowiki>--ctx-size 0</nowiki>'', model loading error: ggml_backend_cuda_buffer_type_alloc_buffer: allocating 18000.00 MiB on device 0: cudaMalloc failed: out of memory, alloc_tensor_range: failed to allocate CUDA0 buffer of size 18874368000
        * ''<nowiki>--ctx-size 70000</nowiki>'', model loaded, nvidia-smi Memory-Usage 14910MiB/16311MiB
      * RTX 3060 12 Go
        * ''<nowiki>--ctx-size 44000</nowiki>'', model loaded, nvidia-smi Memory-Usage 11136MiB/12288MiB


Plan de test de comparaison par LeChat de Mistral:

[[https://cdn-uploads.huggingface.co/production/uploads/64d1faaa1ed6649d70d1fa2f/jYT1Iq9Jv6vw8Cllr3DuX.png|{{https://cdn-uploads.huggingface.co/production/uploads/64d1faaa1ed6649d70d1fa2f/jYT1Iq9Jv6vw8Cllr3DuX.png?600}}]]

==== API service ====

Mistral
  * IHM: https://console.mistral.ai
  * https://codestral.mistral.ai/v1
    * codestral-2508, Our cutting-edge language model for coding released August 2025.
      * max_context_length=256000, default_model_temperature=0.3
      * capabilities: completion_chat, function_calling, completion_fim, <del>fine_tuning, vision, ocr, classification, moderation, audio</del>
  * https://api.mistral.ai/v1
    * devstral-2512, Official mistral-vibe-cli-latest Mistral AI model
      * max_context_length=262144, default_model_temperature=0.2
      * capabilities: completion_chat, function_calling, <del>completion_fim, fine_tuning, vision, ocr, classification, moderation, audio</del>

==== Autres usages ====

  * [[https://linagora.com/webinaire-openllm-lucie-un-modele-souverain-reellement-open-source|LUCIE, le modèle d’IA Open Source dédié à l’Éducation]]
    * [[https://openllm-france.fr/|Lucie-7B, notre premier modèle fondation entraîné à partir de zéro, est le plus gros modèle fondation qui a été entraîné sur plus de 30 % de données françaises]] sur openllm-france.fr
    * [[https://huggingface.co/OpenLLM-France/Lucie-7B-Instruct-human-data|Model Card for Lucie-7B-Instruct-human-data]]
  * [[https://eurollm.io/|EuroLLM, Large language model made in Europe built to support all official 24 EU languages]]
    * [[https://korben.info/eurollm-llm-europeen-local-ollama-laptop.html|EuroLLM - Le LLM européen qui tourne sur votre laptop]]
      * [[https://huggingface.co/utter-project/EuroLLM-9B|huggingface/utter-project/EuroLLM-9B]]
        * https://huggingface.co/bartowski/EuroLLM-9B-Instruct-GGUF
  * [[https://github.com/bofenghuang/vigogne/blob/main/docs/model.md|Vigogne]] modèles réentrainer en français (//2023//)
    * [[https://github.com/bofenghuang/vigogne/blob/main/blogs/2023-08-17-vigogne-chat-v2_0.md|Voilà Voilà: Unleashing Vigogne Chat V2.0]]
  * [[https://www.channelnews.fr/avec-son-moteur-ia-ultra-leger-et-ultra-puissant-lighton-rend-la-deep-research-accessible-et-souveraine-148246|LightOn dévoile Reason-ModernColBERT]]
    * un modèle open source taillé pour la Deep Research et capable de battre des géants du retrieval avec seulement 150 millions de paramètres. L’entraînement complet ne prend que deux heures et moins de 100 lignes de code, ouvrant la voie à un fine-tuning rapide sur des corpus privés

===== Models servers =====

==== llama.cpp ====

https://github.com/ggml-org/llama.cpp

Lancer le serveur avec un modèle en local:
<code bash>
./bin/llama-server -m devstralQ5_K_M.gguf --port 8012 --jinja --ctx-size 20000

~/Code/bronx/AI_Coding/llama.cpp/build/bin/llama-server --port 8012 --chatml -m ~/Data/AI_Models/Qwen2.5-coder-7b-instruct-q8_0.gguf --ctx-size 48000
</code>

Quid des chat formats ? Est-ce lié au modèle ?
  * ''<nowiki>--jinja</nowiki>''
  * ''<nowiki>--chatml</nowiki>''
  * [[https://github.com/ggml-org/llama.cpp/wiki/Templates-supported-by-llama_chat_apply_template|Templates supported by llama_chat_apply_template]]

<code>
$ llama-server --help
...
--chat-template JINJA_TEMPLATE          set custom jinja chat template (default: template taken from model's
                                        metadata)
                                        if suffix/prefix are specified, template will be disabled
                                        only commonly used templates are accepted (unless --jinja is set
                                        before this flag):
                                        list of built-in templates:
                                        bailing, bailing-think, bailing2, chatglm3, chatglm4, chatml,
                                        command-r, deepseek, deepseek2, deepseek3, exaone3, exaone4, falcon3,
                                        gemma, gigachat, glmedge, gpt-oss, granite, grok-2, hunyuan-dense,
                                        hunyuan-moe, kimi-k2, llama2, llama2-sys, llama2-sys-bos,
                                        llama2-sys-strip, llama3, llama4, megrez, minicpm, mistral-v1,
                                        mistral-v3, mistral-v3-tekken, mistral-v7, mistral-v7-tekken, monarch,
                                        openchat, orion, pangu-embedded, phi3, phi4, rwkv-world, seed_oss,
                                        smolvlm, vicuna, vicuna-orca, yandex, zephyr
                                        (env: LLAMA_ARG_CHAT_TEMPLATE)

...
</code>

Modèles:
  * Les models au format GGUF, en fichier ou url sur [[https://huggingface.co/|Hugging Face]], [[https://modelscope.cn/|ModelScope]]
  * [[https://github.com/ggml-org/llama.cpp#obtaining-and-quantizing-models|Obtaining and quantizing models]]

<code bash>
$ ./bin/llama-server --jinja -m ./Qwen3-Coder-30B-A3B-Instruct-Q5_K_S.gguf
llama_context: n_ctx_seq (4096) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
</code>

Élargir la "context window" :
  * Tous les modèles ne supportent pas YaRN (vérifie la documentation).
  * YaRN améliore la gestion des longs textes, mais ne résout pas les problèmes de compréhension profonde
  * ''<nowiki>--rope-scaling {none,linear,yarn}</nowiki>'' RoPE frequency scaling method, defaults to linear unless specified by the model
  * ''<nowiki>--rope-scale N</nowiki>'' RoPE context scaling factor, expands context by a factor of N
  * ''<nowiki>--yarn-orig-ctx N</nowiki>''  YaRN: original context size of model (default: 0 = model training context size)

=== Compilation pour GPU ===

Il faut le compiler avec CUDA. Avec une version >= 11.7 pour [[https://github.com/ggml-org/llama.cpp/issues/11112|compatibilité syntaxe]].

  * [[https://github.com/ggml-org/llama.cpp/blob/master/docs/build.md#cuda|Build llama.cpp with CUDA]]

J'ai [[https://linuxcapable.com/how-to-install-cuda-on-debian-linux/|installé CUDA]] le [[https://developer.nvidia.com/blog/updating-the-cuda-linux-gpg-repository-key|dépot Nvidia]] Cuda et cuda toolkit 13

<code>
$ sudo cat /etc/apt/sources.list.d/cuda-ubuntu2404-x86_64.list
deb [signed-by=/usr/share/keyrings/cuda-archive-keyring.gpg]
 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/x86_64/ /
</code>

en option ou @ spécifier pour le ''cmake build'' :
<code bash>
export PATH=$PATH:/usr/local/cuda-<version>/bin/
</code>

Ensuite une longue compilation :

<code>
# DCMAKE_CUDA_ARCHITECTURES :
# CUDA GPU Compute Capability https://developer.nvidia.com/cuda-gpus
# RTX 3060 : 86
# RTX 5060 : 120

$ export CUDA_VERSION=12.9 && cmake -B build -DGGML_CUDA=ON \
 -DCMAKE_CUDA_ARCHITECTURES="86;120" \
 -DCMAKE_BUILD_WITH_INSTALL_RPATH=ON \
 -DCMAKE_CUDA_COMPILER=/usr/local/cuda-${CUDA_VERSION}/bin/nvcc \
 -DCMAKE_INSTALL_RPATH="/usr/local/cuda-${CUDA_VERSION}/lib64;\$ORIGIN"

-- ccache found, compilation results will be cached. Disable with GGML_CCACHE=OFF.
-- CMAKE_SYSTEM_PROCESSOR: x86_64
-- GGML_SYSTEM_ARCH: x86
-- Including CPU backend
-- x86 detected
-- Adding CPU backend variant ggml-cpu: -march=native 
-- CUDA Toolkit found
-- Using CUDA architectures: 86;120
-- CUDA host compiler is GNU 13.3.0
-- Including CUDA backend
-- ggml version: 0.9.4
-- ggml commit:  6016d0bd4
-- Configuring done (0.5s)
-- Generating done (0.2s)
-- Build files have been written to: /home/cyrille/Code/bronx/AI_Coding/llama.cpp/build

$ time cmake --build build --config Release -j 10

# host: i7-1360P + SSD
...
real	44m35,149s
user	42m38,100s
sys	1m51,594s
...
# Avec `-j 10` (concurent tasks)
real	11m6,449s
user	104m56,615s
sys	3m45,431s
</code>

==== ollama ====

- https://ollama.com
- https://github.com/ollama/ollama

Chat & build with open models.

Interface utilisateur pour gérer et exécuter des modèles localement, utilise Llama.cpp sous le capot.

Sur linux install un service ''systemd''

==== koboldcpp ====

https://github.com/LostRuins/koboldcpp

==== vLLM ====

vLLM est une bibliothèque open-source optimisée pour servir efficacement des LLMs en production, à la différence de llama.cpp qui est pour le développement ou usage solo sur du matériel standard (RTX ou CPU).

  * https://docs.vllm.ai/en/stable/
  * https://github.com/vllm-project/vllm

==== NanoLLM ====

https://github.com/dusty-nv/NanoLLM

From nvidia ingenier "Dustin Franklin" @dustynv .

  * https://dusty-nv.github.io/NanoLLM/
  * https://www.jetson-ai-lab.com/tutorial_nano-llm.html


Todo
  * [[https://towardsdatascience.com/how-to-build-an-openai-compatible-api-87c8edea2f06/|How to build an OpenAI-compatible API]]

==== LiteLLM ====

https://github.com/BerriAI/litellm

==== Tabby ML ====

Est à la fois le serveur de model et l'[[#tabby|assistant de code]].

https://tabby.tabbyml.com/docs/quick-start/installation/linux/

Fourni llama.cpp.
===== Coding assistant =====

Agentic Capabilities LLMs.

  * [[https://cline.bot/blog/why-cline-doesnt-index-your-codebase-and-why-thats-a-good-thing|Why Cline Doesn't Index Your Codebase (And Why That's a Good Thing)]] ; Code isn't like other data: it's interconnected, constantly evolving ; Cline use AST (Abstract Syntax Tree)

Listes d'agents
  * [[https://apidog.com/fr/blog/ai-coding-agents-3/|Top 20 des agents de codage IA à essayer absolument en 2025]]

> La concurrence est rude entre les entreprises et startups de l’IA. Dernier terrain de bataille, les agents dédiés au développement web et à la programmation. Google, avec [[https://www.blogdumoderateur.com/comment-tester-jules-agent-ia-google-code/|Jules]] ; OpenAI, avec [[https://www.blogdumoderateur.com/openai-codex-agent-ia-automatise-code-chatgpt/|Codex]] ; GitHub, avec [[https://www.blogdumoderateur.com/github-nouvel-agent-code-ia/|Copilot]] ; Anthropic, avec [[https://www.blogdumoderateur.com/anthropic-devoile-claude-sonnet-3-7-claude-code/|Claude Code]], sans oublier les outils comme [[https://www.blogdumoderateur.com/tools/lovable/|Lovable]]. Au tour maintenant du Français Mistral de proposer un « assistant de programmation propulsé par l’IA ». Mais de quoi s’agit-il exactement ?
> -> [[https://www.blogdumoderateur.com/mistral-code-agent-ia-automatiser-developpement/|Mistral Code, un nouvel agent IA pour automatiser le développement logiciel]]
==== continue ====

https://docs.continue.dev/


==== Claude code ====

https://claude.com/product/claude-code

==== Synoptia THÉRÈSE Cli ====

THÉRÈSE (Terminal Helper for Engineering, Research, Editing, Software & Execution) est un assistant de code en ligne de commande, 100% français, inspiré de Claude Code mais propulsé par Mistral AI.

https://github.com/ludovicsanchez38-creator/Synoptia-THERESE-CLI

==== Shai ====

shai is a coding agent, your pair programming buddy that lives in the terminal. Written in rust with love <3 at OVH.

https://github.com/ovh/shai
==== opencode ====

  * https://opencode.ai
  * <del>https://github.com/sst/opencode</del> https://github.com/anomalyco/opencode (yep, encore changé de nom...)

Les prompts system:
  * https://github.com/sst/opencode/tree/dev/packages/opencode/src/session/prompt
  * config
    * agents https://opencode.ai/docs/agents/#json
    * modes https://opencode.ai/docs/modes/#json-configuration

[[https://opencode.ai/docs/models/|Modèles conseillés]] :
  * GPT 5.1
  * GPT 5.1 Codex
  * Claude Sonnet 4.5
  * Claude Haiku 4.5
  * Kimi K2
  * GLM 4.6
  * Qwen3 Coder
  * Gemini 3 Pro

Plus de choses [[/informatique/ai_coding/opencode|OpenCode]]

=== Essais de models ===

''opencode models'' liste les modèles disponibles sur les providers configurés. Bien pratique pour trouver le nom à mettre dans la config.

Modèles on-line essayés avec opencode.

  * Big Pickle (opencode zen) : résultats impressionants ! Un vrai super assistant
  * Codestral (mistral free)
    * baseURL: https://codestral.mistral.ai/v1
    * model : codestral-latest
  * Qwen3-Coder-30B-A3B-Instruct (ovhcloud) : ça fonctionne mais juste le minimum
  * OvhCloud pas stable 😩
    * //mistral-nemo-instruct-2407 (ovhcloud) : Pas de réponse//
    * //Mixtral-8x7B-Instruct-v0.1 (ovhcloud) : Bad request//
    * //Llama-3.1-8B-Instruct (ovhcloud) : Failed with "First, let me check the opencode documentation to see if there's any information about ..."//
    * //Meta-Llama-3_3-70B-Instruct (ovhcloud) : Failed with "Unknown agent type: greeting-responder is not a valid agent type"//

  * GemmaCoder3-12B
    * erreur format de conversation : "Conversation roles must alternate user/assistant/user..."

==== cline ====

  * https://github.com/cline/cline
  * https://docs.cline.bot


==== codex-cli ====

Par OpenAi

  * https://developers.openai.com/codex/cli
  * https://github.com/openai/codex

==== Cursor ====

Par Anysphere Inc

https://cursor.com/pricing

==== Tabby ====

Contient le [[#tabby_ml|serveur de model]] qu'il faut installer.

  * https://www.tabbyml.com/
  * source https://github.com/TabbyML/tabby
  * doc https://tabby.tabbyml.com/docs/

==== Gemini CLI ====

==== LLxprt Code ====

fork de Google's Gemini CLI

  * https://github.com/vybestack/llxprt-code
  * présentation: https://www.aitoolnet.com/fr/llxprt-code

==== Windsurf / Codeium ====

https://windsurf.com/editor

==== Amp Free ====

==== Tabnine ====

https://www.tabnine.com/

==== Mistral Vibe ====

Apache 2.0 license

  * https://docs.mistral.ai/mistral-vibe/introduction
  * https://github.com/mistralai/mistral-vibe

===== MCP server =====

Articles:
  * [[https://tighten.com/insights/let-ai-interact-with-your-app-via-mcp/|Let AI Interact with Your App via MCP]] (//show how to build an MCP server for a task management app//)

Curated lists:
  * [[https://github.com/rohitg00/awesome-devops-mcp-servers|rohitg00/awesome-devops-mcp-servers]] A curated list of awesome MCP servers focused on DevOps tools and capabilities.

==== Demo MCP Server ====

A collection of reference implementations for the Model Context Protocol (MCP), as well as references to community-built servers and additional resources.

  * https://modelcontextprotocol.io/docs/getting-started/intro
  * https://github.com/modelcontextprotocol/servers/

  * Everything - Reference / test server with prompts, resources, and tools.
  * Fetch - Web content fetching and conversion for efficient LLM usage.
  * Filesystem - Secure file operations with configurable access controls.
  * Git - Tools to read, search, and manipulate Git repositories.
  * Memory - Knowledge graph-based persistent memory system.
  * Sequential Thinking - Dynamic and reflective problem-solving through thought sequences.
  * Time - Time and timezone conversion capabilities.
  * ...

==== Serena ====

  * https://github.com/mcp/oraios/serena
  * https://apidog.com/fr/blog/serena-mcp-server-fr/

==== goose ====

A local, extensible, open source AI agent that automates engineering tasks.

  * https://block.github.io/goose/
  * https://github.com/block/goose
  * [[https://block.github.io/goose/docs/category/mcp-servers|integrate and use MCP servers as goose extensions]] like Selenium, Dev.to, BrowserBase, [[https://block.github.io/goose/docs/mcp/autovisualiser-mcp|Auto Visualiser]] ...

==== Apify MCP ====

Apify Actors scrape up-to-date web data from any website for AI apps and agents,
 social media monitoring, competitive intelligence, lead generation, and product research.
Crawl website to feed AI

  * https://apify.com/
  * https://mcp.apify.com/

==== arabold/docs-mcp-server ====

https://grounded.tools/
https://github.com/arabold/docs-mcp-server

==== context7 ====

https://context7.com/


==== laravel boost====

Dédié Php Laravel: https://laravel.com/ai/boost

==== Chrome DevTools MCP ====

https://github.com/ChromeDevTools/chrome-devtools-mcp/

chrome-devtools-mcp permet à votre agent de codage (tel que Gemini, Claude, Cursor ou Copilot) de contrôler et d'inspecter un navigateur Chrome en direct. Il agit comme un serveur MCP (Model-Context-Protocol), donnant à votre assistant de codage IA accès à toute la puissance de Chrome DevTools pour une automatisation fiable, un débogage approfondi et une analyse des performances.


===== LSP Server =====

Intelephense (php) https://intelephense.com/docs

Php Actor
  * https://github.com/phpactor/phpactor
  * https://phpactor.readthedocs.io

===== Vector database =====

  * ChromaDB léger, en mémoire ou sur disque
  * FAISS Facebook AI Similarity Search, optimisé pour la recherche de similarité
  * Qdrant open source, scalable

Solutions plus évoluées en SaaS
  * [[https://www.pinecone.io/|Pinecone]]
  * [[https://weaviate.io/|Weaviate]]
    * Weaviate is an open-source vector database that stores both objects and vectors, allowing for the combination of vector search with structured filtering with the fault tolerance and scalability of a cloud-native database​. 
    * https://github.com/weaviate/weaviate
    * https://weaviate.io/pricing


===== LLM Gateway =====

  * [[https://dev.to/varshithvhegde/bifrost-the-llm-gateway-thats-40x-faster-than-litellm-1763|Bifrost: The LLM Gateway That's 40x Faster Than LiteLLM]]

<code>
{
  "fallbacks": {
    "enabled": true,
    "order": [
      "openai/gpt-4o-mini",
      "anthropic/claude-sonnet-4",
      "mistral/mistral-large-latest"
    ]
  }
}
</code>

===== system message =====

Le "prompt système" est un élément essentiel : c'est la feuille de route pour le modèle, en définissant son comportement, ses limites, et même sa "personnalité". Son efficacité dépend de sa formulation et des spécificités du modèle.

  * Exemple de ''system message'' pour un chatbot: [[https://tighten.com/insights/build-private-self-hosted-ai-applications-with-ollama-and-laravel/?utm_source=newsletter&utm_medium=email&utm_campaign=freekdev-newsletter-193#demo-a-chatbot-for-super-spies|Demo: A Chatbot For Super-Spies!]]
  * Les [[https://github.com/sst/opencode/tree/dev/packages/opencode/src/session/prompt|System prompts de OpenCode]]