Ceci est une ancienne révision du document !

AI Coding

Coder avec une IA LLM.

Explications:

How to Run LLMs Locally: A Complete Step-by-Step Guide (2025-05-27) sur la quatification, format GGUF, Group Size Suffix (S/M/L)

Dans les news:

EuroLLM - Le LLM européen qui tourne sur votre laptop
- huggingface/utter-project/EuroLLM-9B
  - https://huggingface.co/bartowski/EuroLLM-9B-Instruct-GGUF
LightOn dévoile Reason-ModernColBERT
- un modèle open source taillé pour la Deep Research et capable de battre des géants du retrieval avec seulement 150 millions de paramètres. L’entraînement complet ne prend que deux heures et moins de 100 lignes de code, ouvrant la voie à un fine-tuning rapide sur des corpus privés

Quelques essais perso

Sur les perfs

Estimations

Devstral avec llama.cpp sur RTX 3060 12 Go.

by ChatGPT :

Modèle	Contexte (seq_len)	Batch_size recommandé	Remarques
—————–	——————	———————	—————————————-
Devstral Small 7B	1024	4	Très sûr, VRAM ample
Devstral Small 7B	2048	2‑3	Bon compromis vitesse/VRAM
Devstral Small 7B	4096	1‑2	VRAM presque saturée
Devstral 13B	1024	2	VRAM limitée
Devstral 13B	2048	1‑2	Optimal, attention VRAM
Devstral 13B	4096	1	VRAM saturée, offload CPU conseillé
Devstral 13B	8192	1	Possible mais contexte long → risque OOM

by LeChat:

contexte (tokens)	modèle (paramètres)	VRAM estimée (Go)	Batch size optimal	Latence estimée (tok/s)	Notes
512	7B	~5.5	8	15-25	Idéal pour des tâches courtes et rapides.
1024	7B	~6.0	4	10-20	Bon compromis pour des prompts moyens.
2048	7B	~7.0	2	5-15	Nécessite une gestion fine de la VRAM.
4096	7B	~8.5	1	3-10	Proche de la limite VRAM, risque de ralentissement.
512	13B	~9.0	4	8-15	Modèle plus gros, latence accrue.
1024	13B	~10.0	2	4-10	VRAM presque saturée, batch_size réduit.
2048	13B	~11.5	1	2-8	Risque élevé de dépassement VRAM, latence importante.

Online services

launch a opencode server :

opencode serve --port=30781 --print-logs --log-level DEBUG

Then prompt : “Explain async/await in JavaScript”

with:

time opencode run -m <ProviderId/ModelId> --attach=http://127.0.0.1:30781 --agent=plan "Explain async/await in JavaScript"

Attention, les résultats peuvent être très différents:

d'une simple phrase de définition à un exemple de code
aussi, le system message prompt est sélectionné par opencode …
- https://github.com/sst/opencode/issues/4861

ovhcloud/Qwen3-Coder-30B-A3B-Instruct = 2,008s / 3,100s
ovhcloud/gpt-oss-20b = 14,219s / 21,714s
ovhcloud/Mistral-Nemo-Instruct-2407 = abandon après 7 minutes d'attente …
opencode/big-pickle = 2,858s / 3,479s
mistral-codestral/codestral-latest = 2,320s / 3,427s

Cartes IA

Hailo

Hailo 8
packaging in a box with a Raspberry by SeedStudio, 26 TOPS, 15 GB RAM - $289

Axelera

Metis
Europa

seeedstudio

reComputer Mini J4012 is a tiny AI computer powered by NVIDIA® Jetson Orin™ NX 16GB module,delivering up to 100 TOPS AI performance - $900
reComputer J2022 - Edge AI Computer with NVIDIA® Jetson Xavier™ NX 16GB - $759

Ollama & Nvidia Jetpack

https://www.jetson-ai-lab.com/tutorial_ollama.html
pour plus de performance utiliser NanoLLM - Optimized LLM Inference
- Ollama uses llama.cpp for inference, which various API benchmarks and comparisons are provided for on the Llava page. It gets roughly half of peak performance versus the faster APIs like NanoLLM , but is generally considered fast enough for text chat.

Nvidia

A10
- https://askgeek.io/en/gpus/NVIDIA/NVIDIA-A10
- NVIDIA A10 vs RTX 3060

	A 10	A 30	A 40	A 100 SXM4	A 800	H 100 SMX5
Prix eBay	$2,330	$3,999	$9,950	$4,000	$20,000	$20,000
Architecture	Ampere	Ampere	Ampere	Ampere	Ampere	Hopper
Code name	GA102	GA100	GA102	GA100	GA100	GH100
Launch date	2021-04	2021-04	2020-10	2020-05	2022-11	2022-03
Maximum RAM	24 GB	24 GB	48 GB	40 GB	40 GB	96 GB
Memory type	GDDR6	HBM2e	GDDR6	HBM2e	HBM2e	HBM3
Memory bandwidth	600.2 GB/s	933.1 GB/s	695.8 GB/s	1555 GB/s	1.56 TB/s	1,681 GB/s
Memory bus width	384 bit	3072 bit	384 bit	5120 bit	5120 bit	5120 bit
Memory clock speed	1563 MHz	1215 MHz	1812 MHz	1215 MHz	1215 MHz	1313 MHz
Core clock speed	885 MHz	930 MHz	1305 MHz	1095 MHz	765 MHz	1837 MHz
Boost clock speed	1695 MHz	1440 MHz	1740 MHz	1410 MHz	1410 MHz	1665 MHz
Peak Half Precision (FP16)	31.24 TFLOPS (1:1)	10.32 TFLOPS (1:1)	37.42 TFLOPS (1:1)	77.97 TFLOPS (4:1)
Pipelines	9216	3584	10752	6912	6912	16896
Thermal Design Power	150 Watt	165 Watt	300 Watt	400 Watt	250 Watt	700 Watt
OpenCL	3.0	3.0	3.0		3.0

Cartes graphiques

Nvidia

RTX 3060
- CUDA GPU Compute Capability: 8.6
RTX 5060 TI 16 Go 475€ TTC chipset.fr
- CUDA GPU Compute Capability: 12.0
pny-rtx-5060ti-16go-overclocked 445€ TTC grosbill.com

gpu_bench

Adaptateur GPU externe

En anglais “GPU enclosures”. Nécessite un port Thunderbolt 3, 4 ou à venir 5.

egpu docks

Accelerating Machine Learning on a Linux Laptop with an External GPU by NVidia (Setting up Ubuntu to use NVIDIA eGPU)

eGPU

Models

Pour de l'assistance au code avec un GPU 16Go

Qwen
- Qwen2.5-Coder-7B-Instruct
  - https://huggingface.co/bartowski/Qwen2.5-Coder-7B-Instruct-GGUF
    - F16 15.2 GB, Q8_0 8.1 GB
- Qwen3-14B
  - https://huggingface.co/unsloth/Qwen3-14B-GGUF
    - Q8_0 15.7 GB, Q6_K_XL 13.3 GB
DeepSeek2
- DeepSeek-Coder-V2-Lite-Instruct
  - https://huggingface.co/bartowski/DeepSeek-Coder-V2-Lite-Instruct-GGUF
    - Q6_K_L 14.6 GB, Q6_K 14.1 GB
Google
- Gemma3 https://huggingface.co/bartowski/burtenshaw_GemmaCoder3-12B-GGUF
  - Q8_0 12.5 GB
- codegemma-7b-it https://huggingface.co/bartowski/codegemma-7b-it-GGUF
  - Q8_0 9.08 GB
Lama3
- https://huggingface.co/bartowski/Llama-3-8B-Instruct-Coder-v2-GGUF
  - Q8_0 8.54 GB
Mistral
- Mistral-7B-Instruct-v0.3 https://huggingface.co/lmstudio-community/Mistral-7B-Instruct-v0.3-GGUF
  - Q8_0 7.7 GB
- Codestral-22B-v0.1 https://huggingface.co/lmstudio-community/Codestral-22B-v0.1-GGUF
  - Q5_K_M 15.7 GB, Q4_K_M 13.3 GB
- Magistral-Small-2509 https://huggingface.co/bartowski/mistralai_Magistral-Small-2509-GGUF
  - Q4_1 14.9 GB, Q4_K_M 14.3 GB
- Devstral-Small-2507 https://huggingface.co/unsloth/Devstral-Small-2507-GGUF
  - agentic LLM for software engineering tasks, finetuned from Mistral-Small-3.1, context window of up to 128k tokens
  - Devstral: How to Run & Fine-tune
  - Q4_K_XL 14.5 GB
VibeThinker-1.5B (Weibo)
- https://huggingface.co/Mungert/VibeThinker-1.5B-GGUF
  - BF16 3.56 GB, F16_Q 2.77 GB
OpenAI
- gpt-oss-20b https://huggingface.co/unsloth/gpt-oss-20b-GGUF
  - F16 13.8 GB,

Plan de test de comparaison :

LeChat https://chat.mistral.ai/chat/c97f1761-ca39-4b0c-98e4-e8514d9567b9

Models servers

llama.cpp

https://github.com/ggml-org/llama.cpp

Lancer le serveur avec un modèle en local:

./bin/llama-server -m devstralQ5_K_M.gguf --port 8012 --jinja --ctx-size 20000

Models:

Les models au format GGUF, en fichier ou url sur Hugging Face, ModelScope
Obtaining and quantizing models

$ ./bin/llama-server --jinja -m ./Qwen3-Coder-30B-A3B-Instruct-Q5_K_S.gguf
llama_context: n_ctx_seq (4096) < n_ctx_train (262144) -- the full capacity of the model will not be utilized

Avec GPU

Il faut le compiler avec CUDA. Avec une version >= 11.7 pour compatibilité syntaxe.

J'ai installé CUDA le dépot Nvidia Cuda et cuda toolkit 13

$ cat /etc/apt/sources.list.d/nvidia-cuda.list
deb [signed-by=/usr/share/keyrings/cuda-archive-keyring.gpg] https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/ /

et aussi

export PATH=$PATH:/usr/local/cuda-13.0/bin/

puis une très longue compilation avec :

# CUDA GPU Compute Capability https://developer.nvidia.com/cuda-gpus
# RTX 3060 : 86
# RTX 5060 : 120
cmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES="86;120"
cmake --build build --config Release

ollama

- https://ollama.com - https://github.com/ollama/ollama

Chat & build with open models

koboldcpp

https://github.com/LostRuins/koboldcpp

vllm

NanoLLM

https://github.com/dusty-nv/NanoLLM

From nvidia ingenier “Dustin Franklin” @dustynv .

Todo

How to build an OpenAI-compatible API

LiteLLM

https://github.com/BerriAI/litellm

Tabby ML

Est à la fois le serveur de model et l'assistant de code.

https://tabby.tabbyml.com/docs/quick-start/installation/linux/

Coding assistant

Agentic Capabilities LLMs.

continue

https://docs.continue.dev/

Claude code

https://claude.com/product/claude-code

opencode

Les prompts system:

😩 Attention au contenu du fichier configuration opencode.json, la moindre erreur n'est pas signalée, mais pose des problèmes.

Essais de models

opencode models list les modèles disponibles sur les providers configurés. Bien pratique pour trouver le nom à mettre dans la config.

Modèles essayés avec opencode.

Big Pickle (opencode zen) : résultats impressionants ! Un vrai super assistant
Codestral (mistral free)
- baseURL: https://codestral.mistral.ai/v1
- model : codestral-latest
Qwen3-Coder-30B-A3B-Instruct (ovhcloud) : ça fonctionne mais juste le minimum
mistral-nemo-instruct-2407 (ovhcloud) : Pas de réponse
Mixtral-8x7B-Instruct-v0.1 (ovhcloud) : Bad request
Llama-3.1-8B-Instruct (ovhcloud) : Failed with “First, let me check the opencode documentation to see if there's any information about …”
Meta-Llama-3_3-70B-Instruct (ovhcloud) : Failed with “Unknown agent type: greeting-responder is not a valid agent type”

cline

codex-cli

Par OpenAi

Tabby

Contient le serveur de model qu'il faut installer.

Gemini CLI

LLxprt Code

Windsurf

Amp Free

MCP server

Articles:

Let AI Interact with Your App via MCP (show how to build an MCP server for a task management app)

Serena

chrome-devtools-mcp permet à votre agent de codage (tel que Gemini, Claude, Cursor ou Copilot) de contrôler et d'inspecter un navigateur Chrome en direct. Il agit comme un serveur MCP (Model-Context-Protocol), donnant à votre assistant de codage IA accès à toute la puissance de Chrome DevTools pour une automatisation fiable, un débogage approfondi et une analyse des performances.

LSP Server

Intelephense (php) https://intelephense.com/docs

Php Actor

system message

Exemple de system message pour un chatbot:

Demo: A Chatbot For Super-Spies!

Cyrille Giquello

Table des matières