Table des matières

AI Language Model

AI Language Model

Les modèles d’intelligence artificielle (IA), des simples algorithmes de régression jusqu’aux réseaux neuronaux complexes utilisés dans l’apprentissage profond, fonctionnent grâce à une logique mathématique. Toutes les données utilisées par un modèle d’intelligence artificielle, y compris les données non structurées comme le texte, l’audio ou les images, doivent être exprimées sous forme numérique. Le plongement vectoriel, ou représentation vectorielle, est une méthode qui permet de convertir un point de données non structuré en un tableau de nombres, tout en conservant la signification originale des données.

Articles:

introduction aux LLM : Démystifier les (LLM) et comment les mettre en œuvre chez vous pour les étudier de plus près
How to Run LLMs Locally: A Complete Step-by-Step Guide (2025-05-27) sur la quatification, format GGUF, Group Size Suffix (S/M/L)
J’ai lancé un mini ChatGPT en local sur mon CPU avec GPT4All
Ajouter un modèle au format ggml dans GPT4All sur Linux Ubuntu

Autres pages:

AI NLP (Natural Language Processing / traitement automatique du langage naturel)
AI Coding
AI Image
GPU Benchmarks
AI Vision
AI Agent

Glossaire

LLM/SLM Large Language Model / Small Language Model
LoRA (Low-Rank Adaptation): une méthode permet de “spécialiser” un peu un modèle est ajoutant des couches légères (qlqs Mo) et adaptables à un modèle pré-entraîné, au lieu de modifier tous ses poids ; 0,1% à 1% des paramètres du modèle sont entraînés. Le modèle de base reste inchangé, LoRA peut être désactivé.
- Axolotl A Free and Open Source LLM Fine-tuning Framework
techniques pour améliorer la gestion des longs textes
- YaRN (Yet Another Recurrent Network): une technique pour améliorer la gestion des textes longs par les LLMs sans nécessiter de réentraînement complet ou de modifications lourdes, seules quelques couches sont ajustées. Vise à étendre la fenêtre de contexte (context window) des LLMs, passer de 4k à 128k tokens. Autres techniques: ALiBi, NTK-aware scaling.
- RoPE (Rotary Position Embedding): technique pour intégrer des informations de position dans les séquences de tokens, tout en permettant une meilleure généralisation à des longueurs de texte variables. Contrairement aux méthodes comme les embeddings de position absolus (BERT), RoPE utilise une représentation relative et rotative des positions qui améliore la capacité des modèles à comprendre les relations entre les tokens, même sur de longues distances.
- RoPE + YaRN : RoPEE Fournit la base mathématique pour comprendre les positions relatives et YaRN étend cette base pour permettre des fenêtres de contexte encore plus grandes
- ALiBi … technique plus ancienne que les 2 précédentes …
GGUF (GPT-Generated Unified Format): format binaire optimisé pour l’inférence (exécution de modèles), développé par la communauté open-source, notamment par ggerganov le créateur de Llama.cpp. Remplace l’ancien format GGML.
- afficher les metadata du fichier: gpustack/gguf-parser-go
- Autres formats: PyTorch, ONNX, TensorRT, GGML (déprécié)
MoE (Mixture of Experts): architecture de modèle où plusieurs “experts” (sous-réseaux spécialisés) sont activés de manière conditionnelle pour traiter différentes parties des données.
- Le modèle est composé de plusieurs sous-réseaux appelés “experts”.
- Un générateur de sélection (router) détermine quels experts utiliser pour chaque entrée.
- permet de réduire le coût de calcul en ne passant les données que par un sous-ensemble des experts.
MCP Model Context Protocol, voir MCP Server
RAG (Retrieval-Augmented Generation): combine deux capacités de l’IA → la récupération d’informations et la génération de texte.
- ReRanking (nettoyage intelligent) consiste à réévaluer et réorganiser les résultat de la phase de retrieval (RAG) pour ne garder que les éléments les plus pertinents et supprimer les redondances
Agents IA
CoT Chain of Thought - Un modèle en mode CoT répond en exposant ses étapes de raisonnement, en mode no CoT il répond directement

Classification de modèles ouverts: Foundation models by Ibm

Comment faire pour qu'un appel à un LLM ait un résultat reproductible d'une fois sur l'autre ?

Hugging Face entreprise française créée en 2016 → L'IA open source par Hugging Face - Gen AI Nantes 2024-01 par Julien Simon

Sur les perfs

GPU Benchmarks

Installer nvidia-drivers et CUDA.

Online services

launch a opencode server :

opencode serve --port=30781 --print-logs --log-level DEBUG

Then prompt : “Explain async/await in JavaScript”

with:

time opencode run -m <ProviderId/ModelId> --attach=http://127.0.0.1:30781 --agent=plan "Explain async/await in JavaScript"

👾 Attention, les résultats peuvent être très différents:

d'une simple phrase de définition à un exemple de code
- mais je n'ai pas modifier la taille du context, ce qui a une grande importance sur la taille/qualité de la réponse …
aussi, le system message prompt est sélectionné par opencode …
- https://github.com/sst/opencode/issues/4861

ovhcloud/Qwen3-Coder-30B-A3B-Instruct = 2,008s / 3,100s
ovhcloud/gpt-oss-20b = 14,219s / 21,714s
ovhcloud/Mistral-Nemo-Instruct-2407 = abandon après 7 minutes d'attente …
ovhcloud/DeepSeek-R1-Distill-Llama-70B = 22,301s / 29,187s
opencode/big-pickle = 2,858s / 3,479s
mistral-codestral/codestral-latest = 2,320s / 3,427s

Cartes IA

Hailo

Hailo 8
packaging in a box with a Raspberry by SeedStudio, 26 TOPS, 15 GB RAM - $289

Axelera

Metis
Europa

seeedstudio

reComputer Mini J4012 is a tiny AI computer powered by NVIDIA® Jetson Orin™ NX 16GB module,delivering up to 100 TOPS AI performance - $900
reComputer J2022 - Edge AI Computer with NVIDIA® Jetson Xavier™ NX 16GB - $759

Ollama & Nvidia Jetpack

https://www.jetson-ai-lab.com/tutorial_ollama.html
pour plus de performance utiliser NanoLLM - Optimized LLM Inference
- Ollama uses llama.cpp for inference, which various API benchmarks and comparisons are provided for on the Llava page. It gets roughly half of peak performance versus the faster APIs like NanoLLM , but is generally considered fast enough for text chat.

Nvidia

A10
- https://askgeek.io/en/gpus/NVIDIA/NVIDIA-A10
- NVIDIA A10 vs RTX 3060

	A 10	A 30	A 40	A 100 SXM4	A 800	H 100 SMX5
Prix eBay	$2,330	$3,999	$9,950	$4,000	$20,000	$20,000
Architecture	Ampere	Ampere	Ampere	Ampere	Ampere	Hopper
Code name	GA102	GA100	GA102	GA100	GA100	GH100
Launch date	2021-04	2021-04	2020-10	2020-05	2022-11	2022-03
Maximum RAM	24 GB	24 GB	48 GB	40 GB	40 GB	96 GB
Memory type	GDDR6	HBM2e	GDDR6	HBM2e	HBM2e	HBM3
Memory bandwidth	600.2 GB/s	933.1 GB/s	695.8 GB/s	1555 GB/s	1.56 TB/s	1,681 GB/s
Memory bus width	384 bit	3072 bit	384 bit	5120 bit	5120 bit	5120 bit
Memory clock speed	1563 MHz	1215 MHz	1812 MHz	1215 MHz	1215 MHz	1313 MHz
Core clock speed	885 MHz	930 MHz	1305 MHz	1095 MHz	765 MHz	1837 MHz
Boost clock speed	1695 MHz	1440 MHz	1740 MHz	1410 MHz	1410 MHz	1665 MHz
Peak Half Precision (FP16)	31.24 TFLOPS (1:1)	10.32 TFLOPS (1:1)	37.42 TFLOPS (1:1)	77.97 TFLOPS (4:1)
Pipelines	9216	3584	10752	6912	6912	16896
Thermal Design Power	150 Watt	165 Watt	300 Watt	400 Watt	250 Watt	700 Watt
OpenCL	3.0	3.0	3.0		3.0

NVIDIA DGX Spark ($4,000) : GB10 Grace Blackwel, 1 FP4 PFLOPS, 128GB, ConnectX-7 Smart NIC, 4TB NVME.M2 with self-encryption
Jetson AI products
- Jetson AGX Orin™ 64GB, 275 TOPS, 2500 €
- Jetson Thor: Blackwell GPU, 128 GB, 2070 FP4 TFLOPS, £3200

Cartes graphiques

Nvidia

RTX 3060
- CUDA GPU Compute Capability: 8.6
RTX 5060 TI 16 Go 475€ TTC chipset.fr
- CUDA GPU Compute Capability: 12.0
pny-rtx-5060ti-16go-overclocked 445€ TTC grosbill.com

gpu_bench

Tips: Reset nvidia et CUDA:

# éteindre la carte
# débrancher THB
$ sudo rmmod nvidia_uvm nvidia

Adaptateur GPU externe

En anglais “GPU enclosures”. Nécessite un port Thunderbolt 3, 4 ou à venir 5.

egpu docks

Accelerating Machine Learning on a Linux Laptop with an External GPU by NVidia (Setting up Ubuntu to use NVIDIA eGPU)

eGPU

Models servers

llama.cpp

https://github.com/ggml-org/llama.cpp

Lancer le serveur avec un modèle en local:

./bin/llama-server -m devstralQ5_K_M.gguf --port 8012 --jinja --ctx-size 20000
./bin/llama-server --port 8012 --chatml -m ~/Data/AI_Models/Qwen2.5-coder-7b-instruct-q8_0.gguf --ctx-size 48000

nouveautés hiver 2025-26:

la répartition automatique entre GPU et CPU, plus besoin de gérer –n-gpu-layers
host-memory prompt caching : ~~j'ai des scripts qui se sont mis à planter à cause de réponse avec content vide et reasoning_content archi plein. L'utilisation de l'option –cache-ram 0 semble résoudre ces plantages.~~

chat templates

Quid des chat formats ? Est-ce lié au modèle ?

--jinja
--chatml
Templates supported by llama_chat_apply_template

$ llama-server --help
...
--chat-template JINJA_TEMPLATE          set custom jinja chat template (default: template taken from model's
                                        metadata)
                                        if suffix/prefix are specified, template will be disabled
                                        only commonly used templates are accepted (unless --jinja is set
                                        before this flag):
                                        list of built-in templates:
                                        bailing, bailing-think, bailing2, chatglm3, chatglm4, chatml,
                                        command-r, deepseek, deepseek2, deepseek3, exaone-moe, exaone3,
                                        exaone4, falcon3, gemma, gigachat, glmedge, gpt-oss, granite, grok-2,
                                        hunyuan-dense, hunyuan-moe, kimi-k2, llama2, llama2-sys,
                                        llama2-sys-bos, llama2-sys-strip, llama3, llama4, megrez, minicpm,
                                        mistral-v1, mistral-v3, mistral-v3-tekken, mistral-v7,
                                        mistral-v7-tekken, monarch, openchat, orion, pangu-embedded, phi3,
                                        phi4, rwkv-world, seed_oss, smolvlm, solar-open, vicuna, vicuna-orca,
                                        yandex, zephyr
                                        (env: LLAMA_ARG_CHAT_TEMPLATE)

...

models GGUF format

Modèles:

Les models au format GGUF, en fichier ou url sur Hugging Face, ModelScope
Obtaining and quantizing models

$ ./bin/llama-server --jinja -m ./Qwen3-Coder-30B-A3B-Instruct-Q5_K_S.gguf
llama_context: n_ctx_seq (4096) < n_ctx_train (262144) -- the full capacity of the model will not be utilized

Élargir la “context window” :

Tous les modèles ne supportent pas YaRN (vérifie la documentation).
YaRN améliore la gestion des longs textes, mais ne résout pas les problèmes de compréhension profonde
--rope-scaling {none,linear,yarn} RoPE frequency scaling method, defaults to linear unless specified by the model
--rope-scale N RoPE context scaling factor, expands context by a factor of N
--yarn-orig-ctx N YaRN: original context size of model (default: 0 = model training context size)

Compilation pour GPU

Il faut le compiler avec CUDA. Avec une version >= 11.7 pour compatibilité syntaxe.

Build llama.cpp with CUDA

J'ai installé CUDA le dépot Nvidia Cuda et cuda toolkit 13

$ sudo cat /etc/apt/sources.list.d/cuda-ubuntu2404-x86_64.list
deb [signed-by=/usr/share/keyrings/cuda-archive-keyring.gpg]
 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/x86_64/ /

Ma dernière installation :

sudo apt install nvidia-headless-590-open nvidia-utils-590 nvidia-cuda-toolkit nvidia-cuda-dev
 
Package: nvidia-headless-590-open
Version: 590.48.01-0ubuntu0.24.04.1
APT-Sources: http://fr.archive.ubuntu.com/ubuntu noble-updates/restricted amd64 Packages
 
Package: nvidia-cuda-toolkit
Version: 12.0.140~12.0.1-4build4
APT-Sources: http://fr.archive.ubuntu.com/ubuntu noble/multiverse amd64 Packages
 
# Je ne comprends pas j'ai pourtant un /etc/apt/sources.list.d/cuda-ubuntu2404-x86_64.list
# qui pointe sur /etc/apt/sources.list.d/cuda-ubuntu2404-x86_64.list

en option ou @ spécifier pour le cmake build :

export PATH=$PATH:/usr/local/cuda-<version>/bin/

Ensuite une longue compilation :

# DCMAKE_CUDA_ARCHITECTURES :
# CUDA GPU Compute Capability https://developer.nvidia.com/cuda-gpus
# RTX 3060 : 86
# RTX 5060 : 120

$ export CUDA_VERSION=12.9
$ export CUDA_VERSION=13.3
$ cmake -B build -DGGML_CUDA=ON \
 -DCMAKE_CUDA_ARCHITECTURES="86;120" \
 -DCMAKE_BUILD_WITH_INSTALL_RPATH=ON \
 -DCMAKE_CUDA_COMPILER=/usr/local/cuda-${CUDA_VERSION}/bin/nvcc \
 -DCMAKE_INSTALL_RPATH="/usr/local/cuda-${CUDA_VERSION}/lib64;\$ORIGIN"

-- ccache found, compilation results will be cached. Disable with GGML_CCACHE=OFF.
-- CMAKE_SYSTEM_PROCESSOR: x86_64
-- GGML_SYSTEM_ARCH: x86
-- Including CPU backend
-- x86 detected
-- Adding CPU backend variant ggml-cpu: -march=native 
-- CUDA Toolkit found
-- Using CUDA architectures: 86;120
-- CUDA host compiler is GNU 13.3.0
-- Including CUDA backend
-- ggml version: 0.9.4
-- ggml commit:  6016d0bd4
-- Configuring done (0.5s)
-- Generating done (0.2s)
-- Build files have been written to: /home/cyrille/Code/bronx/AI_Coding/llama.cpp/build

$ time cmake --build build --clean-first --config Release -j 10

# host: i7-1360P + SSD
...
real	44m35,149s
user	42m38,100s
sys	1m51,594s
...
# Avec `-j 10` (concurent tasks)
real	11m6,449s
user	104m56,615s
sys	3m45,431s
# Plus récemment
real	6m35,663s
user	61m37,436s
sys	2m37,613s

# host: Core(TM) Ultra 7 270K Plus
real	3m6.637s
user	27m13.877s
sys	1m24.687s

Compilation pour CPU (SYCL)

Linux OneApi toolkit

https://www.intel.com/content/www/us/en/docs/oneapi-toolkit/installation-guide-linux/latest/install-oneapi-toolkit-with-apt.html
- 71 paquets pour 2.3 Go
- Relire https://github.com/ggml-org/llama.cpp/blob/master/docs/backend/SYCL.md#i-setup-environment pour n'installer que les paquets nécessaire

Par défaut intel-oneapi-toolkit installe tout ce monde :

intel-oneapi-ccl-2022.0 intel-oneapi-ccl-devel intel-oneapi-ccl-devel-2022.0 intel-oneapi-common-licensing intel-oneapi-common-licensing-2026.0
intel-oneapi-common-oneapi-vars intel-oneapi-common-oneapi-vars-2026.0 intel-oneapi-common-vars intel-oneapi-compiler-cpp-eclipse-cfg-2026.0
intel-oneapi-compiler-dpcpp-cpp intel-oneapi-compiler-dpcpp-cpp-2026.0 intel-oneapi-compiler-dpcpp-cpp-common-2026.0
intel-oneapi-compiler-dpcpp-cpp-runtime-2026.0 intel-oneapi-compiler-dpcpp-eclipse-cfg-2026.0 intel-oneapi-compiler-fortran-2026.0
intel-oneapi-compiler-fortran-common-2026.0 intel-oneapi-compiler-fortran-runtime-2026.0 intel-oneapi-compiler-shared-2026.0
intel-oneapi-compiler-shared-common-2026.0 intel-oneapi-compiler-shared-runtime-2026.0 intel-oneapi-dev-utilities intel-oneapi-dev-utilities-2026.0
intel-oneapi-dev-utilities-eclipse-cfg-2026.0 intel-oneapi-dnnl-2026.0 intel-oneapi-dnnl-devel intel-oneapi-dnnl-devel-2026.0
intel-oneapi-dpcpp-cpp-2026.0 intel-oneapi-dpcpp-debugger-2026.0 intel-oneapi-icc-eclipse-plugin-cpp-2026.0 intel-oneapi-ipp-2026.0
intel-oneapi-ipp-devel intel-oneapi-ipp-devel-2026.0 intel-oneapi-ippcp-2026.0 intel-oneapi-ippcp-devel intel-oneapi-ippcp-devel-2026.0
intel-oneapi-libdpstd-devel-2022.12 intel-oneapi-mkl-classic-devel-2026.0 intel-oneapi-mkl-classic-include-2026.0 intel-oneapi-mkl-cluster-2026.0
intel-oneapi-mkl-cluster-devel-2026.0 intel-oneapi-mkl-core-2026.0 intel-oneapi-mkl-core-devel-2026.0 intel-oneapi-mkl-devel
intel-oneapi-mkl-devel-2026.0 intel-oneapi-mkl-sycl-2026.0 intel-oneapi-mkl-sycl-blas-2026.0 intel-oneapi-mkl-sycl-data-fitting-2026.0
intel-oneapi-mkl-sycl-devel-2026.0 intel-oneapi-mkl-sycl-dft-2026.0 intel-oneapi-mkl-sycl-include-2026.0 intel-oneapi-mkl-sycl-lapack-2026.0
intel-oneapi-mkl-sycl-rng-2026.0 intel-oneapi-mkl-sycl-sparse-2026.0 intel-oneapi-mkl-sycl-stats-2026.0 intel-oneapi-mkl-sycl-vm-2026.0
intel-oneapi-mpi-2021.18 intel-oneapi-mpi-devel intel-oneapi-mpi-devel-2021.18 intel-oneapi-openmp-2026.0 intel-oneapi-openmp-common-2026.0
intel-oneapi-tbb-2023.0 intel-oneapi-tbb-devel intel-oneapi-tbb-devel-2023.0 intel-oneapi-tcm-1.5 intel-oneapi-tlt intel-oneapi-tlt-2026.0
intel-oneapi-toolkit intel-oneapi-toolkit-env-2026.0 intel-oneapi-toolkit-getting-started-2026.0 intel-oneapi-umf-1.1 intel-oneapi-vtune

$ source /opt/intel/oneapi/setvars.sh
$ sycl-ls
[opencl:cpu][opencl:0] Intel(R) OpenCL, 13th Gen Intel(R) Core(TM) i7-1360P OpenCL 3.0 (Build 0) [2026.21.3.0.31_160000]

En fait ça ne va pas car

$ ./llama-ls-sycl-device
./llama-ls-sycl-device: error while loading shared libraries: libsycl.so.8: cannot open shared object file: No such file or directory
 
# Probleme de version 😩
$ find /opt/intel/oneapi -name "libsycl.so*"
/opt/intel/oneapi/2026.0/lib/libsycl.so.9.0.0
/opt/intel/oneapi/2026.0/lib/libsycl.so.9.0.0-gdb.py
/opt/intel/oneapi/2026.0/lib/libsycl.so
/opt/intel/oneapi/2026.0/lib/libsycl.so.9
/opt/intel/oneapi/compiler/2026.0/lib/libsycl.so.9.0.0
/opt/intel/oneapi/compiler/2026.0/lib/libsycl.so.9.0.0-gdb.py
/opt/intel/oneapi/compiler/2026.0/lib/libsycl.so
/opt/intel/oneapi/compiler/2026.0/lib/libsycl.so.9

Ok, passe à la compilation comme expliqué sur https://github.com/ggml-org/llama.cpp/blob/master/docs/backend/SYCL.md#ii-build-llamacpp pour que le binaire utilise la version SYCL installée par intel-oneapi-toolkit.

./examples/sycl/build.sh

Compilation sans erreur, mais … “what(): can not find preferred GPU platform” 😩

$ ./build/bin/llama-ls-sycl-device
# idem avec
$ ./build/bin/llama-bench -p 0 -n 128,256,512

[New LWP 35410]
[New LWP 35409]
[New LWP 35408]
[New LWP 35407]
[New LWP 35406]
[New LWP 35405]
[New LWP 35404]
[New LWP 35403]
[New LWP 35402]
[New LWP 35401]
[New LWP 35400]
[New LWP 35399]
[New LWP 35398]
[New LWP 35397]
[New LWP 35396]

This GDB supports auto-downloading debuginfo from the following URLs:
  <https://debuginfod.ubuntu.com>
Enable debuginfod for this session? (y or [n]) [answered N; input not from terminal]
Debuginfod has been disabled.
...
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x000079304a910813 in __GI___wait4 (pid=35411, stat_loc=0x0, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
warning: 30	../sysdeps/unix/sysv/linux/wait4.c: Aucun fichier ou dossier de ce nom
#0  0x000079304a910813 in __GI___wait4 (pid=35411, stat_loc=0x0, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
30	in ../sysdeps/unix/sysv/linux/wait4.c
#1  0x000079304e48aa1a in ggml_print_backtrace () from /home/cyrille/Code/bronx/AI_Coding/llama.cpp-SYCL/build/bin/libggml-base.so.0
#2  0x000079304e4a3d76 in ggml_uncaught_exception() () from /home/cyrille/Code/bronx/AI_Coding/llama.cpp-SYCL/build/bin/libggml-base.so.0
#3  0x000079304acbb0da in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#4  0x000079304aca5a55 in std::terminate() () from /lib/x86_64-linux-gnu/libstdc++.so.6
#5  0x000079304acbb391 in __cxa_throw () from /lib/x86_64-linux-gnu/libstdc++.so.6
#6  0x000079304b19e765 in dpct::dev_mgr::dev_mgr() () from /home/cyrille/Code/bronx/AI_Coding/llama.cpp-SYCL/build/bin/libggml-sycl.so.0
#7  0x000079304b16e8f3 in ggml_backend_sycl_print_sycl_devices () from /home/cyrille/Code/bronx/AI_Coding/llama.cpp-SYCL/build/bin/libggml-sycl.so.0
#8  0x0000000000405527 in main ()
[Inferior 1 (process 35394) detached]
terminate called after throwing an instance of 'std::runtime_error'
  what():  can not find preferred GPU platform
PLEASE submit a bug report to https://software.intel.com/en-us/support/priority-support and include the crash backtrace and instructions to reproduce the bug.
Abandon (core dumped)

Et fait un reboot puis ça fonctionne. Les perfs: 2.6 plus rapide que sans SYCL (36.34 vs 13.94).

mistral.rs

Aucun rapport avec Mistral.ai

https://github.com/EricLBuehler/mistral.rs

Any Hugging Face model, zero config
True multimodality: Text, vision, video, and audio, speech generation, image generation, and embeddings in one engine.
Smart quantization
Built-in web UI
Hardware-aware
Flexible SDKs: Python package and Rust crate to build your projects.
Native agentic support: built-in agentic loop with web search, local Python code execution with model feedback, session management, and custom tool hooks.

À l'installation :

la compilation est très longue (743 fichiers) et s'accapare toute la puissance de la machine…
brancher le eGpu avant, sinon faudra re-installer 😩
- ça va activer flash-attn et la compilation de candle-flash-attn peut prendre 45 minutes !!!

ollama

- https://ollama.com - https://github.com/ollama/ollama

Chat & build with open models.

Interface utilisateur pour gérer et exécuter des modèles localement, utilise Llama.cpp sous le capot.

Sur linux install un service systemd

koboldcpp

A single self-contained distributable that builds off llama.cpp and adds many additional powerful features

https://github.com/LostRuins/koboldcpp

vLLM

vLLM est une bibliothèque open-source optimisée pour servir efficacement des LLMs en production, à la différence de llama.cpp qui est pour le développement ou usage solo sur du matériel standard (RTX ou CPU).

NanoLLM

https://github.com/dusty-nv/NanoLLM

From nvidia ingenier “Dustin Franklin” @dustynv .

Todo

How to build an OpenAI-compatible API

ZML

https://github.com/zml/zml/

Réduction de tokens

Headroom

Compress tool outputs, logs, files, and RAG chunks before they reach the LLM. 60-95% fewer tokens, same answers. Library, proxy, MCP server.
https://headroom-docs.vercel.app/docs
https://github.com/chopratejas/headroom
https://www.lemondeinformatique.fr/actualites/lire-headroom-un-projet-open-source-pour-reduire-la-facture-des-tokens-100357.html

RTK

CLI proxy that reduces LLM token consumption by 60-90% on common dev commands. Single Rust binary, zero dependencies
https://www.rtk-ai.app/
https://github.com/rtk-ai/rtk

Openwolf

Sharper context. Fewer tokens. Open-source middleware for Claude Code.
https://openwolf.com/
https://github.com/cytostack/openwolf