====== AI Vision ======
Il y a YOLO et tout plein d'outils dédiés à la détection dans des images. Là je teste avec des modèles multimodaux, sans entraînement spécifique.
===== expérience =====
Le prompt demande s'il y a des panneaux solaire dans l'image fournie, avec sa bbox, et si "oui" de calculer les coordonnées géographiques de l'objet trouvé. Les 2 instructions permettent d'éliminer des faux positifs.
Par exemple le modèle trouve un panneau solaire dans cette image, mais ne trouve pas les coordonnées géo, on peut donc l'évacuer des positifs.
{{:informatique:ai_lm:ai_vision:champ-avec-rayures_18-131487-91478.jpeg?direct&140|champ avec rayures}}
===== llama.cpp =====
* [[https://github.com/ggml-org/llama.cpp/tree/master/tools/mtmd|Multimodal Support in llama.cpp]]
Nécessite un modèle multimodal et un fichier ''mmproj'' approprié.
Avec **llama-mtmd-cli** et **gemma-3-4b-it** :
* [[https://huggingface.co/ggml-org/gemma-3-4b-it-GGUF/resolve/main/gemma-3-4b-it-Q4_K_M.gguf|gemma-3-4b-it-Q4_K_M.gguf]]
* [[https://huggingface.co/ggml-org/gemma-3-4b-it-GGUF/resolve/main/mmproj-model-f16.gguf|mmproj-model-f16.gguf]]
# gemma-3-4b-it-UD-Q8_K_XL
$ time ~/llama.cpp/build/bin/llama-mtmd-cli --log-timestamps \
-m ~/Data/gemma-3-4b-it-UD-Q8_K_XL.gguf \
--mmproj ~/Data/mmproj-model-f16.gguf \
--image ~/Data/screenshot_20260214-141126.png -p 'Vois tu des panneaux solaires sur cette image ?'
main: loading model: ~/Data/gemma-3-4b-it-UD-Q8_K_XL.gguf
WARN: This is an experimental CLI for testing multimodal capability.
For normal use cases, please use the standard llama-cli
encoding image slice...
image slice encoded in 789 ms
decoding image batch 1/1, n_tokens_batch = 256
sched_reserve: reserving ...
sched_reserve: CUDA0 compute buffer size = 517.12 MiB
sched_reserve: CUDA_Host compute buffer size = 269.02 MiB
sched_reserve: graph nodes = 1369
sched_reserve: graph splits = 2
sched_reserve: reserve took 109.44 ms, sched copies = 1
image decoded (batch 1/1) in 201 ms
sched_reserve: reserving ...
sched_reserve: CUDA0 compute buffer size = 517.12 MiB
sched_reserve: CUDA_Host compute buffer size = 269.02 MiB
sched_reserve: graph nodes = 1369
sched_reserve: graph splits = 2
sched_reserve: reserve took 188.38 ms, sched copies = 1
Oui, je vois des panneaux solaires sur l'image. Ils sont disposés sur le toit du bâtiment principal au centre de l'image.
llama_perf_context_print: load time = 2846.21 ms
llama_perf_context_print: prompt eval time = 852.69 ms / 278 tokens ( 3.07 ms per token, 326.03 tokens per second)
llama_perf_context_print: eval time = 542.06 ms / 30 runs ( 18.07 ms per token, 55.34 tokens per second)
llama_perf_context_print: total time = 2344.07 ms / 308 tokens
llama_perf_context_print: graphs reused = 29
real 0m8,165s
user 0m4,880s
sys 0m3,269s
# gemma-3-4b-it-Q4_K_M.gguf
Oui, je vois des panneaux solaires sur l'image. Ils sont installés sur le toit du bâtiment principal, qui est une grande structure rectangulaire.
llama_perf_context_print: load time = 1614.35 ms
llama_perf_context_print: prompt eval time = 856.85 ms / 278 tokens ( 3.08 ms per token, 324.45 tokens per second)
llama_perf_context_print: eval time = 359.10 ms / 33 runs ( 10.88 ms per token, 91.90 tokens per second)
llama_perf_context_print: total time = 2049.84 ms / 311 tokens
llama_perf_context_print: graphs reused = 32
real 0m6,531s
user 0m3,426s
sys 0m3,041s
Avec **llama-mtmd-cli** et **SmolVLM2-2.2B-Instruct** :
# SmolVLM2-2.2B-Instruct-Q4_0
time ~/llama.cpp/build/bin/llama-mtmd-cli -m ~/Data/SmolVLM2-2.2B-Instruct-Q4_0.gguf --mmproj ~/Data/mmproj-SmolVLM2-2.2B-Instruct-f16.gguf --image ~/Data/screenshot_20260214-141126.png -p 'Vois tu des panneaux solaires sur cette image ?' --log-timestamps
build: 7971 (5fa1c190d) with GNU 13.3.0 for Linux x86_64
common_init_result: fitting params to device memory, for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on
gguf_init_from_file_impl: invalid magic characters: 'Entr', expected 'GGUF'
llama_model_load: error loading model: llama_model_loader: failed to load model from ~/Data/SmolVLM2-2.2B-Instruct-Q4_0.gguf
llama_model_load_from_file_impl: failed to load model
llama_params_fit: encountered an error while trying to fit params to free device memory: failed to load model
llama_params_fit: fitting params to free memory took 0.10 seconds
Erreur de segmentation (core dumped)
Avec **llama-mtmd-cli** et **Qwen2-VL-2B-Instruct** :
# Qwen2-VL-2B-Instruct-Q4_0
time ~/llama.cpp/build/bin/llama-mtmd-cli -m ~/Data/Qwen2-VL-2B-Instruct-Q4_0.gguf --mmproj ~/Data/mmproj-Qwen2-VL-2B-Instruct-f16.gguf --image ~/Data/screenshot_20260214-141126.png -p 'Vois tu des panneaux solaires sur cette image ?' --log-timestamps -ngl 99
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes
build: 7971 (5fa1c190d) with GNU 13.3.0 for Linux x86_64
common_init_result: fitting params to device memory, for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on
gguf_init_from_file_impl: invalid magic characters: 'Entr', expected 'GGUF'
llama_model_load: error loading model: llama_model_loader: failed to load model from ~/Data/Qwen2-VL-2B-Instruct-Q4_0.gguf
llama_model_load_from_file_impl: failed to load model
llama_params_fit: encountered an error while trying to fit params to free device memory: failed to load model
llama_params_fit: fitting params to free memory took 0.09 seconds
Erreur de segmentation (core dumped)
Avec **llama-mtmd-cli** et **MobileVLM-3B** :
$ time ~/llama.cpp/build/bin/llama-mtmd-cli -m ~/Data/MobileVLM-3B-q3_K_S.gguf --mmproj ~/Data/MobileVLM-3B-mmproj-f16.gguf --image ~/Data/screenshot_20260214-141126.png -p 'Vois tu des panneaux solaires sur cette image ?' --log-timestamps -ngl 99
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes
build: 7971 (5fa1c190d) with GNU 13.3.0 for Linux x86_64
common_init_result: fitting params to device memory, for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on
gguf_init_from_file_impl: invalid magic characters: 'Repo', expected 'GGUF'
llama_model_load: error loading model: llama_model_loader: failed to load model from ~/Data/MobileVLM-3B-q3_K_S.gguf
llama_model_load_from_file_impl: failed to load model
llama_params_fit: encountered an error while trying to fit params to free device memory: failed to load model
llama_params_fit: fitting params to free memory took 0.09 seconds
Erreur de segmentation (core dumped)