Il y a YOLO et tout plein d'outils dédiés à la détection dans des images. Là je teste avec des modèles multimodaux, sans entraînement spécifique.
Le prompt demande s'il y a des panneaux solaire dans l'image fournie, avec sa bbox, et si “oui” de calculer les coordonnées géographiques de l'objet trouvé. Les 2 instructions permettent d'éliminer des faux positifs.
Par exemple le modèle trouve un panneau solaire dans cette image, mais ne trouve pas les coordonnées géo, on peut donc l'évacuer des positifs.
Nécessite un modèle multimodal et un fichier mmproj approprié.
Avec llama-mtmd-cli et gemma-3-4b-it :
# gemma-3-4b-it-UD-Q8_K_XL $ time ~/llama.cpp/build/bin/llama-mtmd-cli --log-timestamps \ -m ~/Data/gemma-3-4b-it-UD-Q8_K_XL.gguf \ --mmproj ~/Data/mmproj-model-f16.gguf \ --image ~/Data/screenshot_20260214-141126.png -p 'Vois tu des panneaux solaires sur cette image ?' main: loading model: ~/Data/gemma-3-4b-it-UD-Q8_K_XL.gguf WARN: This is an experimental CLI for testing multimodal capability. For normal use cases, please use the standard llama-cli encoding image slice... image slice encoded in 789 ms decoding image batch 1/1, n_tokens_batch = 256 sched_reserve: reserving ... sched_reserve: CUDA0 compute buffer size = 517.12 MiB sched_reserve: CUDA_Host compute buffer size = 269.02 MiB sched_reserve: graph nodes = 1369 sched_reserve: graph splits = 2 sched_reserve: reserve took 109.44 ms, sched copies = 1 image decoded (batch 1/1) in 201 ms sched_reserve: reserving ... sched_reserve: CUDA0 compute buffer size = 517.12 MiB sched_reserve: CUDA_Host compute buffer size = 269.02 MiB sched_reserve: graph nodes = 1369 sched_reserve: graph splits = 2 sched_reserve: reserve took 188.38 ms, sched copies = 1 Oui, je vois des panneaux solaires sur l'image. Ils sont disposés sur le toit du bâtiment principal au centre de l'image. llama_perf_context_print: load time = 2846.21 ms llama_perf_context_print: prompt eval time = 852.69 ms / 278 tokens ( 3.07 ms per token, 326.03 tokens per second) llama_perf_context_print: eval time = 542.06 ms / 30 runs ( 18.07 ms per token, 55.34 tokens per second) llama_perf_context_print: total time = 2344.07 ms / 308 tokens llama_perf_context_print: graphs reused = 29 real 0m8,165s user 0m4,880s sys 0m3,269s # gemma-3-4b-it-Q4_K_M.gguf Oui, je vois des panneaux solaires sur l'image. Ils sont installés sur le toit du bâtiment principal, qui est une grande structure rectangulaire. llama_perf_context_print: load time = 1614.35 ms llama_perf_context_print: prompt eval time = 856.85 ms / 278 tokens ( 3.08 ms per token, 324.45 tokens per second) llama_perf_context_print: eval time = 359.10 ms / 33 runs ( 10.88 ms per token, 91.90 tokens per second) llama_perf_context_print: total time = 2049.84 ms / 311 tokens llama_perf_context_print: graphs reused = 32 real 0m6,531s user 0m3,426s sys 0m3,041s
Avec llama-mtmd-cli et SmolVLM2-2.2B-Instruct :
# SmolVLM2-2.2B-Instruct-Q4_0 time ~/llama.cpp/build/bin/llama-mtmd-cli -m ~/Data/SmolVLM2-2.2B-Instruct-Q4_0.gguf --mmproj ~/Data/mmproj-SmolVLM2-2.2B-Instruct-f16.gguf --image ~/Data/screenshot_20260214-141126.png -p 'Vois tu des panneaux solaires sur cette image ?' --log-timestamps build: 7971 (5fa1c190d) with GNU 13.3.0 for Linux x86_64 common_init_result: fitting params to device memory, for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on gguf_init_from_file_impl: invalid magic characters: 'Entr', expected 'GGUF' llama_model_load: error loading model: llama_model_loader: failed to load model from ~/Data/SmolVLM2-2.2B-Instruct-Q4_0.gguf llama_model_load_from_file_impl: failed to load model llama_params_fit: encountered an error while trying to fit params to free device memory: failed to load model llama_params_fit: fitting params to free memory took 0.10 seconds Erreur de segmentation (core dumped)
Avec llama-mtmd-cli et Qwen2-VL-2B-Instruct :
# Qwen2-VL-2B-Instruct-Q4_0 time ~/llama.cpp/build/bin/llama-mtmd-cli -m ~/Data/Qwen2-VL-2B-Instruct-Q4_0.gguf --mmproj ~/Data/mmproj-Qwen2-VL-2B-Instruct-f16.gguf --image ~/Data/screenshot_20260214-141126.png -p 'Vois tu des panneaux solaires sur cette image ?' --log-timestamps -ngl 99 ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes build: 7971 (5fa1c190d) with GNU 13.3.0 for Linux x86_64 common_init_result: fitting params to device memory, for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on gguf_init_from_file_impl: invalid magic characters: 'Entr', expected 'GGUF' llama_model_load: error loading model: llama_model_loader: failed to load model from ~/Data/Qwen2-VL-2B-Instruct-Q4_0.gguf llama_model_load_from_file_impl: failed to load model llama_params_fit: encountered an error while trying to fit params to free device memory: failed to load model llama_params_fit: fitting params to free memory took 0.09 seconds Erreur de segmentation (core dumped)
Avec llama-mtmd-cli et MobileVLM-3B :
$ time ~/llama.cpp/build/bin/llama-mtmd-cli -m ~/Data/MobileVLM-3B-q3_K_S.gguf --mmproj ~/Data/MobileVLM-3B-mmproj-f16.gguf --image ~/Data/screenshot_20260214-141126.png -p 'Vois tu des panneaux solaires sur cette image ?' --log-timestamps -ngl 99 ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes build: 7971 (5fa1c190d) with GNU 13.3.0 for Linux x86_64 common_init_result: fitting params to device memory, for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on gguf_init_from_file_impl: invalid magic characters: 'Repo', expected 'GGUF' llama_model_load: error loading model: llama_model_loader: failed to load model from ~/Data/MobileVLM-3B-q3_K_S.gguf llama_model_load_from_file_impl: failed to load model llama_params_fit: encountered an error while trying to fit params to free device memory: failed to load model llama_params_fit: fitting params to free memory took 0.09 seconds Erreur de segmentation (core dumped)