AI Vision

Il y a YOLO et tout plein d'outils dédiés à la détection dans des images.

llama.cpp

Multimodal Support in llama.cpp

Nécessite un modèle multimodal et un fichier mmproj approprié.

gemma-3-4b
- gemma-3-4b-it-Q4_K_M.gguf
- mmproj-model-f16.gguf

Avec llama-mtmd-cli et gemma-3-4b-it :

# gemma-3-4b-it-UD-Q8_K_XL
$ time ~/llama.cpp/build/bin/llama-mtmd-cli --log-timestamps \
 -m ~/Data/AI_ModelsVision/gemma-3-4b-it-UD-Q8_K_XL.gguf \
 --mmproj ~/Data/AI_ModelsVision/mmproj-model-f16.gguf \
 --image ~/Data/screenshot_20260214-141126.png -p 'Vois tu des panneaux solaires sur cette image ?'
 
main: loading model: /home/cyrille/Data/AI_ModelsVision/gemma-3-4b-it-UD-Q8_K_XL.gguf
WARN: This is an experimental CLI for testing multimodal capability.
      For normal use cases, please use the standard llama-cli
encoding image slice...
image slice encoded in 789 ms
decoding image batch 1/1, n_tokens_batch = 256
sched_reserve: reserving ...
sched_reserve:      CUDA0 compute buffer size =   517.12 MiB
sched_reserve:  CUDA_Host compute buffer size =   269.02 MiB
sched_reserve: graph nodes  = 1369
sched_reserve: graph splits = 2
sched_reserve: reserve took 109.44 ms, sched copies = 1
image decoded (batch 1/1) in 201 ms
sched_reserve: reserving ...
sched_reserve:      CUDA0 compute buffer size =   517.12 MiB
sched_reserve:  CUDA_Host compute buffer size =   269.02 MiB
sched_reserve: graph nodes  = 1369
sched_reserve: graph splits = 2
sched_reserve: reserve took 188.38 ms, sched copies = 1
 
Oui, je vois des panneaux solaires sur l'image. Ils sont disposés sur le toit du bâtiment principal au centre de l'image.
 
 
llama_perf_context_print:        load time =    2846.21 ms
llama_perf_context_print: prompt eval time =     852.69 ms /   278 tokens (    3.07 ms per token,   326.03 tokens per second)
llama_perf_context_print:        eval time =     542.06 ms /    30 runs   (   18.07 ms per token,    55.34 tokens per second)
llama_perf_context_print:       total time =    2344.07 ms /   308 tokens
llama_perf_context_print:    graphs reused =         29
 
real	0m8,165s
user	0m4,880s
sys	0m3,269s
 
# gemma-3-4b-it-Q4_K_M.gguf
 
Oui, je vois des panneaux solaires sur l'image. Ils sont installés sur le toit du bâtiment principal, qui est une grande structure rectangulaire.
 
 
llama_perf_context_print:        load time =    1614.35 ms
llama_perf_context_print: prompt eval time =     856.85 ms /   278 tokens (    3.08 ms per token,   324.45 tokens per second)
llama_perf_context_print:        eval time =     359.10 ms /    33 runs   (   10.88 ms per token,    91.90 tokens per second)
llama_perf_context_print:       total time =    2049.84 ms /   311 tokens
llama_perf_context_print:    graphs reused =         32
 
real	0m6,531s
user	0m3,426s
sys	0m3,041s

Avec llama-mtmd-cli et SmolVLM2-2.2B-Instruct :

#
time ~/llama.cpp/build/bin/llama-mtmd-cli -m ~/Data/SmolVLM2-2.2B-Instruct-Q4_0.gguf --mmproj ~/Data/mmproj-SmolVLM2-2.2B-Instruct-f16.gguf --image ~/Data/screenshot_20260214-141126.png -p 'Vois tu des panneaux solaires sur cette image ?' --log-timestamps
 
build: 7971 (5fa1c190d) with GNU 13.3.0 for Linux x86_64
common_init_result: fitting params to device memory, for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on
gguf_init_from_file_impl: invalid magic characters: 'Entr', expected 'GGUF'
llama_model_load: error loading model: llama_model_loader: failed to load model from /home/cyrille/Data/AI_ModelsVision/SmolVLM2-2.2B-Instruct-Q4_0.gguf
llama_model_load_from_file_impl: failed to load model
llama_params_fit: encountered an error while trying to fit params to free device memory: failed to load model
llama_params_fit: fitting params to free memory took 0.10 seconds
Erreur de segmentation (core dumped)