Différences

Ci-dessous, les différences entre deux révisions de la page.

--- informatique:ai_lm:ai_nlp [13/01/2026 11:23] – [AI (NLP) Natural Language Processing] cyrille
+++ informatique:ai_lm:ai_nlp [18/01/2026 10:16] (Version actuelle) – [Wikidata] cyrille
@@ Ligne 8: / Ligne 8: @@
   * [[https://www.ibm.com/fr-fr/think/topics/named-entity-recognition|Qu’est-ce que la reconnaissance d’entités nommées (NER) ?]]
-===== Models embedding =====
+Les étapes
+  * Tokenisation
+    * décompose un texte en unités plus petites, appelées tokens. Ces tokens peuvent être des mots, des signes de ponctuation ou d'autres unités linguistiques.
+  * Marquage de parties du discours (POS)
+    * marquage des parties du discours. Cela attribue aux tokens des types de mots grammaticaux, comme les noms, les verbes et les adjectifs
+  * Détection d'entités (NER)
+    * vise à reconnaître et à classer des entités nommées telles que des personnes, des lieux, des organisations et d'autres informations spécifiques
+==== Glossaire ====
+  * STS Semantic Textual Similarity:  calculate the similarities between embeddings's texts.
+  * Embedding: fixed-size vector representation
+  * Cross Encoder (a.k.a reranker): Calculates a similarity score given pairs of texts. Generally provides superior performance compared to a Sentence Transformer (a.k.a. bi-encoder) model.
+  * Sparse Encoder : sparse vector representations is a list of ''token: weight'' key-value pairs representing an entry and its weight.
+  * RAG (Retrieval-Augmented Generation): combine deux capacités de l’IA → la récupération d’informations et la génération de texte
+===== Models embedding =====
   * [[https://www.ibm.com/fr-fr/think/topics/vector-embedding|Qu’est-ce qu’un plongement vectoriel ?]]
+===== Sentence Transformers =====
+https://www.sbert.net/
+used to compute embeddings using Sentence Transformer models ([[https://www.sbert.net/docs/quickstart.html#sentence-transformer|quickstart]]), to calculate similarity scores using Cross-Encoder (a.k.a. reranker) models ([[https://www.sbert.net/docs/quickstart.html#cross-encoder|quickstart]]), or to generate sparse embeddings using Sparse Encoder models ([[https://www.sbert.net/docs/quickstart.html#sparse-encoder|quickstart]]).
+===== Vectors databases =====
+{{ :informatique:ai_lm:vectors-database.png?300|https://docs.trychroma.com/docs/overview/introduction}}
+  * FAISS Facebook AI Similarity Search, optimisé pour la recherche de similarité
+  * [[https://qdrant.tech/|Qdrant]], open source, scalable
+  * [[https://milvus.io/docs/fr/overview.md|milvus]]
+Solutions plus évoluées en SaaS
+  * [[https://www.pinecone.io/|Pinecone]]
+  * [[https://weaviate.io/|Weaviate]]
+    * Weaviate is an open-source vector database that stores both objects and vectors, allowing for the combination of vector search with structured filtering with the fault tolerance and scalability of a cloud-native database.
+    * https://github.com/weaviate/weaviate
+    * https://weaviate.io/pricing
+==== ChromaDB ====
+  * [[https://github.com/chroma-core/chroma|ChromaDB]] léger, en mémoire ou sur disque
+    * [[https://docs.trychroma.com/docs/run-chroma/client-server|Running Chroma in Client-Server Mode]]
+    * [[https://cookbook.chromadb.dev|Chroma Cookbook]]
+    * [[https://blog.stephane-robert.info/docs/developper/programmation/python/chroma/#exemple-complet--recherche-de-documents-internes|Chroma : Guide complet base données vectorielle]] par Stéphane Robert 2025
+Client Api
+  * Php https://github.com/CodeWithKyrian/chromadb-php
+==== Wikidata ====
+Utiliser 2 méthodes différentes pour
+  * Pour extraire les labels, aliases et déclarations (claims)
+  * Pour extraire le graph des P31/P279
+permet d'optimiser les traitements
+=== Wikidata Dumps ===
+Il y a des dumps Wikidata (préférer un mirroir pour être sympa).
+Dump Json, streamable (GZ) :
+  * https://files.scatter.red/wikimedia/other/wikibase/wikidatawiki/20260105/wikidata-20260105-all.json.gz
+  * 151 Go, plus de ''118 654 999'' lignes
+Dump RDF N-Triples (brut), streamable (GZ) :
+  * https://files.scatter.red/wikimedia/other/wikibase/wikidatawiki/latest-all.nt.gz
+  * 246 Go
+Dump RDF N-Triples (brut), streamable (GZ) ET nettoyé des ''statements deprecated'', ''doublons inutiles'' et ''certaines redondances'', ne garde que les “direct claims fiables”
+  * https://files.scatter.red/wikimedia/other/wikibase/wikidatawiki/latest-truthy.nt.gz
+  * 69.6 Go 👌 pour ''8 128 295 676'' lignes !
+Lectures:
+  * PDF [[https://wikidataworkshop.github.io/2022/papers/Wikidata_Workshop_2022_paper_4558.pdf|Getting and hosting your own copy of Wikidata]] from Wikidata’22: Wikidata workshop at ISWC 2022
+Query services:
+  * Original https://query.wikidata.org
+  * The graph was split in two some time ago. The scholarly articles must be queried on https://query-scholarly.wikidata.org/
+  * QLever démo https://qlever.dev/wikidata/ (//données à jour le 2026-01-16, [[https://qlever.dev/wikidata/OoKng8|pour vérifier]]//)