informatique:ai_lm:ai_nlp_rag
Différences
Ci-dessous, les différences entre deux révisions de la page.
| Les deux révisions précédentesRévision précédenteProchaine révision | Révision précédente | ||
| informatique:ai_lm:ai_nlp_rag [21/04/2026 08:58] – supprimée - modification externe (Date inconnue) 127.0.0.1 | informatique:ai_lm:ai_nlp_rag [04/06/2026 08:21] (Version actuelle) – [AI NLP and RAG] cyrille | ||
|---|---|---|---|
| Ligne 1: | Ligne 1: | ||
| + | ====== AI NLP and RAG ====== | ||
| + | |||
| + | * NLP: Natural Language Processing, Traitement automatique du langage naturel | ||
| + | * RAG: Retrieval-Augmented Generation, récupération d’informations et génération de texte | ||
| + | |||
| + | Voir aussi: [[/ | ||
| + | |||
| + | Outils RAG: | ||
| + | * https:// | ||
| + | * [[https:// | ||
| + | * Langchain integration as a [[https:// | ||
| + | * [[https:// | ||
| + | * One of the easiest ways to use ColBERT in applications nowadays is the semi-official, | ||
| + | |||
| + | La reconnaissance d’entités (NER), également appelée segmentation d’entités ou extraction d’entités, | ||
| + | |||
| + | * [[https:// | ||
| + | * [[https:// | ||
| + | |||
| + | Les étapes | ||
| + | * Tokenisation | ||
| + | * décompose un texte en unités plus petites, appelées tokens. Ces tokens peuvent être des mots, des signes de ponctuation ou d' | ||
| + | * Marquage de parties du discours (POS) | ||
| + | * marquage des parties du discours. Cela attribue aux tokens des types de mots grammaticaux, | ||
| + | * Détection d' | ||
| + | * vise à reconnaître et à classer des entités nommées telles que des personnes, des lieux, des organisations et d' | ||
| + | |||
| + | ReRanking | ||
| + | * Modèles de ReRanking : Utilisation de modèles spécialisés (comme Cross-Encoders) qui comparent directement la question et chaque chunk pour calculer un score de pertinence plus précis. | ||
| + | * Fusion de scores : Combinaison de plusieurs critères (pertinence vectorielle, | ||
| + | * Filtrage des redondances : Suppression des chunks qui se recoupent trop, afin d’éviter de répéter la même information. | ||
| + | |||
| + | |||
| + | SEQUOIA (Semantic-Evolved QUery-Optimized Iterative Abstraction) is a novel RAG architecture that combines four techniques into a unified retrieval pipeline: | ||
| + | - Liste numérotéeSemantic Chunking -- splits documents by embedding similarity boundaries instead of fixed-size windows | ||
| + | - RAPTOR Tree -- recursively clusters chunks and summarizes via LLM, building a hierarchy | ||
| + | - Step-Back Prompting -- LLM generates a more abstract query; both queries used for retrieval across all tree levels | ||
| + | - Confidence-Gated Adaptive Depth -- retrieval starts at leaf level, ascends tree only if confidence is below threshold | ||
| + | |||
| + | < | ||
| + | query | ||
| + | → multi-query expansion (2 rewrites + 1 step-back, via LLM) | ||
| + | → hybrid retrieval per variant (BM25 + dense + RRF, top-20 each) | ||
| + | → RRF merge across all variants | ||
| + | → cross-encoder rerank (top-50 → top-5) | ||
| + | → context compression (sentence-level filtering by cosine sim to query, | ||
| + | keep top 12 sentences, collapse into one chunk) | ||
| + | → LLM with short-answer prompt | ||
| + | </ | ||
| + | |||
| + | |||
| + | Articles | ||
| + | * [[https:// | ||
| + | * [[https:// | ||
| + | |||
| + | |||
| + | ==== Glossaire ==== | ||
| + | |||
| + | * STS Semantic Textual Similarity: | ||
| + | * Embedding: fixed-size vector representation | ||
| + | * Cross Encoder (a.k.a reranker): Calculates a similarity score given pairs of texts. Generally provides superior performance compared to a Sentence Transformer (a.k.a. bi-encoder) model. | ||
| + | * Sparse Encoder : sparse vector representations is a list of '' | ||
| + | * RAG (Retrieval-Augmented Generation): | ||
| + | |||
| + | ===== Models embedding ===== | ||
| + | |||
| + | * [[https:// | ||
| + | |||
| + | ===== Sentence Transformers ===== | ||
| + | |||
| + | https:// | ||
| + | |||
| + | used to compute embeddings using Sentence Transformer models ([[https:// | ||
| + | |||
| + | ===== Vectors databases ===== | ||
| + | |||
| + | {{ : | ||
| + | * FAISS Facebook AI Similarity Search, optimisé pour la recherche de similarité | ||
| + | * [[https:// | ||
| + | * [[https:// | ||
| + | |||
| + | Solutions plus évoluées en SaaS | ||
| + | * [[https:// | ||
| + | * [[https:// | ||
| + | * Weaviate is an open-source vector database that stores both objects and vectors, allowing for the combination of vector search with structured filtering with the fault tolerance and scalability of a cloud-native database. | ||
| + | * https:// | ||
| + | * https:// | ||
| + | |||
| + | ==== ChromaDB ==== | ||
| + | |||
| + | * [[https:// | ||
| + | * [[https:// | ||
| + | * [[https:// | ||
| + | * [[https:// | ||
| + | |||
| + | Client Api | ||
| + | * Php https:// | ||
| + | |||
| + | ==== Wikidata ==== | ||
| + | |||
| + | Utiliser 2 méthodes différentes pour | ||
| + | * Pour extraire les labels, aliases et déclarations (claims) | ||
| + | * Pour extraire le graph des P31/P279 | ||
| + | permet d' | ||
| + | |||
| + | === Wikidata Dumps === | ||
| + | |||
| + | Il y a des dumps Wikidata (préférer un mirroir pour être sympa). | ||
| + | |||
| + | Dump Json, streamable (GZ) : | ||
| + | * https:// | ||
| + | * 151 Go, plus de '' | ||
| + | |||
| + | Dump RDF N-Triples (brut), streamable (GZ) : | ||
| + | * https:// | ||
| + | * 246 Go | ||
| + | |||
| + | Dump RDF N-Triples (brut), streamable (GZ) ET nettoyé des '' | ||
| + | * https:// | ||
| + | * 69.6 Go 👌 pour '' | ||
| + | |||
| + | Lectures: | ||
| + | * PDF [[https:// | ||
| + | |||
| + | Query services: | ||
| + | * Original https:// | ||
| + | * The graph was split in two some time ago. The scholarly articles must be queried on https:// | ||
| + | * QLever démo https:// | ||
