DSP Retrieval Strategies¶
Document Type: Knowledge Base - Retrieval Research
Created: 2025-11-24
Last Updated: 2025-11-24
Confidence Level: High
Source: Information Retrieval research, RAG best practices, Knowledge Graph theory
Purpose: Comprehensive research on Vector, Keyword, and Graph-based retrieval for DSP implementation
Overview¶
The Search operation in the DSP framework is critical for knowledge-intensive tasks. Unlike simple retrieve-then-read systems, DSP enables strategic placement of retrieval within multi-stage pipelines. This document provides deep research on three fundamental retrieval paradigms that AirsDSP must support.
1. Vector Search (Dense Retrieval)¶
Core Concept¶
Vector search (also called semantic search or dense retrieval) represents documents and queries as high-dimensional embeddings, then finds similar items using distance metrics in vector space.
How It Works¶
- Embedding Generation: Convert text into dense vectors (typically 384-1536 dimensions)
- Models:
sentence-transformers,text-embedding-ada-002,voyage-ai,cohere-embed -
Each document chunk →
Vec<f32>of fixed dimension -
Index Construction: Build efficient search structures
- HNSW (Hierarchical Navigable Small World): Graph-based, fast approximate search
- IVF (Inverted File Index): Clustering-based, good for large datasets
-
Flat Index: Brute force, exact but slow
-
Query Processing:
Distance Metrics¶
| Metric | Formula | Use Case | Range |
|---|---|---|---|
| Cosine Similarity | dot(a,b) / (norm(a) * norm(b)) |
Most common, normalized | [-1, 1] |
| Euclidean Distance | sqrt(sum((a-b)^2)) |
Absolute distance | [0, ∞) |
| Dot Product | sum(a * b) |
When vectors pre-normalized | [-∞, ∞] |
Strengths¶
- ✅ Semantic Understanding: Captures meaning, not just keywords
- ✅ Multilingual: Works across languages with multilingual embeddings
- ✅ Synonym Handling: "car" and "automobile" are close in vector space
- ✅ Context Aware: Understands "bank" (river) vs "bank" (finance)
Weaknesses¶
- ❌ Exact Match Fails: Poor at finding specific IDs, codes, or rare terms
- ❌ Computational Cost: Embedding generation adds latency
- ❌ Black Box: Hard to explain why a result was retrieved
- ❌ Recency Bias: Struggles with very recent or niche information
Key Research Papers¶
- DPR (Dense Passage Retrieval) - Karpukhin et al., 2020
- Showed dense retrieval can outperform BM25 on open-domain QA
-
Bi-encoder architecture: separate encoders for query and document
-
ColBERT - Khattab & Zaharia, 2020
- Late interaction: compute similarity at token level, not document level
- Better accuracy than DPR, still efficient
Implementation Considerations for AirsDSP¶
Rust Ecosystem:
- qdrant-client: Client for Qdrant vector DB
- pgvector: PostgreSQL extension (via sqlx)
- faiss-rs: Rust bindings for Facebook's FAISS
- hnsw: Pure Rust HNSW implementation
Design Questions: - Should AirsDSP handle embedding generation, or assume pre-embedded documents? - Should the framework provide a default embedding model, or require user-supplied?
2. Keyword Search (Sparse Retrieval)¶
Core Concept¶
Keyword search (also called lexical search or sparse retrieval) matches documents based on exact term overlap, weighted by statistical importance.
How It Works¶
-
Inverted Index Construction:
-
Query Processing:
- Parse query into terms
- Look up each term in inverted index
-
Score documents using ranking algorithm
-
Ranking Algorithms:
BM25 (Best Match 25)¶
The gold standard for keyword search. Formula:
Where:
- f(qi, D): Frequency of term qi in document D
- |D|: Length of document D
- avgdl: Average document length in collection
- k1: Term frequency saturation (typically 1.2-2.0)
- b: Length normalization (typically 0.75)
- IDF(qi): Inverse document frequency of term qi
Key Insight: BM25 has diminishing returns for term frequency (prevents keyword stuffing) and normalizes by document length.
TF-IDF (Term Frequency - Inverse Document Frequency)¶
Simpler, older algorithm:
Where:
- N: Total number of documents
- df(qi): Number of documents containing term qi
Strengths¶
- ✅ Exact Match: Perfect for IDs, codes, specific names
- ✅ Explainable: Clear why a document matched (term overlap)
- ✅ Fast: Inverted indices are extremely efficient
- ✅ No Training: Works out-of-the-box, no ML models needed
- ✅ Deterministic: Same query always returns same results
Weaknesses¶
- ❌ Vocabulary Mismatch: "car" won't match "automobile"
- ❌ No Semantic Understanding: "bank" (river) and "bank" (finance) treated identically
- ❌ Sensitive to Wording: Query must use document's exact terms
Advanced Techniques¶
Query Expansion: - Add synonyms to query: "car" → "car OR automobile OR vehicle" - Use WordNet or custom thesaurus
Stemming/Lemmatization: - Reduce words to root form: "running" → "run" - Libraries: Porter Stemmer, Snowball
Stopword Removal: - Remove common words: "the", "a", "is" - Reduces index size, improves precision
Implementation Considerations for AirsDSP¶
Rust Ecosystem:
- tantivy: Pure Rust full-text search engine (like Lucene)
- meilisearch: Fast, typo-tolerant search (has Rust SDK)
- sonic: Lightweight alternative to Elasticsearch
- Manual BM25 implementation using HashMap<String, Vec<DocId>>
Design Questions: - Should AirsDSP provide built-in tokenization, or require pre-tokenized input? - Should stemming/stopword removal be configurable or opinionated?
3. Graph Retrieval (Knowledge Graphs & DAGs)¶
Core Concept¶
Graph retrieval treats knowledge as a network of entities (nodes) and relationships (edges), enabling traversal-based and structural queries.
Knowledge Graph Fundamentals¶
Structure:
(Entity) --[Relationship]--> (Entity)
Example:
(Paris) --[capital_of]--> (France)
(France) --[located_in]--> (Europe)
(Paris) --[has_population]--> (2.2M)
Representation:
- RDF Triples: (subject, predicate, object)
- (Paris, capital_of, France)
- Property Graphs: Nodes and edges can have properties
- Node: {id: "Paris", type: "City", population: 2.2M}
- Edge: {from: "Paris", to: "France", type: "capital_of", since: 987}
Graph Query Patterns¶
1. Direct Lookup¶
2. Multi-Hop Traversal¶
Query: "What continent is Paris in?"
Graph: (Paris) --[capital_of]--> (France) --[located_in]--> (?)
Result: Europe
3. Subgraph Matching¶
Query: "Find all cities in Europe with population > 1M"
Pattern:
(City) --[located_in]--> (Country) --[located_in]--> (Europe)
WHERE City.population > 1M
GraphRAG (Graph-based RAG)¶
Recent research shows combining graphs with LLMs improves complex reasoning.
Microsoft's GraphRAG Approach (2024): 1. Graph Construction: Extract entities and relationships from documents 2. Community Detection: Cluster related entities 3. Hierarchical Summarization: Summarize each community 4. Query-Time Retrieval: - Map query to relevant communities - Retrieve community summaries + local subgraph - LLM synthesizes answer
Benefits over Pure Vector RAG: - ✅ Better at multi-hop reasoning - ✅ Captures explicit relationships - ✅ Enables "explain the connection" queries - ✅ Reduces hallucination (grounded in graph structure)
DAG-Based Retrieval¶
Use Case: Document hierarchies, dependency graphs, citation networks
Example - Academic Papers:
Paper A --[cites]--> Paper B --[cites]--> Paper C
Query: "Find all papers that build on Paper C"
Traversal: Reverse edges from C
Topological Ordering: - Process nodes in dependency order - Useful for "what must I read before X?" queries
Graph Query Languages¶
| Language | Description | Example |
|---|---|---|
| SPARQL | RDF query language | SELECT ?capital WHERE { ?capital :capitalOf :France } |
| Cypher | Neo4j's query language | MATCH (c:City)-[:CAPITAL_OF]->(f:France) RETURN c |
| Gremlin | Apache TinkerPop traversal | g.V().has('name','France').in('capital_of') |
Strengths¶
- ✅ Relationship-Aware: Explicitly models "who knows whom", "what causes what"
- ✅ Multi-Hop Reasoning: Natural for "friend of a friend" queries
- ✅ Explainable Paths: Can show reasoning chain
- ✅ Structured Knowledge: Enforces schema and constraints
Weaknesses¶
- ❌ Construction Cost: Building accurate KGs is expensive
- ❌ Incompleteness: Graphs are never complete
- ❌ Query Complexity: Writing graph queries requires expertise
- ❌ Scalability: Large graphs (billions of edges) are hard to query efficiently
Implementation Considerations for AirsDSP¶
Rust Ecosystem:
- petgraph: Pure Rust graph data structure library
- indradb: Graph database written in Rust
- neo4rs: Neo4j driver for Rust
- Custom adjacency list: HashMap<NodeId, Vec<Edge>>
Design Questions:
- Should AirsDSP assume pre-built graphs, or provide graph construction from text?
- Should graph queries be expressed in a DSL, or via Rust API?
- How to integrate graph retrieval with the Search operation in DSP pipelines?
Comparative Analysis¶
When to Use Each Method¶
| Scenario | Best Method | Rationale |
|---|---|---|
| "Find documents about climate change" | Vector | Semantic understanding needed |
| "Find document with ID ABC-123" | Keyword | Exact match required |
| "Who is the CEO of Apple?" | Graph | Structured relationship query |
| "Explain the connection between X and Y" | Graph | Multi-hop reasoning |
| "Find similar research papers" | Vector | Semantic similarity |
| "Search for error code E4502" | Keyword | Exact term match |
Hybrid Approaches (Future Research)¶
While this document focuses on individual methods, real-world systems often combine them:
- Vector + Keyword: Use BM25 for initial filtering, re-rank with vector similarity
- Graph + Vector: Embed graph neighborhoods, search in vector space
- All Three: Route query to appropriate method based on query type
Note: Hybrid search design will be addressed in a separate research document after foundational understanding is established.
Implications for AirsDSP Architecture¶
Core Requirements¶
- Pluggable Retrieval: The
Searchtrait must accommodate all three paradigms - Pipeline Awareness: Retrieval strategy may change based on pipeline stage
- Explicit Control: Users should choose retrieval method, not auto-selected
- Performance: Retrieval is often the bottleneck in RAG systems
Open Questions for Architecture Phase¶
- Should AirsDSP provide default implementations, or just traits?
- How to handle retrieval that requires external services (e.g., Qdrant, Neo4j)?
- Should the framework support retrieval result caching?
- How to represent retrieved context in the DSP
Contextobject?
References¶
Vector Search¶
- Karpukhin et al. (2020) - "Dense Passage Retrieval for Open-Domain Question Answering"
- Khattab & Zaharia (2020) - "ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT"
Keyword Search¶
- Robertson & Zaragoza (2009) - "The Probabilistic Relevance Framework: BM25 and Beyond"
- Manning et al. (2008) - "Introduction to Information Retrieval" (Chapter 6: Scoring and Ranking)
Graph Retrieval¶
- Edge et al. (2024) - "From Local to Global: A Graph RAG Approach to Query-Focused Summarization" (Microsoft Research)
- Hogan et al. (2021) - "Knowledge Graphs" (ACM Computing Surveys)
Rust Libraries¶
- Tantivy: https://github.com/quickwit-oss/tantivy
- Qdrant: https://qdrant.tech/
- Petgraph: https://github.com/petgraph/petgraph