DSP Retrieval Strategies¶

Document Type: Knowledge Base - Retrieval Research
Created: 2025-11-24
Last Updated: 2025-11-24
Confidence Level: High
Source: Information Retrieval research, RAG best practices, Knowledge Graph theory
Purpose: Comprehensive research on Vector, Keyword, and Graph-based retrieval for DSP implementation

Overview¶

The Search operation in the DSP framework is critical for knowledge-intensive tasks. Unlike simple retrieve-then-read systems, DSP enables strategic placement of retrieval within multi-stage pipelines. This document provides deep research on three fundamental retrieval paradigms that AirsDSP must support.

1. Vector Search (Dense Retrieval)¶

Core Concept¶

Vector search (also called semantic search or dense retrieval) represents documents and queries as high-dimensional embeddings, then finds similar items using distance metrics in vector space.

How It Works¶

Embedding Generation: Convert text into dense vectors (typically 384-1536 dimensions)
Models: sentence-transformers, text-embedding-ada-002, voyage-ai, cohere-embed
Each document chunk → Vec<f32> of fixed dimension
Index Construction: Build efficient search structures
HNSW (Hierarchical Navigable Small World): Graph-based, fast approximate search
IVF (Inverted File Index): Clustering-based, good for large datasets
Flat Index: Brute force, exact but slow

Query Processing:

query_text → embedding_model → query_vector
query_vector → index.search(k=10) → top_k_documents

Distance Metrics¶

Metric	Formula	Use Case	Range
Cosine Similarity	`dot(a,b) / (norm(a) * norm(b))`	Most common, normalized	[-1, 1]
Euclidean Distance	`sqrt(sum((a-b)^2))`	Absolute distance	[0, ∞)
Dot Product	`sum(a * b)`	When vectors pre-normalized	[-∞, ∞]

Strengths¶

✅ Semantic Understanding: Captures meaning, not just keywords
✅ Multilingual: Works across languages with multilingual embeddings
✅ Synonym Handling: "car" and "automobile" are close in vector space
✅ Context Aware: Understands "bank" (river) vs "bank" (finance)

Weaknesses¶

❌ Exact Match Fails: Poor at finding specific IDs, codes, or rare terms
❌ Computational Cost: Embedding generation adds latency
❌ Black Box: Hard to explain why a result was retrieved
❌ Recency Bias: Struggles with very recent or niche information

Key Research Papers¶

DPR (Dense Passage Retrieval) - Karpukhin et al., 2020
Showed dense retrieval can outperform BM25 on open-domain QA
Bi-encoder architecture: separate encoders for query and document
ColBERT - Khattab & Zaharia, 2020
Late interaction: compute similarity at token level, not document level
Better accuracy than DPR, still efficient

Implementation Considerations for AirsDSP¶

Rust Ecosystem: - qdrant-client: Client for Qdrant vector DB - pgvector: PostgreSQL extension (via sqlx) - faiss-rs: Rust bindings for Facebook's FAISS - hnsw: Pure Rust HNSW implementation

Design Questions: - Should AirsDSP handle embedding generation, or assume pre-embedded documents? - Should the framework provide a default embedding model, or require user-supplied?

2. Keyword Search (Sparse Retrieval)¶

Core Concept¶

Keyword search (also called lexical search or sparse retrieval) matches documents based on exact term overlap, weighted by statistical importance.

How It Works¶

Inverted Index Construction:

Document 1: "The quick brown fox"
Document 2: "The lazy brown dog"

Inverted Index:
"quick" → [Doc1]
"brown" → [Doc1, Doc2]
"fox"   → [Doc1]
"lazy"  → [Doc2]
"dog"   → [Doc2]

Query Processing:
Parse query into terms
Look up each term in inverted index
Score documents using ranking algorithm
Ranking Algorithms:

BM25 (Best Match 25)¶

The gold standard for keyword search. Formula:

score(D, Q) = Σ IDF(qi) * (f(qi, D) * (k1 + 1)) / (f(qi, D) + k1 * (1 - b + b * |D| / avgdl))

Where: - f(qi, D): Frequency of term qi in document D - |D|: Length of document D - avgdl: Average document length in collection - k1: Term frequency saturation (typically 1.2-2.0) - b: Length normalization (typically 0.75) - IDF(qi): Inverse document frequency of term qi

Key Insight: BM25 has diminishing returns for term frequency (prevents keyword stuffing) and normalizes by document length.

TF-IDF (Term Frequency - Inverse Document Frequency)¶

Simpler, older algorithm:

score(D, Q) = Σ TF(qi, D) * IDF(qi)

TF(qi, D) = count(qi in D) / |D|
IDF(qi) = log(N / df(qi))

Where: - N: Total number of documents - df(qi): Number of documents containing term qi

Strengths¶

✅ Exact Match: Perfect for IDs, codes, specific names
✅ Explainable: Clear why a document matched (term overlap)
✅ Fast: Inverted indices are extremely efficient
✅ No Training: Works out-of-the-box, no ML models needed
✅ Deterministic: Same query always returns same results

Weaknesses¶

❌ Vocabulary Mismatch: "car" won't match "automobile"
❌ No Semantic Understanding: "bank" (river) and "bank" (finance) treated identically
❌ Sensitive to Wording: Query must use document's exact terms

Advanced Techniques¶

Query Expansion: - Add synonyms to query: "car" → "car OR automobile OR vehicle" - Use WordNet or custom thesaurus

Stemming/Lemmatization: - Reduce words to root form: "running" → "run" - Libraries: Porter Stemmer, Snowball

Stopword Removal: - Remove common words: "the", "a", "is" - Reduces index size, improves precision

Implementation Considerations for AirsDSP¶

Rust Ecosystem: - tantivy: Pure Rust full-text search engine (like Lucene) - meilisearch: Fast, typo-tolerant search (has Rust SDK) - sonic: Lightweight alternative to Elasticsearch - Manual BM25 implementation using HashMap<String, Vec<DocId>>

Design Questions: - Should AirsDSP provide built-in tokenization, or require pre-tokenized input? - Should stemming/stopword removal be configurable or opinionated?

3. Graph Retrieval (Knowledge Graphs & DAGs)¶

Core Concept¶

Graph retrieval treats knowledge as a network of entities (nodes) and relationships (edges), enabling traversal-based and structural queries.

Knowledge Graph Fundamentals¶

Structure:

(Entity) --[Relationship]--> (Entity)

Example:
(Paris) --[capital_of]--> (France)
(France) --[located_in]--> (Europe)
(Paris) --[has_population]--> (2.2M)

Representation: - RDF Triples: (subject, predicate, object) - (Paris, capital_of, France) - Property Graphs: Nodes and edges can have properties - Node: {id: "Paris", type: "City", population: 2.2M} - Edge: {from: "Paris", to: "France", type: "capital_of", since: 987}

Graph Query Patterns¶

1. Direct Lookup¶

Query: "What is the capital of France?"
Graph: (France) --[capital_of]-- (?)
Result: Paris

2. Multi-Hop Traversal¶

Query: "What continent is Paris in?"
Graph: (Paris) --[capital_of]--> (France) --[located_in]--> (?)
Result: Europe

3. Subgraph Matching¶

Query: "Find all cities in Europe with population > 1M"
Pattern:
  (City) --[located_in]--> (Country) --[located_in]--> (Europe)
  WHERE City.population > 1M

GraphRAG (Graph-based RAG)¶

Recent research shows combining graphs with LLMs improves complex reasoning.

Microsoft's GraphRAG Approach (2024): 1. Graph Construction: Extract entities and relationships from documents 2. Community Detection: Cluster related entities 3. Hierarchical Summarization: Summarize each community 4. Query-Time Retrieval: - Map query to relevant communities - Retrieve community summaries + local subgraph - LLM synthesizes answer

Benefits over Pure Vector RAG: - ✅ Better at multi-hop reasoning - ✅ Captures explicit relationships - ✅ Enables "explain the connection" queries - ✅ Reduces hallucination (grounded in graph structure)

DAG-Based Retrieval¶

Use Case: Document hierarchies, dependency graphs, citation networks

Example - Academic Papers:

Paper A --[cites]--> Paper B --[cites]--> Paper C

Query: "Find all papers that build on Paper C"
Traversal: Reverse edges from C

Topological Ordering: - Process nodes in dependency order - Useful for "what must I read before X?" queries

Graph Query Languages¶

Language	Description	Example
SPARQL	RDF query language	`SELECT ?capital WHERE { ?capital :capitalOf :France }`
Cypher	Neo4j's query language	`MATCH (c:City)-[:CAPITAL_OF]->(f:France) RETURN c`
Gremlin	Apache TinkerPop traversal	`g.V().has('name','France').in('capital_of')`

Strengths¶

✅ Relationship-Aware: Explicitly models "who knows whom", "what causes what"
✅ Multi-Hop Reasoning: Natural for "friend of a friend" queries
✅ Explainable Paths: Can show reasoning chain
✅ Structured Knowledge: Enforces schema and constraints

Weaknesses¶

❌ Construction Cost: Building accurate KGs is expensive
❌ Incompleteness: Graphs are never complete
❌ Query Complexity: Writing graph queries requires expertise
❌ Scalability: Large graphs (billions of edges) are hard to query efficiently

Implementation Considerations for AirsDSP¶

Rust Ecosystem: - petgraph: Pure Rust graph data structure library - indradb: Graph database written in Rust - neo4rs: Neo4j driver for Rust - Custom adjacency list: HashMap<NodeId, Vec<Edge>>

Design Questions: - Should AirsDSP assume pre-built graphs, or provide graph construction from text? - Should graph queries be expressed in a DSL, or via Rust API? - How to integrate graph retrieval with the Search operation in DSP pipelines?

Comparative Analysis¶

When to Use Each Method¶

Scenario	Best Method	Rationale
"Find documents about climate change"	Vector	Semantic understanding needed
"Find document with ID ABC-123"	Keyword	Exact match required
"Who is the CEO of Apple?"	Graph	Structured relationship query
"Explain the connection between X and Y"	Graph	Multi-hop reasoning
"Find similar research papers"	Vector	Semantic similarity
"Search for error code E4502"	Keyword	Exact term match

Hybrid Approaches (Future Research)¶

While this document focuses on individual methods, real-world systems often combine them:

Vector + Keyword: Use BM25 for initial filtering, re-rank with vector similarity
Graph + Vector: Embed graph neighborhoods, search in vector space
All Three: Route query to appropriate method based on query type

Note: Hybrid search design will be addressed in a separate research document after foundational understanding is established.

Implications for AirsDSP Architecture¶

Core Requirements¶

Pluggable Retrieval: The Search trait must accommodate all three paradigms
Pipeline Awareness: Retrieval strategy may change based on pipeline stage
Explicit Control: Users should choose retrieval method, not auto-selected
Performance: Retrieval is often the bottleneck in RAG systems

Open Questions for Architecture Phase¶

Should AirsDSP provide default implementations, or just traits?
How to handle retrieval that requires external services (e.g., Qdrant, Neo4j)?
Should the framework support retrieval result caching?
How to represent retrieved context in the DSP Context object?

References¶

Vector Search¶

Karpukhin et al. (2020) - "Dense Passage Retrieval for Open-Domain Question Answering"
Khattab & Zaharia (2020) - "ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT"

Keyword Search¶

Robertson & Zaragoza (2009) - "The Probabilistic Relevance Framework: BM25 and Beyond"
Manning et al. (2008) - "Introduction to Information Retrieval" (Chapter 6: Scoring and Ranking)

Graph Retrieval¶

Edge et al. (2024) - "From Local to Global: A Graph RAG Approach to Query-Focused Summarization" (Microsoft Research)
Hogan et al. (2021) - "Knowledge Graphs" (ACM Computing Surveys)

Rust Libraries¶

Tantivy: https://github.com/quickwit-oss/tantivy
Qdrant: https://qdrant.tech/
Petgraph: https://github.com/petgraph/petgraph