Skip to content

DSP Retrieval Strategies

Document Type: Knowledge Base - Retrieval Research
Created: 2025-11-24
Last Updated: 2025-11-24
Confidence Level: High
Source: Information Retrieval research, RAG best practices, Knowledge Graph theory
Purpose: Comprehensive research on Vector, Keyword, and Graph-based retrieval for DSP implementation

Overview

The Search operation in the DSP framework is critical for knowledge-intensive tasks. Unlike simple retrieve-then-read systems, DSP enables strategic placement of retrieval within multi-stage pipelines. This document provides deep research on three fundamental retrieval paradigms that AirsDSP must support.

1. Vector Search (Dense Retrieval)

Core Concept

Vector search (also called semantic search or dense retrieval) represents documents and queries as high-dimensional embeddings, then finds similar items using distance metrics in vector space.

How It Works

  1. Embedding Generation: Convert text into dense vectors (typically 384-1536 dimensions)
  2. Models: sentence-transformers, text-embedding-ada-002, voyage-ai, cohere-embed
  3. Each document chunk → Vec<f32> of fixed dimension

  4. Index Construction: Build efficient search structures

  5. HNSW (Hierarchical Navigable Small World): Graph-based, fast approximate search
  6. IVF (Inverted File Index): Clustering-based, good for large datasets
  7. Flat Index: Brute force, exact but slow

  8. Query Processing:

    query_text → embedding_model → query_vector
    query_vector → index.search(k=10) → top_k_documents
    

Distance Metrics

Metric Formula Use Case Range
Cosine Similarity dot(a,b) / (norm(a) * norm(b)) Most common, normalized [-1, 1]
Euclidean Distance sqrt(sum((a-b)^2)) Absolute distance [0, ∞)
Dot Product sum(a * b) When vectors pre-normalized [-∞, ∞]

Strengths

  • Semantic Understanding: Captures meaning, not just keywords
  • Multilingual: Works across languages with multilingual embeddings
  • Synonym Handling: "car" and "automobile" are close in vector space
  • Context Aware: Understands "bank" (river) vs "bank" (finance)

Weaknesses

  • Exact Match Fails: Poor at finding specific IDs, codes, or rare terms
  • Computational Cost: Embedding generation adds latency
  • Black Box: Hard to explain why a result was retrieved
  • Recency Bias: Struggles with very recent or niche information

Key Research Papers

  • DPR (Dense Passage Retrieval) - Karpukhin et al., 2020
  • Showed dense retrieval can outperform BM25 on open-domain QA
  • Bi-encoder architecture: separate encoders for query and document

  • ColBERT - Khattab & Zaharia, 2020

  • Late interaction: compute similarity at token level, not document level
  • Better accuracy than DPR, still efficient

Implementation Considerations for AirsDSP

Rust Ecosystem: - qdrant-client: Client for Qdrant vector DB - pgvector: PostgreSQL extension (via sqlx) - faiss-rs: Rust bindings for Facebook's FAISS - hnsw: Pure Rust HNSW implementation

Design Questions: - Should AirsDSP handle embedding generation, or assume pre-embedded documents? - Should the framework provide a default embedding model, or require user-supplied?


2. Keyword Search (Sparse Retrieval)

Core Concept

Keyword search (also called lexical search or sparse retrieval) matches documents based on exact term overlap, weighted by statistical importance.

How It Works

  1. Inverted Index Construction:

    Document 1: "The quick brown fox"
    Document 2: "The lazy brown dog"
    
    Inverted Index:
    "quick" → [Doc1]
    "brown" → [Doc1, Doc2]
    "fox"   → [Doc1]
    "lazy"  → [Doc2]
    "dog"   → [Doc2]
    

  2. Query Processing:

  3. Parse query into terms
  4. Look up each term in inverted index
  5. Score documents using ranking algorithm

  6. Ranking Algorithms:

BM25 (Best Match 25)

The gold standard for keyword search. Formula:

score(D, Q) = Σ IDF(qi) * (f(qi, D) * (k1 + 1)) / (f(qi, D) + k1 * (1 - b + b * |D| / avgdl))

Where: - f(qi, D): Frequency of term qi in document D - |D|: Length of document D - avgdl: Average document length in collection - k1: Term frequency saturation (typically 1.2-2.0) - b: Length normalization (typically 0.75) - IDF(qi): Inverse document frequency of term qi

Key Insight: BM25 has diminishing returns for term frequency (prevents keyword stuffing) and normalizes by document length.

TF-IDF (Term Frequency - Inverse Document Frequency)

Simpler, older algorithm:

score(D, Q) = Σ TF(qi, D) * IDF(qi)

TF(qi, D) = count(qi in D) / |D|
IDF(qi) = log(N / df(qi))

Where: - N: Total number of documents - df(qi): Number of documents containing term qi

Strengths

  • Exact Match: Perfect for IDs, codes, specific names
  • Explainable: Clear why a document matched (term overlap)
  • Fast: Inverted indices are extremely efficient
  • No Training: Works out-of-the-box, no ML models needed
  • Deterministic: Same query always returns same results

Weaknesses

  • Vocabulary Mismatch: "car" won't match "automobile"
  • No Semantic Understanding: "bank" (river) and "bank" (finance) treated identically
  • Sensitive to Wording: Query must use document's exact terms

Advanced Techniques

Query Expansion: - Add synonyms to query: "car" → "car OR automobile OR vehicle" - Use WordNet or custom thesaurus

Stemming/Lemmatization: - Reduce words to root form: "running" → "run" - Libraries: Porter Stemmer, Snowball

Stopword Removal: - Remove common words: "the", "a", "is" - Reduces index size, improves precision

Implementation Considerations for AirsDSP

Rust Ecosystem: - tantivy: Pure Rust full-text search engine (like Lucene) - meilisearch: Fast, typo-tolerant search (has Rust SDK) - sonic: Lightweight alternative to Elasticsearch - Manual BM25 implementation using HashMap<String, Vec<DocId>>

Design Questions: - Should AirsDSP provide built-in tokenization, or require pre-tokenized input? - Should stemming/stopword removal be configurable or opinionated?


3. Graph Retrieval (Knowledge Graphs & DAGs)

Core Concept

Graph retrieval treats knowledge as a network of entities (nodes) and relationships (edges), enabling traversal-based and structural queries.

Knowledge Graph Fundamentals

Structure:

(Entity) --[Relationship]--> (Entity)

Example:
(Paris) --[capital_of]--> (France)
(France) --[located_in]--> (Europe)
(Paris) --[has_population]--> (2.2M)

Representation: - RDF Triples: (subject, predicate, object) - (Paris, capital_of, France) - Property Graphs: Nodes and edges can have properties - Node: {id: "Paris", type: "City", population: 2.2M} - Edge: {from: "Paris", to: "France", type: "capital_of", since: 987}

Graph Query Patterns

1. Direct Lookup

Query: "What is the capital of France?"
Graph: (France) --[capital_of]-- (?)
Result: Paris

2. Multi-Hop Traversal

Query: "What continent is Paris in?"
Graph: (Paris) --[capital_of]--> (France) --[located_in]--> (?)
Result: Europe

3. Subgraph Matching

Query: "Find all cities in Europe with population > 1M"
Pattern:
  (City) --[located_in]--> (Country) --[located_in]--> (Europe)
  WHERE City.population > 1M

GraphRAG (Graph-based RAG)

Recent research shows combining graphs with LLMs improves complex reasoning.

Microsoft's GraphRAG Approach (2024): 1. Graph Construction: Extract entities and relationships from documents 2. Community Detection: Cluster related entities 3. Hierarchical Summarization: Summarize each community 4. Query-Time Retrieval: - Map query to relevant communities - Retrieve community summaries + local subgraph - LLM synthesizes answer

Benefits over Pure Vector RAG: - ✅ Better at multi-hop reasoning - ✅ Captures explicit relationships - ✅ Enables "explain the connection" queries - ✅ Reduces hallucination (grounded in graph structure)

DAG-Based Retrieval

Use Case: Document hierarchies, dependency graphs, citation networks

Example - Academic Papers:

Paper A --[cites]--> Paper B --[cites]--> Paper C

Query: "Find all papers that build on Paper C"
Traversal: Reverse edges from C

Topological Ordering: - Process nodes in dependency order - Useful for "what must I read before X?" queries

Graph Query Languages

Language Description Example
SPARQL RDF query language SELECT ?capital WHERE { ?capital :capitalOf :France }
Cypher Neo4j's query language MATCH (c:City)-[:CAPITAL_OF]->(f:France) RETURN c
Gremlin Apache TinkerPop traversal g.V().has('name','France').in('capital_of')

Strengths

  • Relationship-Aware: Explicitly models "who knows whom", "what causes what"
  • Multi-Hop Reasoning: Natural for "friend of a friend" queries
  • Explainable Paths: Can show reasoning chain
  • Structured Knowledge: Enforces schema and constraints

Weaknesses

  • Construction Cost: Building accurate KGs is expensive
  • Incompleteness: Graphs are never complete
  • Query Complexity: Writing graph queries requires expertise
  • Scalability: Large graphs (billions of edges) are hard to query efficiently

Implementation Considerations for AirsDSP

Rust Ecosystem: - petgraph: Pure Rust graph data structure library - indradb: Graph database written in Rust - neo4rs: Neo4j driver for Rust - Custom adjacency list: HashMap<NodeId, Vec<Edge>>

Design Questions: - Should AirsDSP assume pre-built graphs, or provide graph construction from text? - Should graph queries be expressed in a DSL, or via Rust API? - How to integrate graph retrieval with the Search operation in DSP pipelines?


Comparative Analysis

When to Use Each Method

Scenario Best Method Rationale
"Find documents about climate change" Vector Semantic understanding needed
"Find document with ID ABC-123" Keyword Exact match required
"Who is the CEO of Apple?" Graph Structured relationship query
"Explain the connection between X and Y" Graph Multi-hop reasoning
"Find similar research papers" Vector Semantic similarity
"Search for error code E4502" Keyword Exact term match

Hybrid Approaches (Future Research)

While this document focuses on individual methods, real-world systems often combine them:

  • Vector + Keyword: Use BM25 for initial filtering, re-rank with vector similarity
  • Graph + Vector: Embed graph neighborhoods, search in vector space
  • All Three: Route query to appropriate method based on query type

Note: Hybrid search design will be addressed in a separate research document after foundational understanding is established.


Implications for AirsDSP Architecture

Core Requirements

  1. Pluggable Retrieval: The Search trait must accommodate all three paradigms
  2. Pipeline Awareness: Retrieval strategy may change based on pipeline stage
  3. Explicit Control: Users should choose retrieval method, not auto-selected
  4. Performance: Retrieval is often the bottleneck in RAG systems

Open Questions for Architecture Phase

  1. Should AirsDSP provide default implementations, or just traits?
  2. How to handle retrieval that requires external services (e.g., Qdrant, Neo4j)?
  3. Should the framework support retrieval result caching?
  4. How to represent retrieved context in the DSP Context object?

References

  • Karpukhin et al. (2020) - "Dense Passage Retrieval for Open-Domain Question Answering"
  • Khattab & Zaharia (2020) - "ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT"
  • Robertson & Zaragoza (2009) - "The Probabilistic Relevance Framework: BM25 and Beyond"
  • Manning et al. (2008) - "Introduction to Information Retrieval" (Chapter 6: Scoring and Ranking)

Graph Retrieval

  • Edge et al. (2024) - "From Local to Global: A Graph RAG Approach to Query-Focused Summarization" (Microsoft Research)
  • Hogan et al. (2021) - "Knowledge Graphs" (ACM Computing Surveys)

Rust Libraries

  • Tantivy: https://github.com/quickwit-oss/tantivy
  • Qdrant: https://qdrant.tech/
  • Petgraph: https://github.com/petgraph/petgraph