MongoDB Vector Search for Semantic Applications: Building AI-Powered Search with SQL-Style Vector Operations
Modern applications increasingly require intelligent search capabilities that understand semantic meaning rather than just keyword matching. Traditional text-based search approaches struggle with understanding context, handling synonyms, and providing relevant results for complex queries that require conceptual understanding rather than exact text matches.
MongoDB Atlas Vector Search provides native vector database capabilities that enable semantic similarity search, recommendation systems, and retrieval-augmented generation (RAG) applications. Unlike standalone vector databases that require separate infrastructure, Atlas Vector Search integrates seamlessly with MongoDB's document model, allowing developers to combine traditional database operations with advanced AI-powered search in a single, unified platform.
The Traditional Search Limitations Challenge
Conventional approaches to search and content discovery have significant limitations for modern intelligent applications:
-- Traditional relational search - limited semantic understanding
-- PostgreSQL full-text search with performance and relevance challenges
CREATE TABLE documents (
document_id SERIAL PRIMARY KEY,
title VARCHAR(500) NOT NULL,
content TEXT NOT NULL,
category VARCHAR(100),
tags TEXT[],
author VARCHAR(200),
created_at TIMESTAMP DEFAULT NOW(),
-- Full-text search vector (keyword-based only)
search_vector tsvector GENERATED ALWAYS AS (
setweight(to_tsvector('english', title), 'A') ||
setweight(to_tsvector('english', content), 'B') ||
setweight(to_tsvector('english', array_to_string(tags, ' ')), 'C')
) STORED
);
-- Create full-text search index
CREATE INDEX idx_documents_fts ON documents USING GIN(search_vector);
-- Additional indexes for filtering
CREATE INDEX idx_documents_category ON documents(category);
CREATE INDEX idx_documents_created_at ON documents(created_at DESC);
CREATE INDEX idx_documents_author ON documents(author);
-- Traditional keyword-based search with limited semantic understanding
WITH search_query AS (
SELECT
document_id,
title,
content,
category,
author,
created_at,
-- Basic relevance scoring (keyword-based only)
ts_rank_cd(search_vector, plainto_tsquery('english', 'machine learning algorithms')) as relevance_score,
-- Highlight matching text
ts_headline('english', content, plainto_tsquery('english', 'machine learning algorithms'),
'MaxWords=50, MinWords=20, ShortWord=3, HighlightAll=false') as highlighted_content,
-- Basic similarity using trigram matching (very limited)
similarity(title, 'machine learning algorithms') as title_similarity,
-- Category boosting (manual relevance adjustment)
CASE category
WHEN 'AI' THEN 1.5
WHEN 'Technology' THEN 1.2
ELSE 1.0
END as category_boost
FROM documents
WHERE search_vector @@ plainto_tsquery('english', 'machine learning algorithms')
OR similarity(title, 'machine learning algorithms') > 0.1
),
ranked_results AS (
SELECT
*,
-- Combined relevance scoring (still keyword-dependent)
(relevance_score * category_boost *
CASE WHEN title_similarity > 0.3 THEN 2.0 ELSE 1.0 END) as final_score,
-- Manual semantic grouping (limited effectiveness)
CASE
WHEN content ILIKE '%neural network%' OR content ILIKE '%deep learning%' THEN 'Deep Learning'
WHEN content ILIKE '%statistics%' OR content ILIKE '%data science%' THEN 'Data Science'
WHEN content ILIKE '%algorithm%' OR content ILIKE '%optimization%' THEN 'Algorithms'
ELSE 'General'
END as semantic_category,
-- Time decay factor
CASE
WHEN created_at >= NOW() - INTERVAL '30 days' THEN 1.2
WHEN created_at >= NOW() - INTERVAL '90 days' THEN 1.0
WHEN created_at >= NOW() - INTERVAL '1 year' THEN 0.8
ELSE 0.6
END as recency_boost
FROM search_query
WHERE relevance_score > 0.01
),
related_documents AS (
-- Attempt to find related documents (very basic approach)
SELECT DISTINCT
r1.document_id,
r2.document_id as related_id,
r2.title as related_title,
-- Basic relatedness calculation
(array_length(array(SELECT UNNEST(r1.tags) INTERSECT SELECT UNNEST(r2.tags)), 1) /
GREATEST(array_length(r1.tags, 1), array_length(r2.tags, 1))::numeric) as tag_similarity,
CASE WHEN r1.category = r2.category THEN 0.3 ELSE 0 END as category_match,
CASE WHEN r1.author = r2.author THEN 0.2 ELSE 0 END as author_match
FROM ranked_results r1
JOIN documents r2 ON r1.document_id != r2.document_id
WHERE r1.final_score > 0.5
),
final_results AS (
SELECT
r.document_id,
r.title,
LEFT(r.content, 200) || '...' as content_preview,
r.highlighted_content,
r.category,
r.semantic_category,
r.author,
r.created_at,
-- Final ranking with all factors
ROUND((r.final_score * r.recency_boost)::numeric, 4) as final_relevance_score,
-- Related documents (limited by keyword overlap)
COALESCE(
(SELECT json_agg(json_build_object(
'id', related_id,
'title', related_title,
'similarity', ROUND((tag_similarity + category_match + author_match)::numeric, 3)
)) FROM related_documents rd
WHERE rd.document_id = r.document_id
AND (tag_similarity + category_match + author_match) > 0.1
LIMIT 5),
'[]'::json
) as related_documents
FROM ranked_results r
)
SELECT
document_id,
title,
content_preview,
highlighted_content,
category,
semantic_category,
author,
final_relevance_score,
related_documents,
-- Search result metadata
COUNT(*) OVER () as total_results,
ROW_NUMBER() OVER (ORDER BY final_relevance_score DESC) as result_rank
FROM final_results
ORDER BY final_relevance_score DESC, created_at DESC
LIMIT 20;
-- Problems with traditional keyword-based search:
-- 1. No understanding of semantic meaning or context
-- 2. Cannot handle synonyms, related concepts, or conceptual queries
-- 3. Limited relevance scoring based only on keyword frequency and position
-- 4. Poor handling of multilingual content and cross-language search
-- 5. No support for similarity search across different content types
-- 6. Manual and error-prone relevance tuning with limited effectiveness
-- 7. Cannot understand user intent beyond explicit keyword matches
-- 8. Poor recommendation capabilities based only on metadata overlap
-- 9. Limited support for complex search patterns and AI-powered features
-- 10. No integration with modern machine learning and embedding models
-- MySQL approach (even more limited)
SELECT
document_id,
title,
content,
category,
-- Basic full-text search (MySQL limitations)
MATCH(title, content) AGAINST('machine learning' IN NATURAL LANGUAGE MODE) as relevance,
-- Simple keyword highlighting
REPLACE(
REPLACE(title, 'machine', '<mark>machine</mark>'),
'learning', '<mark>learning</mark>'
) as highlighted_title
FROM mysql_documents
WHERE MATCH(title, content) AGAINST('machine learning' IN NATURAL LANGUAGE MODE)
ORDER BY relevance DESC
LIMIT 10;
-- MySQL limitations:
-- - Very basic full-text search with limited relevance algorithms
-- - No semantic understanding or contextual matching
-- - Limited text processing and language support
-- - Basic relevance scoring without advanced ranking factors
-- - No support for vector embeddings or similarity search
-- - Limited customization of search behavior and ranking
-- - Poor performance with large text corpuses
-- - No integration with modern AI/ML search techniques
MongoDB Atlas Vector Search provides intelligent semantic search capabilities:
// MongoDB Atlas Vector Search - AI-powered semantic search and similarity matching
const { MongoClient } = require('mongodb');
const client = new MongoClient('mongodb+srv://your-cluster.mongodb.net/');
const db = client.db('intelligent_search_platform');
// Advanced vector search and semantic similarity platform
class VectorSearchManager {
constructor(db) {
this.db = db;
this.collections = {
documents: db.collection('documents'),
vectorIndex: db.collection('vector_index_metadata'),
searchAnalytics: db.collection('search_analytics'),
userProfiles: db.collection('user_profiles'),
recommendations: db.collection('recommendations')
};
// Vector search configuration
this.vectorConfig = {
dimensions: 1536, // OpenAI text-embedding-ada-002
similarity: 'cosine',
indexType: 'knnVector'
};
this.embeddingModel = 'text-embedding-ada-002'; // Can be configured for different models
}
async initializeVectorSearchIndexes() {
console.log('Initializing Atlas Vector Search indexes...');
// Create vector search index for document content
const contentVectorIndex = {
name: 'content_vector_index',
definition: {
fields: [
{
type: 'vector',
path: 'contentVector',
numDimensions: this.vectorConfig.dimensions,
similarity: this.vectorConfig.similarity
},
{
type: 'filter',
path: 'category'
},
{
type: 'filter',
path: 'tags'
},
{
type: 'filter',
path: 'publishedDate'
},
{
type: 'filter',
path: 'author'
},
{
type: 'filter',
path: 'contentType'
}
]
}
};
// Create vector search index for title embeddings
const titleVectorIndex = {
name: 'title_vector_index',
definition: {
fields: [
{
type: 'vector',
path: 'titleVector',
numDimensions: this.vectorConfig.dimensions,
similarity: this.vectorConfig.similarity
}
]
}
};
// Create hybrid search index combining vector and text search
const hybridSearchIndex = {
name: 'hybrid_search_index',
definition: {
fields: [
{
type: 'vector',
path: 'contentVector',
numDimensions: this.vectorConfig.dimensions,
similarity: this.vectorConfig.similarity
},
{
type: 'autocomplete',
path: 'title',
tokenization: 'edgeGram',
minGrams: 2,
maxGrams: 15
},
{
type: 'text',
path: 'content',
analyzer: 'lucene.standard'
},
{
type: 'text',
path: 'tags',
analyzer: 'lucene.keyword'
}
]
}
};
try {
// Note: In practice, vector search indexes are created through MongoDB Atlas UI
// or MongoDB CLI. This code shows the structure for reference.
console.log('Vector search indexes configured:');
console.log('- Content Vector Index:', contentVectorIndex.name);
console.log('- Title Vector Index:', titleVectorIndex.name);
console.log('- Hybrid Search Index:', hybridSearchIndex.name);
// Store index metadata for application reference
await this.collections.vectorIndex.insertMany([
{ ...contentVectorIndex, createdAt: new Date(), status: 'active' },
{ ...titleVectorIndex, createdAt: new Date(), status: 'active' },
{ ...hybridSearchIndex, createdAt: new Date(), status: 'active' }
]);
return {
contentVectorIndex: contentVectorIndex.name,
titleVectorIndex: titleVectorIndex.name,
hybridSearchIndex: hybridSearchIndex.name
};
} catch (error) {
console.error('Vector index initialization failed:', error);
throw error;
}
}
async ingestDocumentsWithVectorization(documents) {
console.log(`Processing ${documents.length} documents for vector search ingestion...`);
const processedDocuments = [];
const batchSize = 10;
// Process documents in batches to manage API rate limits
for (let i = 0; i < documents.length; i += batchSize) {
const batch = documents.slice(i, i + batchSize);
console.log(`Processing batch ${Math.floor(i / batchSize) + 1}/${Math.ceil(documents.length / batchSize)}`);
const batchPromises = batch.map(async (doc) => {
try {
// Generate embeddings for title and content
const [titleEmbedding, contentEmbedding] = await Promise.all([
this.generateEmbedding(doc.title),
this.generateEmbedding(doc.content)
]);
// Extract key phrases and entities for enhanced searchability
const extractedEntities = await this.extractEntities(doc.content);
const keyPhrases = await this.extractKeyPhrases(doc.content);
// Calculate content characteristics for better matching
const contentCharacteristics = this.analyzeContentCharacteristics(doc.content);
return {
_id: doc._id || new ObjectId(),
// Original document content
title: doc.title,
content: doc.content,
summary: doc.summary || this.generateSummary(doc.content),
// Document metadata
category: doc.category,
tags: doc.tags || [],
author: doc.author,
publishedDate: doc.publishedDate || new Date(),
contentType: doc.contentType || 'article',
language: doc.language || 'en',
// Vector embeddings for semantic search
titleVector: titleEmbedding,
contentVector: contentEmbedding,
// Enhanced searchability features
entities: extractedEntities,
keyPhrases: keyPhrases,
contentCharacteristics: contentCharacteristics,
// Search optimization metadata
searchMetadata: {
wordCount: doc.content.split(/\s+/).length,
readingTime: Math.ceil(doc.content.split(/\s+/).length / 200), // minutes
complexity: contentCharacteristics.complexity,
topicDistribution: contentCharacteristics.topics,
sentimentScore: contentCharacteristics.sentiment
},
// Document quality and authority signals
qualitySignals: {
authorityScore: doc.authorityScore || 0.5,
freshnessScore: this.calculateFreshnessScore(doc.publishedDate || new Date()),
engagementScore: doc.engagementScore || 0.5,
accuracyScore: doc.accuracyScore || 0.8
},
// Indexing and processing metadata
indexed: true,
indexedAt: new Date(),
vectorModelVersion: this.embeddingModel,
processingVersion: '1.0'
};
} catch (error) {
console.error(`Failed to process document ${doc._id}:`, error);
return null;
}
});
const batchResults = await Promise.all(batchPromises);
const validResults = batchResults.filter(result => result !== null);
processedDocuments.push(...validResults);
// Rate limiting pause between batches
if (i + batchSize < documents.length) {
await new Promise(resolve => setTimeout(resolve, 1000));
}
}
// Bulk insert processed documents
if (processedDocuments.length > 0) {
const insertResult = await this.collections.documents.insertMany(processedDocuments, {
ordered: false
});
console.log(`Successfully indexed ${insertResult.insertedCount} documents with vector embeddings`);
return {
totalProcessed: documents.length,
successfullyIndexed: insertResult.insertedCount,
failed: documents.length - processedDocuments.length,
indexedDocuments: processedDocuments
};
}
return {
totalProcessed: documents.length,
successfullyIndexed: 0,
failed: documents.length,
indexedDocuments: []
};
}
async performSemanticSearch(query, options = {}) {
console.log(`Performing semantic search for: "${query}"`);
const {
limit = 20,
filters = {},
includeScore = true,
similarityThreshold = 0.7,
searchType = 'semantic', // 'semantic', 'hybrid', 'keyword'
userContext = null
} = options;
try {
// Generate query embedding for semantic search
const queryEmbedding = await this.generateEmbedding(query);
let pipeline = [];
if (searchType === 'semantic' || searchType === 'hybrid') {
// Vector similarity search stage
pipeline.push({
$vectorSearch: {
index: 'content_vector_index',
path: 'contentVector',
queryVector: queryEmbedding,
numCandidates: limit * 10, // Search more candidates for better results
limit: limit * 2, // Get more results for reranking
filter: this.buildFilterExpression(filters)
}
});
// Add vector search score
pipeline.push({
$addFields: {
vectorScore: { $meta: 'vectorSearchScore' },
searchMethod: 'vector'
}
});
}
if (searchType === 'hybrid') {
// Combine with text search for hybrid approach
pipeline.push({
$unionWith: {
coll: 'documents',
pipeline: [
{
$search: {
index: 'hybrid_search_index',
compound: {
should: [
{
text: {
query: query,
path: ['title', 'content'],
score: { boost: { value: 2.0 } }
}
},
{
autocomplete: {
query: query,
path: 'title',
score: { boost: { value: 1.5 } }
}
}
],
filter: this.buildSearchFilterClauses(filters)
}
}
},
{
$addFields: {
textScore: { $meta: 'searchScore' },
searchMethod: 'text'
}
},
{ $limit: limit }
]
}
});
}
// Enhanced result processing and ranking
pipeline.push({
$addFields: {
// Calculate comprehensive relevance score
relevanceScore: {
$switch: {
branches: [
{
case: { $eq: ['$searchMethod', 'vector'] },
then: {
$multiply: [
{ $ifNull: ['$vectorScore', 0] },
{ $add: [
{ $multiply: [{ $ifNull: ['$qualitySignals.authorityScore', 0.5] }, 0.2] },
{ $multiply: [{ $ifNull: ['$qualitySignals.freshnessScore', 0.5] }, 0.1] },
{ $multiply: [{ $ifNull: ['$qualitySignals.engagementScore', 0.5] }, 0.15] },
0.55 // Base score weight
]}
]
}
},
{
case: { $eq: ['$searchMethod', 'text'] },
then: {
$multiply: [
{ $ifNull: ['$textScore', 0] },
0.8 // Weight text search lower than semantic
]
}
}
],
default: 0
}
},
// Extract relevant snippets
contentSnippet: {
$substrCP: [
'$content',
0,
300
]
},
// Calculate query-document semantic similarity
semanticRelevance: {
$cond: {
if: { $gt: [{ $ifNull: ['$vectorScore', 0] }, similarityThreshold] },
then: 'high',
else: {
$cond: {
if: { $gt: [{ $ifNull: ['$vectorScore', 0] }, similarityThreshold * 0.8] },
then: 'medium',
else: 'low'
}
}
}
}
}
});
// User personalization if context provided
if (userContext) {
pipeline.push({
$addFields: {
personalizedScore: {
$multiply: [
'$relevanceScore',
{
$add: [
// Category preference boost
{
$cond: {
if: { $in: ['$category', userContext.preferredCategories || []] },
then: 0.2,
else: 0
}
},
// Author preference boost
{
$cond: {
if: { $in: ['$author', userContext.followedAuthors || []] },
then: 0.15,
else: 0
}
},
// Language preference
{
$cond: {
if: { $eq: ['$language', userContext.preferredLanguage || 'en'] },
then: 0.1,
else: -0.05
}
},
1.0 // Base multiplier
]
}
]
}
}
});
}
// Filter by similarity threshold and finalize results
pipeline.push(
{
$match: {
relevanceScore: { $gte: similarityThreshold * 0.5 }
}
},
{
$sort: {
[userContext ? 'personalizedScore' : 'relevanceScore']: -1,
publishedDate: -1
}
},
{
$limit: limit
},
{
$project: {
_id: 1,
title: 1,
contentSnippet: 1,
category: 1,
tags: 1,
author: 1,
publishedDate: 1,
contentType: 1,
language: 1,
entities: 1,
keyPhrases: 1,
searchMetadata: 1,
relevanceScore: includeScore ? 1 : 0,
personalizedScore: (includeScore && userContext) ? 1 : 0,
vectorScore: includeScore ? 1 : 0,
textScore: includeScore ? 1 : 0,
semanticRelevance: 1,
searchMethod: 1
}
}
);
const searchStart = Date.now();
const results = await this.collections.documents.aggregate(pipeline).toArray();
const searchTime = Date.now() - searchStart;
// Log search analytics
await this.logSearchAnalytics({
query: query,
searchType: searchType,
filters: filters,
resultCount: results.length,
searchTime: searchTime,
userContext: userContext,
timestamp: new Date()
});
console.log(`Semantic search completed in ${searchTime}ms, found ${results.length} results`);
return {
query: query,
searchType: searchType,
results: results,
metadata: {
totalResults: results.length,
searchTime: searchTime,
similarityThreshold: similarityThreshold,
filtersApplied: Object.keys(filters).length > 0
}
};
} catch (error) {
console.error('Semantic search failed:', error);
throw error;
}
}
async findSimilarDocuments(documentId, options = {}) {
console.log(`Finding documents similar to: ${documentId}`);
const {
limit = 10,
similarityThreshold = 0.75,
excludeCategories = [],
includeScore = true
} = options;
// Get the source document and its vector
const sourceDocument = await this.collections.documents.findOne(
{ _id: documentId },
{ projection: { contentVector: 1, title: 1, category: 1, tags: 1 } }
);
if (!sourceDocument || !sourceDocument.contentVector) {
throw new Error('Source document not found or not vectorized');
}
// Find similar documents using vector search
const pipeline = [
{
$vectorSearch: {
index: 'content_vector_index',
path: 'contentVector',
queryVector: sourceDocument.contentVector,
numCandidates: limit * 20,
limit: limit * 2,
filter: {
$and: [
{ _id: { $ne: documentId } }, // Exclude source document
excludeCategories.length > 0 ?
{ category: { $not: { $in: excludeCategories } } } :
{}
]
}
}
},
{
$addFields: {
similarityScore: { $meta: 'vectorSearchScore' },
// Calculate additional similarity factors
tagSimilarity: {
$let: {
vars: {
commonTags: {
$size: {
$setIntersection: ['$tags', sourceDocument.tags || []]
}
},
totalTags: {
$add: [
{ $size: { $ifNull: ['$tags', []] } },
{ $size: { $ifNull: [sourceDocument.tags, []] } }
]
}
},
in: {
$cond: {
if: { $gt: ['$$totalTags', 0] },
then: { $divide: ['$$commonTags', '$$totalTags'] },
else: 0
}
}
}
},
categorySimilarity: {
$cond: {
if: { $eq: ['$category', sourceDocument.category] },
then: 0.2,
else: 0
}
}
}
},
{
$addFields: {
combinedSimilarity: {
$add: [
{ $multiply: ['$similarityScore', 0.7] },
{ $multiply: ['$tagSimilarity', 0.2] },
'$categorySimilarity'
]
}
}
},
{
$match: {
combinedSimilarity: { $gte: similarityThreshold }
}
},
{
$sort: { combinedSimilarity: -1 }
},
{
$limit: limit
},
{
$project: {
_id: 1,
title: 1,
contentSnippet: { $substrCP: ['$content', 0, 200] },
category: 1,
tags: 1,
author: 1,
publishedDate: 1,
similarityScore: includeScore ? 1 : 0,
combinedSimilarity: includeScore ? 1 : 0,
searchMetadata: 1
}
}
];
const similarDocuments = await this.collections.documents.aggregate(pipeline).toArray();
return {
sourceDocumentId: documentId,
sourceTitle: sourceDocument.title,
similarDocuments: similarDocuments,
metadata: {
totalSimilar: similarDocuments.length,
similarityThreshold: similarityThreshold,
searchMethod: 'vector_similarity'
}
};
}
async generateRecommendations(userId, options = {}) {
console.log(`Generating personalized recommendations for user: ${userId}`);
const {
limit = 15,
diversityFactor = 0.3,
includeExplanations = true
} = options;
// Get user profile and interaction history
const userProfile = await this.collections.userProfiles.findOne({ userId: userId });
if (!userProfile) {
console.log('User profile not found, using general recommendations');
return this.generateGeneralRecommendations(limit);
}
// Build user preference vector from interaction history
const userVector = await this.buildUserPreferenceVector(userProfile);
if (!userVector) {
return this.generateGeneralRecommendations(limit);
}
// Find documents matching user preferences
const pipeline = [
{
$vectorSearch: {
index: 'content_vector_index',
path: 'contentVector',
queryVector: userVector,
numCandidates: limit * 10,
limit: limit * 3,
filter: {
$and: [
// Exclude already read documents
{ _id: { $not: { $in: userProfile.readDocuments || [] } } },
// Include preferred categories
userProfile.preferredCategories && userProfile.preferredCategories.length > 0 ?
{ category: { $in: userProfile.preferredCategories } } :
{},
// Fresh content preference
{
publishedDate: {
$gte: new Date(Date.now() - 90 * 24 * 60 * 60 * 1000) // Last 90 days
}
}
]
}
}
},
{
$addFields: {
preferenceScore: { $meta: 'vectorSearchScore' },
// Category affinity scoring
categoryScore: {
$switch: {
branches: (userProfile.categoryAffinities || []).map(affinity => ({
case: { $eq: ['$category', affinity.category] },
then: affinity.score
})),
default: 0.5
}
},
// Author following boost
authorScore: {
$cond: {
if: { $in: ['$author', userProfile.followedAuthors || []] },
then: 0.8,
else: 0.4
}
},
// Freshness scoring
freshnessScore: {
$divide: [
{ $subtract: [Date.now(), '$publishedDate'] },
(30 * 24 * 60 * 60 * 1000) // 30 days in milliseconds
]
}
}
},
{
$addFields: {
recommendationScore: {
$add: [
{ $multiply: ['$preferenceScore', 0.4] },
{ $multiply: ['$categoryScore', 0.25] },
{ $multiply: ['$authorScore', 0.2] },
{ $multiply: [{ $max: [0, { $subtract: [1, '$freshnessScore'] }] }, 0.15] }
]
}
}
}
];
// Apply diversity to avoid filter bubble
if (diversityFactor > 0) {
pipeline.push({
$group: {
_id: '$category',
documents: {
$push: {
_id: '$_id',
title: '$title',
recommendationScore: '$recommendationScore',
category: '$category',
author: '$author',
publishedDate: '$publishedDate',
tags: '$tags'
}
},
maxScore: { $max: '$recommendationScore' }
}
});
pipeline.push({
$sort: { maxScore: -1 }
});
// Select diverse recommendations
pipeline.push({
$project: {
documents: {
$slice: [
{ $sortArray: { input: '$documents', sortBy: { recommendationScore: -1 } } },
Math.ceil(limit * diversityFactor)
]
}
}
});
pipeline.push({
$unwind: '$documents'
});
pipeline.push({
$replaceRoot: { newRoot: '$documents' }
});
}
pipeline.push(
{
$sort: { recommendationScore: -1 }
},
{
$limit: limit
}
);
const recommendations = await this.collections.documents.aggregate(pipeline).toArray();
// Generate explanations if requested
if (includeExplanations) {
for (const rec of recommendations) {
rec.explanation = this.generateRecommendationExplanation(rec, userProfile);
}
}
// Store recommendations for future analysis
await this.collections.recommendations.insertOne({
userId: userId,
recommendations: recommendations.map(r => ({
documentId: r._id,
score: r.recommendationScore,
explanation: r.explanation
})),
generatedAt: new Date(),
algorithm: 'vector_preference_matching',
diversityFactor: diversityFactor
});
return {
userId: userId,
recommendations: recommendations,
metadata: {
totalRecommendations: recommendations.length,
algorithm: 'vector_preference_matching',
diversityApplied: diversityFactor > 0,
generatedAt: new Date()
}
};
}
// Helper methods for vector search operations
async generateEmbedding(text) {
// In production, this would call OpenAI API or other embedding service
// For this example, we'll simulate embeddings
// Simulate API call delay
await new Promise(resolve => setTimeout(resolve, 100));
// Generate mock embedding vector (in production, use actual embedding API)
const mockEmbedding = Array.from({ length: this.vectorConfig.dimensions }, () =>
Math.random() * 2 - 1 // Values between -1 and 1
);
return mockEmbedding;
}
async extractEntities(text) {
// Simulate entity extraction (in production, use NLP service)
const entities = [];
// Basic keyword extraction simulation
const words = text.toLowerCase().split(/\W+/);
const entityKeywords = ['mongodb', 'database', 'javascript', 'python', 'ai', 'machine learning'];
entityKeywords.forEach(keyword => {
if (words.includes(keyword) || words.includes(keyword.replace(' ', ''))) {
entities.push({
text: keyword,
type: 'technology',
confidence: 0.8
});
}
});
return entities;
}
async extractKeyPhrases(text) {
// Simulate key phrase extraction
const sentences = text.split(/[.!?]+/);
const keyPhrases = [];
sentences.forEach(sentence => {
const words = sentence.trim().split(/\s+/);
if (words.length >= 3 && words.length <= 8) {
keyPhrases.push({
phrase: sentence.trim(),
relevance: Math.random()
});
}
});
return keyPhrases.sort((a, b) => b.relevance - a.relevance).slice(0, 10);
}
analyzeContentCharacteristics(content) {
const wordCount = content.split(/\s+/).length;
const sentenceCount = content.split(/[.!?]+/).length;
const avgWordsPerSentence = wordCount / sentenceCount;
return {
complexity: avgWordsPerSentence > 20 ? 'high' : avgWordsPerSentence > 15 ? 'medium' : 'low',
topics: ['general'], // Would use topic modeling in production
sentiment: Math.random() * 2 - 1, // -1 to 1 scale
readabilityScore: Math.max(0, Math.min(100, 100 - (avgWordsPerSentence * 2)))
};
}
calculateFreshnessScore(publishedDate) {
const ageInDays = (Date.now() - publishedDate.getTime()) / (24 * 60 * 60 * 1000);
return Math.max(0, Math.min(1, 1 - (ageInDays / 365))); // Decay over 1 year
}
generateSummary(content) {
// Simple summary generation (first 200 characters)
return content.length > 200 ? content.substring(0, 197) + '...' : content;
}
buildFilterExpression(filters) {
const filterExpression = { $and: [] };
if (filters.category) {
filterExpression.$and.push({ category: { $eq: filters.category } });
}
if (filters.author) {
filterExpression.$and.push({ author: { $eq: filters.author } });
}
if (filters.tags && filters.tags.length > 0) {
filterExpression.$and.push({ tags: { $in: filters.tags } });
}
if (filters.dateRange) {
filterExpression.$and.push({
publishedDate: {
$gte: new Date(filters.dateRange.start),
$lte: new Date(filters.dateRange.end)
}
});
}
return filterExpression.$and.length > 0 ? filterExpression : {};
}
buildSearchFilterClauses(filters) {
const clauses = [];
if (filters.category) {
clauses.push({ equals: { path: 'category', value: filters.category } });
}
if (filters.tags && filters.tags.length > 0) {
clauses.push({ in: { path: 'tags', value: filters.tags } });
}
return clauses;
}
async logSearchAnalytics(analyticsData) {
try {
await this.collections.searchAnalytics.insertOne({
...analyticsData,
sessionId: analyticsData.userContext?.sessionId,
userId: analyticsData.userContext?.userId
});
} catch (error) {
console.warn('Failed to log search analytics:', error.message);
}
}
async buildUserPreferenceVector(userProfile) {
if (!userProfile.interactionHistory || userProfile.interactionHistory.length === 0) {
return null;
}
// Get vectors for user's previously interacted documents
const interactedDocuments = await this.collections.documents.find(
{
_id: { $in: userProfile.interactionHistory.slice(-20).map(h => h.documentId) }
},
{ projection: { contentVector: 1 } }
).toArray();
if (interactedDocuments.length === 0) {
return null;
}
// Calculate weighted average vector based on interaction types
const weightedVectors = interactedDocuments.map((doc, index) => {
const interaction = userProfile.interactionHistory.find(h =>
h.documentId.toString() === doc._id.toString()
);
const weight = this.getInteractionWeight(interaction.type);
return doc.contentVector.map(val => val * weight);
});
// Average the vectors
const dimensions = weightedVectors[0].length;
const avgVector = Array(dimensions).fill(0);
weightedVectors.forEach(vector => {
vector.forEach((val, i) => {
avgVector[i] += val;
});
});
return avgVector.map(val => val / weightedVectors.length);
}
getInteractionWeight(interactionType) {
const weights = {
'view': 0.1,
'like': 0.3,
'share': 0.5,
'bookmark': 0.7,
'comment': 0.8
};
return weights[interactionType] || 0.1;
}
generateRecommendationExplanation(recommendation, userProfile) {
const explanations = [];
if (userProfile.preferredCategories && userProfile.preferredCategories.includes(recommendation.category)) {
explanations.push(`Matches your interest in ${recommendation.category}`);
}
if (userProfile.followedAuthors && userProfile.followedAuthors.includes(recommendation.author)) {
explanations.push(`By ${recommendation.author}, an author you follow`);
}
if (recommendation.tags) {
const matchingTags = recommendation.tags.filter(tag =>
userProfile.interests && userProfile.interests.includes(tag)
);
if (matchingTags.length > 0) {
explanations.push(`Related to ${matchingTags.slice(0, 2).join(' and ')}`);
}
}
if (explanations.length === 0) {
explanations.push('Similar to content you\'ve previously engaged with');
}
return explanations.join('; ');
}
async generateGeneralRecommendations(limit) {
// Fallback recommendations based on popularity and quality
const pipeline = [
{
$addFields: {
popularityScore: {
$add: [
{ $multiply: [{ $ifNull: ['$qualitySignals.engagementScore', 0.5] }, 0.4] },
{ $multiply: [{ $ifNull: ['$qualitySignals.authorityScore', 0.5] }, 0.3] },
{ $multiply: [{ $ifNull: ['$qualitySignals.freshnessScore', 0.5] }, 0.3] }
]
}
}
},
{
$sort: { popularityScore: -1 }
},
{
$limit: limit
},
{
$project: {
_id: 1,
title: 1,
contentSnippet: { $substrCP: ['$content', 0, 200] },
category: 1,
author: 1,
publishedDate: 1,
popularityScore: 1
}
}
];
const recommendations = await this.collections.documents.aggregate(pipeline).toArray();
return {
recommendations: recommendations,
metadata: {
algorithm: 'popularity_based',
totalRecommendations: recommendations.length
}
};
}
}
// Benefits of MongoDB Atlas Vector Search:
// - Native vector database capabilities within MongoDB Atlas infrastructure
// - Seamless integration with existing MongoDB documents and operations
// - Support for multiple vector similarity algorithms (cosine, euclidean, dot product)
// - Hybrid search combining vector similarity with traditional text search
// - Scalable vector indexing with automatic optimization and maintenance
// - Built-in filtering capabilities for combining semantic search with metadata filters
// - Real-time vector search with sub-second response times at scale
// - Integration with popular embedding models (OpenAI, Cohere, Hugging Face)
// - Support for multiple vector dimensions and embedding types
// - Advanced ranking and personalization capabilities for AI-powered applications
module.exports = {
VectorSearchManager
};
Understanding MongoDB Vector Search Architecture
Advanced Vector Search Patterns and Optimization
Implement sophisticated vector search optimization techniques for production applications:
// Advanced vector search optimization and performance tuning
class VectorSearchOptimizer {
constructor(db) {
this.db = db;
this.performanceMetrics = new Map();
this.indexStrategies = {
exactSearch: { type: 'exactSearch', precision: 1.0, speed: 'slow' },
approximateSearch: { type: 'approximateSearch', precision: 0.95, speed: 'fast' },
hierarchicalSearch: { type: 'hierarchicalSearch', precision: 0.98, speed: 'medium' }
};
}
async optimizeVectorIndexConfiguration(collectionName, vectorField, options = {}) {
console.log(`Optimizing vector index configuration for ${collectionName}.${vectorField}`);
const {
dimensions = 1536,
similarityMetric = 'cosine',
numCandidates = 1000,
performanceTarget = 'balanced' // 'speed', 'accuracy', 'balanced'
} = options;
// Analyze existing data distribution
const dataAnalysis = await this.analyzeVectorDataDistribution(collectionName, vectorField);
// Determine optimal index configuration
const indexConfig = this.calculateOptimalIndexConfig(
dataAnalysis,
performanceTarget,
dimensions
);
// Create optimized vector search index configuration
const optimizedIndex = {
name: `optimized_${vectorField}_index`,
definition: {
fields: [
{
type: 'vector',
path: vectorField,
numDimensions: dimensions,
similarity: similarityMetric
},
// Add filter fields based on common query patterns
...this.generateFilterFieldsFromAnalysis(dataAnalysis)
]
},
configuration: {
// Advanced tuning parameters
numCandidates: this.calculateOptimalCandidates(dataAnalysis.documentCount),
ef: indexConfig.ef, // Search accuracy parameter
efConstruction: indexConfig.efConstruction, // Build-time parameter
maxConnections: indexConfig.maxConnections, // Graph connectivity
// Performance optimizations
vectorCompression: indexConfig.compressionEnabled,
quantization: indexConfig.quantizationLevel,
cachingStrategy: indexConfig.cachingStrategy
}
};
console.log('Optimized vector index configuration:', optimizedIndex);
return optimizedIndex;
}
async performVectorSearchBenchmark(collectionName, testQueries, indexConfigurations) {
console.log(`Benchmarking vector search performance with ${testQueries.length} test queries`);
const benchmarkResults = [];
for (const config of indexConfigurations) {
console.log(`Testing configuration: ${config.name}`);
const configResults = {
configurationName: config.name,
queryResults: [],
performanceMetrics: {
avgLatency: 0,
p95Latency: 0,
p99Latency: 0,
throughput: 0,
accuracy: 0
}
};
const latencies = [];
const accuracyScores = [];
const startTime = Date.now();
for (let i = 0; i < testQueries.length; i++) {
const query = testQueries[i];
const queryStart = Date.now();
try {
const results = await this.db.collection(collectionName).aggregate([
{
$vectorSearch: {
index: config.indexName,
path: config.vectorField,
queryVector: query.vector,
numCandidates: config.numCandidates || 100,
limit: query.limit || 10
}
},
{
$addFields: {
score: { $meta: 'vectorSearchScore' }
}
}
]).toArray();
const queryLatency = Date.now() - queryStart;
latencies.push(queryLatency);
// Calculate accuracy if ground truth available
if (query.expectedResults) {
const accuracy = this.calculateSearchAccuracy(results, query.expectedResults);
accuracyScores.push(accuracy);
}
configResults.queryResults.push({
queryIndex: i,
resultCount: results.length,
latency: queryLatency,
topScore: results[0]?.score || 0
});
} catch (error) {
console.error(`Query ${i} failed:`, error.message);
configResults.queryResults.push({
queryIndex: i,
error: error.message,
latency: null
});
}
}
const totalTime = Date.now() - startTime;
// Calculate performance metrics
const validLatencies = latencies.filter(l => l !== null);
if (validLatencies.length > 0) {
configResults.performanceMetrics.avgLatency =
validLatencies.reduce((sum, l) => sum + l, 0) / validLatencies.length;
const sortedLatencies = validLatencies.sort((a, b) => a - b);
configResults.performanceMetrics.p95Latency =
sortedLatencies[Math.floor(sortedLatencies.length * 0.95)];
configResults.performanceMetrics.p99Latency =
sortedLatencies[Math.floor(sortedLatencies.length * 0.99)];
configResults.performanceMetrics.throughput =
(validLatencies.length / totalTime) * 1000; // queries per second
}
if (accuracyScores.length > 0) {
configResults.performanceMetrics.accuracy =
accuracyScores.reduce((sum, a) => sum + a, 0) / accuracyScores.length;
}
benchmarkResults.push(configResults);
}
// Analyze and rank configurations
const rankedConfigurations = this.rankConfigurationsByPerformance(benchmarkResults);
return {
benchmarkResults: benchmarkResults,
recommendations: rankedConfigurations,
testMetadata: {
totalQueries: testQueries.length,
configurationstested: indexConfigurations.length,
benchmarkDuration: Date.now() - startTime
}
};
}
async implementAdvancedVectorSearchPatterns(collectionName, searchPattern, options = {}) {
console.log(`Implementing advanced vector search pattern: ${searchPattern}`);
const patterns = {
multiModalSearch: () => this.implementMultiModalSearch(collectionName, options),
hierarchicalSearch: () => this.implementHierarchicalSearch(collectionName, options),
temporalVectorSearch: () => this.implementTemporalVectorSearch(collectionName, options),
facetedVectorSearch: () => this.implementFacetedVectorSearch(collectionName, options),
clusterBasedSearch: () => this.implementClusterBasedSearch(collectionName, options)
};
if (!patterns[searchPattern]) {
throw new Error(`Unknown search pattern: ${searchPattern}`);
}
return await patterns[searchPattern]();
}
async implementMultiModalSearch(collectionName, options) {
// Multi-modal search combining text, image, and other vector embeddings
const {
textVector,
imageVector,
audioVector,
weights = { text: 0.5, image: 0.3, audio: 0.2 },
limit = 20
} = options;
const collection = this.db.collection(collectionName);
// Combine multiple vector searches
const pipeline = [
{
$vectorSearch: {
index: 'multi_modal_index',
path: 'textVector',
queryVector: textVector,
numCandidates: limit * 5,
limit: limit * 2
}
},
{
$addFields: {
textScore: { $meta: 'vectorSearchScore' }
}
}
];
if (imageVector) {
pipeline.push({
$unionWith: {
coll: collectionName,
pipeline: [
{
$vectorSearch: {
index: 'image_vector_index',
path: 'imageVector',
queryVector: imageVector,
numCandidates: limit * 5,
limit: limit * 2
}
},
{
$addFields: {
imageScore: { $meta: 'vectorSearchScore' }
}
}
]
}
});
}
if (audioVector) {
pipeline.push({
$unionWith: {
coll: collectionName,
pipeline: [
{
$vectorSearch: {
index: 'audio_vector_index',
path: 'audioVector',
queryVector: audioVector,
numCandidates: limit * 5,
limit: limit * 2
}
},
{
$addFields: {
audioScore: { $meta: 'vectorSearchScore' }
}
}
]
}
});
}
// Combine scores from different modalities
pipeline.push({
$group: {
_id: '$_id',
doc: { $first: '$$ROOT' },
textScore: { $max: { $ifNull: ['$textScore', 0] } },
imageScore: { $max: { $ifNull: ['$imageScore', 0] } },
audioScore: { $max: { $ifNull: ['$audioScore', 0] } }
}
});
pipeline.push({
$addFields: {
combinedScore: {
$add: [
{ $multiply: ['$textScore', weights.text] },
{ $multiply: ['$imageScore', weights.image] },
{ $multiply: ['$audioScore', weights.audio] }
]
}
}
});
pipeline.push({
$sort: { combinedScore: -1 }
});
pipeline.push({
$limit: limit
});
const results = await collection.aggregate(pipeline).toArray();
return {
searchType: 'multi_modal',
results: results,
weights: weights,
metadata: {
modalities: Object.keys(weights).filter(k => options[k + 'Vector']),
totalResults: results.length
}
};
}
async implementTemporalVectorSearch(collectionName, options) {
// Time-aware vector search with temporal relevance
const {
queryVector,
timeWindow = { days: 30 },
temporalWeight = 0.3,
limit = 20
} = options;
const collection = this.db.collection(collectionName);
const cutoffDate = new Date(Date.now() - timeWindow.days * 24 * 60 * 60 * 1000);
const pipeline = [
{
$vectorSearch: {
index: 'temporal_vector_index',
path: 'contentVector',
queryVector: queryVector,
numCandidates: limit * 10,
limit: limit * 3,
filter: {
publishedDate: { $gte: cutoffDate }
}
}
},
{
$addFields: {
vectorScore: { $meta: 'vectorSearchScore' },
// Calculate temporal relevance
temporalScore: {
$divide: [
{ $subtract: ['$publishedDate', cutoffDate] },
{ $subtract: [new Date(), cutoffDate] }
]
}
}
},
{
$addFields: {
combinedScore: {
$add: [
{ $multiply: ['$vectorScore', 1 - temporalWeight] },
{ $multiply: ['$temporalScore', temporalWeight] }
]
}
}
},
{
$sort: { combinedScore: -1 }
},
{
$limit: limit
}
];
const results = await collection.aggregate(pipeline).toArray();
return {
searchType: 'temporal_vector',
results: results,
temporalWindow: timeWindow,
temporalWeight: temporalWeight
};
}
// Helper methods for vector search optimization
async analyzeVectorDataDistribution(collectionName, vectorField) {
const collection = this.db.collection(collectionName);
// Sample documents to analyze distribution
const sampleSize = 1000;
const pipeline = [
{ $sample: { size: sampleSize } },
{
$project: {
vectorLength: { $size: `$${vectorField}` },
vectorMagnitude: {
$sqrt: {
$reduce: {
input: `$${vectorField}`,
initialValue: 0,
in: { $add: ['$$value', { $multiply: ['$$this', '$$this'] }] }
}
}
}
}
}
];
const samples = await collection.aggregate(pipeline).toArray();
const totalDocs = await collection.countDocuments();
const avgMagnitude = samples.reduce((sum, doc) => sum + doc.vectorMagnitude, 0) / samples.length;
return {
documentCount: totalDocs,
sampleSize: samples.length,
avgVectorMagnitude: avgMagnitude,
vectorDimensions: samples[0]?.vectorLength || 0,
magnitudeDistribution: this.calculateDistributionStats(
samples.map(s => s.vectorMagnitude)
)
};
}
calculateOptimalIndexConfig(dataAnalysis, performanceTarget, dimensions) {
const baseConfig = {
ef: 200,
efConstruction: 400,
maxConnections: 32,
compressionEnabled: false,
quantizationLevel: 'none',
cachingStrategy: 'adaptive'
};
// Adjust based on data characteristics and performance target
if (dataAnalysis.documentCount > 1000000) {
baseConfig.compressionEnabled = true;
baseConfig.quantizationLevel = 'int8';
}
switch (performanceTarget) {
case 'speed':
baseConfig.ef = 100;
baseConfig.efConstruction = 200;
baseConfig.quantizationLevel = 'int8';
break;
case 'accuracy':
baseConfig.ef = 400;
baseConfig.efConstruction = 800;
baseConfig.maxConnections = 64;
break;
case 'balanced':
default:
// Use base configuration
break;
}
return baseConfig;
}
generateFilterFieldsFromAnalysis(dataAnalysis) {
// Generate common filter fields based on data analysis
return [
{ type: 'filter', path: 'category' },
{ type: 'filter', path: 'publishedDate' },
{ type: 'filter', path: 'tags' }
];
}
calculateOptimalCandidates(documentCount) {
// Calculate optimal numCandidates based on collection size
if (documentCount < 10000) return Math.min(documentCount, 100);
if (documentCount < 100000) return 200;
if (documentCount < 1000000) return 500;
return 1000;
}
calculateSearchAccuracy(results, expectedResults) {
// Calculate precision@k accuracy metric
const actualIds = new Set(results.map(r => r._id.toString()));
const expectedIds = new Set(expectedResults.map(r => r._id.toString()));
let matches = 0;
for (const id of actualIds) {
if (expectedIds.has(id)) matches++;
}
return matches / Math.min(results.length, expectedResults.length);
}
rankConfigurationsByPerformance(benchmarkResults) {
// Rank configurations based on composite performance score
return benchmarkResults
.map(result => ({
...result,
compositeScore: this.calculateCompositeScore(result.performanceMetrics)
}))
.sort((a, b) => b.compositeScore - a.compositeScore)
.map((result, index) => ({
rank: index + 1,
configurationName: result.configurationName,
compositeScore: result.compositeScore,
metrics: result.performanceMetrics,
recommendation: this.generateConfigurationRecommendation(result)
}));
}
calculateCompositeScore(metrics) {
// Weighted composite score combining latency, throughput, and accuracy
const latencyScore = metrics.avgLatency ? Math.max(0, 1 - (metrics.avgLatency / 1000)) : 0;
const throughputScore = Math.min(1, metrics.throughput / 100);
const accuracyScore = metrics.accuracy || 0.8;
return (latencyScore * 0.4 + throughputScore * 0.3 + accuracyScore * 0.3);
}
generateConfigurationRecommendation(result) {
const metrics = result.performanceMetrics;
const recommendations = [];
if (metrics.avgLatency > 500) {
recommendations.push('Consider reducing numCandidates or enabling quantization for better latency');
}
if (metrics.accuracy < 0.8) {
recommendations.push('Increase ef parameter or numCandidates to improve search accuracy');
}
if (metrics.throughput < 10) {
recommendations.push('Optimize index configuration or consider horizontal scaling');
}
return recommendations.length > 0 ? recommendations : ['Configuration performs within acceptable parameters'];
}
calculateDistributionStats(values) {
const sorted = values.slice().sort((a, b) => a - b);
const mean = values.reduce((sum, val) => sum + val, 0) / values.length;
return {
mean: mean,
median: sorted[Math.floor(sorted.length / 2)],
min: sorted[0],
max: sorted[sorted.length - 1],
stddev: Math.sqrt(values.reduce((sum, val) => sum + Math.pow(val - mean, 2), 0) / values.length)
};
}
}
SQL-Style Vector Operations with QueryLeaf
QueryLeaf provides familiar SQL syntax for MongoDB vector search operations:
-- QueryLeaf vector search operations with SQL-familiar syntax
-- Create vector search index with SQL DDL
CREATE VECTOR INDEX content_embeddings_idx ON documents (
content_vector VECTOR(1536) USING cosine_similarity
WITH (
num_candidates = 1000,
index_type = 'hnsw',
ef_construction = 400,
max_connections = 32
)
)
INCLUDE (category, tags, published_date, author) AS filters;
-- Advanced semantic search with SQL-style vector operations
WITH semantic_query AS (
-- Generate query embedding (integrated with embedding services)
SELECT embed_text('machine learning algorithms for natural language processing') as query_vector
),
vector_search_results AS (
SELECT
d.document_id,
d.title,
d.content,
d.category,
d.tags,
d.author,
d.published_date,
-- Vector similarity search with cosine similarity
VECTOR_SIMILARITY(d.content_vector, sq.query_vector, 'cosine') as similarity_score,
-- Vector distance calculations
VECTOR_DISTANCE(d.content_vector, sq.query_vector, 'euclidean') as euclidean_distance,
VECTOR_DISTANCE(d.content_vector, sq.query_vector, 'manhattan') as manhattan_distance,
-- Vector magnitude and normalization
VECTOR_MAGNITUDE(d.content_vector) as vector_magnitude,
VECTOR_NORMALIZE(d.content_vector) as normalized_vector
FROM documents d
CROSS JOIN semantic_query sq
WHERE
-- Vector similarity threshold filtering
VECTOR_SIMILARITY(d.content_vector, sq.query_vector, 'cosine') > 0.75
-- Traditional filters combined with vector search
AND d.category IN ('AI', 'Technology', 'Data Science')
AND d.published_date >= CURRENT_DATE - INTERVAL '1 year'
-- Vector search with K-nearest neighbors
AND d.document_id IN (
SELECT document_id
FROM VECTOR_KNN_SEARCH(
table_name => 'documents',
vector_column => 'content_vector',
query_vector => sq.query_vector,
k => 50,
distance_function => 'cosine'
)
)
),
enhanced_results AS (
SELECT
vsr.*,
-- Advanced similarity calculations
VECTOR_DOT_PRODUCT(vsr.normalized_vector, sq.query_vector) as dot_product_similarity,
-- Multi-vector comparison for hybrid matching
GREATEST(
VECTOR_SIMILARITY(d.title_vector, sq.query_vector, 'cosine'),
vsr.similarity_score * 0.8
) as hybrid_similarity_score,
-- Vector clustering and topic modeling
VECTOR_CLUSTER_ID(vsr.content_vector, 'kmeans', 10) as topic_cluster,
VECTOR_TOPIC_PROBABILITY(vsr.content_vector, ARRAY['AI', 'ML', 'NLP', 'Data Science']) as topic_probabilities,
-- Temporal vector decay for freshness
vsr.similarity_score * EXP(-0.1 * EXTRACT(DAYS FROM (CURRENT_DATE - vsr.published_date))) as time_decayed_similarity,
-- Content quality boosting based on vector characteristics
vsr.similarity_score * (1 + LOG(GREATEST(1, ARRAY_LENGTH(vsr.tags, 1)) / 10.0)) as quality_boosted_similarity,
-- Personalization using user preference vectors
COALESCE(
VECTOR_SIMILARITY(vsr.content_vector, user_preference_vector('user_123'), 'cosine') * 0.3,
0
) as personalization_boost
FROM vector_search_results vsr
CROSS JOIN semantic_query sq
LEFT JOIN documents d ON vsr.document_id = d.document_id
WHERE vsr.similarity_score > 0.70
),
final_ranked_results AS (
SELECT
document_id,
title,
SUBSTRING(content, 1, 300) || '...' as content_preview,
category,
tags,
author,
published_date,
-- Comprehensive relevance scoring
ROUND((
hybrid_similarity_score * 0.4 +
time_decayed_similarity * 0.25 +
quality_boosted_similarity * 0.2 +
personalization_boost * 0.15
)::numeric, 4) as final_relevance_score,
-- Individual score components for analysis
ROUND(similarity_score::numeric, 4) as base_similarity,
ROUND(hybrid_similarity_score::numeric, 4) as hybrid_score,
ROUND(time_decayed_similarity::numeric, 4) as freshness_score,
ROUND(personalization_boost::numeric, 4) as personal_score,
-- Vector metadata
topic_cluster,
topic_probabilities,
vector_magnitude,
-- Search result ranking
ROW_NUMBER() OVER (ORDER BY final_relevance_score DESC) as search_rank,
COUNT(*) OVER () as total_results
FROM enhanced_results
WHERE (
hybrid_similarity_score * 0.4 +
time_decayed_similarity * 0.25 +
quality_boosted_similarity * 0.2 +
personalization_boost * 0.15
) > 0.6
)
SELECT
search_rank,
document_id,
title,
content_preview,
category,
STRING_AGG(DISTINCT tag, ', ' ORDER BY tag) as tags_summary,
author,
published_date,
final_relevance_score,
-- Explanation of ranking factors
JSON_BUILD_OBJECT(
'base_similarity', base_similarity,
'hybrid_boost', hybrid_score - base_similarity,
'freshness_impact', freshness_score - base_similarity,
'personalization_impact', personal_score,
'topic_cluster', topic_cluster,
'primary_topics', (
SELECT ARRAY_AGG(topic ORDER BY probability DESC)
FROM UNNEST(topic_probabilities) WITH ORDINALITY AS t(probability, topic)
WHERE probability > 0.1
LIMIT 3
)
) as ranking_explanation
FROM final_ranked_results
CROSS JOIN UNNEST(tags) as tag
GROUP BY search_rank, document_id, title, content_preview, category, author,
published_date, final_relevance_score, base_similarity, hybrid_score,
freshness_score, personal_score, topic_cluster, topic_probabilities
ORDER BY final_relevance_score DESC
LIMIT 20;
-- Advanced vector aggregation and analytics
WITH vector_analysis AS (
SELECT
category,
author,
DATE_TRUNC('month', published_date) as month_bucket,
-- Vector aggregation functions
VECTOR_AVG(content_vector) as category_centroid_vector,
VECTOR_STDDEV(content_vector) as vector_spread,
-- Vector clustering within groups
VECTOR_KMEANS_CENTROIDS(content_vector, 5) as sub_clusters,
-- Similarity analysis within categories
AVG(VECTOR_PAIRWISE_SIMILARITY(content_vector, 'cosine')) as avg_internal_similarity,
MIN(VECTOR_PAIRWISE_SIMILARITY(content_vector, 'cosine')) as min_internal_similarity,
MAX(VECTOR_PAIRWISE_SIMILARITY(content_vector, 'cosine')) as max_internal_similarity,
-- Document count and metadata
COUNT(*) as document_count,
AVG(ARRAY_LENGTH(tags, 1)) as avg_tags_per_doc,
AVG(LENGTH(content)) as avg_content_length,
-- Vector quality metrics
AVG(VECTOR_MAGNITUDE(content_vector)) as avg_vector_magnitude,
STDDEV(VECTOR_MAGNITUDE(content_vector)) as vector_magnitude_stddev
FROM documents
WHERE published_date >= CURRENT_DATE - INTERVAL '2 years'
AND content_vector IS NOT NULL
GROUP BY category, author, DATE_TRUNC('month', published_date)
),
cross_category_analysis AS (
SELECT
va1.category as category_a,
va2.category as category_b,
-- Cross-category vector similarity
VECTOR_SIMILARITY(va1.category_centroid_vector, va2.category_centroid_vector, 'cosine') as category_similarity,
-- Content overlap analysis
OVERLAP_COEFFICIENT(va1.category, va2.category, 'tags') as tag_overlap,
OVERLAP_COEFFICIENT(va1.category, va2.category, 'authors') as author_overlap,
-- Temporal correlation
CORRELATION(va1.document_count, va2.document_count) OVER (
PARTITION BY va1.category, va2.category
ORDER BY va1.month_bucket
) as temporal_correlation
FROM vector_analysis va1
CROSS JOIN vector_analysis va2
WHERE va1.category != va2.category
AND va1.month_bucket = va2.month_bucket
AND va1.document_count >= 5
AND va2.document_count >= 5
),
semantic_recommendations AS (
SELECT
category,
-- Find most similar categories for recommendation
ARRAY_AGG(
category_b ORDER BY category_similarity DESC
) FILTER (WHERE category_similarity > 0.7) as similar_categories,
-- Trending analysis
CASE
WHEN temporal_correlation > 0.8 THEN 'strongly_correlated'
WHEN temporal_correlation > 0.5 THEN 'moderately_correlated'
WHEN temporal_correlation < -0.5 THEN 'inversely_correlated'
ELSE 'independent'
END as trend_relationship,
-- Content strategy recommendations
CASE
WHEN AVG(category_similarity) > 0.8 THEN 'High content overlap - consider specialization'
WHEN AVG(category_similarity) < 0.3 THEN 'Low overlap - good content differentiation'
ELSE 'Moderate overlap - balanced content strategy'
END as content_strategy_recommendation
FROM cross_category_analysis
GROUP BY category, temporal_correlation
)
SELECT
va.category,
va.document_count,
ROUND(va.avg_internal_similarity::numeric, 3) as content_consistency_score,
ROUND(va.avg_vector_magnitude::numeric, 3) as content_richness_score,
-- Vector-based content insights
CASE
WHEN va.avg_internal_similarity > 0.8 THEN 'Highly consistent content'
WHEN va.avg_internal_similarity > 0.6 THEN 'Moderately consistent content'
ELSE 'Diverse content range'
END as content_consistency_assessment,
-- Similar categories for cross-promotion
sr.similar_categories,
sr.trend_relationship,
sr.content_strategy_recommendation,
-- Growth and engagement potential
CASE
WHEN va.document_count > LAG(va.document_count) OVER (
PARTITION BY va.category ORDER BY va.month_bucket
) THEN 'Growing'
WHEN va.document_count < LAG(va.document_count) OVER (
PARTITION BY va.category ORDER BY va.month_bucket
) THEN 'Declining'
ELSE 'Stable'
END as content_trend,
-- Vector search optimization recommendations
CASE
WHEN va.vector_magnitude_stddev > 0.5 THEN 'Consider vector normalization for consistent search performance'
WHEN va.avg_vector_magnitude < 0.1 THEN 'Low vector magnitudes may indicate embedding quality issues'
ELSE 'Vector embeddings appear well-distributed'
END as search_optimization_advice
FROM vector_analysis va
LEFT JOIN semantic_recommendations sr ON va.category = sr.category
WHERE va.month_bucket >= DATE_TRUNC('month', CURRENT_DATE - INTERVAL '6 months')
ORDER BY va.document_count DESC, va.avg_internal_similarity DESC;
-- Real-time vector search performance monitoring
WITH search_performance_metrics AS (
SELECT
DATE_TRUNC('hour', search_timestamp) as hour_bucket,
search_type,
-- Query performance metrics
COUNT(*) as total_searches,
AVG(response_time_ms) as avg_response_time,
PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY response_time_ms) as p95_response_time,
MAX(response_time_ms) as max_response_time,
-- Result quality metrics
AVG(result_count) as avg_results_returned,
AVG(CASE WHEN result_count > 0 THEN top_similarity_score ELSE NULL END) as avg_top_similarity,
AVG(user_satisfaction_score) as avg_user_satisfaction,
-- Vector search specific metrics
AVG(vector_candidates_examined) as avg_candidates_examined,
AVG(vector_index_hit_ratio) as avg_index_hit_ratio,
COUNT(*) FILTER (WHERE similarity_threshold_met = true) as threshold_met_count,
-- Error and timeout analysis
COUNT(*) FILTER (WHERE search_timeout = true) as timeout_count,
COUNT(*) FILTER (WHERE search_error IS NOT NULL) as error_count,
STRING_AGG(DISTINCT search_error, '; ') as error_types
FROM vector_search_log
WHERE search_timestamp >= CURRENT_TIMESTAMP - INTERVAL '24 hours'
GROUP BY DATE_TRUNC('hour', search_timestamp), search_type
),
performance_alerts AS (
SELECT
hour_bucket,
search_type,
total_searches,
avg_response_time,
p95_response_time,
avg_user_satisfaction,
-- Performance alerting logic
CASE
WHEN avg_response_time > 1000 THEN 'CRITICAL - High average latency'
WHEN p95_response_time > 2000 THEN 'WARNING - High P95 latency'
WHEN avg_user_satisfaction < 0.7 THEN 'WARNING - Low user satisfaction'
WHEN timeout_count > total_searches * 0.05 THEN 'WARNING - High timeout rate'
ELSE 'NORMAL'
END as performance_status,
-- Optimization recommendations
CASE
WHEN avg_candidates_examined > 10000 THEN 'Consider reducing numCandidates for better performance'
WHEN avg_index_hit_ratio < 0.8 THEN 'Index may need rebuilding - low hit ratio detected'
WHEN error_count > 0 THEN 'Investigate errors: ' || error_types
ELSE 'Performance within normal parameters'
END as optimization_recommendation,
-- Trending analysis
avg_response_time - LAG(avg_response_time) OVER (
PARTITION BY search_type
ORDER BY hour_bucket
) as latency_trend,
total_searches - LAG(total_searches) OVER (
PARTITION BY search_type
ORDER BY hour_bucket
) as volume_trend
FROM search_performance_metrics
)
SELECT
hour_bucket,
search_type,
total_searches,
ROUND(avg_response_time::numeric, 1) as avg_latency_ms,
ROUND(p95_response_time::numeric, 1) as p95_latency_ms,
ROUND(avg_user_satisfaction::numeric, 2) as satisfaction_score,
performance_status,
optimization_recommendation,
-- Trend indicators
CASE
WHEN latency_trend > 200 THEN 'DEGRADING'
WHEN latency_trend < -200 THEN 'IMPROVING'
ELSE 'STABLE'
END as latency_trend_status,
CASE
WHEN volume_trend > total_searches * 0.2 THEN 'HIGH_GROWTH'
WHEN volume_trend > total_searches * 0.1 THEN 'GROWING'
WHEN volume_trend < -total_searches * 0.1 THEN 'DECLINING'
ELSE 'STABLE'
END as volume_trend_status
FROM performance_alerts
WHERE performance_status != 'NORMAL' OR hour_bucket >= CURRENT_TIMESTAMP - INTERVAL '6 hours'
ORDER BY hour_bucket DESC, total_searches DESC;
-- QueryLeaf provides comprehensive vector search capabilities:
-- 1. SQL-familiar vector operations with VECTOR_SIMILARITY, VECTOR_DISTANCE functions
-- 2. Advanced K-nearest neighbors search with customizable distance functions
-- 3. Hybrid search combining vector similarity with traditional text search
-- 4. Vector aggregation functions for analytics and clustering
-- 5. Real-time performance monitoring and optimization recommendations
-- 6. Multi-modal vector search across text, image, and audio embeddings
-- 7. Temporal vector search with time-aware relevance scoring
-- 8. Vector-based recommendation systems with personalization
-- 9. Integration with MongoDB's native vector search optimizations
-- 10. Familiar SQL patterns for complex vector analytics and reporting
Best Practices for Vector Search Implementation
Vector Index Design Strategy
Essential principles for optimal MongoDB vector search design:
- Embedding Selection: Choose appropriate embedding models based on content type and use case requirements
- Index Configuration: Optimize vector index parameters for the balance of accuracy and performance needed
- Filtering Strategy: Design metadata filters to narrow search space before vector similarity calculations
- Dimensionality Management: Select optimal embedding dimensions based on content complexity and performance requirements
- Update Patterns: Plan for efficient vector updates and re-indexing as content changes
- Quality Assurance: Implement vector quality validation and monitoring for embedding consistency
Performance and Scalability
Optimize MongoDB vector search for production workloads:
- Index Optimization: Monitor and tune vector index parameters based on actual query patterns
- Hybrid Search: Combine vector and traditional search for optimal relevance and performance
- Caching Strategy: Implement intelligent caching for frequently accessed vectors and query results
- Resource Planning: Plan memory and compute resources for vector search operations at scale
- Monitoring Setup: Implement comprehensive vector search performance and quality monitoring
- Testing Strategy: Develop thorough testing for vector search accuracy and performance characteristics
Conclusion
MongoDB Atlas Vector Search provides native vector database capabilities that eliminate the complexity and infrastructure overhead of separate vector databases while enabling sophisticated semantic search and AI-powered applications. The seamless integration with MongoDB's document model allows developers to combine traditional database operations with advanced vector search in a unified platform.
Key MongoDB Vector Search benefits include:
- Native Integration: Built-in vector search capabilities within MongoDB Atlas infrastructure
- Semantic Understanding: Advanced similarity search that understands meaning and context
- Hybrid Search: Combining vector similarity with traditional text search and metadata filtering
- Scalable Performance: Production-ready vector indexing with sub-second response times
- AI-Ready Platform: Direct integration with popular embedding models and AI frameworks
- Familiar Operations: Vector search operations integrated with standard MongoDB query patterns
Whether you're building recommendation systems, semantic search applications, RAG implementations, or any application requiring intelligent content discovery, MongoDB Atlas Vector Search with QueryLeaf's familiar SQL interface provides the foundation for modern AI-powered applications.
QueryLeaf Integration: QueryLeaf automatically manages MongoDB vector search operations while providing SQL-familiar vector query syntax, similarity functions, and performance optimization. Advanced vector search patterns, multi-modal search, and semantic analytics are seamlessly handled through familiar SQL constructs, making sophisticated AI-powered search both powerful and accessible to SQL-oriented development teams.
The integration of native vector search capabilities with SQL-style operations makes MongoDB an ideal platform for applications requiring both intelligent semantic search and familiar database interaction patterns, ensuring your AI-powered applications remain both innovative and maintainable as they scale and evolve.