MongoDB Vector Search and AI Applications: Building Semantic Search and Similarity Systems for Modern AI-Powered Applications
Modern artificial intelligence applications require sophisticated search capabilities that understand semantic meaning beyond traditional keyword matching, enabling natural language queries, content recommendation systems, and intelligent document retrieval based on conceptual similarity rather than exact text matches. Traditional full-text search approaches struggle with understanding context, synonyms, and conceptual relationships, limiting their effectiveness for AI-powered applications that need to comprehend user intent and content meaning.
MongoDB Vector Search provides comprehensive vector similarity capabilities that enable semantic search, recommendation engines, and AI-powered content discovery through high-dimensional vector embeddings and advanced similarity algorithms. Unlike traditional search systems that rely on exact keyword matching, MongoDB Vector Search leverages machine learning embeddings to understand content semantics, enabling applications to find conceptually similar documents, perform natural language search, and power intelligent recommendation systems.
The Traditional Search Limitation Challenge
Conventional text-based search approaches have significant limitations for modern AI applications:
-- Traditional PostgreSQL full-text search - limited semantic understanding and context awareness
-- Basic full-text search setup with limited semantic capabilities
CREATE TABLE documents (
document_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
title VARCHAR(500) NOT NULL,
content TEXT NOT NULL,
category VARCHAR(100),
author VARCHAR(200),
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
-- Traditional metadata
tags TEXT[],
keywords VARCHAR(1000),
summary TEXT,
document_type VARCHAR(50),
language VARCHAR(10) DEFAULT 'en',
-- Basic search vectors (limited functionality)
search_vector tsvector GENERATED ALWAYS AS (
setweight(to_tsvector('english', coalesce(title, '')), 'A') ||
setweight(to_tsvector('english', coalesce(content, '')), 'B') ||
setweight(to_tsvector('english', coalesce(summary, '')), 'C') ||
setweight(to_tsvector('english', array_to_string(coalesce(tags, '{}'), ' ')), 'D')
) STORED
);
-- Additional tables for recommendation attempts
CREATE TABLE user_interactions (
interaction_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
user_id UUID NOT NULL,
document_id UUID NOT NULL REFERENCES documents(document_id),
interaction_type VARCHAR(50) NOT NULL, -- 'view', 'like', 'share', 'download'
interaction_timestamp TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
duration_seconds INTEGER,
rating INTEGER CHECK (rating BETWEEN 1 AND 5)
);
CREATE TABLE document_similarity (
document_id_1 UUID NOT NULL REFERENCES documents(document_id),
document_id_2 UUID NOT NULL REFERENCES documents(document_id),
similarity_score DECIMAL(5,4) NOT NULL CHECK (similarity_score BETWEEN 0 AND 1),
similarity_type VARCHAR(50) NOT NULL, -- 'keyword', 'category', 'manual'
calculated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
PRIMARY KEY (document_id_1, document_id_2)
);
-- Traditional keyword-based search with limited semantic understanding
WITH search_query AS (
SELECT
'machine learning artificial intelligence neural networks deep learning' as query_text,
to_tsquery('english',
'machine & learning & artificial & intelligence & neural & networks & deep & learning'
) as search_tsquery
),
basic_search AS (
SELECT
d.document_id,
d.title,
d.content,
d.category,
d.author,
d.created_at,
d.tags,
-- Basic relevance scoring (limited effectiveness)
ts_rank(d.search_vector, sq.search_tsquery) as basic_relevance,
ts_rank_cd(d.search_vector, sq.search_tsquery) as weighted_relevance,
-- Simple keyword matching
array_length(
string_to_array(
regexp_replace(
lower(d.title || ' ' || d.content),
'[^a-z0-9\s]', ' ', 'g'
),
' '
) & string_to_array(lower(sq.query_text), ' '), 1
) as keyword_matches,
-- Category-based scoring
CASE d.category
WHEN 'AI/ML' THEN 2.0
WHEN 'Technology' THEN 1.5
WHEN 'Science' THEN 1.2
ELSE 1.0
END as category_boost,
-- Recency scoring
CASE
WHEN d.created_at > CURRENT_DATE - INTERVAL '30 days' THEN 1.5
WHEN d.created_at > CURRENT_DATE - INTERVAL '90 days' THEN 1.2
WHEN d.created_at > CURRENT_DATE - INTERVAL '365 days' THEN 1.1
ELSE 1.0
END as recency_boost
FROM documents d
CROSS JOIN search_query sq
WHERE d.search_vector @@ sq.search_tsquery
),
popularity_metrics AS (
SELECT
ui.document_id,
COUNT(*) as interaction_count,
COUNT(*) FILTER (WHERE ui.interaction_type = 'view') as view_count,
COUNT(*) FILTER (WHERE ui.interaction_type = 'like') as like_count,
COUNT(*) FILTER (WHERE ui.interaction_type = 'share') as share_count,
AVG(ui.rating) FILTER (WHERE ui.rating IS NOT NULL) as avg_rating,
AVG(ui.duration_seconds) FILTER (WHERE ui.duration_seconds IS NOT NULL) as avg_duration
FROM user_interactions ui
WHERE ui.interaction_timestamp > CURRENT_DATE - INTERVAL '90 days'
GROUP BY ui.document_id
),
similarity_expansion AS (
-- Manual similarity relationships (limited and maintenance-heavy)
SELECT DISTINCT
ds.document_id_2 as document_id,
MAX(ds.similarity_score) as max_similarity,
COUNT(*) as similar_document_count
FROM basic_search bs
JOIN document_similarity ds ON bs.document_id = ds.document_id_1
WHERE ds.similarity_score > 0.3
GROUP BY ds.document_id_2
),
final_search_results AS (
SELECT
bs.document_id,
bs.title,
SUBSTRING(bs.content, 1, 300) || '...' as content_preview,
bs.category,
bs.author,
bs.created_at,
bs.tags,
-- Complex relevance calculation (limited effectiveness)
(
(bs.basic_relevance * 10) +
(bs.weighted_relevance * 15) +
(COALESCE(bs.keyword_matches, 0) * 5) +
(bs.category_boost * 3) +
(bs.recency_boost * 2) +
(COALESCE(pm.like_count, 0) * 0.1) +
(COALESCE(pm.avg_rating, 0) * 2) +
(COALESCE(se.max_similarity, 0) * 8)
) as final_relevance_score,
-- Metrics for debugging
bs.basic_relevance,
bs.keyword_matches,
pm.interaction_count,
pm.like_count,
pm.avg_rating,
se.similar_document_count,
-- Formatted information
CASE
WHEN pm.interaction_count > 100 THEN 'Popular'
WHEN pm.interaction_count > 50 THEN 'Moderately Popular'
WHEN pm.interaction_count > 10 THEN 'Some Interest'
ELSE 'New/Limited Interest'
END as popularity_status
FROM basic_search bs
LEFT JOIN popularity_metrics pm ON bs.document_id = pm.document_id
LEFT JOIN similarity_expansion se ON bs.document_id = se.document_id
)
SELECT
document_id,
title,
content_preview,
category,
author,
created_at,
tags,
ROUND(final_relevance_score::numeric, 2) as relevance_score,
popularity_status,
-- Limited recommendation capability
CASE
WHEN final_relevance_score > 25 THEN 'Highly Relevant'
WHEN final_relevance_score > 15 THEN 'Relevant'
WHEN final_relevance_score > 8 THEN 'Potentially Relevant'
ELSE 'Low Relevance'
END as relevance_category
FROM final_search_results
WHERE final_relevance_score > 5 -- Filter low-relevance results
ORDER BY final_relevance_score DESC
LIMIT 20;
-- Problems with traditional search approaches:
-- 1. No semantic understanding - "ML" vs "machine learning" treated as completely different
-- 2. Limited context awareness - cannot understand conceptual relationships
-- 3. Poor synonym handling - requires manual synonym dictionaries
-- 4. No natural language query support - requires exact keyword matching
-- 5. Complex manual similarity calculations that don't scale
-- 6. No understanding of document embeddings or vector representations
-- 7. Limited recommendation capabilities based on simple collaborative filtering
-- 8. Poor handling of multilingual content and cross-language search
-- 9. No support for image, audio, or multi-modal content search
-- 10. Maintenance-heavy similarity relationships that become stale
-- Attempt at content-based recommendation (ineffective)
CREATE OR REPLACE FUNCTION calculate_basic_similarity(doc1_id UUID, doc2_id UUID)
RETURNS DECIMAL AS $$
DECLARE
doc1_vector tsvector;
doc2_vector tsvector;
similarity_score DECIMAL;
BEGIN
SELECT search_vector INTO doc1_vector FROM documents WHERE document_id = doc1_id;
SELECT search_vector INTO doc2_vector FROM documents WHERE document_id = doc2_id;
-- Extremely limited similarity calculation
SELECT ts_rank(doc1_vector, plainto_tsquery('english',
array_to_string(
string_to_array(
regexp_replace(doc2_vector::text, '[^a-zA-Z0-9\s]', ' ', 'g'),
' '
),
' '
)
)) INTO similarity_score;
RETURN COALESCE(similarity_score, 0);
END;
$$ LANGUAGE plpgsql;
-- Manual batch similarity calculation (expensive and inaccurate)
INSERT INTO document_similarity (document_id_1, document_id_2, similarity_score, similarity_type)
SELECT
d1.document_id,
d2.document_id,
calculate_basic_similarity(d1.document_id, d2.document_id),
'keyword'
FROM documents d1
CROSS JOIN documents d2
WHERE d1.document_id != d2.document_id
AND d1.category = d2.category -- Only calculate within same category
AND NOT EXISTS (
SELECT 1 FROM document_similarity ds
WHERE ds.document_id_1 = d1.document_id
AND ds.document_id_2 = d2.document_id
)
LIMIT 10000; -- Batch processing required due to computational cost
-- Traditional approach limitations:
-- 1. No understanding of semantic meaning or context
-- 2. Poor performance with large document collections
-- 3. Manual maintenance of similarity relationships
-- 4. Limited multilingual and cross-domain search capabilities
-- 5. No support for natural language queries or conversational search
-- 6. Inability to handle synonyms and conceptual relationships
-- 7. No integration with modern AI/ML embedding models
-- 8. Poor recommendation quality based on simple keyword overlap
-- 9. No support for multi-modal content (images, videos, audio)
-- 10. Scalability issues with growing content collections
MongoDB Vector Search provides sophisticated AI-powered semantic capabilities:
// MongoDB Vector Search - advanced AI-powered semantic search with comprehensive embedding management
const { MongoClient } = require('mongodb');
const { OpenAI } = require('openai');
const tf = require('@tensorflow/tfjs-node');
const client = new MongoClient('mongodb+srv://username:[email protected]');
const db = client.db('advanced_ai_search_platform');
// Advanced AI-powered search and recommendation engine
class AdvancedVectorSearchEngine {
constructor(db, aiConfig = {}) {
this.db = db;
this.collections = {
documents: db.collection('documents'),
embeddings: db.collection('document_embeddings'),
userProfiles: db.collection('user_profiles'),
searchLogs: db.collection('search_logs'),
recommendations: db.collection('recommendations'),
modelMetadata: db.collection('model_metadata')
};
// AI model configuration
this.aiConfig = {
embeddingModel: aiConfig.embeddingModel || 'text-embedding-3-large',
embeddingDimensions: aiConfig.embeddingDimensions || 3072,
maxTokens: aiConfig.maxTokens || 8191,
batchSize: aiConfig.batchSize || 50,
similarityThreshold: aiConfig.similarityThreshold || 0.7,
// Advanced AI configurations
useMultimodalEmbeddings: aiConfig.useMultimodalEmbeddings || false,
enableSemanticCaching: aiConfig.enableSemanticCaching || true,
enableQueryExpansion: aiConfig.enableQueryExpansion || true,
enablePersonalization: aiConfig.enablePersonalization || true,
// Model providers
openaiApiKey: aiConfig.openaiApiKey || process.env.OPENAI_API_KEY,
huggingFaceApiKey: aiConfig.huggingFaceApiKey || process.env.HUGGINGFACE_API_KEY,
cohereApiKey: aiConfig.cohereApiKey || process.env.COHERE_API_KEY
};
// Initialize AI clients
this.openai = new OpenAI({ apiKey: this.aiConfig.openaiApiKey });
this.embeddingCache = new Map();
this.searchCache = new Map();
this.setupVectorSearchIndexes();
this.initializeEmbeddingModels();
}
async setupVectorSearchIndexes() {
console.log('Setting up MongoDB Vector Search indexes...');
try {
// Primary document embedding index
await this.collections.documents.createSearchIndex({
name: 'document_vector_index',
definition: {
fields: [
{
type: 'vector',
path: 'embedding',
numDimensions: this.aiConfig.embeddingDimensions,
similarity: 'cosine'
},
{
type: 'filter',
path: 'category'
},
{
type: 'filter',
path: 'language'
},
{
type: 'filter',
path: 'contentType'
},
{
type: 'filter',
path: 'accessLevel'
},
{
type: 'filter',
path: 'createdAt'
}
]
}
});
// Multi-modal content index for images and multimedia
await this.collections.documents.createSearchIndex({
name: 'multimodal_vector_index',
definition: {
fields: [
{
type: 'vector',
path: 'multimodalEmbedding',
numDimensions: 1536, // Different dimension for multi-modal models
similarity: 'cosine'
},
{
type: 'filter',
path: 'mediaType'
}
]
}
});
// User profile vector index for personalization
await this.collections.userProfiles.createSearchIndex({
name: 'user_profile_vector_index',
definition: {
fields: [
{
type: 'vector',
path: 'interestEmbedding',
numDimensions: this.aiConfig.embeddingDimensions,
similarity: 'cosine'
}
]
}
});
console.log('Vector Search indexes created successfully');
} catch (error) {
console.error('Error setting up Vector Search indexes:', error);
throw error;
}
}
async generateDocumentEmbedding(document, options = {}) {
console.log(`Generating embeddings for document: ${document.title}`);
try {
// Prepare content for embedding generation
const embeddingContent = this.prepareContentForEmbedding(document, options);
// Check cache first
const cacheKey = this.generateCacheKey(embeddingContent);
if (this.embeddingCache.has(cacheKey) && this.aiConfig.enableSemanticCaching) {
console.log('Using cached embedding');
return this.embeddingCache.get(cacheKey);
}
// Generate embedding using OpenAI
const embeddingResponse = await this.openai.embeddings.create({
model: this.aiConfig.embeddingModel,
input: embeddingContent,
dimensions: this.aiConfig.embeddingDimensions
});
const embedding = embeddingResponse.data[0].embedding;
// Cache the embedding
if (this.aiConfig.enableSemanticCaching) {
this.embeddingCache.set(cacheKey, embedding);
}
// Store embedding with comprehensive metadata
const embeddingDocument = {
documentId: document._id,
embedding: embedding,
// Embedding metadata
model: this.aiConfig.embeddingModel,
dimensions: this.aiConfig.embeddingDimensions,
contentLength: embeddingContent.length,
tokensUsed: embeddingResponse.usage?.total_tokens || 0,
// Content characteristics
contentType: document.contentType || 'text',
language: document.language || 'en',
category: document.category,
// Processing metadata
generatedAt: new Date(),
modelVersion: embeddingResponse.model,
processingTime: Date.now() - (options.startTime || Date.now()),
// Quality metrics
contentQuality: this.assessContentQuality(document),
embeddingNorm: this.calculateVectorNorm(embedding),
// Optimization metadata
batchProcessed: options.batchProcessed || false,
cacheHit: false
};
// Store in embedding collection for tracking
await this.collections.embeddings.insertOne(embeddingDocument);
// Update main document with embedding
await this.collections.documents.updateOne(
{ _id: document._id },
{
$set: {
embedding: embedding,
embeddingMetadata: {
model: this.aiConfig.embeddingModel,
generatedAt: new Date(),
dimensions: this.aiConfig.embeddingDimensions,
contentHash: this.generateContentHash(embeddingContent)
}
}
}
);
return embedding;
} catch (error) {
console.error(`Error generating embedding for document ${document._id}:`, error);
throw error;
}
}
prepareContentForEmbedding(document, options = {}) {
// Intelligent content preparation for optimal embedding generation
let content = '';
// Title with higher weight
if (document.title) {
content += `Title: ${document.title}\n\n`;
}
// Summary if available
if (document.summary) {
content += `Summary: ${document.summary}\n\n`;
}
// Main content with intelligent truncation
if (document.content) {
const maxContentLength = this.aiConfig.maxTokens * 0.7; // Reserve space for title/metadata
let mainContent = document.content;
if (mainContent.length > maxContentLength) {
// Intelligent content truncation - keep beginning and key sections
const beginningChunk = mainContent.substring(0, maxContentLength * 0.6);
const endingChunk = mainContent.substring(mainContent.length - maxContentLength * 0.2);
mainContent = beginningChunk + '\n...\n' + endingChunk;
}
content += `Content: ${mainContent}\n\n`;
}
// Metadata context
if (document.category) {
content += `Category: ${document.category}\n`;
}
if (document.tags && document.tags.length > 0) {
content += `Tags: ${document.tags.join(', ')}\n`;
}
if (document.keywords) {
content += `Keywords: ${document.keywords}\n`;
}
return content.trim();
}
async performSemanticSearch(query, options = {}) {
console.log(`Performing semantic search for: "${query}"`);
const startTime = Date.now();
try {
// Generate query embedding
const queryEmbedding = await this.generateQueryEmbedding(query, options);
// Build comprehensive search pipeline
const searchPipeline = await this.buildSemanticSearchPipeline(queryEmbedding, query, options);
// Execute vector search with MongoDB Atlas Vector Search
const searchResults = await this.collections.documents.aggregate(searchPipeline).toArray();
// Post-process and enhance results
const enhancedResults = await this.enhanceSearchResults(searchResults, query, options);
// Log search for analytics and improvement
await this.logSearchQuery(query, queryEmbedding, enhancedResults, options);
// Generate personalized recommendations if user context available
let personalizedRecommendations = [];
if (options.userId && this.aiConfig.enablePersonalization) {
personalizedRecommendations = await this.generatePersonalizedRecommendations(
options.userId,
enhancedResults.slice(0, 5),
options
);
}
return {
query: query,
results: enhancedResults,
personalizedRecommendations: personalizedRecommendations,
// Search metadata
metadata: {
totalResults: enhancedResults.length,
searchTime: Date.now() - startTime,
queryEmbeddingDimensions: queryEmbedding.length,
embeddingModel: this.aiConfig.embeddingModel,
similarityThreshold: options.similarityThreshold || this.aiConfig.similarityThreshold,
filtersApplied: this.extractAppliedFilters(options),
personalizationEnabled: this.aiConfig.enablePersonalization && !!options.userId
},
// Query insights
insights: {
queryComplexity: this.assessQueryComplexity(query),
semanticCategories: this.identifySemanticCategories(enhancedResults),
resultDiversity: this.calculateResultDiversity(enhancedResults),
averageSimilarity: this.calculateAverageSimilarity(enhancedResults)
},
// Related queries and suggestions
relatedQueries: await this.generateRelatedQueries(query, enhancedResults),
searchSuggestions: await this.generateSearchSuggestions(query, options)
};
} catch (error) {
console.error(`Semantic search error for query "${query}":`, error);
throw error;
}
}
async buildSemanticSearchPipeline(queryEmbedding, query, options = {}) {
const pipeline = [];
// Stage 1: Vector similarity search
pipeline.push({
$vectorSearch: {
index: options.multimodal ? 'multimodal_vector_index' : 'document_vector_index',
path: options.multimodal ? 'multimodalEmbedding' : 'embedding',
queryVector: queryEmbedding,
numCandidates: options.numCandidates || 1000,
limit: options.vectorSearchLimit || 100,
// Apply filters for performance and relevance
filter: this.buildSearchFilters(options)
}
});
// Stage 2: Add similarity score and metadata
pipeline.push({
$addFields: {
vectorSimilarityScore: { $meta: 'vectorSearchScore' },
searchMetadata: {
searchTime: new Date(),
searchQuery: query,
searchModel: this.aiConfig.embeddingModel
}
}
});
// Stage 3: Hybrid scoring combining vector similarity with text relevance
if (options.enableHybridSearch !== false) {
pipeline.push({
$addFields: {
// Text match scoring for hybrid approach
textMatchScore: {
$cond: {
if: { $regexMatch: { input: '$title', regex: query, options: 'i' } },
then: 0.3,
else: {
$cond: {
if: { $regexMatch: { input: '$content', regex: query, options: 'i' } },
then: 0.2,
else: 0
}
}
}
},
// Recency scoring
recencyScore: {
$switch: {
branches: [
{
case: { $gte: ['$createdAt', new Date(Date.now() - 30 * 24 * 60 * 60 * 1000)] },
then: 0.1
},
{
case: { $gte: ['$createdAt', new Date(Date.now() - 90 * 24 * 60 * 60 * 1000)] },
then: 0.05
}
],
default: 0
}
},
// Popularity scoring based on user interactions
popularityScore: {
$multiply: [
{ $log10: { $add: [{ $ifNull: ['$metrics.viewCount', 0] }, 1] } },
0.05
]
},
// Content quality scoring
qualityScore: {
$multiply: [
{ $divide: [{ $strLenCP: { $ifNull: ['$content', ''] } }, 10000] },
0.02
]
}
}
});
// Combined hybrid score
pipeline.push({
$addFields: {
hybridScore: {
$add: [
{ $multiply: ['$vectorSimilarityScore', 0.7] }, // Vector similarity weight
'$textMatchScore',
'$recencyScore',
'$popularityScore',
'$qualityScore'
]
}
}
});
}
// Stage 4: Apply similarity threshold filtering
pipeline.push({
$match: {
vectorSimilarityScore: {
$gte: options.similarityThreshold || this.aiConfig.similarityThreshold
}
}
});
// Stage 5: Lookup related collections for rich context
pipeline.push({
$lookup: {
from: 'users',
localField: 'createdBy',
foreignField: '_id',
as: 'authorInfo',
pipeline: [
{ $project: { name: 1, avatar: 1, expertise: 1, reputation: 1 } }
]
}
});
// Stage 6: Add computed fields for result enhancement
pipeline.push({
$addFields: {
// Content preview generation
contentPreview: {
$cond: {
if: { $gt: [{ $strLenCP: { $ifNull: ['$content', ''] } }, 300] },
then: { $concat: [{ $substr: ['$content', 0, 300] }, '...'] },
else: '$content'
}
},
// Relevance category
relevanceCategory: {
$switch: {
branches: [
{ case: { $gte: ['$vectorSimilarityScore', 0.9] }, then: 'Highly Relevant' },
{ case: { $gte: ['$vectorSimilarityScore', 0.8] }, then: 'Very Relevant' },
{ case: { $gte: ['$vectorSimilarityScore', 0.7] }, then: 'Relevant' },
{ case: { $gte: ['$vectorSimilarityScore', 0.6] }, then: 'Moderately Relevant' }
],
default: 'Potentially Relevant'
}
},
// Author information
authorName: { $arrayElemAt: ['$authorInfo.name', 0] },
authorExpertise: { $arrayElemAt: ['$authorInfo.expertise', 0] },
// Formatted metadata
formattedCreatedAt: {
$dateToString: {
format: '%Y-%m-%d',
date: '$createdAt'
}
}
}
});
// Stage 7: Final projection for clean output
pipeline.push({
$project: {
_id: 1,
title: 1,
contentPreview: 1,
category: 1,
tags: 1,
language: 1,
contentType: 1,
createdAt: 1,
formattedCreatedAt: 1,
// Scoring information
vectorSimilarityScore: { $round: ['$vectorSimilarityScore', 4] },
hybridScore: { $round: [{ $ifNull: ['$hybridScore', '$vectorSimilarityScore'] }, 4] },
relevanceCategory: 1,
// Author information
authorName: 1,
authorExpertise: 1,
// Access and metadata
accessLevel: 1,
downloadUrl: { $concat: ['/api/documents/', { $toString: '$_id' }] },
// Analytics metadata
metrics: {
viewCount: { $ifNull: ['$metrics.viewCount', 0] },
likeCount: { $ifNull: ['$metrics.likeCount', 0] },
shareCount: { $ifNull: ['$metrics.shareCount', 0] }
},
searchMetadata: 1
}
});
// Stage 8: Sort by hybrid score or vector similarity
const sortField = options.enableHybridSearch !== false ? 'hybridScore' : 'vectorSimilarityScore';
pipeline.push({ $sort: { [sortField]: -1 } });
// Stage 9: Apply final limit
pipeline.push({ $limit: options.limit || 20 });
return pipeline;
}
buildSearchFilters(options) {
const filters = {};
// Category filtering
if (options.category) {
filters.category = { $eq: options.category };
}
// Language filtering
if (options.language) {
filters.language = { $eq: options.language };
}
// Content type filtering
if (options.contentType) {
filters.contentType = { $eq: options.contentType };
}
// Access level filtering
if (options.accessLevel) {
filters.accessLevel = { $eq: options.accessLevel };
}
// Date range filtering
if (options.dateFrom || options.dateTo) {
filters.createdAt = {};
if (options.dateFrom) filters.createdAt.$gte = new Date(options.dateFrom);
if (options.dateTo) filters.createdAt.$lte = new Date(options.dateTo);
}
// Author filtering
if (options.authorId) {
filters.createdBy = { $eq: options.authorId };
}
// Tags filtering
if (options.tags && options.tags.length > 0) {
filters.tags = { $in: options.tags };
}
return filters;
}
async generateQueryEmbedding(query, options = {}) {
console.log(`Generating query embedding for: "${query}"`);
try {
// Enhance query with expansion if enabled
let enhancedQuery = query;
if (this.aiConfig.enableQueryExpansion && options.expandQuery !== false) {
enhancedQuery = await this.expandQuery(query, options);
}
// Generate embedding
const embeddingResponse = await this.openai.embeddings.create({
model: this.aiConfig.embeddingModel,
input: enhancedQuery,
dimensions: this.aiConfig.embeddingDimensions
});
return embeddingResponse.data[0].embedding;
} catch (error) {
console.error(`Error generating query embedding for "${query}":`, error);
throw error;
}
}
async expandQuery(query, options = {}) {
console.log(`Expanding query: "${query}"`);
try {
// Use GPT to expand the query with related terms and concepts
const expansionPrompt = `
Given the search query: "${query}"
Generate an expanded version that includes:
1. Synonyms and related terms
2. Alternative phrasings
3. Conceptually related topics
4. Common variations and abbreviations
Keep the expansion focused and relevant. Return only the expanded query text.
Original query: ${query}
Expanded query:`;
const completion = await this.openai.chat.completions.create({
model: 'gpt-4',
messages: [{ role: 'user', content: expansionPrompt }],
max_tokens: 150,
temperature: 0.3
});
const expandedQuery = completion.choices[0].message.content.trim();
console.log(`Query expanded to: "${expandedQuery}"`);
return expandedQuery;
} catch (error) {
console.error(`Error expanding query "${query}":`, error);
return query; // Fall back to original query
}
}
async generatePersonalizedRecommendations(userId, searchResults, options = {}) {
console.log(`Generating personalized recommendations for user: ${userId}`);
try {
// Get user profile and interaction history
const userProfile = await this.collections.userProfiles.findOne({ userId: userId });
if (!userProfile) {
console.log('No user profile found, returning general recommendations');
return this.generateGeneralRecommendations(searchResults, options);
}
// Generate personalized recommendations based on user interests
const recommendationPipeline = [
{
$vectorSearch: {
index: 'document_vector_index',
path: 'embedding',
queryVector: userProfile.interestEmbedding,
numCandidates: 500,
limit: 50,
filter: {
_id: { $nin: searchResults.map(r => r._id) }, // Exclude current results
accessLevel: { $in: ['public', 'user'] }
}
}
},
{
$addFields: {
personalizedScore: { $meta: 'vectorSearchScore' },
recommendationReason: 'Based on your interests and reading history'
}
},
{
$lookup: {
from: 'users',
localField: 'createdBy',
foreignField: '_id',
as: 'authorInfo',
pipeline: [{ $project: { name: 1, expertise: 1 } }]
}
},
{
$project: {
_id: 1,
title: 1,
category: 1,
tags: 1,
createdAt: 1,
personalizedScore: { $round: ['$personalizedScore', 4] },
recommendationReason: 1,
authorName: { $arrayElemAt: ['$authorInfo.name', 0] },
downloadUrl: { $concat: ['/api/documents/', { $toString: '$_id' }] }
}
},
{ $sort: { personalizedScore: -1 } },
{ $limit: options.recommendationLimit || 10 }
];
const recommendations = await this.collections.documents.aggregate(recommendationPipeline).toArray();
return recommendations;
} catch (error) {
console.error(`Error generating personalized recommendations for user ${userId}:`, error);
return [];
}
}
async enhanceSearchResults(results, query, options = {}) {
console.log(`Enhancing ${results.length} search results`);
try {
// Add result enhancements
const enhancedResults = await Promise.all(results.map(async (result, index) => {
// Calculate additional metadata
const enhancedResult = {
...result,
// Result ranking
rank: index + 1,
// Enhanced content preview with query highlighting
highlightedPreview: this.highlightQueryInText(result.contentPreview || '', query),
// Semantic category classification
semanticCategory: await this.classifyContentSemantics(result),
// Reading time estimation
estimatedReadingTime: this.estimateReadingTime(result.content || result.contentPreview || ''),
// Related concepts extraction
extractedConcepts: this.extractKeyConcepts(result.title + ' ' + (result.contentPreview || '')),
// Confidence scoring
confidenceScore: this.calculateConfidenceScore(result),
// Access recommendations
accessRecommendation: this.generateAccessRecommendation(result, options)
};
return enhancedResult;
}));
return enhancedResults;
} catch (error) {
console.error('Error enhancing search results:', error);
return results; // Return original results if enhancement fails
}
}
highlightQueryInText(text, query) {
if (!text || !query) return text;
// Simple highlighting - in production, use more sophisticated highlighting
const queryWords = query.toLowerCase().split(/\s+/);
let highlightedText = text;
queryWords.forEach(word => {
if (word.length > 2) { // Only highlight words longer than 2 characters
const regex = new RegExp(`\\b${word}\\b`, 'gi');
highlightedText = highlightedText.replace(regex, `**${word}**`);
}
});
return highlightedText;
}
estimateReadingTime(text) {
const wordsPerMinute = 250; // Average reading speed
const wordCount = text.split(/\s+/).length;
const readingTime = Math.ceil(wordCount / wordsPerMinute);
return {
minutes: readingTime,
wordCount: wordCount,
formattedTime: readingTime === 1 ? '1 minute' : `${readingTime} minutes`
};
}
extractKeyConcepts(text) {
// Simple concept extraction - in production, use NLP libraries
const concepts = [];
const words = text.toLowerCase().split(/\s+/);
// Technical terms and concepts (simplified approach)
const technicalTerms = [
'artificial intelligence', 'machine learning', 'deep learning', 'neural networks',
'data science', 'analytics', 'algorithm', 'optimization', 'automation',
'cloud computing', 'blockchain', 'cybersecurity', 'api', 'database'
];
technicalTerms.forEach(term => {
if (text.toLowerCase().includes(term)) {
concepts.push(term);
}
});
return concepts.slice(0, 5); // Return top 5 concepts
}
calculateConfidenceScore(result) {
// Multi-factor confidence calculation
let confidence = result.vectorSimilarityScore * 0.6; // Base similarity
// Content length factor
const contentLength = (result.content || result.contentPreview || '').length;
if (contentLength > 1000) confidence += 0.1;
if (contentLength > 3000) confidence += 0.1;
// Metadata completeness factor
if (result.category) confidence += 0.05;
if (result.tags && result.tags.length > 0) confidence += 0.05;
if (result.authorName) confidence += 0.05;
// Popularity factor
if (result.metrics.viewCount > 100) confidence += 0.05;
if (result.metrics.likeCount > 10) confidence += 0.05;
return Math.min(confidence, 1.0); // Cap at 1.0
}
generateAccessRecommendation(result, options) {
// Generate recommendations for how to use/access the content
const recommendations = [];
if (result.vectorSimilarityScore > 0.9) {
recommendations.push('Highly recommended - very relevant to your search');
}
if (result.metrics.viewCount > 1000) {
recommendations.push('Popular content - frequently viewed by users');
}
if (result.estimatedReadingTime && result.estimatedReadingTime.minutes <= 5) {
recommendations.push('Quick read - can be completed in a few minutes');
}
if (result.category === 'tutorial') {
recommendations.push('Step-by-step guidance available');
}
return recommendations;
}
async logSearchQuery(query, queryEmbedding, results, options) {
try {
const searchLog = {
query: query,
queryEmbedding: queryEmbedding,
userId: options.userId || null,
sessionId: options.sessionId || null,
// Search configuration
searchConfig: {
model: this.aiConfig.embeddingModel,
similarityThreshold: options.similarityThreshold || this.aiConfig.similarityThreshold,
limit: options.limit || 20,
enableHybridSearch: options.enableHybridSearch !== false,
enablePersonalization: this.aiConfig.enablePersonalization && !!options.userId
},
// Results metadata
resultsMetadata: {
totalResults: results.length,
averageSimilarity: results.length > 0 ?
results.reduce((sum, r) => sum + r.vectorSimilarityScore, 0) / results.length : 0,
topCategories: this.extractTopCategories(results),
searchTime: Date.now() - (options.startTime || Date.now())
},
// User context
userContext: {
ipAddress: options.ipAddress,
userAgent: options.userAgent,
referrer: options.referrer
},
timestamp: new Date()
};
await this.collections.searchLogs.insertOne(searchLog);
} catch (error) {
console.error('Error logging search query:', error);
// Don't throw - logging shouldn't break search
}
}
extractTopCategories(results) {
const categoryCount = {};
results.forEach(result => {
if (result.category) {
categoryCount[result.category] = (categoryCount[result.category] || 0) + 1;
}
});
return Object.entries(categoryCount)
.sort(([,a], [,b]) => b - a)
.slice(0, 5)
.map(([category, count]) => ({ category, count }));
}
// Additional utility methods for comprehensive vector search functionality
generateCacheKey(content) {
const crypto = require('crypto');
return crypto.createHash('sha256').update(content).digest('hex');
}
generateContentHash(content) {
const crypto = require('crypto');
return crypto.createHash('md5').update(content).digest('hex');
}
calculateVectorNorm(vector) {
return Math.sqrt(vector.reduce((sum, val) => sum + val * val, 0));
}
assessContentQuality(document) {
let qualityScore = 0;
// Length factor
const contentLength = (document.content || '').length;
if (contentLength > 1000) qualityScore += 0.3;
if (contentLength > 5000) qualityScore += 0.2;
// Metadata completeness
if (document.title) qualityScore += 0.1;
if (document.summary) qualityScore += 0.1;
if (document.tags && document.tags.length > 0) qualityScore += 0.1;
if (document.category) qualityScore += 0.1;
// Structure indicators
if (document.content && document.content.includes('\n\n')) qualityScore += 0.1; // Paragraphs
return Math.min(qualityScore, 1.0);
}
}
// Benefits of MongoDB Vector Search for AI Applications:
// - Native vector similarity search with cosine similarity
// - Seamless integration with embedding models (OpenAI, Hugging Face, etc.)
// - High-performance vector indexing and retrieval at scale
// - Advanced filtering and hybrid search capabilities
// - Built-in support for multi-modal content (text, images, audio)
// - Personalization through user profile vector matching
// - Real-time search with low-latency vector operations
// - Comprehensive search analytics and query optimization
// - Integration with MongoDB's document model for rich metadata
// - Production-ready scalability with sharding and replication
module.exports = {
AdvancedVectorSearchEngine
};
Understanding MongoDB Vector Search Architecture
Advanced AI Integration Patterns and Semantic Search Optimization
Implement sophisticated vector search strategies for production AI applications:
// Production-ready MongoDB Vector Search with advanced AI integration and optimization patterns
class ProductionVectorSearchPlatform extends AdvancedVectorSearchEngine {
constructor(db, productionConfig) {
super(db, productionConfig);
this.productionConfig = {
...productionConfig,
multiModelSupport: true,
realtimeIndexing: true,
distributedEmbedding: true,
autoOptimization: true,
advancedAnalytics: true,
contentModeration: true
};
this.setupProductionOptimizations();
this.initializeAdvancedFeatures();
this.setupMonitoringAndAlerts();
}
async implementAdvancedSemanticCapabilities() {
console.log('Implementing advanced semantic capabilities...');
// Multi-model embedding strategy
const embeddingStrategy = {
textEmbeddings: {
primary: 'text-embedding-3-large',
fallback: 'text-embedding-ada-002',
specialized: {
code: 'code-search-babbage-code-001',
legal: 'text-similarity-curie-001',
medical: 'text-search-curie-doc-001'
}
},
multimodalEmbeddings: {
imageText: 'clip-vit-base-patch32',
audioText: 'wav2vec2-base-960h',
videoText: 'video-text-retrieval'
},
domainSpecific: {
scientific: 'scibert-scivocab-uncased',
financial: 'finbert-base-uncased',
biomedical: 'biobert-base-cased'
}
};
return await this.deployEmbeddingStrategy(embeddingStrategy);
}
async setupRealtimeSemanticIndexing() {
console.log('Setting up real-time semantic indexing...');
const indexingPipeline = {
// Change stream monitoring for real-time updates
changeStreams: [
{
collection: 'documents',
pipeline: [
{ $match: { 'operationType': { $in: ['insert', 'update'] } } }
],
handler: this.processDocumentChange.bind(this)
}
],
// Batch processing for bulk operations
batchProcessor: {
batchSize: 100,
maxWaitTime: 30000, // 30 seconds
retryLogic: true,
errorHandling: 'resilient'
},
// Quality assurance pipeline
qualityChecks: [
'contentValidation',
'languageDetection',
'duplicateDetection',
'contentModeration'
]
};
return await this.deployIndexingPipeline(indexingPipeline);
}
async implementAdvancedRecommendationEngine() {
console.log('Implementing advanced recommendation engine...');
const recommendationStrategies = {
// Collaborative filtering with vector embeddings
collaborative: {
userSimilarity: 'cosine',
itemSimilarity: 'cosine',
hybridWeighting: {
contentBased: 0.6,
collaborative: 0.4
}
},
// Content-based recommendations
contentBased: {
semanticSimilarity: true,
categoryWeighting: true,
temporalDecay: true,
diversityOptimization: true
},
// Deep learning recommendations
deepLearning: {
neuralCollaborativeFiltering: true,
sequentialRecommendations: true,
multiTaskLearning: true
}
};
return await this.deployRecommendationStrategies(recommendationStrategies);
}
async optimizeVectorSearchPerformance() {
console.log('Optimizing vector search performance...');
const optimizations = {
// Index optimization strategies
indexOptimization: {
approximateNearestNeighbor: true,
hierarchicalNavigableSmallWorld: true,
productQuantization: true,
localitySensitiveHashing: true
},
// Query optimization
queryOptimization: {
queryExpansion: true,
queryRewriting: true,
candidatePrefiltering: true,
adaptiveSimilarityThresholds: true
},
// Caching strategies
cachingStrategy: {
embeddingCache: '10GB',
resultCache: '5GB',
queryCache: '2GB',
indexCache: '20GB'
}
};
return await this.implementOptimizations(optimizations);
}
}
SQL-Style Vector Search Operations with QueryLeaf
QueryLeaf provides familiar SQL syntax for MongoDB Vector Search operations and AI-powered semantic queries:
-- QueryLeaf advanced vector search and AI operations with SQL-familiar syntax
-- Create vector search indexes for different content types and embedding models
CREATE VECTOR INDEX document_semantic_index
ON documents (
embedding VECTOR(3072) USING COSINE_SIMILARITY,
category,
language,
contentType,
accessLevel,
createdAt
)
WITH (
model = 'text-embedding-3-large',
auto_update = true,
optimization_level = 'performance',
-- Advanced index configuration
approximate_nn = true,
candidate_multiplier = 10,
ef_construction = 200,
m_connections = 16
);
CREATE VECTOR INDEX multimodal_content_index
ON documents (
multimodalEmbedding VECTOR(1536) USING COSINE_SIMILARITY,
mediaType,
contentFormat
)
WITH (
model = 'clip-vit-base-patch32',
multimodal = true
);
-- Advanced semantic search with vector similarity and hybrid scoring
WITH semantic_search AS (
SELECT
d.*,
-- Vector similarity search using embeddings
VECTOR_SEARCH(
d.embedding,
GENERATE_EMBEDDING(
'Find research papers about machine learning applications in healthcare diagnostics',
'text-embedding-3-large'
),
'COSINE'
) as vector_similarity,
-- Hybrid scoring combining vector and traditional text search
(
VECTOR_SEARCH(
d.embedding,
GENERATE_EMBEDDING(
'machine learning healthcare diagnostics medical AI',
'text-embedding-3-large'
),
'COSINE'
) * 0.7 +
MATCH_SCORE(d.title || ' ' || d.content, 'machine learning healthcare diagnostics') * 0.2 +
-- Recency boost
CASE
WHEN d.createdAt > CURRENT_DATE - INTERVAL '30 days' THEN 0.1
WHEN d.createdAt > CURRENT_DATE - INTERVAL '90 days' THEN 0.05
ELSE 0
END +
-- Quality and popularity boost
(LOG(d.metrics.citationCount + 1) * 0.02) +
(d.metrics.averageRating / 5.0 * 0.03)
) as hybrid_score
FROM documents d
WHERE
-- Vector similarity threshold
VECTOR_SEARCH(
d.embedding,
GENERATE_EMBEDDING(
'machine learning healthcare diagnostics',
'text-embedding-3-large'
),
'COSINE'
) > 0.75
-- Additional filters for precision
AND d.category IN ('research', 'academic', 'medical')
AND d.language = 'en'
AND d.accessLevel IN ('public', 'academic')
AND d.contentType = 'research_paper'
),
-- Enhanced search with semantic category classification and concept extraction
enriched_results AS (
SELECT
ss.*,
-- Semantic category classification using AI
AI_CLASSIFY_CATEGORY(
ss.title || ' ' || SUBSTRING(ss.content, 1, 1000),
['machine_learning', 'healthcare', 'diagnostics', 'medical_imaging', 'clinical_ai']
) as semantic_categories,
-- Key concept extraction
AI_EXTRACT_CONCEPTS(
ss.title || ' ' || ss.abstract,
10 -- top 10 concepts
) as key_concepts,
-- Content summary generation
AI_SUMMARIZE(
ss.content,
max_length => 200,
style => 'academic'
) as ai_summary,
-- Reading difficulty assessment
AI_ASSESS_DIFFICULTY(
ss.content,
domain => 'medical'
) as reading_difficulty,
-- Related research identification
FIND_SIMILAR_DOCUMENTS(
ss.embedding,
limit => 5,
exclude_ids => ARRAY[ss.document_id],
similarity_threshold => 0.8
) as related_research,
-- Citation and reference analysis
ANALYZE_CITATIONS(ss.content) as citation_analysis,
-- Author expertise scoring
u.expertise_score,
u.h_index,
u.research_domains,
-- Impact metrics
CALCULATE_IMPACT_SCORE(
ss.metrics.citationCount,
ss.metrics.downloadCount,
ss.metrics.viewCount,
ss.createdAt
) as impact_score
FROM semantic_search ss
JOIN users u ON ss.createdBy = u.user_id
WHERE ss.vector_similarity > 0.7
),
-- Personalized recommendations based on user research interests
personalized_recommendations AS (
SELECT
er.*,
-- User interest alignment scoring
VECTOR_SIMILARITY(
er.embedding,
(SELECT interest_embedding FROM user_profiles WHERE user_id = CURRENT_USER_ID()),
'COSINE'
) as interest_alignment,
-- Reading history similarity
CALCULATE_READING_HISTORY_SIMILARITY(
CURRENT_USER_ID(),
er.document_id,
window_days => 180
) as reading_history_similarity,
-- Collaborative filtering score
COLLABORATIVE_FILTERING_SCORE(
CURRENT_USER_ID(),
er.document_id,
algorithm => 'neural_collaborative_filtering'
) as collaborative_score,
-- Personalized relevance scoring
(
er.hybrid_score * 0.5 +
interest_alignment * 0.3 +
reading_history_similarity * 0.1 +
collaborative_score * 0.1
) as personalized_relevance
FROM enriched_results er
WHERE interest_alignment > 0.6
),
-- Advanced analytics and search insights
search_analytics AS (
SELECT
COUNT(*) as total_results,
AVG(pr.vector_similarity) as avg_similarity,
AVG(pr.hybrid_score) as avg_hybrid_score,
AVG(pr.personalized_relevance) as avg_personalized_relevance,
-- Category distribution analysis
JSON_OBJECT_AGG(
pr.category,
COUNT(*)
) as category_distribution,
-- Semantic category insights
FLATTEN_ARRAY(
ARRAY_AGG(pr.semantic_categories)
) as all_semantic_categories,
-- Concept frequency analysis
AI_ANALYZE_CONCEPT_TRENDS(
ARRAY_AGG(pr.key_concepts),
time_window => '30 days'
) as concept_trends,
-- Research domain coverage
CALCULATE_DOMAIN_COVERAGE(
ARRAY_AGG(pr.research_domains)
) as domain_coverage,
-- Quality distribution
JSON_OBJECT(
'high_impact', COUNT(*) FILTER (WHERE pr.impact_score > 80),
'medium_impact', COUNT(*) FILTER (WHERE pr.impact_score BETWEEN 50 AND 80),
'emerging', COUNT(*) FILTER (WHERE pr.impact_score BETWEEN 20 AND 50),
'new_research', COUNT(*) FILTER (WHERE pr.impact_score < 20)
) as quality_distribution
FROM personalized_recommendations pr
)
-- Final comprehensive search results with analytics and recommendations
SELECT
-- Document information
pr.document_id,
pr.title,
pr.ai_summary,
pr.category,
pr.semantic_categories,
pr.key_concepts,
pr.reading_difficulty,
pr.createdAt,
-- Author information
JSON_OBJECT(
'name', u.name,
'expertise_score', pr.expertise_score,
'h_index', pr.h_index,
'research_domains', pr.research_domains
) as author_info,
-- Relevance scoring
ROUND(pr.vector_similarity, 4) as semantic_similarity,
ROUND(pr.hybrid_score, 4) as hybrid_relevance,
ROUND(pr.personalized_relevance, 4) as personalized_score,
ROUND(pr.interest_alignment, 4) as interest_match,
-- Content characteristics
pr.reading_difficulty,
pr.impact_score,
pr.citation_analysis,
-- Related content
pr.related_research,
-- Access information
CASE pr.accessLevel
WHEN 'public' THEN 'Open Access'
WHEN 'academic' THEN 'Academic Access Required'
WHEN 'subscription' THEN 'Subscription Required'
ELSE 'Restricted Access'
END as access_type,
-- Download and interaction URLs
CONCAT('/api/documents/', pr.document_id, '/download') as download_url,
CONCAT('/api/documents/', pr.document_id, '/cite') as citation_url,
CONCAT('/api/documents/', pr.document_id, '/related') as related_url,
-- Recommendation metadata
JSON_OBJECT(
'recommendation_reason', CASE
WHEN pr.interest_alignment > 0.9 THEN 'Highly aligned with your research interests'
WHEN pr.collaborative_score > 0.8 THEN 'Recommended by researchers with similar interests'
WHEN pr.reading_history_similarity > 0.7 THEN 'Similar to your recent reading patterns'
ELSE 'Semantically relevant to your search'
END,
'confidence_level', CASE
WHEN pr.personalized_relevance > 0.9 THEN 'Very High'
WHEN pr.personalized_relevance > 0.8 THEN 'High'
WHEN pr.personalized_relevance > 0.7 THEN 'Medium'
ELSE 'Low'
END
) as recommendation_metadata,
-- Search analytics (same for all results)
(SELECT ROW_TO_JSON(sa.*) FROM search_analytics sa) as search_insights
FROM personalized_recommendations pr
JOIN users u ON pr.createdBy = u.user_id
WHERE pr.personalized_relevance > 0.6
ORDER BY pr.personalized_relevance DESC
LIMIT 20;
-- Advanced vector operations for content discovery and analysis
-- Find conceptually similar documents across different languages
WITH multilingual_search AS (
SELECT
d.document_id,
d.title,
d.language,
d.category,
-- Cross-language semantic similarity
VECTOR_SEARCH(
d.embedding,
GENERATE_MULTILINGUAL_EMBEDDING(
'intelligence artificielle apprentissage automatique', -- French query
source_language => 'fr',
target_embedding_language => 'en'
),
'COSINE'
) as cross_language_similarity
FROM documents d
WHERE d.language IN ('en', 'fr', 'de', 'es', 'zh')
AND VECTOR_SEARCH(
d.embedding,
GENERATE_MULTILINGUAL_EMBEDDING(
'intelligence artificielle apprentissage automatique',
source_language => 'fr',
target_embedding_language => 'en'
),
'COSINE'
) > 0.8
)
SELECT * FROM multilingual_search
ORDER BY cross_language_similarity DESC;
-- Content recommendation based on user behavior patterns
CREATE VIEW personalized_content_feed AS
WITH user_interaction_embedding AS (
SELECT
ui.user_id,
-- Generate user interest embedding from interaction history
AGGREGATE_EMBEDDINGS(
ARRAY_AGG(d.embedding),
weights => ARRAY_AGG(
CASE ui.interaction_type
WHEN 'download' THEN 1.0
WHEN 'like' THEN 0.8
WHEN 'share' THEN 0.9
WHEN 'view' THEN 0.3
ELSE 0.1
END *
-- Temporal decay
GREATEST(0.1, 1.0 - EXTRACT(DAYS FROM CURRENT_DATE - ui.interaction_timestamp) / 365.0)
),
aggregation_method => 'weighted_average'
) as interest_embedding
FROM user_interactions ui
JOIN documents d ON ui.document_id = d.document_id
WHERE ui.interaction_timestamp > CURRENT_DATE - INTERVAL '1 year'
GROUP BY ui.user_id
),
content_recommendations AS (
SELECT
uie.user_id,
d.document_id,
d.title,
d.category,
d.createdAt,
-- Interest-based similarity
VECTOR_SIMILARITY(
d.embedding,
uie.interest_embedding,
'COSINE'
) as interest_similarity,
-- Trending factor
CALCULATE_TRENDING_SCORE(
d.document_id,
time_window => '7 days'
) as trending_score,
-- Novelty factor (encourages discovery)
CALCULATE_NOVELTY_SCORE(
uie.user_id,
d.document_id,
d.category
) as novelty_score,
-- Combined recommendation score
(
VECTOR_SIMILARITY(d.embedding, uie.interest_embedding, 'COSINE') * 0.6 +
CALCULATE_TRENDING_SCORE(d.document_id, time_window => '7 days') * 0.2 +
CALCULATE_NOVELTY_SCORE(uie.user_id, d.document_id, d.category) * 0.2
) as recommendation_score
FROM user_interaction_embedding uie
CROSS JOIN documents d
WHERE NOT EXISTS (
-- Exclude already interacted content
SELECT 1 FROM user_interactions ui2
WHERE ui2.user_id = uie.user_id
AND ui2.document_id = d.document_id
)
AND VECTOR_SIMILARITY(d.embedding, uie.interest_embedding, 'COSINE') > 0.7
)
SELECT
user_id,
document_id,
title,
category,
ROUND(interest_similarity, 4) as interest_match,
ROUND(trending_score, 4) as trending_score,
ROUND(novelty_score, 4) as discovery_potential,
ROUND(recommendation_score, 4) as overall_score,
-- Recommendation explanation
CASE
WHEN interest_similarity > 0.9 THEN 'Perfect match for your interests'
WHEN trending_score > 0.8 THEN 'Trending content in your area'
WHEN novelty_score > 0.7 THEN 'New topic for you to explore'
ELSE 'Related to your reading patterns'
END as recommendation_reason
FROM content_recommendations
WHERE recommendation_score > 0.75
ORDER BY user_id, recommendation_score DESC;
-- Advanced analytics for content optimization and performance monitoring
WITH vector_search_analytics AS (
SELECT
-- Search performance metrics
sl.query,
COUNT(*) as search_frequency,
AVG(sl.resultsMetadata.totalResults) as avg_results_count,
AVG(sl.resultsMetadata.averageSimilarity) as avg_similarity_score,
AVG(sl.resultsMetadata.searchTime) as avg_search_time_ms,
-- Query characteristics
AI_ANALYZE_QUERY_INTENT(sl.query) as query_intent,
AI_EXTRACT_ENTITIES(sl.query) as query_entities,
LENGTH(sl.query) as query_length,
-- Result quality metrics
AVG(
(SELECT COUNT(*) FROM JSON_ARRAY_ELEMENTS_TEXT(sl.resultsMetadata.topCategories))
) as category_diversity,
-- User engagement with results
COALESCE(
(
SELECT AVG(ui.rating)
FROM user_interactions ui
WHERE ui.session_id = sl.sessionId
AND ui.interaction_timestamp >= sl.timestamp
AND ui.interaction_timestamp <= sl.timestamp + INTERVAL '1 hour'
), 0
) as result_satisfaction_score
FROM search_logs sl
WHERE sl.timestamp >= CURRENT_DATE - INTERVAL '30 days'
GROUP BY sl.query, AI_ANALYZE_QUERY_INTENT(sl.query), AI_EXTRACT_ENTITIES(sl.query), LENGTH(sl.query)
),
content_performance_analysis AS (
SELECT
d.document_id,
d.title,
d.category,
d.createdAt,
-- Discoverability metrics
COUNT(sl.query) as times_found_in_search,
AVG(sl.resultsMetadata.averageSimilarity) as avg_search_relevance,
-- Engagement metrics
COUNT(ui.interaction_id) as total_interactions,
COUNT(ui.interaction_id) FILTER (WHERE ui.interaction_type = 'view') as view_count,
COUNT(ui.interaction_id) FILTER (WHERE ui.interaction_type = 'download') as download_count,
AVG(ui.rating) FILTER (WHERE ui.rating IS NOT NULL) as avg_rating,
-- Content optimization recommendations
CASE
WHEN COUNT(sl.query) < 5 THEN 'Low discoverability - consider SEO optimization'
WHEN AVG(sl.resultsMetadata.averageSimilarity) < 0.7 THEN 'Low relevance - review content structure'
WHEN COUNT(ui.interaction_id) FILTER (WHERE ui.interaction_type = 'download') /
NULLIF(COUNT(ui.interaction_id) FILTER (WHERE ui.interaction_type = 'view'), 0) < 0.1
THEN 'Low conversion - improve content value proposition'
ELSE 'Performance within normal parameters'
END as optimization_recommendation
FROM documents d
LEFT JOIN search_logs sl ON d.document_id = ANY(
SELECT JSON_ARRAY_ELEMENTS_TEXT(sl.resultsMetadata.resultIds)::UUID
)
LEFT JOIN user_interactions ui ON d.document_id = ui.document_id
WHERE d.createdAt >= CURRENT_DATE - INTERVAL '90 days'
GROUP BY d.document_id, d.title, d.category, d.createdAt
)
SELECT
-- Search analytics summary
vsa.query,
vsa.search_frequency,
vsa.query_intent,
vsa.query_entities,
ROUND(vsa.avg_similarity_score, 3) as avg_relevance,
ROUND(vsa.avg_search_time_ms, 1) as avg_response_time_ms,
ROUND(vsa.result_satisfaction_score, 2) as user_satisfaction,
-- Content performance insights
cpa.title as top_performing_content,
cpa.times_found_in_search,
cpa.total_interactions,
cpa.optimization_recommendation,
-- Improvement recommendations
CASE
WHEN vsa.avg_search_time_ms > 1000 THEN 'Consider index optimization'
WHEN vsa.avg_similarity_score < 0.7 THEN 'Review embedding model performance'
WHEN vsa.result_satisfaction_score < 3.0 THEN 'Improve result quality and relevance'
ELSE 'Search performance is optimal'
END as search_optimization_recommendation
FROM vector_search_analytics vsa
LEFT JOIN content_performance_analysis cpa ON true
WHERE vsa.search_frequency > 10 -- Focus on frequently searched queries
ORDER BY vsa.search_frequency DESC, vsa.result_satisfaction_score DESC
LIMIT 50;
-- QueryLeaf provides comprehensive vector search capabilities:
-- 1. Native vector similarity search with advanced embedding models
-- 2. Hybrid scoring combining semantic and traditional text search
-- 3. Personalized recommendations based on user interest embeddings
-- 4. Multi-language semantic search with cross-language understanding
-- 5. Real-time content recommendations and discovery systems
-- 6. Advanced analytics for search optimization and content performance
-- 7. AI-powered content classification and concept extraction
-- 8. Production-ready vector indexing with performance optimization
-- 9. Comprehensive search logging and user behavior analysis
-- 10. SQL-familiar syntax for complex vector operations and AI workflows
Best Practices for Production Vector Search Implementation
Embedding Strategy and Model Selection
Essential principles for effective MongoDB Vector Search deployment:
- Model Selection: Choose appropriate embedding models based on content type, domain, and language requirements
- Embedding Quality: Implement comprehensive content preparation and preprocessing for optimal embedding generation
- Index Optimization: Configure vector indexes with appropriate similarity metrics and performance parameters
- Hybrid Approach: Combine vector similarity with traditional text search for comprehensive relevance scoring
- Personalization: Implement user profile embeddings for personalized search and recommendation experiences
- Performance Monitoring: Track search performance, result quality, and user satisfaction metrics continuously
Scalability and Performance Optimization
Optimize vector search deployments for production-scale requirements:
- Index Strategy: Design efficient vector indexes with appropriate dimensionality and similarity algorithms
- Caching Implementation: Implement multi-tier caching for embeddings, queries, and search results
- Batch Processing: Optimize embedding generation and indexing through intelligent batch processing
- Query Optimization: Implement query expansion, rewriting, and adaptive similarity thresholds
- Resource Management: Monitor and optimize computational resources for embedding generation and vector operations
- Distribution Strategy: Design sharding and replication strategies for large-scale vector collections
Conclusion
MongoDB Vector Search provides comprehensive AI-powered semantic search capabilities that enable natural language queries, intelligent content discovery, and sophisticated recommendation systems through high-dimensional vector embeddings and advanced similarity algorithms. The native MongoDB integration ensures that vector search benefits from the same scalability, performance, and operational features as traditional database operations.
Key MongoDB Vector Search benefits include:
- Semantic Understanding: AI-powered semantic search that understands meaning and context beyond keyword matching
- Advanced Similarity: Sophisticated vector similarity algorithms with cosine similarity and approximate nearest neighbor search
- Hybrid Capabilities: Seamless integration of vector similarity with traditional text search and metadata filtering
- Personalization: User profile embeddings for personalized search results and intelligent recommendations
- Multi-Modal Support: Vector search across text, images, audio, and multi-modal content with unified similarity operations
- Production Ready: High-performance vector indexing with automatic optimization and comprehensive analytics
Whether you're building AI-powered search applications, recommendation engines, content discovery platforms, or intelligent document retrieval systems, MongoDB Vector Search with QueryLeaf's familiar SQL interface provides the foundation for sophisticated semantic capabilities.
QueryLeaf Integration: QueryLeaf automatically manages MongoDB Vector Search operations while providing SQL-familiar syntax for vector similarity queries, embedding generation, and AI-powered content discovery. Advanced vector search patterns, personalization algorithms, and semantic analytics are seamlessly handled through familiar SQL constructs, making sophisticated AI capabilities accessible to SQL-oriented development teams.
The combination of MongoDB's robust vector search capabilities with SQL-style AI operations makes it an ideal platform for modern AI applications that require both advanced semantic understanding and familiar database management patterns, ensuring your AI-powered search solutions can scale efficiently while remaining maintainable and feature-rich.