Skip to content

MongoDB Vector Search for Semantic Applications: Building AI-Powered Search with SQL-Style Vector Operations

Modern applications increasingly require intelligent search capabilities that understand semantic meaning rather than just keyword matching. Traditional text-based search approaches struggle with understanding context, handling synonyms, and providing relevant results for complex queries that require conceptual understanding rather than exact text matches.

MongoDB Atlas Vector Search provides native vector database capabilities that enable semantic similarity search, recommendation systems, and retrieval-augmented generation (RAG) applications. Unlike standalone vector databases that require separate infrastructure, Atlas Vector Search integrates seamlessly with MongoDB's document model, allowing developers to combine traditional database operations with advanced AI-powered search in a single, unified platform.

The Traditional Search Limitations Challenge

Conventional approaches to search and content discovery have significant limitations for modern intelligent applications:

-- Traditional relational search - limited semantic understanding

-- PostgreSQL full-text search with performance and relevance challenges
CREATE TABLE documents (
  document_id SERIAL PRIMARY KEY,
  title VARCHAR(500) NOT NULL,
  content TEXT NOT NULL,
  category VARCHAR(100),
  tags TEXT[],
  author VARCHAR(200),
  created_at TIMESTAMP DEFAULT NOW(),

  -- Full-text search vector (keyword-based only)
  search_vector tsvector GENERATED ALWAYS AS (
    setweight(to_tsvector('english', title), 'A') ||
    setweight(to_tsvector('english', content), 'B') ||
    setweight(to_tsvector('english', array_to_string(tags, ' ')), 'C')
  ) STORED
);

-- Create full-text search index
CREATE INDEX idx_documents_fts ON documents USING GIN(search_vector);

-- Additional indexes for filtering
CREATE INDEX idx_documents_category ON documents(category);
CREATE INDEX idx_documents_created_at ON documents(created_at DESC);
CREATE INDEX idx_documents_author ON documents(author);

-- Traditional keyword-based search with limited semantic understanding
WITH search_query AS (
  SELECT 
    document_id,
    title,
    content,
    category,
    author,
    created_at,

    -- Basic relevance scoring (keyword-based only)
    ts_rank_cd(search_vector, plainto_tsquery('english', 'machine learning algorithms')) as relevance_score,

    -- Highlight matching text
    ts_headline('english', content, plainto_tsquery('english', 'machine learning algorithms'), 
                'MaxWords=50, MinWords=20, ShortWord=3, HighlightAll=false') as highlighted_content,

    -- Basic similarity using trigram matching (very limited)
    similarity(title, 'machine learning algorithms') as title_similarity,

    -- Category boosting (manual relevance adjustment)
    CASE category 
      WHEN 'AI' THEN 1.5 
      WHEN 'Technology' THEN 1.2 
      ELSE 1.0 
    END as category_boost

  FROM documents
  WHERE search_vector @@ plainto_tsquery('english', 'machine learning algorithms')
     OR similarity(title, 'machine learning algorithms') > 0.1
),

ranked_results AS (
  SELECT 
    *,
    -- Combined relevance scoring (still keyword-dependent)
    (relevance_score * category_boost * 
     CASE WHEN title_similarity > 0.3 THEN 2.0 ELSE 1.0 END) as final_score,

    -- Manual semantic grouping (limited effectiveness)
    CASE 
      WHEN content ILIKE '%neural network%' OR content ILIKE '%deep learning%' THEN 'Deep Learning'
      WHEN content ILIKE '%statistics%' OR content ILIKE '%data science%' THEN 'Data Science' 
      WHEN content ILIKE '%algorithm%' OR content ILIKE '%optimization%' THEN 'Algorithms'
      ELSE 'General'
    END as semantic_category,

    -- Time decay factor
    CASE 
      WHEN created_at >= NOW() - INTERVAL '30 days' THEN 1.2
      WHEN created_at >= NOW() - INTERVAL '90 days' THEN 1.0
      WHEN created_at >= NOW() - INTERVAL '1 year' THEN 0.8
      ELSE 0.6
    END as recency_boost

  FROM search_query
  WHERE relevance_score > 0.01
),

related_documents AS (
  -- Attempt to find related documents (very basic approach)
  SELECT DISTINCT
    r1.document_id,
    r2.document_id as related_id,
    r2.title as related_title,

    -- Basic relatedness calculation
    (array_length(array(SELECT UNNEST(r1.tags) INTERSECT SELECT UNNEST(r2.tags)), 1) / 
     GREATEST(array_length(r1.tags, 1), array_length(r2.tags, 1))::numeric) as tag_similarity,

    CASE WHEN r1.category = r2.category THEN 0.3 ELSE 0 END as category_match,
    CASE WHEN r1.author = r2.author THEN 0.2 ELSE 0 END as author_match

  FROM ranked_results r1
  JOIN documents r2 ON r1.document_id != r2.document_id
  WHERE r1.final_score > 0.5
),

final_results AS (
  SELECT 
    r.document_id,
    r.title,
    LEFT(r.content, 200) || '...' as content_preview,
    r.highlighted_content,
    r.category,
    r.semantic_category,
    r.author,
    r.created_at,

    -- Final ranking with all factors
    ROUND((r.final_score * r.recency_boost)::numeric, 4) as final_relevance_score,

    -- Related documents (limited by keyword overlap)
    COALESCE(
      (SELECT json_agg(json_build_object(
        'id', related_id,
        'title', related_title,
        'similarity', ROUND((tag_similarity + category_match + author_match)::numeric, 3)
      )) FROM related_documents rd 
       WHERE rd.document_id = r.document_id 
         AND (tag_similarity + category_match + author_match) > 0.1
       LIMIT 5),
      '[]'::json
    ) as related_documents

  FROM ranked_results r
)

SELECT 
  document_id,
  title,
  content_preview,
  highlighted_content,
  category,
  semantic_category,
  author,
  final_relevance_score,
  related_documents,

  -- Search result metadata
  COUNT(*) OVER () as total_results,
  ROW_NUMBER() OVER (ORDER BY final_relevance_score DESC) as result_rank

FROM final_results
ORDER BY final_relevance_score DESC, created_at DESC
LIMIT 20;

-- Problems with traditional keyword-based search:
-- 1. No understanding of semantic meaning or context
-- 2. Cannot handle synonyms, related concepts, or conceptual queries
-- 3. Limited relevance scoring based only on keyword frequency and position  
-- 4. Poor handling of multilingual content and cross-language search
-- 5. No support for similarity search across different content types
-- 6. Manual and error-prone relevance tuning with limited effectiveness
-- 7. Cannot understand user intent beyond explicit keyword matches
-- 8. Poor recommendation capabilities based only on metadata overlap
-- 9. Limited support for complex search patterns and AI-powered features
-- 10. No integration with modern machine learning and embedding models

-- MySQL approach (even more limited)
SELECT 
  document_id,
  title,
  content,
  category,

  -- Basic full-text search (MySQL limitations)
  MATCH(title, content) AGAINST('machine learning' IN NATURAL LANGUAGE MODE) as relevance,

  -- Simple keyword highlighting
  REPLACE(
    REPLACE(title, 'machine', '<mark>machine</mark>'), 
    'learning', '<mark>learning</mark>'
  ) as highlighted_title

FROM mysql_documents
WHERE MATCH(title, content) AGAINST('machine learning' IN NATURAL LANGUAGE MODE)
ORDER BY relevance DESC
LIMIT 10;

-- MySQL limitations:
-- - Very basic full-text search with limited relevance algorithms
-- - No semantic understanding or contextual matching
-- - Limited text processing and language support
-- - Basic relevance scoring without advanced ranking factors
-- - No support for vector embeddings or similarity search
-- - Limited customization of search behavior and ranking
-- - Poor performance with large text corpuses
-- - No integration with modern AI/ML search techniques

MongoDB Atlas Vector Search provides intelligent semantic search capabilities:

// MongoDB Atlas Vector Search - AI-powered semantic search and similarity matching
const { MongoClient } = require('mongodb');

const client = new MongoClient('mongodb+srv://your-cluster.mongodb.net/');
const db = client.db('intelligent_search_platform');

// Advanced vector search and semantic similarity platform
class VectorSearchManager {
  constructor(db) {
    this.db = db;
    this.collections = {
      documents: db.collection('documents'),
      vectorIndex: db.collection('vector_index_metadata'),
      searchAnalytics: db.collection('search_analytics'),
      userProfiles: db.collection('user_profiles'),
      recommendations: db.collection('recommendations')
    };

    // Vector search configuration
    this.vectorConfig = {
      dimensions: 1536, // OpenAI text-embedding-ada-002
      similarity: 'cosine',
      indexType: 'knnVector'
    };

    this.embeddingModel = 'text-embedding-ada-002'; // Can be configured for different models
  }

  async initializeVectorSearchIndexes() {
    console.log('Initializing Atlas Vector Search indexes...');

    // Create vector search index for document content
    const contentVectorIndex = {
      name: 'content_vector_index',
      definition: {
        fields: [
          {
            type: 'vector',
            path: 'contentVector',
            numDimensions: this.vectorConfig.dimensions,
            similarity: this.vectorConfig.similarity
          },
          {
            type: 'filter',
            path: 'category'
          },
          {
            type: 'filter', 
            path: 'tags'
          },
          {
            type: 'filter',
            path: 'publishedDate'
          },
          {
            type: 'filter',
            path: 'author'
          },
          {
            type: 'filter',
            path: 'contentType'
          }
        ]
      }
    };

    // Create vector search index for title embeddings
    const titleVectorIndex = {
      name: 'title_vector_index', 
      definition: {
        fields: [
          {
            type: 'vector',
            path: 'titleVector',
            numDimensions: this.vectorConfig.dimensions,
            similarity: this.vectorConfig.similarity
          }
        ]
      }
    };

    // Create hybrid search index combining vector and text search
    const hybridSearchIndex = {
      name: 'hybrid_search_index',
      definition: {
        fields: [
          {
            type: 'vector',
            path: 'contentVector',
            numDimensions: this.vectorConfig.dimensions,
            similarity: this.vectorConfig.similarity
          },
          {
            type: 'autocomplete',
            path: 'title',
            tokenization: 'edgeGram',
            minGrams: 2,
            maxGrams: 15
          },
          {
            type: 'text',
            path: 'content',
            analyzer: 'lucene.standard'
          },
          {
            type: 'text',
            path: 'tags',
            analyzer: 'lucene.keyword'
          }
        ]
      }
    };

    try {
      // Note: In practice, vector search indexes are created through MongoDB Atlas UI
      // or MongoDB CLI. This code shows the structure for reference.
      console.log('Vector search indexes configured:');
      console.log('- Content Vector Index:', contentVectorIndex.name);
      console.log('- Title Vector Index:', titleVectorIndex.name); 
      console.log('- Hybrid Search Index:', hybridSearchIndex.name);

      // Store index metadata for application reference
      await this.collections.vectorIndex.insertMany([
        { ...contentVectorIndex, createdAt: new Date(), status: 'active' },
        { ...titleVectorIndex, createdAt: new Date(), status: 'active' },
        { ...hybridSearchIndex, createdAt: new Date(), status: 'active' }
      ]);

      return {
        contentVectorIndex: contentVectorIndex.name,
        titleVectorIndex: titleVectorIndex.name,
        hybridSearchIndex: hybridSearchIndex.name
      };

    } catch (error) {
      console.error('Vector index initialization failed:', error);
      throw error;
    }
  }

  async ingestDocumentsWithVectorization(documents) {
    console.log(`Processing ${documents.length} documents for vector search ingestion...`);

    const processedDocuments = [];
    const batchSize = 10;

    // Process documents in batches to manage API rate limits
    for (let i = 0; i < documents.length; i += batchSize) {
      const batch = documents.slice(i, i + batchSize);

      console.log(`Processing batch ${Math.floor(i / batchSize) + 1}/${Math.ceil(documents.length / batchSize)}`);

      const batchPromises = batch.map(async (doc) => {
        try {
          // Generate embeddings for title and content
          const [titleEmbedding, contentEmbedding] = await Promise.all([
            this.generateEmbedding(doc.title),
            this.generateEmbedding(doc.content)
          ]);

          // Extract key phrases and entities for enhanced searchability
          const extractedEntities = await this.extractEntities(doc.content);
          const keyPhrases = await this.extractKeyPhrases(doc.content);

          // Calculate content characteristics for better matching
          const contentCharacteristics = this.analyzeContentCharacteristics(doc.content);

          return {
            _id: doc._id || new ObjectId(),

            // Original document content
            title: doc.title,
            content: doc.content,
            summary: doc.summary || this.generateSummary(doc.content),

            // Document metadata
            category: doc.category,
            tags: doc.tags || [],
            author: doc.author,
            publishedDate: doc.publishedDate || new Date(),
            contentType: doc.contentType || 'article',
            language: doc.language || 'en',

            // Vector embeddings for semantic search
            titleVector: titleEmbedding,
            contentVector: contentEmbedding,

            // Enhanced searchability features
            entities: extractedEntities,
            keyPhrases: keyPhrases,
            contentCharacteristics: contentCharacteristics,

            // Search optimization metadata
            searchMetadata: {
              wordCount: doc.content.split(/\s+/).length,
              readingTime: Math.ceil(doc.content.split(/\s+/).length / 200), // minutes
              complexity: contentCharacteristics.complexity,
              topicDistribution: contentCharacteristics.topics,
              sentimentScore: contentCharacteristics.sentiment
            },

            // Document quality and authority signals
            qualitySignals: {
              authorityScore: doc.authorityScore || 0.5,
              freshnessScore: this.calculateFreshnessScore(doc.publishedDate || new Date()),
              engagementScore: doc.engagementScore || 0.5,
              accuracyScore: doc.accuracyScore || 0.8
            },

            // Indexing and processing metadata
            indexed: true,
            indexedAt: new Date(),
            vectorModelVersion: this.embeddingModel,
            processingVersion: '1.0'
          };

        } catch (error) {
          console.error(`Failed to process document ${doc._id}:`, error);
          return null;
        }
      });

      const batchResults = await Promise.all(batchPromises);
      const validResults = batchResults.filter(result => result !== null);
      processedDocuments.push(...validResults);

      // Rate limiting pause between batches
      if (i + batchSize < documents.length) {
        await new Promise(resolve => setTimeout(resolve, 1000));
      }
    }

    // Bulk insert processed documents
    if (processedDocuments.length > 0) {
      const insertResult = await this.collections.documents.insertMany(processedDocuments, {
        ordered: false
      });

      console.log(`Successfully indexed ${insertResult.insertedCount} documents with vector embeddings`);

      return {
        totalProcessed: documents.length,
        successfullyIndexed: insertResult.insertedCount,
        failed: documents.length - processedDocuments.length,
        indexedDocuments: processedDocuments
      };
    }

    return {
      totalProcessed: documents.length,
      successfullyIndexed: 0,
      failed: documents.length,
      indexedDocuments: []
    };
  }

  async performSemanticSearch(query, options = {}) {
    console.log(`Performing semantic search for: "${query}"`);

    const {
      limit = 20,
      filters = {},
      includeScore = true,
      similarityThreshold = 0.7,
      searchType = 'semantic', // 'semantic', 'hybrid', 'keyword'
      userContext = null
    } = options;

    try {
      // Generate query embedding for semantic search
      const queryEmbedding = await this.generateEmbedding(query);

      let pipeline = [];

      if (searchType === 'semantic' || searchType === 'hybrid') {
        // Vector similarity search stage
        pipeline.push({
          $vectorSearch: {
            index: 'content_vector_index',
            path: 'contentVector',
            queryVector: queryEmbedding,
            numCandidates: limit * 10, // Search more candidates for better results
            limit: limit * 2, // Get more results for reranking
            filter: this.buildFilterExpression(filters)
          }
        });

        // Add vector search score
        pipeline.push({
          $addFields: {
            vectorScore: { $meta: 'vectorSearchScore' },
            searchMethod: 'vector'
          }
        });
      }

      if (searchType === 'hybrid') {
        // Combine with text search for hybrid approach
        pipeline.push({
          $unionWith: {
            coll: 'documents',
            pipeline: [
              {
                $search: {
                  index: 'hybrid_search_index',
                  compound: {
                    should: [
                      {
                        text: {
                          query: query,
                          path: ['title', 'content'],
                          score: { boost: { value: 2.0 } }
                        }
                      },
                      {
                        autocomplete: {
                          query: query,
                          path: 'title',
                          score: { boost: { value: 1.5 } }
                        }
                      }
                    ],
                    filter: this.buildSearchFilterClauses(filters)
                  }
                }
              },
              {
                $addFields: {
                  textScore: { $meta: 'searchScore' },
                  searchMethod: 'text'
                }
              },
              { $limit: limit }
            ]
          }
        });
      }

      // Enhanced result processing and ranking
      pipeline.push({
        $addFields: {
          // Calculate comprehensive relevance score
          relevanceScore: {
            $switch: {
              branches: [
                {
                  case: { $eq: ['$searchMethod', 'vector'] },
                  then: {
                    $multiply: [
                      { $ifNull: ['$vectorScore', 0] },
                      { $add: [
                        { $multiply: [{ $ifNull: ['$qualitySignals.authorityScore', 0.5] }, 0.2] },
                        { $multiply: [{ $ifNull: ['$qualitySignals.freshnessScore', 0.5] }, 0.1] },
                        { $multiply: [{ $ifNull: ['$qualitySignals.engagementScore', 0.5] }, 0.15] },
                        0.55 // Base score weight
                      ]}
                    ]
                  }
                },
                {
                  case: { $eq: ['$searchMethod', 'text'] },
                  then: {
                    $multiply: [
                      { $ifNull: ['$textScore', 0] },
                      0.8 // Weight text search lower than semantic
                    ]
                  }
                }
              ],
              default: 0
            }
          },

          // Extract relevant snippets
          contentSnippet: {
            $substrCP: [
              '$content', 
              0, 
              300
            ]
          },

          // Calculate query-document semantic similarity
          semanticRelevance: {
            $cond: {
              if: { $gt: [{ $ifNull: ['$vectorScore', 0] }, similarityThreshold] },
              then: 'high',
              else: {
                $cond: {
                  if: { $gt: [{ $ifNull: ['$vectorScore', 0] }, similarityThreshold * 0.8] },
                  then: 'medium',
                  else: 'low'
                }
              }
            }
          }
        }
      });

      // User personalization if context provided
      if (userContext) {
        pipeline.push({
          $addFields: {
            personalizedScore: {
              $multiply: [
                '$relevanceScore',
                {
                  $add: [
                    // Category preference boost
                    {
                      $cond: {
                        if: { $in: ['$category', userContext.preferredCategories || []] },
                        then: 0.2,
                        else: 0
                      }
                    },
                    // Author preference boost  
                    {
                      $cond: {
                        if: { $in: ['$author', userContext.followedAuthors || []] },
                        then: 0.15,
                        else: 0
                      }
                    },
                    // Language preference
                    {
                      $cond: {
                        if: { $eq: ['$language', userContext.preferredLanguage || 'en'] },
                        then: 0.1,
                        else: -0.05
                      }
                    },
                    1.0 // Base multiplier
                  ]
                }
              ]
            }
          }
        });
      }

      // Filter by similarity threshold and finalize results
      pipeline.push(
        {
          $match: {
            relevanceScore: { $gte: similarityThreshold * 0.5 }
          }
        },
        {
          $sort: {
            [userContext ? 'personalizedScore' : 'relevanceScore']: -1,
            publishedDate: -1
          }
        },
        {
          $limit: limit
        },
        {
          $project: {
            _id: 1,
            title: 1,
            contentSnippet: 1,
            category: 1,
            tags: 1,
            author: 1,
            publishedDate: 1,
            contentType: 1,
            language: 1,
            entities: 1,
            keyPhrases: 1,
            searchMetadata: 1,
            relevanceScore: includeScore ? 1 : 0,
            personalizedScore: (includeScore && userContext) ? 1 : 0,
            vectorScore: includeScore ? 1 : 0,
            textScore: includeScore ? 1 : 0,
            semanticRelevance: 1,
            searchMethod: 1
          }
        }
      );

      const searchStart = Date.now();
      const results = await this.collections.documents.aggregate(pipeline).toArray();
      const searchTime = Date.now() - searchStart;

      // Log search analytics
      await this.logSearchAnalytics({
        query: query,
        searchType: searchType,
        filters: filters,
        resultCount: results.length,
        searchTime: searchTime,
        userContext: userContext,
        timestamp: new Date()
      });

      console.log(`Semantic search completed in ${searchTime}ms, found ${results.length} results`);

      return {
        query: query,
        searchType: searchType,
        results: results,
        metadata: {
          totalResults: results.length,
          searchTime: searchTime,
          similarityThreshold: similarityThreshold,
          filtersApplied: Object.keys(filters).length > 0
        }
      };

    } catch (error) {
      console.error('Semantic search failed:', error);
      throw error;
    }
  }

  async findSimilarDocuments(documentId, options = {}) {
    console.log(`Finding documents similar to: ${documentId}`);

    const {
      limit = 10,
      similarityThreshold = 0.75,
      excludeCategories = [],
      includeScore = true
    } = options;

    // Get the source document and its vector
    const sourceDocument = await this.collections.documents.findOne(
      { _id: documentId },
      { projection: { contentVector: 1, title: 1, category: 1, tags: 1 } }
    );

    if (!sourceDocument || !sourceDocument.contentVector) {
      throw new Error('Source document not found or not vectorized');
    }

    // Find similar documents using vector search
    const pipeline = [
      {
        $vectorSearch: {
          index: 'content_vector_index',
          path: 'contentVector',
          queryVector: sourceDocument.contentVector,
          numCandidates: limit * 20,
          limit: limit * 2,
          filter: {
            $and: [
              { _id: { $ne: documentId } }, // Exclude source document
              excludeCategories.length > 0 ? 
                { category: { $not: { $in: excludeCategories } } } : 
                {}
            ]
          }
        }
      },
      {
        $addFields: {
          similarityScore: { $meta: 'vectorSearchScore' },

          // Calculate additional similarity factors
          tagSimilarity: {
            $let: {
              vars: {
                commonTags: {
                  $size: {
                    $setIntersection: ['$tags', sourceDocument.tags || []]
                  }
                },
                totalTags: {
                  $add: [
                    { $size: { $ifNull: ['$tags', []] } },
                    { $size: { $ifNull: [sourceDocument.tags, []] } }
                  ]
                }
              },
              in: {
                $cond: {
                  if: { $gt: ['$$totalTags', 0] },
                  then: { $divide: ['$$commonTags', '$$totalTags'] },
                  else: 0
                }
              }
            }
          },

          categorySimilarity: {
            $cond: {
              if: { $eq: ['$category', sourceDocument.category] },
              then: 0.2,
              else: 0
            }
          }
        }
      },
      {
        $addFields: {
          combinedSimilarity: {
            $add: [
              { $multiply: ['$similarityScore', 0.7] },
              { $multiply: ['$tagSimilarity', 0.2] },
              '$categorySimilarity'
            ]
          }
        }
      },
      {
        $match: {
          combinedSimilarity: { $gte: similarityThreshold }
        }
      },
      {
        $sort: { combinedSimilarity: -1 }
      },
      {
        $limit: limit
      },
      {
        $project: {
          _id: 1,
          title: 1,
          contentSnippet: { $substrCP: ['$content', 0, 200] },
          category: 1,
          tags: 1,
          author: 1,
          publishedDate: 1,
          similarityScore: includeScore ? 1 : 0,
          combinedSimilarity: includeScore ? 1 : 0,
          searchMetadata: 1
        }
      }
    ];

    const similarDocuments = await this.collections.documents.aggregate(pipeline).toArray();

    return {
      sourceDocumentId: documentId,
      sourceTitle: sourceDocument.title,
      similarDocuments: similarDocuments,
      metadata: {
        totalSimilar: similarDocuments.length,
        similarityThreshold: similarityThreshold,
        searchMethod: 'vector_similarity'
      }
    };
  }

  async generateRecommendations(userId, options = {}) {
    console.log(`Generating personalized recommendations for user: ${userId}`);

    const {
      limit = 15,
      diversityFactor = 0.3,
      includeExplanations = true
    } = options;

    // Get user profile and interaction history
    const userProfile = await this.collections.userProfiles.findOne({ userId: userId });

    if (!userProfile) {
      console.log('User profile not found, using general recommendations');
      return this.generateGeneralRecommendations(limit);
    }

    // Build user preference vector from interaction history
    const userVector = await this.buildUserPreferenceVector(userProfile);

    if (!userVector) {
      return this.generateGeneralRecommendations(limit);
    }

    // Find documents matching user preferences
    const pipeline = [
      {
        $vectorSearch: {
          index: 'content_vector_index',
          path: 'contentVector',
          queryVector: userVector,
          numCandidates: limit * 10,
          limit: limit * 3,
          filter: {
            $and: [
              // Exclude already read documents
              { _id: { $not: { $in: userProfile.readDocuments || [] } } },

              // Include preferred categories
              userProfile.preferredCategories && userProfile.preferredCategories.length > 0 ?
                { category: { $in: userProfile.preferredCategories } } :
                {},

              // Fresh content preference
              {
                publishedDate: {
                  $gte: new Date(Date.now() - 90 * 24 * 60 * 60 * 1000) // Last 90 days
                }
              }
            ]
          }
        }
      },
      {
        $addFields: {
          preferenceScore: { $meta: 'vectorSearchScore' },

          // Category affinity scoring
          categoryScore: {
            $switch: {
              branches: (userProfile.categoryAffinities || []).map(affinity => ({
                case: { $eq: ['$category', affinity.category] },
                then: affinity.score
              })),
              default: 0.5
            }
          },

          // Author following boost
          authorScore: {
            $cond: {
              if: { $in: ['$author', userProfile.followedAuthors || []] },
              then: 0.8,
              else: 0.4
            }
          },

          // Freshness scoring
          freshnessScore: {
            $divide: [
              { $subtract: [Date.now(), '$publishedDate'] },
              (30 * 24 * 60 * 60 * 1000) // 30 days in milliseconds
            ]
          }
        }
      },
      {
        $addFields: {
          recommendationScore: {
            $add: [
              { $multiply: ['$preferenceScore', 0.4] },
              { $multiply: ['$categoryScore', 0.25] },
              { $multiply: ['$authorScore', 0.2] },
              { $multiply: [{ $max: [0, { $subtract: [1, '$freshnessScore'] }] }, 0.15] }
            ]
          }
        }
      }
    ];

    // Apply diversity to avoid filter bubble
    if (diversityFactor > 0) {
      pipeline.push({
        $group: {
          _id: '$category',
          documents: {
            $push: {
              _id: '$_id',
              title: '$title',
              recommendationScore: '$recommendationScore',
              category: '$category',
              author: '$author',
              publishedDate: '$publishedDate',
              tags: '$tags'
            }
          },
          maxScore: { $max: '$recommendationScore' }
        }
      });

      pipeline.push({
        $sort: { maxScore: -1 }
      });

      // Select diverse recommendations
      pipeline.push({
        $project: {
          documents: {
            $slice: [
              { $sortArray: { input: '$documents', sortBy: { recommendationScore: -1 } } },
              Math.ceil(limit * diversityFactor)
            ]
          }
        }
      });

      pipeline.push({
        $unwind: '$documents'
      });

      pipeline.push({
        $replaceRoot: { newRoot: '$documents' }
      });
    }

    pipeline.push(
      {
        $sort: { recommendationScore: -1 }
      },
      {
        $limit: limit
      }
    );

    const recommendations = await this.collections.documents.aggregate(pipeline).toArray();

    // Generate explanations if requested
    if (includeExplanations) {
      for (const rec of recommendations) {
        rec.explanation = this.generateRecommendationExplanation(rec, userProfile);
      }
    }

    // Store recommendations for future analysis
    await this.collections.recommendations.insertOne({
      userId: userId,
      recommendations: recommendations.map(r => ({
        documentId: r._id,
        score: r.recommendationScore,
        explanation: r.explanation
      })),
      generatedAt: new Date(),
      algorithm: 'vector_preference_matching',
      diversityFactor: diversityFactor
    });

    return {
      userId: userId,
      recommendations: recommendations,
      metadata: {
        totalRecommendations: recommendations.length,
        algorithm: 'vector_preference_matching',
        diversityApplied: diversityFactor > 0,
        generatedAt: new Date()
      }
    };
  }

  // Helper methods for vector search operations

  async generateEmbedding(text) {
    // In production, this would call OpenAI API or other embedding service
    // For this example, we'll simulate embeddings

    // Simulate API call delay
    await new Promise(resolve => setTimeout(resolve, 100));

    // Generate mock embedding vector (in production, use actual embedding API)
    const mockEmbedding = Array.from({ length: this.vectorConfig.dimensions }, () => 
      Math.random() * 2 - 1 // Values between -1 and 1
    );

    return mockEmbedding;
  }

  async extractEntities(text) {
    // Simulate entity extraction (in production, use NLP service)
    const entities = [];

    // Basic keyword extraction simulation
    const words = text.toLowerCase().split(/\W+/);
    const entityKeywords = ['mongodb', 'database', 'javascript', 'python', 'ai', 'machine learning'];

    entityKeywords.forEach(keyword => {
      if (words.includes(keyword) || words.includes(keyword.replace(' ', ''))) {
        entities.push({
          text: keyword,
          type: 'technology',
          confidence: 0.8
        });
      }
    });

    return entities;
  }

  async extractKeyPhrases(text) {
    // Simulate key phrase extraction
    const sentences = text.split(/[.!?]+/);
    const keyPhrases = [];

    sentences.forEach(sentence => {
      const words = sentence.trim().split(/\s+/);
      if (words.length >= 3 && words.length <= 8) {
        keyPhrases.push({
          phrase: sentence.trim(),
          relevance: Math.random()
        });
      }
    });

    return keyPhrases.sort((a, b) => b.relevance - a.relevance).slice(0, 10);
  }

  analyzeContentCharacteristics(content) {
    const wordCount = content.split(/\s+/).length;
    const sentenceCount = content.split(/[.!?]+/).length;
    const avgWordsPerSentence = wordCount / sentenceCount;

    return {
      complexity: avgWordsPerSentence > 20 ? 'high' : avgWordsPerSentence > 15 ? 'medium' : 'low',
      topics: ['general'], // Would use topic modeling in production
      sentiment: Math.random() * 2 - 1, // -1 to 1 scale
      readabilityScore: Math.max(0, Math.min(100, 100 - (avgWordsPerSentence * 2)))
    };
  }

  calculateFreshnessScore(publishedDate) {
    const ageInDays = (Date.now() - publishedDate.getTime()) / (24 * 60 * 60 * 1000);
    return Math.max(0, Math.min(1, 1 - (ageInDays / 365))); // Decay over 1 year
  }

  generateSummary(content) {
    // Simple summary generation (first 200 characters)
    return content.length > 200 ? content.substring(0, 197) + '...' : content;
  }

  buildFilterExpression(filters) {
    const filterExpression = { $and: [] };

    if (filters.category) {
      filterExpression.$and.push({ category: { $eq: filters.category } });
    }

    if (filters.author) {
      filterExpression.$and.push({ author: { $eq: filters.author } });
    }

    if (filters.tags && filters.tags.length > 0) {
      filterExpression.$and.push({ tags: { $in: filters.tags } });
    }

    if (filters.dateRange) {
      filterExpression.$and.push({ 
        publishedDate: {
          $gte: new Date(filters.dateRange.start),
          $lte: new Date(filters.dateRange.end)
        }
      });
    }

    return filterExpression.$and.length > 0 ? filterExpression : {};
  }

  buildSearchFilterClauses(filters) {
    const clauses = [];

    if (filters.category) {
      clauses.push({ equals: { path: 'category', value: filters.category } });
    }

    if (filters.tags && filters.tags.length > 0) {
      clauses.push({ in: { path: 'tags', value: filters.tags } });
    }

    return clauses;
  }

  async logSearchAnalytics(analyticsData) {
    try {
      await this.collections.searchAnalytics.insertOne({
        ...analyticsData,
        sessionId: analyticsData.userContext?.sessionId,
        userId: analyticsData.userContext?.userId
      });
    } catch (error) {
      console.warn('Failed to log search analytics:', error.message);
    }
  }

  async buildUserPreferenceVector(userProfile) {
    if (!userProfile.interactionHistory || userProfile.interactionHistory.length === 0) {
      return null;
    }

    // Get vectors for user's previously interacted documents
    const interactedDocuments = await this.collections.documents.find(
      { 
        _id: { $in: userProfile.interactionHistory.slice(-20).map(h => h.documentId) } 
      },
      { projection: { contentVector: 1 } }
    ).toArray();

    if (interactedDocuments.length === 0) {
      return null;
    }

    // Calculate weighted average vector based on interaction types
    const weightedVectors = interactedDocuments.map((doc, index) => {
      const interaction = userProfile.interactionHistory.find(h => 
        h.documentId.toString() === doc._id.toString()
      );

      const weight = this.getInteractionWeight(interaction.type);
      return doc.contentVector.map(val => val * weight);
    });

    // Average the vectors
    const dimensions = weightedVectors[0].length;
    const avgVector = Array(dimensions).fill(0);

    weightedVectors.forEach(vector => {
      vector.forEach((val, i) => {
        avgVector[i] += val;
      });
    });

    return avgVector.map(val => val / weightedVectors.length);
  }

  getInteractionWeight(interactionType) {
    const weights = {
      'view': 0.1,
      'like': 0.3,
      'share': 0.5,
      'bookmark': 0.7,
      'comment': 0.8
    };
    return weights[interactionType] || 0.1;
  }

  generateRecommendationExplanation(recommendation, userProfile) {
    const explanations = [];

    if (userProfile.preferredCategories && userProfile.preferredCategories.includes(recommendation.category)) {
      explanations.push(`Matches your interest in ${recommendation.category}`);
    }

    if (userProfile.followedAuthors && userProfile.followedAuthors.includes(recommendation.author)) {
      explanations.push(`By ${recommendation.author}, an author you follow`);
    }

    if (recommendation.tags) {
      const matchingTags = recommendation.tags.filter(tag => 
        userProfile.interests && userProfile.interests.includes(tag)
      );
      if (matchingTags.length > 0) {
        explanations.push(`Related to ${matchingTags.slice(0, 2).join(' and ')}`);
      }
    }

    if (explanations.length === 0) {
      explanations.push('Similar to content you\'ve previously engaged with');
    }

    return explanations.join('; ');
  }

  async generateGeneralRecommendations(limit) {
    // Fallback recommendations based on popularity and quality
    const pipeline = [
      {
        $addFields: {
          popularityScore: {
            $add: [
              { $multiply: [{ $ifNull: ['$qualitySignals.engagementScore', 0.5] }, 0.4] },
              { $multiply: [{ $ifNull: ['$qualitySignals.authorityScore', 0.5] }, 0.3] },
              { $multiply: [{ $ifNull: ['$qualitySignals.freshnessScore', 0.5] }, 0.3] }
            ]
          }
        }
      },
      {
        $sort: { popularityScore: -1 }
      },
      {
        $limit: limit
      },
      {
        $project: {
          _id: 1,
          title: 1,
          contentSnippet: { $substrCP: ['$content', 0, 200] },
          category: 1,
          author: 1,
          publishedDate: 1,
          popularityScore: 1
        }
      }
    ];

    const recommendations = await this.collections.documents.aggregate(pipeline).toArray();

    return {
      recommendations: recommendations,
      metadata: {
        algorithm: 'popularity_based',
        totalRecommendations: recommendations.length
      }
    };
  }
}

// Benefits of MongoDB Atlas Vector Search:
// - Native vector database capabilities within MongoDB Atlas infrastructure
// - Seamless integration with existing MongoDB documents and operations  
// - Support for multiple vector similarity algorithms (cosine, euclidean, dot product)
// - Hybrid search combining vector similarity with traditional text search
// - Scalable vector indexing with automatic optimization and maintenance
// - Built-in filtering capabilities for combining semantic search with metadata filters
// - Real-time vector search with sub-second response times at scale
// - Integration with popular embedding models (OpenAI, Cohere, Hugging Face)
// - Support for multiple vector dimensions and embedding types
// - Advanced ranking and personalization capabilities for AI-powered applications

module.exports = {
  VectorSearchManager
};

Understanding MongoDB Vector Search Architecture

Advanced Vector Search Patterns and Optimization

Implement sophisticated vector search optimization techniques for production applications:

// Advanced vector search optimization and performance tuning
class VectorSearchOptimizer {
  constructor(db) {
    this.db = db;
    this.performanceMetrics = new Map();
    this.indexStrategies = {
      exactSearch: { type: 'exactSearch', precision: 1.0, speed: 'slow' },
      approximateSearch: { type: 'approximateSearch', precision: 0.95, speed: 'fast' },
      hierarchicalSearch: { type: 'hierarchicalSearch', precision: 0.98, speed: 'medium' }
    };
  }

  async optimizeVectorIndexConfiguration(collectionName, vectorField, options = {}) {
    console.log(`Optimizing vector index configuration for ${collectionName}.${vectorField}`);

    const {
      dimensions = 1536,
      similarityMetric = 'cosine',
      numCandidates = 1000,
      performanceTarget = 'balanced' // 'speed', 'accuracy', 'balanced'
    } = options;

    // Analyze existing data distribution
    const dataAnalysis = await this.analyzeVectorDataDistribution(collectionName, vectorField);

    // Determine optimal index configuration
    const indexConfig = this.calculateOptimalIndexConfig(
      dataAnalysis, 
      performanceTarget, 
      dimensions
    );

    // Create optimized vector search index configuration
    const optimizedIndex = {
      name: `optimized_${vectorField}_index`,
      definition: {
        fields: [
          {
            type: 'vector',
            path: vectorField,
            numDimensions: dimensions,
            similarity: similarityMetric
          },
          // Add filter fields based on common query patterns
          ...this.generateFilterFieldsFromAnalysis(dataAnalysis)
        ]
      },
      configuration: {
        // Advanced tuning parameters
        numCandidates: this.calculateOptimalCandidates(dataAnalysis.documentCount),
        ef: indexConfig.ef, // Search accuracy parameter
        efConstruction: indexConfig.efConstruction, // Build-time parameter
        maxConnections: indexConfig.maxConnections, // Graph connectivity

        // Performance optimizations
        vectorCompression: indexConfig.compressionEnabled,
        quantization: indexConfig.quantizationLevel,
        cachingStrategy: indexConfig.cachingStrategy
      }
    };

    console.log('Optimized vector index configuration:', optimizedIndex);

    return optimizedIndex;
  }

  async performVectorSearchBenchmark(collectionName, testQueries, indexConfigurations) {
    console.log(`Benchmarking vector search performance with ${testQueries.length} test queries`);

    const benchmarkResults = [];

    for (const config of indexConfigurations) {
      console.log(`Testing configuration: ${config.name}`);

      const configResults = {
        configurationName: config.name,
        queryResults: [],
        performanceMetrics: {
          avgLatency: 0,
          p95Latency: 0,
          p99Latency: 0,
          throughput: 0,
          accuracy: 0
        }
      };

      const latencies = [];
      const accuracyScores = [];

      const startTime = Date.now();

      for (let i = 0; i < testQueries.length; i++) {
        const query = testQueries[i];

        const queryStart = Date.now();

        try {
          const results = await this.db.collection(collectionName).aggregate([
            {
              $vectorSearch: {
                index: config.indexName,
                path: config.vectorField,
                queryVector: query.vector,
                numCandidates: config.numCandidates || 100,
                limit: query.limit || 10
              }
            },
            {
              $addFields: {
                score: { $meta: 'vectorSearchScore' }
              }
            }
          ]).toArray();

          const queryLatency = Date.now() - queryStart;
          latencies.push(queryLatency);

          // Calculate accuracy if ground truth available
          if (query.expectedResults) {
            const accuracy = this.calculateSearchAccuracy(results, query.expectedResults);
            accuracyScores.push(accuracy);
          }

          configResults.queryResults.push({
            queryIndex: i,
            resultCount: results.length,
            latency: queryLatency,
            topScore: results[0]?.score || 0
          });

        } catch (error) {
          console.error(`Query ${i} failed:`, error.message);
          configResults.queryResults.push({
            queryIndex: i,
            error: error.message,
            latency: null
          });
        }
      }

      const totalTime = Date.now() - startTime;

      // Calculate performance metrics
      const validLatencies = latencies.filter(l => l !== null);
      if (validLatencies.length > 0) {
        configResults.performanceMetrics.avgLatency = 
          validLatencies.reduce((sum, l) => sum + l, 0) / validLatencies.length;

        const sortedLatencies = validLatencies.sort((a, b) => a - b);
        configResults.performanceMetrics.p95Latency = 
          sortedLatencies[Math.floor(sortedLatencies.length * 0.95)];
        configResults.performanceMetrics.p99Latency = 
          sortedLatencies[Math.floor(sortedLatencies.length * 0.99)];

        configResults.performanceMetrics.throughput = 
          (validLatencies.length / totalTime) * 1000; // queries per second
      }

      if (accuracyScores.length > 0) {
        configResults.performanceMetrics.accuracy = 
          accuracyScores.reduce((sum, a) => sum + a, 0) / accuracyScores.length;
      }

      benchmarkResults.push(configResults);
    }

    // Analyze and rank configurations
    const rankedConfigurations = this.rankConfigurationsByPerformance(benchmarkResults);

    return {
      benchmarkResults: benchmarkResults,
      recommendations: rankedConfigurations,
      testMetadata: {
        totalQueries: testQueries.length,
        configurationstested: indexConfigurations.length,
        benchmarkDuration: Date.now() - startTime
      }
    };
  }

  async implementAdvancedVectorSearchPatterns(collectionName, searchPattern, options = {}) {
    console.log(`Implementing advanced vector search pattern: ${searchPattern}`);

    const patterns = {
      multiModalSearch: () => this.implementMultiModalSearch(collectionName, options),
      hierarchicalSearch: () => this.implementHierarchicalSearch(collectionName, options),
      temporalVectorSearch: () => this.implementTemporalVectorSearch(collectionName, options),
      facetedVectorSearch: () => this.implementFacetedVectorSearch(collectionName, options),
      clusterBasedSearch: () => this.implementClusterBasedSearch(collectionName, options)
    };

    if (!patterns[searchPattern]) {
      throw new Error(`Unknown search pattern: ${searchPattern}`);
    }

    return await patterns[searchPattern]();
  }

  async implementMultiModalSearch(collectionName, options) {
    // Multi-modal search combining text, image, and other vector embeddings
    const {
      textVector,
      imageVector,
      audioVector,
      weights = { text: 0.5, image: 0.3, audio: 0.2 },
      limit = 20
    } = options;

    const collection = this.db.collection(collectionName);

    // Combine multiple vector searches
    const pipeline = [
      {
        $vectorSearch: {
          index: 'multi_modal_index',
          path: 'textVector',
          queryVector: textVector,
          numCandidates: limit * 5,
          limit: limit * 2
        }
      },
      {
        $addFields: {
          textScore: { $meta: 'vectorSearchScore' }
        }
      }
    ];

    if (imageVector) {
      pipeline.push({
        $unionWith: {
          coll: collectionName,
          pipeline: [
            {
              $vectorSearch: {
                index: 'image_vector_index',
                path: 'imageVector',
                queryVector: imageVector,
                numCandidates: limit * 5,
                limit: limit * 2
              }
            },
            {
              $addFields: {
                imageScore: { $meta: 'vectorSearchScore' }
              }
            }
          ]
        }
      });
    }

    if (audioVector) {
      pipeline.push({
        $unionWith: {
          coll: collectionName,
          pipeline: [
            {
              $vectorSearch: {
                index: 'audio_vector_index', 
                path: 'audioVector',
                queryVector: audioVector,
                numCandidates: limit * 5,
                limit: limit * 2
              }
            },
            {
              $addFields: {
                audioScore: { $meta: 'vectorSearchScore' }
              }
            }
          ]
        }
      });
    }

    // Combine scores from different modalities
    pipeline.push({
      $group: {
        _id: '$_id',
        doc: { $first: '$$ROOT' },
        textScore: { $max: { $ifNull: ['$textScore', 0] } },
        imageScore: { $max: { $ifNull: ['$imageScore', 0] } },
        audioScore: { $max: { $ifNull: ['$audioScore', 0] } }
      }
    });

    pipeline.push({
      $addFields: {
        combinedScore: {
          $add: [
            { $multiply: ['$textScore', weights.text] },
            { $multiply: ['$imageScore', weights.image] },
            { $multiply: ['$audioScore', weights.audio] }
          ]
        }
      }
    });

    pipeline.push({
      $sort: { combinedScore: -1 }
    });

    pipeline.push({
      $limit: limit
    });

    const results = await collection.aggregate(pipeline).toArray();

    return {
      searchType: 'multi_modal',
      results: results,
      weights: weights,
      metadata: {
        modalities: Object.keys(weights).filter(k => options[k + 'Vector']),
        totalResults: results.length
      }
    };
  }

  async implementTemporalVectorSearch(collectionName, options) {
    // Time-aware vector search with temporal relevance
    const {
      queryVector,
      timeWindow = { days: 30 },
      temporalWeight = 0.3,
      limit = 20
    } = options;

    const collection = this.db.collection(collectionName);
    const cutoffDate = new Date(Date.now() - timeWindow.days * 24 * 60 * 60 * 1000);

    const pipeline = [
      {
        $vectorSearch: {
          index: 'temporal_vector_index',
          path: 'contentVector',
          queryVector: queryVector,
          numCandidates: limit * 10,
          limit: limit * 3,
          filter: {
            publishedDate: { $gte: cutoffDate }
          }
        }
      },
      {
        $addFields: {
          vectorScore: { $meta: 'vectorSearchScore' },

          // Calculate temporal relevance
          temporalScore: {
            $divide: [
              { $subtract: ['$publishedDate', cutoffDate] },
              { $subtract: [new Date(), cutoffDate] }
            ]
          }
        }
      },
      {
        $addFields: {
          combinedScore: {
            $add: [
              { $multiply: ['$vectorScore', 1 - temporalWeight] },
              { $multiply: ['$temporalScore', temporalWeight] }
            ]
          }
        }
      },
      {
        $sort: { combinedScore: -1 }
      },
      {
        $limit: limit
      }
    ];

    const results = await collection.aggregate(pipeline).toArray();

    return {
      searchType: 'temporal_vector',
      results: results,
      temporalWindow: timeWindow,
      temporalWeight: temporalWeight
    };
  }

  // Helper methods for vector search optimization

  async analyzeVectorDataDistribution(collectionName, vectorField) {
    const collection = this.db.collection(collectionName);

    // Sample documents to analyze distribution
    const sampleSize = 1000;
    const pipeline = [
      { $sample: { size: sampleSize } },
      {
        $project: {
          vectorLength: { $size: `$${vectorField}` },
          vectorMagnitude: {
            $sqrt: {
              $reduce: {
                input: `$${vectorField}`,
                initialValue: 0,
                in: { $add: ['$$value', { $multiply: ['$$this', '$$this'] }] }
              }
            }
          }
        }
      }
    ];

    const samples = await collection.aggregate(pipeline).toArray();

    const totalDocs = await collection.countDocuments();
    const avgMagnitude = samples.reduce((sum, doc) => sum + doc.vectorMagnitude, 0) / samples.length;

    return {
      documentCount: totalDocs,
      sampleSize: samples.length,
      avgVectorMagnitude: avgMagnitude,
      vectorDimensions: samples[0]?.vectorLength || 0,
      magnitudeDistribution: this.calculateDistributionStats(
        samples.map(s => s.vectorMagnitude)
      )
    };
  }

  calculateOptimalIndexConfig(dataAnalysis, performanceTarget, dimensions) {
    const baseConfig = {
      ef: 200,
      efConstruction: 400,
      maxConnections: 32,
      compressionEnabled: false,
      quantizationLevel: 'none',
      cachingStrategy: 'adaptive'
    };

    // Adjust based on data characteristics and performance target
    if (dataAnalysis.documentCount > 1000000) {
      baseConfig.compressionEnabled = true;
      baseConfig.quantizationLevel = 'int8';
    }

    switch (performanceTarget) {
      case 'speed':
        baseConfig.ef = 100;
        baseConfig.efConstruction = 200;
        baseConfig.quantizationLevel = 'int8';
        break;
      case 'accuracy':
        baseConfig.ef = 400;
        baseConfig.efConstruction = 800;
        baseConfig.maxConnections = 64;
        break;
      case 'balanced':
      default:
        // Use base configuration
        break;
    }

    return baseConfig;
  }

  generateFilterFieldsFromAnalysis(dataAnalysis) {
    // Generate common filter fields based on data analysis
    return [
      { type: 'filter', path: 'category' },
      { type: 'filter', path: 'publishedDate' },
      { type: 'filter', path: 'tags' }
    ];
  }

  calculateOptimalCandidates(documentCount) {
    // Calculate optimal numCandidates based on collection size
    if (documentCount < 10000) return Math.min(documentCount, 100);
    if (documentCount < 100000) return 200;
    if (documentCount < 1000000) return 500;
    return 1000;
  }

  calculateSearchAccuracy(results, expectedResults) {
    // Calculate precision@k accuracy metric
    const actualIds = new Set(results.map(r => r._id.toString()));
    const expectedIds = new Set(expectedResults.map(r => r._id.toString()));

    let matches = 0;
    for (const id of actualIds) {
      if (expectedIds.has(id)) matches++;
    }

    return matches / Math.min(results.length, expectedResults.length);
  }

  rankConfigurationsByPerformance(benchmarkResults) {
    // Rank configurations based on composite performance score
    return benchmarkResults
      .map(result => ({
        ...result,
        compositeScore: this.calculateCompositeScore(result.performanceMetrics)
      }))
      .sort((a, b) => b.compositeScore - a.compositeScore)
      .map((result, index) => ({
        rank: index + 1,
        configurationName: result.configurationName,
        compositeScore: result.compositeScore,
        metrics: result.performanceMetrics,
        recommendation: this.generateConfigurationRecommendation(result)
      }));
  }

  calculateCompositeScore(metrics) {
    // Weighted composite score combining latency, throughput, and accuracy
    const latencyScore = metrics.avgLatency ? Math.max(0, 1 - (metrics.avgLatency / 1000)) : 0;
    const throughputScore = Math.min(1, metrics.throughput / 100);
    const accuracyScore = metrics.accuracy || 0.8;

    return (latencyScore * 0.4 + throughputScore * 0.3 + accuracyScore * 0.3);
  }

  generateConfigurationRecommendation(result) {
    const metrics = result.performanceMetrics;
    const recommendations = [];

    if (metrics.avgLatency > 500) {
      recommendations.push('Consider reducing numCandidates or enabling quantization for better latency');
    }

    if (metrics.accuracy < 0.8) {
      recommendations.push('Increase ef parameter or numCandidates to improve search accuracy');
    }

    if (metrics.throughput < 10) {
      recommendations.push('Optimize index configuration or consider horizontal scaling');
    }

    return recommendations.length > 0 ? recommendations : ['Configuration performs within acceptable parameters'];
  }

  calculateDistributionStats(values) {
    const sorted = values.slice().sort((a, b) => a - b);
    const mean = values.reduce((sum, val) => sum + val, 0) / values.length;

    return {
      mean: mean,
      median: sorted[Math.floor(sorted.length / 2)],
      min: sorted[0],
      max: sorted[sorted.length - 1],
      stddev: Math.sqrt(values.reduce((sum, val) => sum + Math.pow(val - mean, 2), 0) / values.length)
    };
  }
}

SQL-Style Vector Operations with QueryLeaf

QueryLeaf provides familiar SQL syntax for MongoDB vector search operations:

-- QueryLeaf vector search operations with SQL-familiar syntax

-- Create vector search index with SQL DDL
CREATE VECTOR INDEX content_embeddings_idx ON documents (
  content_vector VECTOR(1536) USING cosine_similarity
  WITH (
    num_candidates = 1000,
    index_type = 'hnsw',
    ef_construction = 400,
    max_connections = 32
  )
) 
INCLUDE (category, tags, published_date, author) AS filters;

-- Advanced semantic search with SQL-style vector operations
WITH semantic_query AS (
  -- Generate query embedding (integrated with embedding services)
  SELECT embed_text('machine learning algorithms for natural language processing') as query_vector
),

vector_search_results AS (
  SELECT 
    d.document_id,
    d.title,
    d.content,
    d.category,
    d.tags,
    d.author,
    d.published_date,

    -- Vector similarity search with cosine similarity
    VECTOR_SIMILARITY(d.content_vector, sq.query_vector, 'cosine') as similarity_score,

    -- Vector distance calculations
    VECTOR_DISTANCE(d.content_vector, sq.query_vector, 'euclidean') as euclidean_distance,
    VECTOR_DISTANCE(d.content_vector, sq.query_vector, 'manhattan') as manhattan_distance,

    -- Vector magnitude and normalization
    VECTOR_MAGNITUDE(d.content_vector) as vector_magnitude,
    VECTOR_NORMALIZE(d.content_vector) as normalized_vector

  FROM documents d
  CROSS JOIN semantic_query sq
  WHERE 
    -- Vector similarity threshold filtering
    VECTOR_SIMILARITY(d.content_vector, sq.query_vector, 'cosine') > 0.75

    -- Traditional filters combined with vector search
    AND d.category IN ('AI', 'Technology', 'Data Science')
    AND d.published_date >= CURRENT_DATE - INTERVAL '1 year'

    -- Vector search with K-nearest neighbors
    AND d.document_id IN (
      SELECT document_id 
      FROM VECTOR_KNN_SEARCH(
        table_name => 'documents',
        vector_column => 'content_vector', 
        query_vector => sq.query_vector,
        k => 50,
        distance_function => 'cosine'
      )
    )
),

enhanced_results AS (
  SELECT 
    vsr.*,

    -- Advanced similarity calculations
    VECTOR_DOT_PRODUCT(vsr.normalized_vector, sq.query_vector) as dot_product_similarity,

    -- Multi-vector comparison for hybrid matching
    GREATEST(
      VECTOR_SIMILARITY(d.title_vector, sq.query_vector, 'cosine'),
      vsr.similarity_score * 0.8
    ) as hybrid_similarity_score,

    -- Vector clustering and topic modeling
    VECTOR_CLUSTER_ID(vsr.content_vector, 'kmeans', 10) as topic_cluster,
    VECTOR_TOPIC_PROBABILITY(vsr.content_vector, ARRAY['AI', 'ML', 'NLP', 'Data Science']) as topic_probabilities,

    -- Temporal vector decay for freshness
    vsr.similarity_score * EXP(-0.1 * EXTRACT(DAYS FROM (CURRENT_DATE - vsr.published_date))) as time_decayed_similarity,

    -- Content quality boosting based on vector characteristics
    vsr.similarity_score * (1 + LOG(GREATEST(1, ARRAY_LENGTH(vsr.tags, 1)) / 10.0)) as quality_boosted_similarity,

    -- Personalization using user preference vectors
    COALESCE(
      VECTOR_SIMILARITY(vsr.content_vector, user_preference_vector('user_123'), 'cosine') * 0.3,
      0
    ) as personalization_boost

  FROM vector_search_results vsr
  CROSS JOIN semantic_query sq
  LEFT JOIN documents d ON vsr.document_id = d.document_id
  WHERE vsr.similarity_score > 0.70
),

final_ranked_results AS (
  SELECT 
    document_id,
    title,
    SUBSTRING(content, 1, 300) || '...' as content_preview,
    category,
    tags,
    author,
    published_date,

    -- Comprehensive relevance scoring
    ROUND((
      hybrid_similarity_score * 0.4 +
      time_decayed_similarity * 0.25 +
      quality_boosted_similarity * 0.2 +
      personalization_boost * 0.15
    )::numeric, 4) as final_relevance_score,

    -- Individual score components for analysis
    ROUND(similarity_score::numeric, 4) as base_similarity,
    ROUND(hybrid_similarity_score::numeric, 4) as hybrid_score,
    ROUND(time_decayed_similarity::numeric, 4) as freshness_score,
    ROUND(personalization_boost::numeric, 4) as personal_score,

    -- Vector metadata
    topic_cluster,
    topic_probabilities,
    vector_magnitude,

    -- Search result ranking
    ROW_NUMBER() OVER (ORDER BY final_relevance_score DESC) as search_rank,
    COUNT(*) OVER () as total_results

  FROM enhanced_results
  WHERE (
    hybrid_similarity_score * 0.4 +
    time_decayed_similarity * 0.25 +
    quality_boosted_similarity * 0.2 +
    personalization_boost * 0.15
  ) > 0.6
)

SELECT 
  search_rank,
  document_id,
  title,
  content_preview,
  category,
  STRING_AGG(DISTINCT tag, ', ' ORDER BY tag) as tags_summary,
  author,
  published_date,
  final_relevance_score,

  -- Explanation of ranking factors
  JSON_BUILD_OBJECT(
    'base_similarity', base_similarity,
    'hybrid_boost', hybrid_score - base_similarity,
    'freshness_impact', freshness_score - base_similarity,
    'personalization_impact', personal_score,
    'topic_cluster', topic_cluster,
    'primary_topics', (
      SELECT ARRAY_AGG(topic ORDER BY probability DESC)
      FROM UNNEST(topic_probabilities) WITH ORDINALITY AS t(probability, topic)
      WHERE probability > 0.1
      LIMIT 3
    )
  ) as ranking_explanation

FROM final_ranked_results
CROSS JOIN UNNEST(tags) as tag
GROUP BY search_rank, document_id, title, content_preview, category, author, 
         published_date, final_relevance_score, base_similarity, hybrid_score, 
         freshness_score, personal_score, topic_cluster, topic_probabilities
ORDER BY final_relevance_score DESC
LIMIT 20;

-- Advanced vector aggregation and analytics
WITH vector_analysis AS (
  SELECT 
    category,
    author,
    DATE_TRUNC('month', published_date) as month_bucket,

    -- Vector aggregation functions
    VECTOR_AVG(content_vector) as category_centroid_vector,
    VECTOR_STDDEV(content_vector) as vector_spread,

    -- Vector clustering within groups
    VECTOR_KMEANS_CENTROIDS(content_vector, 5) as sub_clusters,

    -- Similarity analysis within categories
    AVG(VECTOR_PAIRWISE_SIMILARITY(content_vector, 'cosine')) as avg_internal_similarity,
    MIN(VECTOR_PAIRWISE_SIMILARITY(content_vector, 'cosine')) as min_internal_similarity,
    MAX(VECTOR_PAIRWISE_SIMILARITY(content_vector, 'cosine')) as max_internal_similarity,

    -- Document count and metadata
    COUNT(*) as document_count,
    AVG(ARRAY_LENGTH(tags, 1)) as avg_tags_per_doc,
    AVG(LENGTH(content)) as avg_content_length,

    -- Vector quality metrics
    AVG(VECTOR_MAGNITUDE(content_vector)) as avg_vector_magnitude,
    STDDEV(VECTOR_MAGNITUDE(content_vector)) as vector_magnitude_stddev

  FROM documents
  WHERE published_date >= CURRENT_DATE - INTERVAL '2 years'
    AND content_vector IS NOT NULL
  GROUP BY category, author, DATE_TRUNC('month', published_date)
),

cross_category_analysis AS (
  SELECT 
    va1.category as category_a,
    va2.category as category_b,

    -- Cross-category vector similarity
    VECTOR_SIMILARITY(va1.category_centroid_vector, va2.category_centroid_vector, 'cosine') as category_similarity,

    -- Content overlap analysis
    OVERLAP_COEFFICIENT(va1.category, va2.category, 'tags') as tag_overlap,
    OVERLAP_COEFFICIENT(va1.category, va2.category, 'authors') as author_overlap,

    -- Temporal correlation
    CORRELATION(va1.document_count, va2.document_count) OVER (
      PARTITION BY va1.category, va2.category 
      ORDER BY va1.month_bucket
    ) as temporal_correlation

  FROM vector_analysis va1
  CROSS JOIN vector_analysis va2
  WHERE va1.category != va2.category
    AND va1.month_bucket = va2.month_bucket
    AND va1.document_count >= 5
    AND va2.document_count >= 5
),

semantic_recommendations AS (
  SELECT 
    category,

    -- Find most similar categories for recommendation
    ARRAY_AGG(
      category_b ORDER BY category_similarity DESC
    ) FILTER (WHERE category_similarity > 0.7) as similar_categories,

    -- Trending analysis
    CASE 
      WHEN temporal_correlation > 0.8 THEN 'strongly_correlated'
      WHEN temporal_correlation > 0.5 THEN 'moderately_correlated' 
      WHEN temporal_correlation < -0.5 THEN 'inversely_correlated'
      ELSE 'independent'
    END as trend_relationship,

    -- Content strategy recommendations
    CASE
      WHEN AVG(category_similarity) > 0.8 THEN 'High content overlap - consider specialization'
      WHEN AVG(category_similarity) < 0.3 THEN 'Low overlap - good content differentiation'
      ELSE 'Moderate overlap - balanced content strategy'
    END as content_strategy_recommendation

  FROM cross_category_analysis
  GROUP BY category, temporal_correlation
)

SELECT 
  va.category,
  va.document_count,
  ROUND(va.avg_internal_similarity::numeric, 3) as content_consistency_score,
  ROUND(va.avg_vector_magnitude::numeric, 3) as content_richness_score,

  -- Vector-based content insights
  CASE 
    WHEN va.avg_internal_similarity > 0.8 THEN 'Highly consistent content'
    WHEN va.avg_internal_similarity > 0.6 THEN 'Moderately consistent content'
    ELSE 'Diverse content range'
  END as content_consistency_assessment,

  -- Similar categories for cross-promotion
  sr.similar_categories,
  sr.trend_relationship,
  sr.content_strategy_recommendation,

  -- Growth and engagement potential
  CASE
    WHEN va.document_count > LAG(va.document_count) OVER (
      PARTITION BY va.category ORDER BY va.month_bucket
    ) THEN 'Growing'
    WHEN va.document_count < LAG(va.document_count) OVER (
      PARTITION BY va.category ORDER BY va.month_bucket  
    ) THEN 'Declining'
    ELSE 'Stable'
  END as content_trend,

  -- Vector search optimization recommendations
  CASE
    WHEN va.vector_magnitude_stddev > 0.5 THEN 'Consider vector normalization for consistent search performance'
    WHEN va.avg_vector_magnitude < 0.1 THEN 'Low vector magnitudes may indicate embedding quality issues'
    ELSE 'Vector embeddings appear well-distributed'
  END as search_optimization_advice

FROM vector_analysis va
LEFT JOIN semantic_recommendations sr ON va.category = sr.category
WHERE va.month_bucket >= DATE_TRUNC('month', CURRENT_DATE - INTERVAL '6 months')
ORDER BY va.document_count DESC, va.avg_internal_similarity DESC;

-- Real-time vector search performance monitoring
WITH search_performance_metrics AS (
  SELECT 
    DATE_TRUNC('hour', search_timestamp) as hour_bucket,
    search_type,

    -- Query performance metrics
    COUNT(*) as total_searches,
    AVG(response_time_ms) as avg_response_time,
    PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY response_time_ms) as p95_response_time,
    MAX(response_time_ms) as max_response_time,

    -- Result quality metrics
    AVG(result_count) as avg_results_returned,
    AVG(CASE WHEN result_count > 0 THEN top_similarity_score ELSE NULL END) as avg_top_similarity,
    AVG(user_satisfaction_score) as avg_user_satisfaction,

    -- Vector search specific metrics
    AVG(vector_candidates_examined) as avg_candidates_examined,
    AVG(vector_index_hit_ratio) as avg_index_hit_ratio,
    COUNT(*) FILTER (WHERE similarity_threshold_met = true) as threshold_met_count,

    -- Error and timeout analysis
    COUNT(*) FILTER (WHERE search_timeout = true) as timeout_count,
    COUNT(*) FILTER (WHERE search_error IS NOT NULL) as error_count,
    STRING_AGG(DISTINCT search_error, '; ') as error_types

  FROM vector_search_log
  WHERE search_timestamp >= CURRENT_TIMESTAMP - INTERVAL '24 hours'
  GROUP BY DATE_TRUNC('hour', search_timestamp), search_type
),

performance_alerts AS (
  SELECT 
    hour_bucket,
    search_type,
    total_searches,
    avg_response_time,
    p95_response_time,
    avg_user_satisfaction,

    -- Performance alerting logic
    CASE 
      WHEN avg_response_time > 1000 THEN 'CRITICAL - High average latency'
      WHEN p95_response_time > 2000 THEN 'WARNING - High P95 latency'
      WHEN avg_user_satisfaction < 0.7 THEN 'WARNING - Low user satisfaction'
      WHEN timeout_count > total_searches * 0.05 THEN 'WARNING - High timeout rate'
      ELSE 'NORMAL'
    END as performance_status,

    -- Optimization recommendations
    CASE
      WHEN avg_candidates_examined > 10000 THEN 'Consider reducing numCandidates for better performance'
      WHEN avg_index_hit_ratio < 0.8 THEN 'Index may need rebuilding - low hit ratio detected'
      WHEN error_count > 0 THEN 'Investigate errors: ' || error_types
      ELSE 'Performance within normal parameters'
    END as optimization_recommendation,

    -- Trending analysis
    avg_response_time - LAG(avg_response_time) OVER (
      PARTITION BY search_type 
      ORDER BY hour_bucket
    ) as latency_trend,

    total_searches - LAG(total_searches) OVER (
      PARTITION BY search_type
      ORDER BY hour_bucket  
    ) as volume_trend

  FROM search_performance_metrics
)

SELECT 
  hour_bucket,
  search_type,
  total_searches,
  ROUND(avg_response_time::numeric, 1) as avg_latency_ms,
  ROUND(p95_response_time::numeric, 1) as p95_latency_ms,
  ROUND(avg_user_satisfaction::numeric, 2) as satisfaction_score,
  performance_status,
  optimization_recommendation,

  -- Trend indicators
  CASE 
    WHEN latency_trend > 200 THEN 'DEGRADING'
    WHEN latency_trend < -200 THEN 'IMPROVING' 
    ELSE 'STABLE'
  END as latency_trend_status,

  CASE
    WHEN volume_trend > total_searches * 0.2 THEN 'HIGH_GROWTH'
    WHEN volume_trend > total_searches * 0.1 THEN 'GROWING'
    WHEN volume_trend < -total_searches * 0.1 THEN 'DECLINING'
    ELSE 'STABLE'
  END as volume_trend_status

FROM performance_alerts
WHERE performance_status != 'NORMAL' OR hour_bucket >= CURRENT_TIMESTAMP - INTERVAL '6 hours'
ORDER BY hour_bucket DESC, total_searches DESC;

-- QueryLeaf provides comprehensive vector search capabilities:
-- 1. SQL-familiar vector operations with VECTOR_SIMILARITY, VECTOR_DISTANCE functions
-- 2. Advanced K-nearest neighbors search with customizable distance functions
-- 3. Hybrid search combining vector similarity with traditional text search
-- 4. Vector aggregation functions for analytics and clustering
-- 5. Real-time performance monitoring and optimization recommendations
-- 6. Multi-modal vector search across text, image, and audio embeddings
-- 7. Temporal vector search with time-aware relevance scoring
-- 8. Vector-based recommendation systems with personalization
-- 9. Integration with MongoDB's native vector search optimizations
-- 10. Familiar SQL patterns for complex vector analytics and reporting

Best Practices for Vector Search Implementation

Vector Index Design Strategy

Essential principles for optimal MongoDB vector search design:

  1. Embedding Selection: Choose appropriate embedding models based on content type and use case requirements
  2. Index Configuration: Optimize vector index parameters for the balance of accuracy and performance needed
  3. Filtering Strategy: Design metadata filters to narrow search space before vector similarity calculations
  4. Dimensionality Management: Select optimal embedding dimensions based on content complexity and performance requirements
  5. Update Patterns: Plan for efficient vector updates and re-indexing as content changes
  6. Quality Assurance: Implement vector quality validation and monitoring for embedding consistency

Performance and Scalability

Optimize MongoDB vector search for production workloads:

  1. Index Optimization: Monitor and tune vector index parameters based on actual query patterns
  2. Hybrid Search: Combine vector and traditional search for optimal relevance and performance
  3. Caching Strategy: Implement intelligent caching for frequently accessed vectors and query results
  4. Resource Planning: Plan memory and compute resources for vector search operations at scale
  5. Monitoring Setup: Implement comprehensive vector search performance and quality monitoring
  6. Testing Strategy: Develop thorough testing for vector search accuracy and performance characteristics

Conclusion

MongoDB Atlas Vector Search provides native vector database capabilities that eliminate the complexity and infrastructure overhead of separate vector databases while enabling sophisticated semantic search and AI-powered applications. The seamless integration with MongoDB's document model allows developers to combine traditional database operations with advanced vector search in a unified platform.

Key MongoDB Vector Search benefits include:

  • Native Integration: Built-in vector search capabilities within MongoDB Atlas infrastructure
  • Semantic Understanding: Advanced similarity search that understands meaning and context
  • Hybrid Search: Combining vector similarity with traditional text search and metadata filtering
  • Scalable Performance: Production-ready vector indexing with sub-second response times
  • AI-Ready Platform: Direct integration with popular embedding models and AI frameworks
  • Familiar Operations: Vector search operations integrated with standard MongoDB query patterns

Whether you're building recommendation systems, semantic search applications, RAG implementations, or any application requiring intelligent content discovery, MongoDB Atlas Vector Search with QueryLeaf's familiar SQL interface provides the foundation for modern AI-powered applications.

QueryLeaf Integration: QueryLeaf automatically manages MongoDB vector search operations while providing SQL-familiar vector query syntax, similarity functions, and performance optimization. Advanced vector search patterns, multi-modal search, and semantic analytics are seamlessly handled through familiar SQL constructs, making sophisticated AI-powered search both powerful and accessible to SQL-oriented development teams.

The integration of native vector search capabilities with SQL-style operations makes MongoDB an ideal platform for applications requiring both intelligent semantic search and familiar database interaction patterns, ensuring your AI-powered applications remain both innovative and maintainable as they scale and evolve.