Skip to content

MongoDB Vector Search and AI Applications: Building Semantic Search and Similarity Systems for Modern AI-Powered Applications

Modern artificial intelligence applications require sophisticated search capabilities that understand semantic meaning beyond traditional keyword matching, enabling natural language queries, content recommendation systems, and intelligent document retrieval based on conceptual similarity rather than exact text matches. Traditional full-text search approaches struggle with understanding context, synonyms, and conceptual relationships, limiting their effectiveness for AI-powered applications that need to comprehend user intent and content meaning.

MongoDB Vector Search provides comprehensive vector similarity capabilities that enable semantic search, recommendation engines, and AI-powered content discovery through high-dimensional vector embeddings and advanced similarity algorithms. Unlike traditional search systems that rely on exact keyword matching, MongoDB Vector Search leverages machine learning embeddings to understand content semantics, enabling applications to find conceptually similar documents, perform natural language search, and power intelligent recommendation systems.

The Traditional Search Limitation Challenge

Conventional text-based search approaches have significant limitations for modern AI applications:

-- Traditional PostgreSQL full-text search - limited semantic understanding and context awareness

-- Basic full-text search setup with limited semantic capabilities
CREATE TABLE documents (
    document_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    title VARCHAR(500) NOT NULL,
    content TEXT NOT NULL,
    category VARCHAR(100),
    author VARCHAR(200),
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,

    -- Traditional metadata
    tags TEXT[],
    keywords VARCHAR(1000),
    summary TEXT,
    document_type VARCHAR(50),
    language VARCHAR(10) DEFAULT 'en',

    -- Basic search vectors (limited functionality)
    search_vector tsvector GENERATED ALWAYS AS (
        setweight(to_tsvector('english', coalesce(title, '')), 'A') ||
        setweight(to_tsvector('english', coalesce(content, '')), 'B') ||
        setweight(to_tsvector('english', coalesce(summary, '')), 'C') ||
        setweight(to_tsvector('english', array_to_string(coalesce(tags, '{}'), ' ')), 'D')
    ) STORED
);

-- Additional tables for recommendation attempts
CREATE TABLE user_interactions (
    interaction_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    user_id UUID NOT NULL,
    document_id UUID NOT NULL REFERENCES documents(document_id),
    interaction_type VARCHAR(50) NOT NULL, -- 'view', 'like', 'share', 'download'
    interaction_timestamp TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    duration_seconds INTEGER,
    rating INTEGER CHECK (rating BETWEEN 1 AND 5)
);

CREATE TABLE document_similarity (
    document_id_1 UUID NOT NULL REFERENCES documents(document_id),
    document_id_2 UUID NOT NULL REFERENCES documents(document_id),
    similarity_score DECIMAL(5,4) NOT NULL CHECK (similarity_score BETWEEN 0 AND 1),
    similarity_type VARCHAR(50) NOT NULL, -- 'keyword', 'category', 'manual'
    calculated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    PRIMARY KEY (document_id_1, document_id_2)
);

-- Traditional keyword-based search with limited semantic understanding
WITH search_query AS (
    SELECT 
        'machine learning artificial intelligence neural networks deep learning' as query_text,
        to_tsquery('english', 
            'machine & learning & artificial & intelligence & neural & networks & deep & learning'
        ) as search_tsquery
),
basic_search AS (
    SELECT 
        d.document_id,
        d.title,
        d.content,
        d.category,
        d.author,
        d.created_at,
        d.tags,

        -- Basic relevance scoring (limited effectiveness)
        ts_rank(d.search_vector, sq.search_tsquery) as basic_relevance,
        ts_rank_cd(d.search_vector, sq.search_tsquery) as weighted_relevance,

        -- Simple keyword matching
        array_length(
            string_to_array(
                regexp_replace(
                    lower(d.title || ' ' || d.content), 
                    '[^a-z0-9\s]', ' ', 'g'
                ), 
                ' '
            ) & string_to_array(lower(sq.query_text), ' '), 1
        ) as keyword_matches,

        -- Category-based scoring
        CASE d.category 
            WHEN 'AI/ML' THEN 2.0
            WHEN 'Technology' THEN 1.5 
            WHEN 'Science' THEN 1.2
            ELSE 1.0 
        END as category_boost,

        -- Recency scoring
        CASE 
            WHEN d.created_at > CURRENT_DATE - INTERVAL '30 days' THEN 1.5
            WHEN d.created_at > CURRENT_DATE - INTERVAL '90 days' THEN 1.2  
            WHEN d.created_at > CURRENT_DATE - INTERVAL '365 days' THEN 1.1
            ELSE 1.0
        END as recency_boost

    FROM documents d
    CROSS JOIN search_query sq
    WHERE d.search_vector @@ sq.search_tsquery
),
popularity_metrics AS (
    SELECT 
        ui.document_id,
        COUNT(*) as interaction_count,
        COUNT(*) FILTER (WHERE ui.interaction_type = 'view') as view_count,
        COUNT(*) FILTER (WHERE ui.interaction_type = 'like') as like_count,
        COUNT(*) FILTER (WHERE ui.interaction_type = 'share') as share_count,
        AVG(ui.rating) FILTER (WHERE ui.rating IS NOT NULL) as avg_rating,
        AVG(ui.duration_seconds) FILTER (WHERE ui.duration_seconds IS NOT NULL) as avg_duration
    FROM user_interactions ui
    WHERE ui.interaction_timestamp > CURRENT_DATE - INTERVAL '90 days'
    GROUP BY ui.document_id
),
similarity_expansion AS (
    -- Manual similarity relationships (limited and maintenance-heavy)
    SELECT DISTINCT
        ds.document_id_2 as document_id,
        MAX(ds.similarity_score) as max_similarity,
        COUNT(*) as similar_document_count
    FROM basic_search bs
    JOIN document_similarity ds ON bs.document_id = ds.document_id_1
    WHERE ds.similarity_score > 0.3
    GROUP BY ds.document_id_2
),
final_search_results AS (
    SELECT 
        bs.document_id,
        bs.title,
        SUBSTRING(bs.content, 1, 300) || '...' as content_preview,
        bs.category,
        bs.author,
        bs.created_at,
        bs.tags,

        -- Complex relevance calculation (limited effectiveness)
        (
            (bs.basic_relevance * 10) +
            (bs.weighted_relevance * 15) + 
            (COALESCE(bs.keyword_matches, 0) * 5) +
            (bs.category_boost * 3) +
            (bs.recency_boost * 2) +
            (COALESCE(pm.like_count, 0) * 0.1) +
            (COALESCE(pm.avg_rating, 0) * 2) +
            (COALESCE(se.max_similarity, 0) * 8)
        ) as final_relevance_score,

        -- Metrics for debugging
        bs.basic_relevance,
        bs.keyword_matches,
        pm.interaction_count,
        pm.like_count,
        pm.avg_rating,
        se.similar_document_count,

        -- Formatted information
        CASE 
            WHEN pm.interaction_count > 100 THEN 'Popular'
            WHEN pm.interaction_count > 50 THEN 'Moderately Popular'
            WHEN pm.interaction_count > 10 THEN 'Some Interest'
            ELSE 'New/Limited Interest'
        END as popularity_status

    FROM basic_search bs
    LEFT JOIN popularity_metrics pm ON bs.document_id = pm.document_id
    LEFT JOIN similarity_expansion se ON bs.document_id = se.document_id
)
SELECT 
    document_id,
    title,
    content_preview,
    category,
    author,
    created_at,
    tags,
    ROUND(final_relevance_score::numeric, 2) as relevance_score,
    popularity_status,

    -- Limited recommendation capability
    CASE 
        WHEN final_relevance_score > 25 THEN 'Highly Relevant'
        WHEN final_relevance_score > 15 THEN 'Relevant' 
        WHEN final_relevance_score > 8 THEN 'Potentially Relevant'
        ELSE 'Low Relevance'
    END as relevance_category

FROM final_search_results
WHERE final_relevance_score > 5  -- Filter low-relevance results
ORDER BY final_relevance_score DESC
LIMIT 20;

-- Problems with traditional search approaches:
-- 1. No semantic understanding - "ML" vs "machine learning" treated as completely different
-- 2. Limited context awareness - cannot understand conceptual relationships
-- 3. Poor synonym handling - requires manual synonym dictionaries
-- 4. No natural language query support - requires exact keyword matching
-- 5. Complex manual similarity calculations that don't scale
-- 6. No understanding of document embeddings or vector representations
-- 7. Limited recommendation capabilities based on simple collaborative filtering
-- 8. Poor handling of multilingual content and cross-language search
-- 9. No support for image, audio, or multi-modal content search
-- 10. Maintenance-heavy similarity relationships that become stale

-- Attempt at content-based recommendation (ineffective)
CREATE OR REPLACE FUNCTION calculate_basic_similarity(doc1_id UUID, doc2_id UUID)
RETURNS DECIMAL AS $$
DECLARE
    doc1_vector tsvector;
    doc2_vector tsvector;
    similarity_score DECIMAL;
BEGIN
    SELECT search_vector INTO doc1_vector FROM documents WHERE document_id = doc1_id;
    SELECT search_vector INTO doc2_vector FROM documents WHERE document_id = doc2_id;

    -- Extremely limited similarity calculation
    SELECT ts_rank(doc1_vector, plainto_tsquery('english', 
        array_to_string(
            string_to_array(
                regexp_replace(doc2_vector::text, '[^a-zA-Z0-9\s]', ' ', 'g'), 
                ' '
            ), 
            ' '
        )
    )) INTO similarity_score;

    RETURN COALESCE(similarity_score, 0);
END;
$$ LANGUAGE plpgsql;

-- Manual batch similarity calculation (expensive and inaccurate)
INSERT INTO document_similarity (document_id_1, document_id_2, similarity_score, similarity_type)
SELECT 
    d1.document_id,
    d2.document_id,
    calculate_basic_similarity(d1.document_id, d2.document_id),
    'keyword'
FROM documents d1
CROSS JOIN documents d2
WHERE d1.document_id != d2.document_id
  AND d1.category = d2.category  -- Only calculate within same category
  AND NOT EXISTS (
    SELECT 1 FROM document_similarity ds 
    WHERE ds.document_id_1 = d1.document_id 
    AND ds.document_id_2 = d2.document_id
  )
LIMIT 10000; -- Batch processing required due to computational cost

-- Traditional approach limitations:
-- 1. No understanding of semantic meaning or context
-- 2. Poor performance with large document collections
-- 3. Manual maintenance of similarity relationships
-- 4. Limited multilingual and cross-domain search capabilities  
-- 5. No support for natural language queries or conversational search
-- 6. Inability to handle synonyms and conceptual relationships
-- 7. No integration with modern AI/ML embedding models
-- 8. Poor recommendation quality based on simple keyword overlap
-- 9. No support for multi-modal content (images, videos, audio)
-- 10. Scalability issues with growing content collections

MongoDB Vector Search provides sophisticated AI-powered semantic capabilities:

// MongoDB Vector Search - advanced AI-powered semantic search with comprehensive embedding management
const { MongoClient } = require('mongodb');
const { OpenAI } = require('openai');
const tf = require('@tensorflow/tfjs-node');

const client = new MongoClient('mongodb+srv://username:[email protected]');
const db = client.db('advanced_ai_search_platform');

// Advanced AI-powered search and recommendation engine
class AdvancedVectorSearchEngine {
  constructor(db, aiConfig = {}) {
    this.db = db;
    this.collections = {
      documents: db.collection('documents'),
      embeddings: db.collection('document_embeddings'),
      userProfiles: db.collection('user_profiles'), 
      searchLogs: db.collection('search_logs'),
      recommendations: db.collection('recommendations'),
      modelMetadata: db.collection('model_metadata')
    };

    // AI model configuration
    this.aiConfig = {
      embeddingModel: aiConfig.embeddingModel || 'text-embedding-3-large',
      embeddingDimensions: aiConfig.embeddingDimensions || 3072,
      maxTokens: aiConfig.maxTokens || 8191,
      batchSize: aiConfig.batchSize || 50,
      similarityThreshold: aiConfig.similarityThreshold || 0.7,

      // Advanced AI configurations
      useMultimodalEmbeddings: aiConfig.useMultimodalEmbeddings || false,
      enableSemanticCaching: aiConfig.enableSemanticCaching || true,
      enableQueryExpansion: aiConfig.enableQueryExpansion || true,
      enablePersonalization: aiConfig.enablePersonalization || true,

      // Model providers
      openaiApiKey: aiConfig.openaiApiKey || process.env.OPENAI_API_KEY,
      huggingFaceApiKey: aiConfig.huggingFaceApiKey || process.env.HUGGINGFACE_API_KEY,
      cohereApiKey: aiConfig.cohereApiKey || process.env.COHERE_API_KEY
    };

    // Initialize AI clients
    this.openai = new OpenAI({ apiKey: this.aiConfig.openaiApiKey });
    this.embeddingCache = new Map();
    this.searchCache = new Map();

    this.setupVectorSearchIndexes();
    this.initializeEmbeddingModels();
  }

  async setupVectorSearchIndexes() {
    console.log('Setting up MongoDB Vector Search indexes...');

    try {
      // Primary document embedding index
      await this.collections.documents.createSearchIndex({
        name: 'document_vector_index',
        definition: {
          fields: [
            {
              type: 'vector',
              path: 'embedding',
              numDimensions: this.aiConfig.embeddingDimensions,
              similarity: 'cosine'
            },
            {
              type: 'filter',
              path: 'category'
            },
            {
              type: 'filter', 
              path: 'language'
            },
            {
              type: 'filter',
              path: 'contentType'
            },
            {
              type: 'filter',
              path: 'accessLevel'
            },
            {
              type: 'filter',
              path: 'createdAt'
            }
          ]
        }
      });

      // Multi-modal content index for images and multimedia
      await this.collections.documents.createSearchIndex({
        name: 'multimodal_vector_index',
        definition: {
          fields: [
            {
              type: 'vector',
              path: 'multimodalEmbedding',
              numDimensions: 1536, // Different dimension for multi-modal models
              similarity: 'cosine'
            },
            {
              type: 'filter',
              path: 'mediaType'
            }
          ]
        }
      });

      // User profile vector index for personalization
      await this.collections.userProfiles.createSearchIndex({
        name: 'user_profile_vector_index',
        definition: {
          fields: [
            {
              type: 'vector',
              path: 'interestEmbedding',
              numDimensions: this.aiConfig.embeddingDimensions,
              similarity: 'cosine'
            }
          ]
        }
      });

      console.log('Vector Search indexes created successfully');
    } catch (error) {
      console.error('Error setting up Vector Search indexes:', error);
      throw error;
    }
  }

  async generateDocumentEmbedding(document, options = {}) {
    console.log(`Generating embeddings for document: ${document.title}`);

    try {
      // Prepare content for embedding generation
      const embeddingContent = this.prepareContentForEmbedding(document, options);

      // Check cache first
      const cacheKey = this.generateCacheKey(embeddingContent);
      if (this.embeddingCache.has(cacheKey) && this.aiConfig.enableSemanticCaching) {
        console.log('Using cached embedding');
        return this.embeddingCache.get(cacheKey);
      }

      // Generate embedding using OpenAI
      const embeddingResponse = await this.openai.embeddings.create({
        model: this.aiConfig.embeddingModel,
        input: embeddingContent,
        dimensions: this.aiConfig.embeddingDimensions
      });

      const embedding = embeddingResponse.data[0].embedding;

      // Cache the embedding
      if (this.aiConfig.enableSemanticCaching) {
        this.embeddingCache.set(cacheKey, embedding);
      }

      // Store embedding with comprehensive metadata
      const embeddingDocument = {
        documentId: document._id,
        embedding: embedding,

        // Embedding metadata
        model: this.aiConfig.embeddingModel,
        dimensions: this.aiConfig.embeddingDimensions,
        contentLength: embeddingContent.length,
        tokensUsed: embeddingResponse.usage?.total_tokens || 0,

        // Content characteristics
        contentType: document.contentType || 'text',
        language: document.language || 'en',
        category: document.category,

        // Processing metadata
        generatedAt: new Date(),
        modelVersion: embeddingResponse.model,
        processingTime: Date.now() - (options.startTime || Date.now()),

        // Quality metrics
        contentQuality: this.assessContentQuality(document),
        embeddingNorm: this.calculateVectorNorm(embedding),

        // Optimization metadata
        batchProcessed: options.batchProcessed || false,
        cacheHit: false
      };

      // Store in embedding collection for tracking
      await this.collections.embeddings.insertOne(embeddingDocument);

      // Update main document with embedding
      await this.collections.documents.updateOne(
        { _id: document._id },
        {
          $set: {
            embedding: embedding,
            embeddingMetadata: {
              model: this.aiConfig.embeddingModel,
              generatedAt: new Date(),
              dimensions: this.aiConfig.embeddingDimensions,
              contentHash: this.generateContentHash(embeddingContent)
            }
          }
        }
      );

      return embedding;

    } catch (error) {
      console.error(`Error generating embedding for document ${document._id}:`, error);
      throw error;
    }
  }

  prepareContentForEmbedding(document, options = {}) {
    // Intelligent content preparation for optimal embedding generation
    let content = '';

    // Title with higher weight
    if (document.title) {
      content += `Title: ${document.title}\n\n`;
    }

    // Summary if available
    if (document.summary) {
      content += `Summary: ${document.summary}\n\n`;
    }

    // Main content with intelligent truncation
    if (document.content) {
      const maxContentLength = this.aiConfig.maxTokens * 0.7; // Reserve space for title/metadata
      let mainContent = document.content;

      if (mainContent.length > maxContentLength) {
        // Intelligent content truncation - keep beginning and key sections
        const beginningChunk = mainContent.substring(0, maxContentLength * 0.6);
        const endingChunk = mainContent.substring(mainContent.length - maxContentLength * 0.2);

        mainContent = beginningChunk + '\n...\n' + endingChunk;
      }

      content += `Content: ${mainContent}\n\n`;
    }

    // Metadata context
    if (document.category) {
      content += `Category: ${document.category}\n`;
    }

    if (document.tags && document.tags.length > 0) {
      content += `Tags: ${document.tags.join(', ')}\n`;
    }

    if (document.keywords) {
      content += `Keywords: ${document.keywords}\n`;
    }

    return content.trim();
  }

  async performSemanticSearch(query, options = {}) {
    console.log(`Performing semantic search for: "${query}"`);
    const startTime = Date.now();

    try {
      // Generate query embedding
      const queryEmbedding = await this.generateQueryEmbedding(query, options);

      // Build comprehensive search pipeline
      const searchPipeline = await this.buildSemanticSearchPipeline(queryEmbedding, query, options);

      // Execute vector search with MongoDB Atlas Vector Search
      const searchResults = await this.collections.documents.aggregate(searchPipeline).toArray();

      // Post-process and enhance results
      const enhancedResults = await this.enhanceSearchResults(searchResults, query, options);

      // Log search for analytics and improvement
      await this.logSearchQuery(query, queryEmbedding, enhancedResults, options);

      // Generate personalized recommendations if user context available
      let personalizedRecommendations = [];
      if (options.userId && this.aiConfig.enablePersonalization) {
        personalizedRecommendations = await this.generatePersonalizedRecommendations(
          options.userId, 
          enhancedResults.slice(0, 5),
          options
        );
      }

      return {
        query: query,
        results: enhancedResults,
        personalizedRecommendations: personalizedRecommendations,

        // Search metadata
        metadata: {
          totalResults: enhancedResults.length,
          searchTime: Date.now() - startTime,
          queryEmbeddingDimensions: queryEmbedding.length,
          embeddingModel: this.aiConfig.embeddingModel,
          similarityThreshold: options.similarityThreshold || this.aiConfig.similarityThreshold,
          filtersApplied: this.extractAppliedFilters(options),
          personalizationEnabled: this.aiConfig.enablePersonalization && !!options.userId
        },

        // Query insights
        insights: {
          queryComplexity: this.assessQueryComplexity(query),
          semanticCategories: this.identifySemanticCategories(enhancedResults),
          resultDiversity: this.calculateResultDiversity(enhancedResults),
          averageSimilarity: this.calculateAverageSimilarity(enhancedResults)
        },

        // Related queries and suggestions
        relatedQueries: await this.generateRelatedQueries(query, enhancedResults),
        searchSuggestions: await this.generateSearchSuggestions(query, options)
      };

    } catch (error) {
      console.error(`Semantic search error for query "${query}":`, error);
      throw error;
    }
  }

  async buildSemanticSearchPipeline(queryEmbedding, query, options = {}) {
    const pipeline = [];

    // Stage 1: Vector similarity search
    pipeline.push({
      $vectorSearch: {
        index: options.multimodal ? 'multimodal_vector_index' : 'document_vector_index',
        path: options.multimodal ? 'multimodalEmbedding' : 'embedding',
        queryVector: queryEmbedding,
        numCandidates: options.numCandidates || 1000,
        limit: options.vectorSearchLimit || 100,

        // Apply filters for performance and relevance
        filter: this.buildSearchFilters(options)
      }
    });

    // Stage 2: Add similarity score and metadata
    pipeline.push({
      $addFields: {
        vectorSimilarityScore: { $meta: 'vectorSearchScore' },
        searchMetadata: {
          searchTime: new Date(),
          searchQuery: query,
          searchModel: this.aiConfig.embeddingModel
        }
      }
    });

    // Stage 3: Hybrid scoring combining vector similarity with text relevance
    if (options.enableHybridSearch !== false) {
      pipeline.push({
        $addFields: {
          // Text match scoring for hybrid approach
          textMatchScore: {
            $cond: {
              if: { $regexMatch: { input: '$title', regex: query, options: 'i' } },
              then: 0.3,
              else: {
                $cond: {
                  if: { $regexMatch: { input: '$content', regex: query, options: 'i' } },
                  then: 0.2,
                  else: 0
                }
              }
            }
          },

          // Recency scoring
          recencyScore: {
            $switch: {
              branches: [
                {
                  case: { $gte: ['$createdAt', new Date(Date.now() - 30 * 24 * 60 * 60 * 1000)] },
                  then: 0.1
                },
                {
                  case: { $gte: ['$createdAt', new Date(Date.now() - 90 * 24 * 60 * 60 * 1000)] },
                  then: 0.05
                }
              ],
              default: 0
            }
          },

          // Popularity scoring based on user interactions
          popularityScore: {
            $multiply: [
              { $log10: { $add: [{ $ifNull: ['$metrics.viewCount', 0] }, 1] } },
              0.05
            ]
          },

          // Content quality scoring
          qualityScore: {
            $multiply: [
              { $divide: [{ $strLenCP: { $ifNull: ['$content', ''] } }, 10000] },
              0.02
            ]
          }
        }
      });

      // Combined hybrid score
      pipeline.push({
        $addFields: {
          hybridScore: {
            $add: [
              { $multiply: ['$vectorSimilarityScore', 0.7] }, // Vector similarity weight
              '$textMatchScore',
              '$recencyScore', 
              '$popularityScore',
              '$qualityScore'
            ]
          }
        }
      });
    }

    // Stage 4: Apply similarity threshold filtering
    pipeline.push({
      $match: {
        vectorSimilarityScore: { 
          $gte: options.similarityThreshold || this.aiConfig.similarityThreshold 
        }
      }
    });

    // Stage 5: Lookup related collections for rich context
    pipeline.push({
      $lookup: {
        from: 'users',
        localField: 'createdBy',
        foreignField: '_id',
        as: 'authorInfo',
        pipeline: [
          { $project: { name: 1, avatar: 1, expertise: 1, reputation: 1 } }
        ]
      }
    });

    // Stage 6: Add computed fields for result enhancement
    pipeline.push({
      $addFields: {
        // Content preview generation
        contentPreview: {
          $cond: {
            if: { $gt: [{ $strLenCP: { $ifNull: ['$content', ''] } }, 300] },
            then: { $concat: [{ $substr: ['$content', 0, 300] }, '...'] },
            else: '$content'
          }
        },

        // Relevance category
        relevanceCategory: {
          $switch: {
            branches: [
              { case: { $gte: ['$vectorSimilarityScore', 0.9] }, then: 'Highly Relevant' },
              { case: { $gte: ['$vectorSimilarityScore', 0.8] }, then: 'Very Relevant' },
              { case: { $gte: ['$vectorSimilarityScore', 0.7] }, then: 'Relevant' },
              { case: { $gte: ['$vectorSimilarityScore', 0.6] }, then: 'Moderately Relevant' }
            ],
            default: 'Potentially Relevant'
          }
        },

        // Author information
        authorName: { $arrayElemAt: ['$authorInfo.name', 0] },
        authorExpertise: { $arrayElemAt: ['$authorInfo.expertise', 0] },

        // Formatted metadata
        formattedCreatedAt: {
          $dateToString: {
            format: '%Y-%m-%d',
            date: '$createdAt'
          }
        }
      }
    });

    // Stage 7: Final projection for clean output
    pipeline.push({
      $project: {
        _id: 1,
        title: 1,
        contentPreview: 1,
        category: 1,
        tags: 1,
        language: 1,
        contentType: 1,
        createdAt: 1,
        formattedCreatedAt: 1,

        // Scoring information
        vectorSimilarityScore: { $round: ['$vectorSimilarityScore', 4] },
        hybridScore: { $round: [{ $ifNull: ['$hybridScore', '$vectorSimilarityScore'] }, 4] },
        relevanceCategory: 1,

        // Author information
        authorName: 1,
        authorExpertise: 1,

        // Access and metadata
        accessLevel: 1,
        downloadUrl: { $concat: ['/api/documents/', { $toString: '$_id' }] },

        // Analytics metadata
        metrics: {
          viewCount: { $ifNull: ['$metrics.viewCount', 0] },
          likeCount: { $ifNull: ['$metrics.likeCount', 0] },
          shareCount: { $ifNull: ['$metrics.shareCount', 0] }
        },

        searchMetadata: 1
      }
    });

    // Stage 8: Sort by hybrid score or vector similarity
    const sortField = options.enableHybridSearch !== false ? 'hybridScore' : 'vectorSimilarityScore';
    pipeline.push({ $sort: { [sortField]: -1 } });

    // Stage 9: Apply final limit
    pipeline.push({ $limit: options.limit || 20 });

    return pipeline;
  }

  buildSearchFilters(options) {
    const filters = {};

    // Category filtering
    if (options.category) {
      filters.category = { $eq: options.category };
    }

    // Language filtering
    if (options.language) {
      filters.language = { $eq: options.language };
    }

    // Content type filtering
    if (options.contentType) {
      filters.contentType = { $eq: options.contentType };
    }

    // Access level filtering
    if (options.accessLevel) {
      filters.accessLevel = { $eq: options.accessLevel };
    }

    // Date range filtering
    if (options.dateFrom || options.dateTo) {
      filters.createdAt = {};
      if (options.dateFrom) filters.createdAt.$gte = new Date(options.dateFrom);
      if (options.dateTo) filters.createdAt.$lte = new Date(options.dateTo);
    }

    // Author filtering
    if (options.authorId) {
      filters.createdBy = { $eq: options.authorId };
    }

    // Tags filtering
    if (options.tags && options.tags.length > 0) {
      filters.tags = { $in: options.tags };
    }

    return filters;
  }

  async generateQueryEmbedding(query, options = {}) {
    console.log(`Generating query embedding for: "${query}"`);

    try {
      // Enhance query with expansion if enabled
      let enhancedQuery = query;

      if (this.aiConfig.enableQueryExpansion && options.expandQuery !== false) {
        enhancedQuery = await this.expandQuery(query, options);
      }

      // Generate embedding
      const embeddingResponse = await this.openai.embeddings.create({
        model: this.aiConfig.embeddingModel,
        input: enhancedQuery,
        dimensions: this.aiConfig.embeddingDimensions
      });

      return embeddingResponse.data[0].embedding;

    } catch (error) {
      console.error(`Error generating query embedding for "${query}":`, error);
      throw error;
    }
  }

  async expandQuery(query, options = {}) {
    console.log(`Expanding query: "${query}"`);

    try {
      // Use GPT to expand the query with related terms and concepts
      const expansionPrompt = `
        Given the search query: "${query}"

        Generate an expanded version that includes:
        1. Synonyms and related terms
        2. Alternative phrasings
        3. Conceptually related topics
        4. Common variations and abbreviations

        Keep the expansion focused and relevant. Return only the expanded query text.

        Original query: ${query}
        Expanded query:`;

      const completion = await this.openai.chat.completions.create({
        model: 'gpt-4',
        messages: [{ role: 'user', content: expansionPrompt }],
        max_tokens: 150,
        temperature: 0.3
      });

      const expandedQuery = completion.choices[0].message.content.trim();
      console.log(`Query expanded to: "${expandedQuery}"`);

      return expandedQuery;

    } catch (error) {
      console.error(`Error expanding query "${query}":`, error);
      return query; // Fall back to original query
    }
  }

  async generatePersonalizedRecommendations(userId, searchResults, options = {}) {
    console.log(`Generating personalized recommendations for user: ${userId}`);

    try {
      // Get user profile and interaction history
      const userProfile = await this.collections.userProfiles.findOne({ userId: userId });
      if (!userProfile) {
        console.log('No user profile found, returning general recommendations');
        return this.generateGeneralRecommendations(searchResults, options);
      }

      // Generate personalized recommendations based on user interests
      const recommendationPipeline = [
        {
          $vectorSearch: {
            index: 'document_vector_index',
            path: 'embedding', 
            queryVector: userProfile.interestEmbedding,
            numCandidates: 500,
            limit: 50,
            filter: {
              _id: { $nin: searchResults.map(r => r._id) }, // Exclude current results
              accessLevel: { $in: ['public', 'user'] }
            }
          }
        },
        {
          $addFields: {
            personalizedScore: { $meta: 'vectorSearchScore' },
            recommendationReason: 'Based on your interests and reading history'
          }
        },
        {
          $lookup: {
            from: 'users',
            localField: 'createdBy',
            foreignField: '_id',
            as: 'authorInfo',
            pipeline: [{ $project: { name: 1, expertise: 1 } }]
          }
        },
        {
          $project: {
            _id: 1,
            title: 1,
            category: 1,
            tags: 1,
            createdAt: 1,
            personalizedScore: { $round: ['$personalizedScore', 4] },
            recommendationReason: 1,
            authorName: { $arrayElemAt: ['$authorInfo.name', 0] },
            downloadUrl: { $concat: ['/api/documents/', { $toString: '$_id' }] }
          }
        },
        { $sort: { personalizedScore: -1 } },
        { $limit: options.recommendationLimit || 10 }
      ];

      const recommendations = await this.collections.documents.aggregate(recommendationPipeline).toArray();

      return recommendations;

    } catch (error) {
      console.error(`Error generating personalized recommendations for user ${userId}:`, error);
      return [];
    }
  }

  async enhanceSearchResults(results, query, options = {}) {
    console.log(`Enhancing ${results.length} search results`);

    try {
      // Add result enhancements
      const enhancedResults = await Promise.all(results.map(async (result, index) => {
        // Calculate additional metadata
        const enhancedResult = {
          ...result,

          // Result ranking
          rank: index + 1,

          // Enhanced content preview with query highlighting
          highlightedPreview: this.highlightQueryInText(result.contentPreview || '', query),

          // Semantic category classification
          semanticCategory: await this.classifyContentSemantics(result),

          // Reading time estimation
          estimatedReadingTime: this.estimateReadingTime(result.content || result.contentPreview || ''),

          // Related concepts extraction
          extractedConcepts: this.extractKeyConcepts(result.title + ' ' + (result.contentPreview || '')),

          // Confidence scoring
          confidenceScore: this.calculateConfidenceScore(result),

          // Access recommendations
          accessRecommendation: this.generateAccessRecommendation(result, options)
        };

        return enhancedResult;
      }));

      return enhancedResults;

    } catch (error) {
      console.error('Error enhancing search results:', error);
      return results; // Return original results if enhancement fails
    }
  }

  highlightQueryInText(text, query) {
    if (!text || !query) return text;

    // Simple highlighting - in production, use more sophisticated highlighting
    const queryWords = query.toLowerCase().split(/\s+/);
    let highlightedText = text;

    queryWords.forEach(word => {
      if (word.length > 2) { // Only highlight words longer than 2 characters
        const regex = new RegExp(`\\b${word}\\b`, 'gi');
        highlightedText = highlightedText.replace(regex, `**${word}**`);
      }
    });

    return highlightedText;
  }

  estimateReadingTime(text) {
    const wordsPerMinute = 250; // Average reading speed
    const wordCount = text.split(/\s+/).length;
    const readingTime = Math.ceil(wordCount / wordsPerMinute);

    return {
      minutes: readingTime,
      wordCount: wordCount,
      formattedTime: readingTime === 1 ? '1 minute' : `${readingTime} minutes`
    };
  }

  extractKeyConcepts(text) {
    // Simple concept extraction - in production, use NLP libraries
    const concepts = [];
    const words = text.toLowerCase().split(/\s+/);

    // Technical terms and concepts (simplified approach)
    const technicalTerms = [
      'artificial intelligence', 'machine learning', 'deep learning', 'neural networks',
      'data science', 'analytics', 'algorithm', 'optimization', 'automation',
      'cloud computing', 'blockchain', 'cybersecurity', 'api', 'database'
    ];

    technicalTerms.forEach(term => {
      if (text.toLowerCase().includes(term)) {
        concepts.push(term);
      }
    });

    return concepts.slice(0, 5); // Return top 5 concepts
  }

  calculateConfidenceScore(result) {
    // Multi-factor confidence calculation
    let confidence = result.vectorSimilarityScore * 0.6; // Base similarity

    // Content length factor
    const contentLength = (result.content || result.contentPreview || '').length;
    if (contentLength > 1000) confidence += 0.1;
    if (contentLength > 3000) confidence += 0.1;

    // Metadata completeness factor
    if (result.category) confidence += 0.05;
    if (result.tags && result.tags.length > 0) confidence += 0.05;
    if (result.authorName) confidence += 0.05;

    // Popularity factor
    if (result.metrics.viewCount > 100) confidence += 0.05;
    if (result.metrics.likeCount > 10) confidence += 0.05;

    return Math.min(confidence, 1.0); // Cap at 1.0
  }

  generateAccessRecommendation(result, options) {
    // Generate recommendations for how to use/access the content
    const recommendations = [];

    if (result.vectorSimilarityScore > 0.9) {
      recommendations.push('Highly recommended - very relevant to your search');
    }

    if (result.metrics.viewCount > 1000) {
      recommendations.push('Popular content - frequently viewed by users');
    }

    if (result.estimatedReadingTime && result.estimatedReadingTime.minutes <= 5) {
      recommendations.push('Quick read - can be completed in a few minutes');
    }

    if (result.category === 'tutorial') {
      recommendations.push('Step-by-step guidance available');
    }

    return recommendations;
  }

  async logSearchQuery(query, queryEmbedding, results, options) {
    try {
      const searchLog = {
        query: query,
        queryEmbedding: queryEmbedding,
        userId: options.userId || null,
        sessionId: options.sessionId || null,

        // Search configuration
        searchConfig: {
          model: this.aiConfig.embeddingModel,
          similarityThreshold: options.similarityThreshold || this.aiConfig.similarityThreshold,
          limit: options.limit || 20,
          enableHybridSearch: options.enableHybridSearch !== false,
          enablePersonalization: this.aiConfig.enablePersonalization && !!options.userId
        },

        // Results metadata
        resultsMetadata: {
          totalResults: results.length,
          averageSimilarity: results.length > 0 ? 
            results.reduce((sum, r) => sum + r.vectorSimilarityScore, 0) / results.length : 0,
          topCategories: this.extractTopCategories(results),
          searchTime: Date.now() - (options.startTime || Date.now())
        },

        // User context
        userContext: {
          ipAddress: options.ipAddress,
          userAgent: options.userAgent,
          referrer: options.referrer
        },

        timestamp: new Date()
      };

      await this.collections.searchLogs.insertOne(searchLog);

    } catch (error) {
      console.error('Error logging search query:', error);
      // Don't throw - logging shouldn't break search
    }
  }

  extractTopCategories(results) {
    const categoryCount = {};
    results.forEach(result => {
      if (result.category) {
        categoryCount[result.category] = (categoryCount[result.category] || 0) + 1;
      }
    });

    return Object.entries(categoryCount)
      .sort(([,a], [,b]) => b - a)
      .slice(0, 5)
      .map(([category, count]) => ({ category, count }));
  }

  // Additional utility methods for comprehensive vector search functionality

  generateCacheKey(content) {
    const crypto = require('crypto');
    return crypto.createHash('sha256').update(content).digest('hex');
  }

  generateContentHash(content) {
    const crypto = require('crypto');
    return crypto.createHash('md5').update(content).digest('hex');
  }

  calculateVectorNorm(vector) {
    return Math.sqrt(vector.reduce((sum, val) => sum + val * val, 0));
  }

  assessContentQuality(document) {
    let qualityScore = 0;

    // Length factor
    const contentLength = (document.content || '').length;
    if (contentLength > 1000) qualityScore += 0.3;
    if (contentLength > 5000) qualityScore += 0.2;

    // Metadata completeness
    if (document.title) qualityScore += 0.1;
    if (document.summary) qualityScore += 0.1;
    if (document.tags && document.tags.length > 0) qualityScore += 0.1;
    if (document.category) qualityScore += 0.1;

    // Structure indicators
    if (document.content && document.content.includes('\n\n')) qualityScore += 0.1; // Paragraphs

    return Math.min(qualityScore, 1.0);
  }
}

// Benefits of MongoDB Vector Search for AI Applications:
// - Native vector similarity search with cosine similarity
// - Seamless integration with embedding models (OpenAI, Hugging Face, etc.)
// - High-performance vector indexing and retrieval at scale
// - Advanced filtering and hybrid search capabilities
// - Built-in support for multi-modal content (text, images, audio)
// - Personalization through user profile vector matching
// - Real-time search with low-latency vector operations
// - Comprehensive search analytics and query optimization
// - Integration with MongoDB's document model for rich metadata
// - Production-ready scalability with sharding and replication

module.exports = {
  AdvancedVectorSearchEngine
};

Understanding MongoDB Vector Search Architecture

Advanced AI Integration Patterns and Semantic Search Optimization

Implement sophisticated vector search strategies for production AI applications:

// Production-ready MongoDB Vector Search with advanced AI integration and optimization patterns
class ProductionVectorSearchPlatform extends AdvancedVectorSearchEngine {
  constructor(db, productionConfig) {
    super(db, productionConfig);

    this.productionConfig = {
      ...productionConfig,
      multiModelSupport: true,
      realtimeIndexing: true,
      distributedEmbedding: true,
      autoOptimization: true,
      advancedAnalytics: true,
      contentModeration: true
    };

    this.setupProductionOptimizations();
    this.initializeAdvancedFeatures();
    this.setupMonitoringAndAlerts();
  }

  async implementAdvancedSemanticCapabilities() {
    console.log('Implementing advanced semantic capabilities...');

    // Multi-model embedding strategy
    const embeddingStrategy = {
      textEmbeddings: {
        primary: 'text-embedding-3-large',
        fallback: 'text-embedding-ada-002',
        specialized: {
          code: 'code-search-babbage-code-001',
          legal: 'text-similarity-curie-001',
          medical: 'text-search-curie-doc-001'
        }
      },

      multimodalEmbeddings: {
        imageText: 'clip-vit-base-patch32',
        audioText: 'wav2vec2-base-960h', 
        videoText: 'video-text-retrieval'
      },

      domainSpecific: {
        scientific: 'scibert-scivocab-uncased',
        financial: 'finbert-base-uncased',
        biomedical: 'biobert-base-cased'
      }
    };

    return await this.deployEmbeddingStrategy(embeddingStrategy);
  }

  async setupRealtimeSemanticIndexing() {
    console.log('Setting up real-time semantic indexing...');

    const indexingPipeline = {
      // Change stream monitoring for real-time updates
      changeStreams: [
        {
          collection: 'documents',
          pipeline: [
            { $match: { 'operationType': { $in: ['insert', 'update'] } } }
          ],
          handler: this.processDocumentChange.bind(this)
        }
      ],

      // Batch processing for bulk operations
      batchProcessor: {
        batchSize: 100,
        maxWaitTime: 30000, // 30 seconds
        retryLogic: true,
        errorHandling: 'resilient'
      },

      // Quality assurance pipeline
      qualityChecks: [
        'contentValidation',
        'languageDetection', 
        'duplicateDetection',
        'contentModeration'
      ]
    };

    return await this.deployIndexingPipeline(indexingPipeline);
  }

  async implementAdvancedRecommendationEngine() {
    console.log('Implementing advanced recommendation engine...');

    const recommendationStrategies = {
      // Collaborative filtering with vector embeddings
      collaborative: {
        userSimilarity: 'cosine',
        itemSimilarity: 'cosine',
        hybridWeighting: {
          contentBased: 0.6,
          collaborative: 0.4
        }
      },

      // Content-based recommendations
      contentBased: {
        semanticSimilarity: true,
        categoryWeighting: true,
        temporalDecay: true,
        diversityOptimization: true
      },

      // Deep learning recommendations
      deepLearning: {
        neuralCollaborativeFiltering: true,
        sequentialRecommendations: true,
        multiTaskLearning: true
      }
    };

    return await this.deployRecommendationStrategies(recommendationStrategies);
  }

  async optimizeVectorSearchPerformance() {
    console.log('Optimizing vector search performance...');

    const optimizations = {
      // Index optimization strategies
      indexOptimization: {
        approximateNearestNeighbor: true,
        hierarchicalNavigableSmallWorld: true,
        productQuantization: true,
        localitySensitiveHashing: true
      },

      // Query optimization
      queryOptimization: {
        queryExpansion: true,
        queryRewriting: true,
        candidatePrefiltering: true,
        adaptiveSimilarityThresholds: true
      },

      // Caching strategies
      cachingStrategy: {
        embeddingCache: '10GB',
        resultCache: '5GB',
        queryCache: '2GB',
        indexCache: '20GB'
      }
    };

    return await this.implementOptimizations(optimizations);
  }
}

SQL-Style Vector Search Operations with QueryLeaf

QueryLeaf provides familiar SQL syntax for MongoDB Vector Search operations and AI-powered semantic queries:

-- QueryLeaf advanced vector search and AI operations with SQL-familiar syntax

-- Create vector search indexes for different content types and embedding models
CREATE VECTOR INDEX document_semantic_index 
ON documents (
  embedding VECTOR(3072) USING COSINE_SIMILARITY,
  category,
  language,
  contentType,
  accessLevel,
  createdAt
)
WITH (
  model = 'text-embedding-3-large',
  auto_update = true,
  optimization_level = 'performance',

  -- Advanced index configuration
  approximate_nn = true,
  candidate_multiplier = 10,
  ef_construction = 200,
  m_connections = 16
);

CREATE VECTOR INDEX multimodal_content_index
ON documents (
  multimodalEmbedding VECTOR(1536) USING COSINE_SIMILARITY,
  mediaType,
  contentFormat
)
WITH (
  model = 'clip-vit-base-patch32',
  multimodal = true
);

-- Advanced semantic search with vector similarity and hybrid scoring
WITH semantic_search AS (
  SELECT 
    d.*,
    -- Vector similarity search using embeddings
    VECTOR_SEARCH(
      d.embedding,
      GENERATE_EMBEDDING(
        'Find research papers about machine learning applications in healthcare diagnostics',
        'text-embedding-3-large'
      ),
      'COSINE'
    ) as vector_similarity,

    -- Hybrid scoring combining vector and traditional text search
    (
      VECTOR_SEARCH(
        d.embedding,
        GENERATE_EMBEDDING(
          'machine learning healthcare diagnostics medical AI',
          'text-embedding-3-large'
        ),
        'COSINE'
      ) * 0.7 +

      MATCH_SCORE(d.title || ' ' || d.content, 'machine learning healthcare diagnostics') * 0.2 +

      -- Recency boost
      CASE 
        WHEN d.createdAt > CURRENT_DATE - INTERVAL '30 days' THEN 0.1
        WHEN d.createdAt > CURRENT_DATE - INTERVAL '90 days' THEN 0.05
        ELSE 0
      END +

      -- Quality and popularity boost
      (LOG(d.metrics.citationCount + 1) * 0.02) +
      (d.metrics.averageRating / 5.0 * 0.03)

    ) as hybrid_score

  FROM documents d
  WHERE 
    -- Vector similarity threshold
    VECTOR_SEARCH(
      d.embedding,
      GENERATE_EMBEDDING(
        'machine learning healthcare diagnostics',
        'text-embedding-3-large'
      ),
      'COSINE'
    ) > 0.75

    -- Additional filters for precision
    AND d.category IN ('research', 'academic', 'medical')
    AND d.language = 'en'
    AND d.accessLevel IN ('public', 'academic')
    AND d.contentType = 'research_paper'
),

-- Enhanced search with semantic category classification and concept extraction
enriched_results AS (
  SELECT 
    ss.*,

    -- Semantic category classification using AI
    AI_CLASSIFY_CATEGORY(
      ss.title || ' ' || SUBSTRING(ss.content, 1, 1000),
      ['machine_learning', 'healthcare', 'diagnostics', 'medical_imaging', 'clinical_ai']
    ) as semantic_categories,

    -- Key concept extraction
    AI_EXTRACT_CONCEPTS(
      ss.title || ' ' || ss.abstract,
      10 -- top 10 concepts
    ) as key_concepts,

    -- Content summary generation
    AI_SUMMARIZE(
      ss.content,
      max_length => 200,
      style => 'academic'
    ) as ai_summary,

    -- Reading difficulty assessment
    AI_ASSESS_DIFFICULTY(
      ss.content,
      domain => 'medical'
    ) as reading_difficulty,

    -- Related research identification
    FIND_SIMILAR_DOCUMENTS(
      ss.embedding,
      limit => 5,
      exclude_ids => ARRAY[ss.document_id],
      similarity_threshold => 0.8
    ) as related_research,

    -- Citation and reference analysis
    ANALYZE_CITATIONS(ss.content) as citation_analysis,

    -- Author expertise scoring
    u.expertise_score,
    u.h_index,
    u.research_domains,

    -- Impact metrics
    CALCULATE_IMPACT_SCORE(
      ss.metrics.citationCount,
      ss.metrics.downloadCount,
      ss.metrics.viewCount,
      ss.createdAt
    ) as impact_score

  FROM semantic_search ss
  JOIN users u ON ss.createdBy = u.user_id
  WHERE ss.vector_similarity > 0.7
),

-- Personalized recommendations based on user research interests
personalized_recommendations AS (
  SELECT 
    er.*,

    -- User interest alignment scoring
    VECTOR_SIMILARITY(
      er.embedding,
      (SELECT interest_embedding FROM user_profiles WHERE user_id = CURRENT_USER_ID()),
      'COSINE'
    ) as interest_alignment,

    -- Reading history similarity
    CALCULATE_READING_HISTORY_SIMILARITY(
      CURRENT_USER_ID(),
      er.document_id,
      window_days => 180
    ) as reading_history_similarity,

    -- Collaborative filtering score
    COLLABORATIVE_FILTERING_SCORE(
      CURRENT_USER_ID(),
      er.document_id,
      algorithm => 'neural_collaborative_filtering'
    ) as collaborative_score,

    -- Personalized relevance scoring
    (
      er.hybrid_score * 0.5 +
      interest_alignment * 0.3 +
      reading_history_similarity * 0.1 +
      collaborative_score * 0.1
    ) as personalized_relevance

  FROM enriched_results er
  WHERE interest_alignment > 0.6
),

-- Advanced analytics and search insights
search_analytics AS (
  SELECT 
    COUNT(*) as total_results,
    AVG(pr.vector_similarity) as avg_similarity,
    AVG(pr.hybrid_score) as avg_hybrid_score,
    AVG(pr.personalized_relevance) as avg_personalized_relevance,

    -- Category distribution analysis
    JSON_OBJECT_AGG(
      pr.category,
      COUNT(*)
    ) as category_distribution,

    -- Semantic category insights
    FLATTEN_ARRAY(
      ARRAY_AGG(pr.semantic_categories)
    ) as all_semantic_categories,

    -- Concept frequency analysis
    AI_ANALYZE_CONCEPT_TRENDS(
      ARRAY_AGG(pr.key_concepts),
      time_window => '30 days'
    ) as concept_trends,

    -- Research domain coverage
    CALCULATE_DOMAIN_COVERAGE(
      ARRAY_AGG(pr.research_domains)
    ) as domain_coverage,

    -- Quality distribution
    JSON_OBJECT(
      'high_impact', COUNT(*) FILTER (WHERE pr.impact_score > 80),
      'medium_impact', COUNT(*) FILTER (WHERE pr.impact_score BETWEEN 50 AND 80),
      'emerging', COUNT(*) FILTER (WHERE pr.impact_score BETWEEN 20 AND 50),
      'new_research', COUNT(*) FILTER (WHERE pr.impact_score < 20)
    ) as quality_distribution

  FROM personalized_recommendations pr
)

-- Final comprehensive search results with analytics and recommendations
SELECT 
  -- Document information
  pr.document_id,
  pr.title,
  pr.ai_summary,
  pr.category,
  pr.semantic_categories,
  pr.key_concepts,
  pr.reading_difficulty,
  pr.createdAt,

  -- Author information
  JSON_OBJECT(
    'name', u.name,
    'expertise_score', pr.expertise_score,
    'h_index', pr.h_index,
    'research_domains', pr.research_domains
  ) as author_info,

  -- Relevance scoring
  ROUND(pr.vector_similarity, 4) as semantic_similarity,
  ROUND(pr.hybrid_score, 4) as hybrid_relevance,
  ROUND(pr.personalized_relevance, 4) as personalized_score,
  ROUND(pr.interest_alignment, 4) as interest_match,

  -- Content characteristics
  pr.reading_difficulty,
  pr.impact_score,
  pr.citation_analysis,

  -- Related content
  pr.related_research,

  -- Access information
  CASE pr.accessLevel
    WHEN 'public' THEN 'Open Access'
    WHEN 'academic' THEN 'Academic Access Required'
    WHEN 'subscription' THEN 'Subscription Required'
    ELSE 'Restricted Access'
  END as access_type,

  -- Download and interaction URLs
  CONCAT('/api/documents/', pr.document_id, '/download') as download_url,
  CONCAT('/api/documents/', pr.document_id, '/cite') as citation_url,
  CONCAT('/api/documents/', pr.document_id, '/related') as related_url,

  -- Recommendation metadata
  JSON_OBJECT(
    'recommendation_reason', CASE 
      WHEN pr.interest_alignment > 0.9 THEN 'Highly aligned with your research interests'
      WHEN pr.collaborative_score > 0.8 THEN 'Recommended by researchers with similar interests'
      WHEN pr.reading_history_similarity > 0.7 THEN 'Similar to your recent reading patterns'
      ELSE 'Semantically relevant to your search'
    END,
    'confidence_level', CASE
      WHEN pr.personalized_relevance > 0.9 THEN 'Very High'
      WHEN pr.personalized_relevance > 0.8 THEN 'High'
      WHEN pr.personalized_relevance > 0.7 THEN 'Medium'
      ELSE 'Low'
    END
  ) as recommendation_metadata,

  -- Search analytics (same for all results)
  (SELECT ROW_TO_JSON(sa.*) FROM search_analytics sa) as search_insights

FROM personalized_recommendations pr
JOIN users u ON pr.createdBy = u.user_id
WHERE pr.personalized_relevance > 0.6
ORDER BY pr.personalized_relevance DESC
LIMIT 20;

-- Advanced vector operations for content discovery and analysis

-- Find conceptually similar documents across different languages
WITH multilingual_search AS (
  SELECT 
    d.document_id,
    d.title,
    d.language,
    d.category,

    -- Cross-language semantic similarity
    VECTOR_SEARCH(
      d.embedding,
      GENERATE_MULTILINGUAL_EMBEDDING(
        'intelligence artificielle apprentissage automatique', -- French query
        source_language => 'fr',
        target_embedding_language => 'en'
      ),
      'COSINE'
    ) as cross_language_similarity

  FROM documents d
  WHERE d.language IN ('en', 'fr', 'de', 'es', 'zh')
    AND VECTOR_SEARCH(
      d.embedding,
      GENERATE_MULTILINGUAL_EMBEDDING(
        'intelligence artificielle apprentissage automatique',
        source_language => 'fr',
        target_embedding_language => 'en'
      ),
      'COSINE'
    ) > 0.8
)
SELECT * FROM multilingual_search
ORDER BY cross_language_similarity DESC;

-- Content recommendation based on user behavior patterns
CREATE VIEW personalized_content_feed AS
WITH user_interaction_embedding AS (
  SELECT 
    ui.user_id,

    -- Generate user interest embedding from interaction history
    AGGREGATE_EMBEDDINGS(
      ARRAY_AGG(d.embedding),
      weights => ARRAY_AGG(
        CASE ui.interaction_type
          WHEN 'download' THEN 1.0
          WHEN 'like' THEN 0.8
          WHEN 'share' THEN 0.9
          WHEN 'view' THEN 0.3
          ELSE 0.1
        END * 
        -- Temporal decay
        GREATEST(0.1, 1.0 - EXTRACT(DAYS FROM CURRENT_DATE - ui.interaction_timestamp) / 365.0)
      ),
      aggregation_method => 'weighted_average'
    ) as interest_embedding

  FROM user_interactions ui
  JOIN documents d ON ui.document_id = d.document_id
  WHERE ui.interaction_timestamp > CURRENT_DATE - INTERVAL '1 year'
  GROUP BY ui.user_id
),
content_recommendations AS (
  SELECT 
    uie.user_id,
    d.document_id,
    d.title,
    d.category,
    d.createdAt,

    -- Interest-based similarity
    VECTOR_SIMILARITY(
      d.embedding,
      uie.interest_embedding,
      'COSINE'
    ) as interest_similarity,

    -- Trending factor
    CALCULATE_TRENDING_SCORE(
      d.document_id,
      time_window => '7 days'
    ) as trending_score,

    -- Novelty factor (encourages discovery)
    CALCULATE_NOVELTY_SCORE(
      uie.user_id,
      d.document_id,
      d.category
    ) as novelty_score,

    -- Combined recommendation score
    (
      VECTOR_SIMILARITY(d.embedding, uie.interest_embedding, 'COSINE') * 0.6 +
      CALCULATE_TRENDING_SCORE(d.document_id, time_window => '7 days') * 0.2 +
      CALCULATE_NOVELTY_SCORE(uie.user_id, d.document_id, d.category) * 0.2
    ) as recommendation_score

  FROM user_interaction_embedding uie
  CROSS JOIN documents d
  WHERE NOT EXISTS (
    -- Exclude already interacted content
    SELECT 1 FROM user_interactions ui2 
    WHERE ui2.user_id = uie.user_id 
    AND ui2.document_id = d.document_id
  )
  AND VECTOR_SIMILARITY(d.embedding, uie.interest_embedding, 'COSINE') > 0.7
)
SELECT 
  user_id,
  document_id,
  title,
  category,
  ROUND(interest_similarity, 4) as interest_match,
  ROUND(trending_score, 4) as trending_score,
  ROUND(novelty_score, 4) as discovery_potential,
  ROUND(recommendation_score, 4) as overall_score,

  -- Recommendation explanation
  CASE 
    WHEN interest_similarity > 0.9 THEN 'Perfect match for your interests'
    WHEN trending_score > 0.8 THEN 'Trending content in your area'
    WHEN novelty_score > 0.7 THEN 'New topic for you to explore'
    ELSE 'Related to your reading patterns'
  END as recommendation_reason

FROM content_recommendations
WHERE recommendation_score > 0.75
ORDER BY user_id, recommendation_score DESC;

-- Advanced analytics for content optimization and performance monitoring
WITH vector_search_analytics AS (
  SELECT 
    -- Search performance metrics
    sl.query,
    COUNT(*) as search_frequency,
    AVG(sl.resultsMetadata.totalResults) as avg_results_count,
    AVG(sl.resultsMetadata.averageSimilarity) as avg_similarity_score,
    AVG(sl.resultsMetadata.searchTime) as avg_search_time_ms,

    -- Query characteristics
    AI_ANALYZE_QUERY_INTENT(sl.query) as query_intent,
    AI_EXTRACT_ENTITIES(sl.query) as query_entities,
    LENGTH(sl.query) as query_length,

    -- Result quality metrics
    AVG(
      (SELECT COUNT(*) FROM JSON_ARRAY_ELEMENTS_TEXT(sl.resultsMetadata.topCategories))
    ) as category_diversity,

    -- User engagement with results
    COALESCE(
      (
        SELECT AVG(ui.rating) 
        FROM user_interactions ui
        WHERE ui.session_id = sl.sessionId
        AND ui.interaction_timestamp >= sl.timestamp
        AND ui.interaction_timestamp <= sl.timestamp + INTERVAL '1 hour'
      ), 0
    ) as result_satisfaction_score

  FROM search_logs sl
  WHERE sl.timestamp >= CURRENT_DATE - INTERVAL '30 days'
  GROUP BY sl.query, AI_ANALYZE_QUERY_INTENT(sl.query), AI_EXTRACT_ENTITIES(sl.query), LENGTH(sl.query)
),
content_performance_analysis AS (
  SELECT 
    d.document_id,
    d.title,
    d.category,
    d.createdAt,

    -- Discoverability metrics
    COUNT(sl.query) as times_found_in_search,
    AVG(sl.resultsMetadata.averageSimilarity) as avg_search_relevance,

    -- Engagement metrics
    COUNT(ui.interaction_id) as total_interactions,
    COUNT(ui.interaction_id) FILTER (WHERE ui.interaction_type = 'view') as view_count,
    COUNT(ui.interaction_id) FILTER (WHERE ui.interaction_type = 'download') as download_count,
    AVG(ui.rating) FILTER (WHERE ui.rating IS NOT NULL) as avg_rating,

    -- Content optimization recommendations
    CASE 
      WHEN COUNT(sl.query) < 5 THEN 'Low discoverability - consider SEO optimization'
      WHEN AVG(sl.resultsMetadata.averageSimilarity) < 0.7 THEN 'Low relevance - review content structure'
      WHEN COUNT(ui.interaction_id) FILTER (WHERE ui.interaction_type = 'download') / 
           NULLIF(COUNT(ui.interaction_id) FILTER (WHERE ui.interaction_type = 'view'), 0) < 0.1 
        THEN 'Low conversion - improve content value proposition'
      ELSE 'Performance within normal parameters'
    END as optimization_recommendation

  FROM documents d
  LEFT JOIN search_logs sl ON d.document_id = ANY(
    SELECT JSON_ARRAY_ELEMENTS_TEXT(sl.resultsMetadata.resultIds)::UUID
  )
  LEFT JOIN user_interactions ui ON d.document_id = ui.document_id
  WHERE d.createdAt >= CURRENT_DATE - INTERVAL '90 days'
  GROUP BY d.document_id, d.title, d.category, d.createdAt
)
SELECT 
  -- Search analytics summary
  vsa.query,
  vsa.search_frequency,
  vsa.query_intent,
  vsa.query_entities,
  ROUND(vsa.avg_similarity_score, 3) as avg_relevance,
  ROUND(vsa.avg_search_time_ms, 1) as avg_response_time_ms,
  ROUND(vsa.result_satisfaction_score, 2) as user_satisfaction,

  -- Content performance insights
  cpa.title as top_performing_content,
  cpa.times_found_in_search,
  cpa.total_interactions,
  cpa.optimization_recommendation,

  -- Improvement recommendations
  CASE 
    WHEN vsa.avg_search_time_ms > 1000 THEN 'Consider index optimization'
    WHEN vsa.avg_similarity_score < 0.7 THEN 'Review embedding model performance'
    WHEN vsa.result_satisfaction_score < 3.0 THEN 'Improve result quality and relevance'
    ELSE 'Search performance is optimal'
  END as search_optimization_recommendation

FROM vector_search_analytics vsa
LEFT JOIN content_performance_analysis cpa ON true
WHERE vsa.search_frequency > 10  -- Focus on frequently searched queries
ORDER BY vsa.search_frequency DESC, vsa.result_satisfaction_score DESC
LIMIT 50;

-- QueryLeaf provides comprehensive vector search capabilities:
-- 1. Native vector similarity search with advanced embedding models
-- 2. Hybrid scoring combining semantic and traditional text search
-- 3. Personalized recommendations based on user interest embeddings  
-- 4. Multi-language semantic search with cross-language understanding
-- 5. Real-time content recommendations and discovery systems
-- 6. Advanced analytics for search optimization and content performance
-- 7. AI-powered content classification and concept extraction
-- 8. Production-ready vector indexing with performance optimization
-- 9. Comprehensive search logging and user behavior analysis
-- 10. SQL-familiar syntax for complex vector operations and AI workflows

Best Practices for Production Vector Search Implementation

Embedding Strategy and Model Selection

Essential principles for effective MongoDB Vector Search deployment:

  1. Model Selection: Choose appropriate embedding models based on content type, domain, and language requirements
  2. Embedding Quality: Implement comprehensive content preparation and preprocessing for optimal embedding generation
  3. Index Optimization: Configure vector indexes with appropriate similarity metrics and performance parameters
  4. Hybrid Approach: Combine vector similarity with traditional text search for comprehensive relevance scoring
  5. Personalization: Implement user profile embeddings for personalized search and recommendation experiences
  6. Performance Monitoring: Track search performance, result quality, and user satisfaction metrics continuously

Scalability and Performance Optimization

Optimize vector search deployments for production-scale requirements:

  1. Index Strategy: Design efficient vector indexes with appropriate dimensionality and similarity algorithms
  2. Caching Implementation: Implement multi-tier caching for embeddings, queries, and search results
  3. Batch Processing: Optimize embedding generation and indexing through intelligent batch processing
  4. Query Optimization: Implement query expansion, rewriting, and adaptive similarity thresholds
  5. Resource Management: Monitor and optimize computational resources for embedding generation and vector operations
  6. Distribution Strategy: Design sharding and replication strategies for large-scale vector collections

Conclusion

MongoDB Vector Search provides comprehensive AI-powered semantic search capabilities that enable natural language queries, intelligent content discovery, and sophisticated recommendation systems through high-dimensional vector embeddings and advanced similarity algorithms. The native MongoDB integration ensures that vector search benefits from the same scalability, performance, and operational features as traditional database operations.

Key MongoDB Vector Search benefits include:

  • Semantic Understanding: AI-powered semantic search that understands meaning and context beyond keyword matching
  • Advanced Similarity: Sophisticated vector similarity algorithms with cosine similarity and approximate nearest neighbor search
  • Hybrid Capabilities: Seamless integration of vector similarity with traditional text search and metadata filtering
  • Personalization: User profile embeddings for personalized search results and intelligent recommendations
  • Multi-Modal Support: Vector search across text, images, audio, and multi-modal content with unified similarity operations
  • Production Ready: High-performance vector indexing with automatic optimization and comprehensive analytics

Whether you're building AI-powered search applications, recommendation engines, content discovery platforms, or intelligent document retrieval systems, MongoDB Vector Search with QueryLeaf's familiar SQL interface provides the foundation for sophisticated semantic capabilities.

QueryLeaf Integration: QueryLeaf automatically manages MongoDB Vector Search operations while providing SQL-familiar syntax for vector similarity queries, embedding generation, and AI-powered content discovery. Advanced vector search patterns, personalization algorithms, and semantic analytics are seamlessly handled through familiar SQL constructs, making sophisticated AI capabilities accessible to SQL-oriented development teams.

The combination of MongoDB's robust vector search capabilities with SQL-style AI operations makes it an ideal platform for modern AI applications that require both advanced semantic understanding and familiar database management patterns, ensuring your AI-powered search solutions can scale efficiently while remaining maintainable and feature-rich.