2025

November 28, 2025
22 min read

MongoDB Database Administration and Monitoring: Enterprise Operations Management and Performance Optimization

Enterprise MongoDB deployments require comprehensive database administration and monitoring strategies to ensure optimal performance, reliability, and operational excellence. Traditional relational database administration approaches often fall short when managing MongoDB's distributed architecture, flexible schema design, and unique operational characteristics that require specialized administrative expertise and tooling.

MongoDB database administration encompasses performance monitoring, capacity planning, security management, backup operations, and operational maintenance through sophisticated tooling and administrative frameworks. Unlike traditional SQL databases that rely on rigid administrative procedures, MongoDB administration requires understanding of document-based operations, replica set management, sharding strategies, and distributed system maintenance patterns.

The Traditional Database Administration Challenge

Conventional relational database administration approaches face significant limitations when applied to MongoDB environments:

-- Traditional PostgreSQL database administration - rigid procedures with limited MongoDB applicability

-- Database performance monitoring with limited visibility
CREATE TABLE db_performance_metrics (
    metric_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    server_name VARCHAR(100) NOT NULL,
    metric_timestamp TIMESTAMP DEFAULT CURRENT_TIMESTAMP,

    -- Connection metrics
    active_connections INTEGER,
    max_connections INTEGER,
    connection_utilization DECIMAL(5,2),
    idle_connections INTEGER,

    -- Query performance metrics
    slow_query_count INTEGER,
    average_query_time DECIMAL(10,4),
    queries_per_second DECIMAL(10,2),
    cache_hit_ratio DECIMAL(5,2),

    -- System resource utilization
    cpu_usage DECIMAL(5,2),
    memory_usage DECIMAL(5,2),
    disk_usage DECIMAL(5,2),
    io_wait DECIMAL(5,2),

    -- Database-specific metrics
    database_size BIGINT,
    index_size BIGINT,
    table_count INTEGER,
    index_count INTEGER,

    -- Lock contention
    lock_waits INTEGER,
    deadlock_count INTEGER,
    blocked_queries INTEGER,

    -- Transaction metrics
    transactions_per_second DECIMAL(10,2),
    transaction_rollback_rate DECIMAL(5,2)
);

-- Complex monitoring infrastructure with limited MongoDB compatibility
CREATE TABLE slow_query_log (
    log_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    server_name VARCHAR(100) NOT NULL,
    database_name VARCHAR(100) NOT NULL,
    query_timestamp TIMESTAMP DEFAULT CURRENT_TIMESTAMP,

    -- Query identification
    query_hash VARCHAR(64) NOT NULL,
    query_text TEXT NOT NULL,
    normalized_query TEXT,

    -- Performance metrics
    execution_time DECIMAL(10,4) NOT NULL,
    rows_examined BIGINT,
    rows_returned BIGINT,
    index_usage TEXT,

    -- Resource consumption
    cpu_time DECIMAL(10,4),
    io_reads INTEGER,
    io_writes INTEGER,
    memory_peak BIGINT,

    -- User context
    user_name VARCHAR(100),
    application_name VARCHAR(200),
    connection_id BIGINT,
    session_id VARCHAR(100),

    -- Query categorization
    query_type VARCHAR(20), -- SELECT, INSERT, UPDATE, DELETE
    complexity_score INTEGER,
    optimization_suggestions TEXT
);

-- Maintenance scheduling with manual coordination
CREATE TABLE maintenance_schedules (
    schedule_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    server_name VARCHAR(100) NOT NULL,
    maintenance_type VARCHAR(50) NOT NULL,
    scheduled_start TIMESTAMP NOT NULL,
    estimated_duration INTERVAL NOT NULL,

    -- Maintenance details
    maintenance_description TEXT,
    affected_databases TEXT[],
    expected_downtime INTERVAL,
    backup_required BOOLEAN DEFAULT TRUE,

    -- Approval workflow
    requested_by VARCHAR(100) NOT NULL,
    approved_by VARCHAR(100),
    approval_timestamp TIMESTAMP,
    approval_status VARCHAR(20) DEFAULT 'pending',

    -- Execution tracking
    actual_start TIMESTAMP,
    actual_end TIMESTAMP,
    execution_status VARCHAR(20) DEFAULT 'scheduled',
    execution_notes TEXT,

    -- Impact assessment
    business_impact VARCHAR(20), -- low, medium, high, critical
    user_notification_required BOOLEAN DEFAULT TRUE,
    rollback_plan TEXT
);

-- Limited backup and recovery management
CREATE TABLE backup_operations (
    backup_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    server_name VARCHAR(100) NOT NULL,
    database_name VARCHAR(100),
    backup_timestamp TIMESTAMP DEFAULT CURRENT_TIMESTAMP,

    -- Backup configuration
    backup_type VARCHAR(20) NOT NULL, -- full, incremental, differential
    backup_method VARCHAR(20) NOT NULL, -- dump, filesystem, streaming
    compression_enabled BOOLEAN DEFAULT TRUE,
    encryption_enabled BOOLEAN DEFAULT FALSE,

    -- Backup metrics
    backup_size BIGINT,
    compressed_size BIGINT,
    backup_duration INTERVAL,

    -- Storage information
    backup_location VARCHAR(500) NOT NULL,
    storage_type VARCHAR(50), -- local, s3, nfs, tape
    retention_period INTERVAL,
    deletion_scheduled TIMESTAMP,

    -- Validation and integrity
    integrity_check_performed BOOLEAN DEFAULT FALSE,
    integrity_check_passed BOOLEAN,
    checksum VARCHAR(128),

    -- Recovery testing
    recovery_test_performed BOOLEAN DEFAULT FALSE,
    recovery_test_passed BOOLEAN,
    recovery_test_notes TEXT
);

-- Complex user access and security management
CREATE TABLE user_access_audit (
    audit_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    server_name VARCHAR(100) NOT NULL,
    audit_timestamp TIMESTAMP DEFAULT CURRENT_TIMESTAMP,

    -- User information
    username VARCHAR(100) NOT NULL,
    user_type VARCHAR(20), -- admin, application, readonly
    authentication_method VARCHAR(50),
    source_ip INET,

    -- Access details
    operation_type VARCHAR(20) NOT NULL, -- login, logout, query, admin
    database_accessed VARCHAR(100),
    table_accessed VARCHAR(100),
    operation_details TEXT,

    -- Security context
    session_id VARCHAR(100),
    connection_duration INTERVAL,
    privilege_level VARCHAR(20),
    elevated_privileges BOOLEAN DEFAULT FALSE,

    -- Result tracking
    operation_success BOOLEAN NOT NULL,
    error_message TEXT,
    security_violation BOOLEAN DEFAULT FALSE,

    -- Risk assessment
    risk_level VARCHAR(20) DEFAULT 'low',
    anomaly_detected BOOLEAN DEFAULT FALSE,
    follow_up_required BOOLEAN DEFAULT FALSE
);

-- Manual index management with limited optimization capabilities
WITH index_analysis AS (
    SELECT 
        schemaname,
        tablename,
        indexname,
        idx_size,
        idx_tup_read,
        idx_tup_fetch,

        -- Index usage calculation
        CASE 
            WHEN idx_tup_read > 0 THEN 
                ROUND((idx_tup_fetch * 100.0 / idx_tup_read), 2)
            ELSE 0 
        END as index_efficiency,

        -- Index maintenance needs
        CASE 
            WHEN idx_tup_read < 1000 AND pg_size_pretty(idx_size::bigint) > '10 MB' THEN 'Consider removal'
            WHEN idx_tup_fetch < idx_tup_read * 0.1 THEN 'Poor selectivity'
            ELSE 'Optimal'
        END as optimization_recommendation

    FROM pg_stat_user_indexes 
    JOIN pg_indexes USING (schemaname, tablename, indexname)
),

maintenance_recommendations AS (
    SELECT 
        ia.schemaname,
        ia.tablename,
        COUNT(*) as total_indexes,
        COUNT(*) FILTER (WHERE ia.optimization_recommendation != 'Optimal') as problematic_indexes,
        array_agg(ia.indexname) FILTER (WHERE ia.optimization_recommendation = 'Consider removal') as removable_indexes,
        SUM(ia.idx_size) as total_index_size,
        AVG(ia.index_efficiency) as avg_efficiency

    FROM index_analysis ia
    GROUP BY ia.schemaname, ia.tablename
)

-- Generate maintenance recommendations
SELECT 
    mr.schemaname,
    mr.tablename,
    mr.total_indexes,
    mr.problematic_indexes,
    mr.removable_indexes,
    pg_size_pretty(mr.total_index_size::bigint) as total_size,
    ROUND(mr.avg_efficiency, 2) as avg_efficiency_percent,

    -- Maintenance priority
    CASE 
        WHEN mr.problematic_indexes > mr.total_indexes * 0.5 THEN 'High Priority'
        WHEN mr.problematic_indexes > 0 THEN 'Medium Priority'
        ELSE 'Low Priority'
    END as maintenance_priority,

    -- Specific recommendations
    CASE 
        WHEN array_length(mr.removable_indexes, 1) > 0 THEN 
            'Remove unused indexes: ' || array_to_string(mr.removable_indexes, ', ')
        WHEN mr.avg_efficiency < 50 THEN
            'Review index selectivity and query patterns'
        ELSE 
            'Continue current index strategy'
    END as detailed_recommendations

FROM maintenance_recommendations mr
WHERE mr.problematic_indexes > 0
ORDER BY mr.problematic_indexes DESC, mr.total_index_size DESC;

-- Problems with traditional database administration:
-- 1. Limited visibility into MongoDB-specific operations and performance characteristics
-- 2. Manual monitoring processes that don't scale with distributed MongoDB deployments  
-- 3. Rigid maintenance procedures that don't account for replica sets and sharding
-- 4. Backup strategies that don't leverage MongoDB's native backup capabilities
-- 5. Security management that doesn't integrate with MongoDB's role-based access control
-- 6. Performance tuning approaches that ignore MongoDB's unique optimization patterns
-- 7. Index management that doesn't understand MongoDB's compound index strategies
-- 8. Monitoring tools that lack MongoDB-specific metrics and operational insights
-- 9. Maintenance scheduling that doesn't coordinate across MongoDB cluster topology
-- 10. Recovery procedures that don't leverage MongoDB's built-in replication and failover

MongoDB provides comprehensive database administration capabilities with integrated monitoring and management tools:

// MongoDB Advanced Database Administration - Enterprise monitoring and operations management
const { MongoClient, ObjectId, GridFSBucket } = require('mongodb');

const client = new MongoClient('mongodb://localhost:27017');
const db = client.db('enterprise_operations');

// Comprehensive MongoDB Database Administration Manager
class MongoDBAdministrationManager {
  constructor(client, config = {}) {
    this.client = client;
    this.db = client.db(config.database || 'enterprise_operations');

    this.collections = {
      performanceMetrics: this.db.collection('performance_metrics'),
      slowOperations: this.db.collection('slow_operations'),
      indexAnalysis: this.db.collection('index_analysis'),
      maintenanceSchedule: this.db.collection('maintenance_schedule'),
      backupOperations: this.db.collection('backup_operations'),
      userActivity: this.db.collection('user_activity'),
      systemAlerts: this.db.collection('system_alerts'),
      capacityPlanning: this.db.collection('capacity_planning')
    };

    // Administration configuration
    this.config = {
      // Monitoring configuration
      metricsCollectionInterval: config.metricsCollectionInterval || 60000, // 1 minute
      slowOperationThreshold: config.slowOperationThreshold || 100, // 100ms
      enableDetailedProfiling: config.enableDetailedProfiling !== false,
      enableRealtimeAlerts: config.enableRealtimeAlerts !== false,

      // Performance thresholds
      cpuThreshold: config.cpuThreshold || 80,
      memoryThreshold: config.memoryThreshold || 85,
      connectionThreshold: config.connectionThreshold || 80,
      diskThreshold: config.diskThreshold || 90,

      // Maintenance configuration
      enableAutomaticMaintenance: config.enableAutomaticMaintenance || false,
      maintenanceWindow: config.maintenanceWindow || { start: '02:00', end: '04:00' },
      enableMaintenanceNotifications: config.enableMaintenanceNotifications !== false,

      // Backup configuration
      enableAutomaticBackups: config.enableAutomaticBackups !== false,
      backupRetentionDays: config.backupRetentionDays || 30,
      backupSchedule: config.backupSchedule || '0 2 * * *', // 2 AM daily

      // Security and compliance
      enableSecurityAuditing: config.enableSecurityAuditing !== false,
      enableComplianceTracking: config.enableComplianceTracking || false,
      enableAccessLogging: config.enableAccessLogging !== false
    };

    // Performance monitoring state
    this.performanceMonitors = new Map();
    this.alertSubscriptions = new Map();
    this.maintenanceQueue = [];

    this.initializeAdministrationSystem();
  }

  async initializeAdministrationSystem() {
    console.log('Initializing MongoDB administration and monitoring system...');

    try {
      // Setup monitoring infrastructure
      await this.setupMonitoringInfrastructure();

      // Initialize performance tracking
      await this.initializePerformanceMonitoring();

      // Setup automated maintenance
      if (this.config.enableAutomaticMaintenance) {
        await this.initializeMaintenanceAutomation();
      }

      // Initialize backup management
      if (this.config.enableAutomaticBackups) {
        await this.initializeBackupManagement();
      }

      // Setup security auditing
      if (this.config.enableSecurityAuditing) {
        await this.initializeSecurityAuditing();
      }

      // Start monitoring processes
      await this.startMonitoringProcesses();

      console.log('MongoDB administration system initialized successfully');

    } catch (error) {
      console.error('Error initializing administration system:', error);
      throw error;
    }
  }

  async setupMonitoringInfrastructure() {
    console.log('Setting up comprehensive monitoring infrastructure...');

    try {
      // Create optimized indexes for monitoring collections
      await this.collections.performanceMetrics.createIndexes([
        { key: { timestamp: -1 }, background: true },
        { key: { serverName: 1, timestamp: -1 }, background: true },
        { key: { 'metrics.alertLevel': 1, timestamp: -1 }, background: true },
        { key: { 'metrics.cpuUsage': 1 }, background: true, sparse: true },
        { key: { 'metrics.memoryUsage': 1 }, background: true, sparse: true }
      ]);

      // Slow operations monitoring indexes
      await this.collections.slowOperations.createIndexes([
        { key: { timestamp: -1 }, background: true },
        { key: { operation: 1, timestamp: -1 }, background: true },
        { key: { duration: -1 }, background: true },
        { key: { 'context.collection': 1, timestamp: -1 }, background: true },
        { key: { 'optimization.needsAttention': 1 }, background: true, sparse: true }
      ]);

      // Index analysis tracking
      await this.collections.indexAnalysis.createIndexes([
        { key: { collection: 1, timestamp: -1 }, background: true },
        { key: { 'analysis.efficiency': 1 }, background: true },
        { key: { 'recommendations.priority': 1, timestamp: -1 }, background: true }
      ]);

      // System alerts indexing
      await this.collections.systemAlerts.createIndexes([
        { key: { timestamp: -1 }, background: true },
        { key: { severity: 1, timestamp: -1 }, background: true },
        { key: { resolved: 1, timestamp: -1 }, background: true },
        { key: { category: 1, timestamp: -1 }, background: true }
      ]);

      console.log('Monitoring infrastructure setup completed');

    } catch (error) {
      console.error('Error setting up monitoring infrastructure:', error);
      throw error;
    }
  }

  async collectComprehensiveMetrics() {
    console.log('Collecting comprehensive performance metrics...');
    const startTime = Date.now();

    try {
      // Get server status information
      const serverStatus = await this.db.admin().serverStatus();
      const dbStats = await this.db.stats();
      const currentOperations = await this.db.admin().currentOp();

      // Collect performance metrics
      const performanceMetrics = {
        _id: new ObjectId(),
        timestamp: new Date(),
        serverName: serverStatus.host,

        // Connection metrics
        connections: {
          current: serverStatus.connections.current,
          available: serverStatus.connections.available,
          totalCreated: serverStatus.connections.totalCreated,
          utilization: Math.round((serverStatus.connections.current / 
                                  (serverStatus.connections.current + serverStatus.connections.available)) * 100)
        },

        // Operation metrics
        operations: {
          insert: serverStatus.opcounters.insert,
          query: serverStatus.opcounters.query,
          update: serverStatus.opcounters.update,
          delete: serverStatus.opcounters.delete,
          getmore: serverStatus.opcounters.getmore,
          command: serverStatus.opcounters.command,

          // Operations per second calculations
          insertRate: this.calculateOperationRate('insert', serverStatus.opcounters.insert),
          queryRate: this.calculateOperationRate('query', serverStatus.opcounters.query),
          updateRate: this.calculateOperationRate('update', serverStatus.opcounters.update),
          deleteRate: this.calculateOperationRate('delete', serverStatus.opcounters.delete)
        },

        // Memory metrics
        memory: {
          resident: serverStatus.mem.resident,
          virtual: serverStatus.mem.virtual,
          mapped: serverStatus.mem.mapped || 0,
          mappedWithJournal: serverStatus.mem.mappedWithJournal || 0,

          // WiredTiger cache metrics (if available)
          cacheSizeGB: serverStatus.wiredTiger?.cache?.['maximum bytes configured'] ? 
                      Math.round(serverStatus.wiredTiger.cache['maximum bytes configured'] / 1024 / 1024 / 1024) : 0,
          cacheUsedGB: serverStatus.wiredTiger?.cache?.['bytes currently in the cache'] ? 
                      Math.round(serverStatus.wiredTiger.cache['bytes currently in the cache'] / 1024 / 1024 / 1024) : 0,
          cacheUtilization: this.calculateCacheUtilization(serverStatus)
        },

        // Database metrics
        database: {
          collections: dbStats.collections,
          objects: dbStats.objects,
          dataSize: dbStats.dataSize,
          storageSize: dbStats.storageSize,
          indexes: dbStats.indexes,
          indexSize: dbStats.indexSize,

          // Growth metrics
          avgObjSize: dbStats.avgObjSize,
          scaleFactor: dbStats.scaleFactor || 1
        },

        // Network metrics
        network: {
          bytesIn: serverStatus.network.bytesIn,
          bytesOut: serverStatus.network.bytesOut,
          numRequests: serverStatus.network.numRequests,

          // Network rates
          bytesInRate: this.calculateNetworkRate('bytesIn', serverStatus.network.bytesIn),
          bytesOutRate: this.calculateNetworkRate('bytesOut', serverStatus.network.bytesOut),
          requestRate: this.calculateNetworkRate('numRequests', serverStatus.network.numRequests)
        },

        // Current operations analysis
        activeOperations: {
          total: currentOperations.inprog.length,
          reads: currentOperations.inprog.filter(op => op.op === 'query' || op.op === 'getmore').length,
          writes: currentOperations.inprog.filter(op => op.op === 'insert' || op.op === 'update' || op.op === 'remove').length,
          commands: currentOperations.inprog.filter(op => op.op === 'command').length,

          // Long running operations
          longRunning: currentOperations.inprog.filter(op => op.microsecs_running > 1000000).length, // > 1 second
          blocked: currentOperations.inprog.filter(op => op.waitingForLock).length
        },

        // Lock metrics
        locks: this.analyzeLockMetrics(serverStatus),

        // Replication metrics (if applicable)
        replication: await this.collectReplicationMetrics(),

        // Sharding metrics (if applicable)
        sharding: await this.collectShardingMetrics(),

        // Alert evaluation
        alerts: await this.evaluatePerformanceAlerts(serverStatus, dbStats),

        // Collection metadata
        collectionTime: Date.now() - startTime,
        metricsVersion: '2.0'
      };

      // Store metrics
      await this.collections.performanceMetrics.insertOne(performanceMetrics);

      // Process alerts if any
      if (performanceMetrics.alerts.length > 0) {
        await this.processPerformanceAlerts(performanceMetrics.alerts, performanceMetrics);
      }

      console.log(`Performance metrics collected successfully: ${performanceMetrics.alerts.length} alerts generated`);

      return performanceMetrics;

    } catch (error) {
      console.error('Error collecting performance metrics:', error);
      throw error;
    }
  }

  async analyzeSlowOperations() {
    console.log('Analyzing slow operations and query performance...');

    try {
      // Get current profiling level
      const profilingLevel = await this.db.runCommand({ profile: -1 });

      // Enable profiling if not already enabled
      if (profilingLevel.was < 1) {
        await this.db.runCommand({ 
          profile: 2, // Profile all operations
          slowms: this.config.slowOperationThreshold 
        });
      }

      // Retrieve slow operations from profiler collection
      const profileData = await this.db.collection('system.profile')
        .find({
          ts: { $gte: new Date(Date.now() - 300000) }, // Last 5 minutes
          millis: { $gte: this.config.slowOperationThreshold }
        })
        .sort({ ts: -1 })
        .limit(1000)
        .toArray();

      // Analyze operations
      for (const operation of profileData) {
        const analysis = await this.analyzeOperation(operation);

        const slowOpDocument = {
          _id: new ObjectId(),
          timestamp: operation.ts,
          operation: operation.op,
          namespace: operation.ns,
          duration: operation.millis,

          // Command details
          command: operation.command,
          planSummary: operation.planSummary,
          executionStats: operation.execStats,

          // Performance analysis
          analysis: analysis,

          // Optimization recommendations
          recommendations: await this.generateOptimizationRecommendations(operation, analysis),

          // Context information
          context: {
            client: operation.client,
            user: operation.user,
            collection: this.extractCollectionName(operation.ns),
            database: this.extractDatabaseName(operation.ns)
          },

          // Impact assessment
          impact: this.assessOperationImpact(operation, analysis),

          // Tracking metadata
          analyzed: true,
          reviewRequired: analysis.complexity === 'high' || operation.millis > 5000,
          processed: new Date()
        };

        await this.collections.slowOperations.insertOne(slowOpDocument);
      }

      console.log(`Analyzed ${profileData.length} slow operations`);

      return profileData.length;

    } catch (error) {
      console.error('Error analyzing slow operations:', error);
      throw error;
    }
  }

  async performIndexAnalysis() {
    console.log('Performing comprehensive index analysis...');

    try {
      const databases = await this.client.db().admin().listDatabases();

      for (const dbInfo of databases.databases) {
        if (['admin', 'local', 'config'].includes(dbInfo.name)) continue;

        const database = this.client.db(dbInfo.name);
        const collections = await database.listCollections().toArray();

        for (const collInfo of collections) {
          const collection = database.collection(collInfo.name);

          // Get index information
          const indexes = await collection.listIndexes().toArray();
          const indexStats = await collection.aggregate([
            { $indexStats: {} }
          ]).toArray();

          // Get collection statistics
          const collStats = await collection.stats();

          // Analyze each index
          for (const index of indexes) {
            const indexStat = indexStats.find(stat => stat.name === index.name);
            const analysis = await this.analyzeIndex(index, indexStat, collStats, collection);

            const indexAnalysisDocument = {
              _id: new ObjectId(),
              timestamp: new Date(),
              database: dbInfo.name,
              collection: collInfo.name,
              indexName: index.name,

              // Index configuration
              indexSpec: index.key,
              indexOptions: {
                unique: index.unique || false,
                sparse: index.sparse || false,
                background: index.background || false,
                partialFilterExpression: index.partialFilterExpression
              },

              // Index statistics
              statistics: {
                accessCount: indexStat?.accesses?.ops || 0,
                lastAccessed: indexStat?.accesses?.since || null,
                indexSize: index.indexSizes ? index.indexSizes[index.name] : 0
              },

              // Analysis results
              analysis: analysis,

              // Performance recommendations
              recommendations: await this.generateIndexRecommendations(analysis, index, collStats),

              // Usage patterns
              usagePatterns: await this.analyzeIndexUsagePatterns(index, collection),

              // Maintenance requirements
              maintenance: {
                needsRebuild: analysis.fragmentation > 30,
                needsOptimization: analysis.efficiency < 70,
                canBeDropped: analysis.usage === 'unused' && !index.unique && index.name !== '_id_',
                priority: this.calculateMaintenancePriority(analysis)
              }
            };

            await this.collections.indexAnalysis.insertOne(indexAnalysisDocument);
          }
        }
      }

      console.log('Index analysis completed successfully');

    } catch (error) {
      console.error('Error performing index analysis:', error);
      throw error;
    }
  }

  async scheduleMaintenanceOperation(maintenanceRequest) {
    console.log('Scheduling maintenance operation...');

    try {
      const maintenanceOperation = {
        _id: new ObjectId(),
        type: maintenanceRequest.type,
        scheduledTime: maintenanceRequest.scheduledTime || new Date(),
        estimatedDuration: maintenanceRequest.estimatedDuration,

        // Operation details
        description: maintenanceRequest.description,
        targetDatabases: maintenanceRequest.databases || [],
        targetCollections: maintenanceRequest.collections || [],

        // Impact assessment
        impactLevel: maintenanceRequest.impactLevel || 'medium',
        downTimeRequired: maintenanceRequest.downTimeRequired || false,
        userNotificationRequired: maintenanceRequest.userNotificationRequired !== false,

        // Approval workflow
        requestedBy: maintenanceRequest.requestedBy,
        approvalRequired: this.determineApprovalRequirement(maintenanceRequest),
        approvalStatus: 'pending',

        // Execution planning
        executionPlan: await this.generateExecutionPlan(maintenanceRequest),
        rollbackPlan: await this.generateRollbackPlan(maintenanceRequest),
        preChecks: await this.generatePreChecks(maintenanceRequest),
        postChecks: await this.generatePostChecks(maintenanceRequest),

        // Status tracking
        status: 'scheduled',
        createdAt: new Date(),
        executedAt: null,
        completedAt: null,

        // Results tracking
        executionResults: null,
        success: null,
        notes: []
      };

      // Validate maintenance window
      const validationResult = await this.validateMaintenanceWindow(maintenanceOperation);
      if (!validationResult.valid) {
        throw new Error(`Maintenance scheduling validation failed: ${validationResult.reason}`);
      }

      // Check for conflicts
      const conflicts = await this.checkMaintenanceConflicts(maintenanceOperation);
      if (conflicts.length > 0) {
        maintenanceOperation.conflicts = conflicts;
        maintenanceOperation.status = 'conflict_detected';
      }

      // Store maintenance operation
      await this.collections.maintenanceSchedule.insertOne(maintenanceOperation);

      // Send notifications if required
      if (maintenanceOperation.userNotificationRequired) {
        await this.sendMaintenanceNotification(maintenanceOperation);
      }

      console.log(`Maintenance operation scheduled: ${maintenanceOperation._id}`, {
        type: maintenanceOperation.type,
        scheduledTime: maintenanceOperation.scheduledTime,
        impactLevel: maintenanceOperation.impactLevel
      });

      return maintenanceOperation;

    } catch (error) {
      console.error('Error scheduling maintenance operation:', error);
      throw error;
    }
  }

  async executeBackupOperation(backupConfig) {
    console.log('Executing comprehensive backup operation...');

    try {
      const backupId = new ObjectId();
      const startTime = new Date();

      const backupOperation = {
        _id: backupId,
        type: backupConfig.type || 'full',
        startTime: startTime,

        // Backup configuration
        databases: backupConfig.databases || ['all'],
        compression: backupConfig.compression !== false,
        encryption: backupConfig.encryption || false,

        // Storage configuration
        storageLocation: backupConfig.storageLocation,
        storageType: backupConfig.storageType || 'local',
        retentionDays: backupConfig.retentionDays || this.config.backupRetentionDays,

        // Backup metadata
        triggeredBy: backupConfig.triggeredBy || 'system',
        backupReason: backupConfig.reason || 'scheduled_backup',

        // Status tracking
        status: 'in_progress',
        progress: 0,
        currentDatabase: null,

        // Results placeholder
        endTime: null,
        duration: null,
        backupSize: null,
        compressedSize: null,
        success: null,
        errorMessage: null
      };

      // Store backup operation record
      await this.collections.backupOperations.insertOne(backupOperation);

      // Execute backup based on type
      let backupResults;
      switch (backupConfig.type) {
        case 'full':
          backupResults = await this.performFullBackup(backupId, backupConfig);
          break;
        case 'incremental':
          backupResults = await this.performIncrementalBackup(backupId, backupConfig);
          break;
        case 'differential':
          backupResults = await this.performDifferentialBackup(backupId, backupConfig);
          break;
        default:
          throw new Error(`Unsupported backup type: ${backupConfig.type}`);
      }

      // Update backup operation with results
      const endTime = new Date();
      await this.collections.backupOperations.updateOne(
        { _id: backupId },
        {
          $set: {
            status: backupResults.success ? 'completed' : 'failed',
            endTime: endTime,
            duration: endTime - startTime,
            backupSize: backupResults.backupSize,
            compressedSize: backupResults.compressedSize,
            success: backupResults.success,
            errorMessage: backupResults.errorMessage,
            progress: 100,

            // Backup verification
            verificationResults: backupResults.verification,
            checksumVerified: backupResults.checksumVerified,

            // Storage information
            backupFiles: backupResults.files,
            storageLocation: backupResults.storageLocation
          }
        }
      );

      // Schedule cleanup of old backups
      await this.scheduleBackupCleanup(backupId, backupConfig);

      console.log(`Backup operation completed: ${backupId}`, {
        success: backupResults.success,
        duration: endTime - startTime,
        backupSize: backupResults.backupSize
      });

      return backupResults;

    } catch (error) {
      console.error('Error executing backup operation:', error);
      throw error;
    }
  }

  // Utility methods for MongoDB administration

  calculateOperationRate(operationType, currentValue) {
    const previousMetrics = this.performanceMonitors.get(`${operationType}_previous`);
    const previousTime = this.performanceMonitors.get(`${operationType}_time`);

    if (previousMetrics && previousTime) {
      const timeDiff = (Date.now() - previousTime) / 1000; // seconds
      const valueDiff = currentValue - previousMetrics;
      const rate = Math.round(valueDiff / timeDiff);

      // Update previous values
      this.performanceMonitors.set(`${operationType}_previous`, currentValue);
      this.performanceMonitors.set(`${operationType}_time`, Date.now());

      return rate;
    } else {
      // Initialize tracking
      this.performanceMonitors.set(`${operationType}_previous`, currentValue);
      this.performanceMonitors.set(`${operationType}_time`, Date.now());
      return 0;
    }
  }

  calculateNetworkRate(metric, currentValue) {
    return this.calculateOperationRate(`network_${metric}`, currentValue);
  }

  calculateCacheUtilization(serverStatus) {
    if (serverStatus.wiredTiger?.cache) {
      const maxBytes = serverStatus.wiredTiger.cache['maximum bytes configured'];
      const currentBytes = serverStatus.wiredTiger.cache['bytes currently in the cache'];

      if (maxBytes && currentBytes) {
        return Math.round((currentBytes / maxBytes) * 100);
      }
    }
    return 0;
  }

  analyzeLockMetrics(serverStatus) {
    if (serverStatus.locks) {
      return {
        globalLock: serverStatus.locks.Global || {},
        databaseLock: serverStatus.locks.Database || {},
        collectionLock: serverStatus.locks.Collection || {},

        // Lock contention analysis
        lockContention: this.calculateLockContention(serverStatus.locks),
        lockEfficiency: this.calculateLockEfficiency(serverStatus.locks)
      };
    }
    return {};
  }

  async collectReplicationMetrics() {
    try {
      const replStatus = await this.db.admin().replSetGetStatus();

      if (replStatus.ok) {
        return {
          setName: replStatus.set,
          members: replStatus.members.length,
          primary: replStatus.members.find(m => m.stateStr === 'PRIMARY'),
          secondaries: replStatus.members.filter(m => m.stateStr === 'SECONDARY'),
          replicationLag: this.calculateReplicationLag(replStatus.members)
        };
      }
    } catch (error) {
      // Not a replica set
      return null;
    }
  }

  async collectShardingMetrics() {
    try {
      const shardingStatus = await this.db.admin().runCommand({ listShards: 1 });

      if (shardingStatus.ok) {
        return {
          shards: shardingStatus.shards.length,
          shardsInfo: shardingStatus.shards,
          balancerActive: await this.checkBalancerStatus()
        };
      }
    } catch (error) {
      // Not a sharded cluster
      return null;
    }
  }

  async evaluatePerformanceAlerts(serverStatus, dbStats) {
    const alerts = [];

    // CPU usage alert
    if (serverStatus.extra_info?.page_faults > 1000) {
      alerts.push({
        type: 'high_page_faults',
        severity: 'warning',
        message: 'High page fault rate detected',
        value: serverStatus.extra_info.page_faults,
        threshold: 1000
      });
    }

    // Connection usage alert
    const connectionUtilization = (serverStatus.connections.current / 
                                 (serverStatus.connections.current + serverStatus.connections.available)) * 100;
    if (connectionUtilization > this.config.connectionThreshold) {
      alerts.push({
        type: 'high_connection_usage',
        severity: connectionUtilization > 95 ? 'critical' : 'warning',
        message: 'High connection utilization',
        value: Math.round(connectionUtilization),
        threshold: this.config.connectionThreshold
      });
    }

    // Memory usage alerts
    if (serverStatus.mem.resident > 8000) { // > 8GB
      alerts.push({
        type: 'high_memory_usage',
        severity: 'warning',
        message: 'High resident memory usage',
        value: serverStatus.mem.resident,
        threshold: 8000
      });
    }

    return alerts;
  }

  async analyzeOperation(operation) {
    return {
      complexity: this.assessQueryComplexity(operation),
      indexUsage: this.analyzeIndexUsage(operation),
      efficiency: this.calculateQueryEfficiency(operation),
      optimization: this.identifyOptimizationOpportunities(operation)
    };
  }

  assessQueryComplexity(operation) {
    let complexity = 'low';

    if (operation.execStats?.totalDocsExamined > 10000) complexity = 'medium';
    if (operation.execStats?.totalDocsExamined > 100000) complexity = 'high';
    if (operation.planSummary?.includes('COLLSCAN')) complexity = 'high';
    if (operation.millis > 5000) complexity = 'high';

    return complexity;
  }

  analyzeIndexUsage(operation) {
    if (operation.planSummary) {
      if (operation.planSummary.includes('IXSCAN')) return 'efficient';
      if (operation.planSummary.includes('COLLSCAN')) return 'full_scan';
    }
    return 'unknown';
  }

  calculateQueryEfficiency(operation) {
    if (operation.execStats?.totalDocsExamined && operation.execStats?.totalDocsReturned) {
      const efficiency = (operation.execStats.totalDocsReturned / operation.execStats.totalDocsExamined) * 100;
      return Math.round(efficiency);
    }
    return 0;
  }
}

// Benefits of MongoDB Advanced Database Administration:
// - Comprehensive performance monitoring with real-time metrics collection and analysis
// - Automated slow operation detection and optimization recommendation generation
// - Intelligent index analysis with usage patterns and maintenance prioritization
// - Automated maintenance scheduling with conflict detection and approval workflows
// - Enterprise backup management with compression, encryption, and retention policies
// - Integrated security auditing and access control monitoring
// - Real-time alerting with configurable thresholds and escalation procedures
// - Capacity planning with growth trend analysis and resource optimization
// - High availability monitoring for replica sets and sharded clusters
// - SQL-compatible administration operations through QueryLeaf integration

module.exports = {
  MongoDBAdministrationManager
};

Understanding MongoDB Database Administration Architecture

Enterprise Performance Monitoring and Optimization

Implement comprehensive monitoring for production MongoDB environments:

// Production-ready MongoDB monitoring with advanced analytics and alerting
class ProductionMonitoringManager extends MongoDBAdministrationManager {
  constructor(client, productionConfig) {
    super(client, productionConfig);

    this.productionConfig = {
      ...productionConfig,
      enableAdvancedAnalytics: true,
      enablePredictiveAlerts: true,
      enableCapacityPlanning: true,
      enableComplianceMonitoring: true,
      enableForensicLogging: true,
      enablePerformanceBaselines: true
    };

    this.setupProductionMonitoring();
    this.initializeAdvancedAnalytics();
    this.setupCapacityPlanning();
  }

  async implementAdvancedMonitoring(monitoringProfile) {
    console.log('Implementing advanced production monitoring...');

    const monitoringFramework = {
      // Real-time performance monitoring
      realTimeMetrics: {
        operationLatency: true,
        throughputAnalysis: true,
        resourceUtilization: true,
        connectionPoolAnalysis: true,
        lockContentionTracking: true
      },

      // Predictive analytics
      predictiveAnalytics: {
        capacityForecast: true,
        performanceTrendAnalysis: true,
        anomalyDetection: true,
        failurePrediction: true,
        maintenanceScheduling: true
      },

      // Advanced alerting
      intelligentAlerting: {
        dynamicThresholds: true,
        alertCorrelation: true,
        escalationManagement: true,
        suppressionRules: true,
        businessImpactAssessment: true
      }
    };

    return await this.deployMonitoringFramework(monitoringFramework, monitoringProfile);
  }

  async setupCapacityPlanningFramework() {
    console.log('Setting up comprehensive capacity planning...');

    const capacityFramework = {
      // Growth analysis
      growthAnalysis: {
        dataGrowthTrends: true,
        operationalGrowthPatterns: true,
        resourceConsumptionForecasting: true,
        scalingRecommendations: true
      },

      // Performance baselines
      performanceBaselines: {
        responseTimeBaselines: true,
        throughputBaselines: true,
        resourceUtilizationBaselines: true,
        operationalBaselines: true
      },

      // Optimization recommendations
      optimizationFramework: {
        indexingOptimization: true,
        queryOptimization: true,
        schemaOptimization: true,
        infrastructureOptimization: true
      }
    };

    return await this.deployCapacityFramework(capacityFramework);
  }

  async implementComplianceMonitoring(complianceRequirements) {
    console.log('Implementing comprehensive compliance monitoring...');

    const complianceFramework = {
      // Audit trail monitoring
      auditCompliance: {
        accessLogging: true,
        changeTracking: true,
        privilegedOperationsTracking: true,
        complianceReporting: true
      },

      // Security monitoring
      securityCompliance: {
        authenticationMonitoring: true,
        authorizationTracking: true,
        securityViolationDetection: true,
        threatDetection: true
      }
    };

    return await this.deployComplianceFramework(complianceFramework, complianceRequirements);
  }
}

SQL-Style Database Administration with QueryLeaf

QueryLeaf provides familiar SQL syntax for MongoDB database administration and monitoring operations:

-- QueryLeaf advanced database administration with SQL-familiar syntax for MongoDB

-- Configure monitoring and administration settings
SET monitoring_interval = 60; -- seconds
SET slow_operation_threshold = 100; -- milliseconds
SET enable_performance_profiling = true;
SET enable_automatic_maintenance = true;
SET enable_capacity_planning = true;

-- Comprehensive performance monitoring with SQL-familiar administration
WITH monitoring_configuration AS (
  SELECT 
    -- Monitoring settings
    60 as metrics_collection_interval_seconds,
    100 as slow_operation_threshold_ms,
    true as enable_detailed_profiling,
    true as enable_realtime_alerts,

    -- Performance thresholds
    80 as cpu_threshold_percent,
    85 as memory_threshold_percent,
    80 as connection_threshold_percent,
    90 as disk_threshold_percent,

    -- Alert configuration
    'critical' as high_severity_level,
    'warning' as medium_severity_level,
    'info' as low_severity_level,

    -- Maintenance configuration
    true as enable_automatic_maintenance,
    '02:00:00'::time as maintenance_window_start,
    '04:00:00'::time as maintenance_window_end,

    -- Backup configuration
    true as enable_automatic_backups,
    30 as backup_retention_days,
    'full' as default_backup_type
),

performance_metrics_collection AS (
  -- Collect comprehensive performance metrics
  SELECT 
    CURRENT_TIMESTAMP as metric_timestamp,
    EXTRACT(EPOCH FROM CURRENT_TIMESTAMP) as metric_epoch,

    -- Server identification
    'mongodb-primary-01' as server_name,
    'production' as environment,
    '4.4.15' as mongodb_version,

    -- Connection metrics
    JSON_BUILD_OBJECT(
      'current_connections', 245,
      'available_connections', 755,
      'total_created', 15420,
      'utilization_percent', ROUND((245.0 / (245 + 755)) * 100, 2)
    ) as connection_metrics,

    -- Operation metrics
    JSON_BUILD_OBJECT(
      'insert_ops', 1245670,
      'query_ops', 8901234,
      'update_ops', 567890,
      'delete_ops', 123456,
      'command_ops', 2345678,

      -- Operations per second (calculated)
      'insert_ops_per_sec', 12.5,
      'query_ops_per_sec', 156.7,
      'update_ops_per_sec', 8.9,
      'delete_ops_per_sec', 2.1,
      'command_ops_per_sec', 45.6
    ) as operation_metrics,

    -- Memory metrics
    JSON_BUILD_OBJECT(
      'resident_mb', 2048,
      'virtual_mb', 4096,
      'mapped_mb', 1024,
      'cache_size_gb', 8,
      'cache_used_gb', 6.5,
      'cache_utilization_percent', 81.25
    ) as memory_metrics,

    -- Database metrics
    JSON_BUILD_OBJECT(
      'total_collections', 45,
      'total_documents', 25678901,
      'data_size_gb', 12.5,
      'storage_size_gb', 15.2,
      'total_indexes', 123,
      'index_size_gb', 2.8,
      'avg_document_size_bytes', 512
    ) as database_metrics,

    -- Lock metrics
    JSON_BUILD_OBJECT(
      'global_lock_ratio', 0.02,
      'database_lock_ratio', 0.01,
      'collection_lock_ratio', 0.005,
      'lock_contention_percent', 1.2,
      'blocked_operations', 3
    ) as lock_metrics,

    -- Replication metrics (if replica set)
    JSON_BUILD_OBJECT(
      'replica_set_name', 'rs-production',
      'is_primary', true,
      'secondary_count', 2,
      'max_replication_lag_seconds', 0.5,
      'oplog_size_gb', 2.0,
      'oplog_used_percent', 15.3
    ) as replication_metrics,

    -- Alert evaluation
    ARRAY[
      CASE WHEN 245.0 / (245 + 755) > 0.8 THEN
        JSON_BUILD_OBJECT(
          'type', 'high_connection_usage',
          'severity', 'warning',
          'value', ROUND((245.0 / (245 + 755)) * 100, 2),
          'threshold', 80,
          'message', 'Connection utilization above threshold'
        )
      END,
      CASE WHEN 81.25 > 85 THEN
        JSON_BUILD_OBJECT(
          'type', 'high_cache_usage',
          'severity', 'warning',
          'value', 81.25,
          'threshold', 85,
          'message', 'Cache utilization above threshold'
        )
      END
    ] as performance_alerts

  FROM monitoring_configuration mc
),

slow_operations_analysis AS (
  -- Analyze slow operations and performance bottlenecks
  SELECT 
    operation_id,
    operation_timestamp,
    operation_type,
    database_name,
    collection_name,
    duration_ms,

    -- Operation details
    operation_command,
    plan_summary,
    execution_stats,

    -- Performance analysis
    CASE 
      WHEN duration_ms > 5000 THEN 'critical'
      WHEN duration_ms > 1000 THEN 'high'
      WHEN duration_ms > 500 THEN 'medium'
      ELSE 'low'
    END as performance_impact,

    -- Query complexity assessment
    CASE 
      WHEN plan_summary LIKE '%COLLSCAN%' THEN 'high'
      WHEN execution_stats->>'totalDocsExamined' > '100000' THEN 'high'
      WHEN execution_stats->>'totalDocsExamined' > '10000' THEN 'medium'
      ELSE 'low'
    END as query_complexity,

    -- Index usage analysis
    CASE 
      WHEN plan_summary LIKE '%IXSCAN%' THEN 'efficient'
      WHEN plan_summary LIKE '%COLLSCAN%' THEN 'full_scan'
      ELSE 'unknown'
    END as index_usage,

    -- Query efficiency calculation
    CASE 
      WHEN execution_stats->>'totalDocsExamined' > '0' AND execution_stats->>'totalDocsReturned' > '0' THEN
        ROUND(
          (execution_stats->>'totalDocsReturned')::decimal / 
          (execution_stats->>'totalDocsExamined')::decimal * 100, 2
        )
      ELSE 0
    END as query_efficiency_percent,

    -- Optimization recommendations
    CASE 
      WHEN plan_summary LIKE '%COLLSCAN%' THEN 'Create appropriate indexes'
      WHEN query_efficiency_percent < 10 THEN 'Optimize query selectivity'
      WHEN duration_ms > 1000 THEN 'Review query structure'
      ELSE 'Performance acceptable'
    END as optimization_recommendation

  FROM slow_operation_log sol
  WHERE sol.operation_timestamp >= CURRENT_TIMESTAMP - INTERVAL '1 hour'
  AND sol.duration_ms >= 100 -- From monitoring configuration
),

index_analysis_comprehensive AS (
  -- Comprehensive index analysis and optimization
  SELECT 
    database_name,
    collection_name,
    index_name,
    index_specification,

    -- Index statistics
    index_size_mb,
    access_count,
    last_accessed_timestamp,

    -- Index efficiency analysis
    CASE 
      WHEN access_count > 1000 AND index_size_mb < 50 THEN 'highly_efficient'
      WHEN access_count > 100 AND index_size_mb < 100 THEN 'efficient'
      WHEN access_count > 10 THEN 'moderately_efficient'
      WHEN access_count = 0 AND index_name != '_id_' THEN 'unused'
      ELSE 'inefficient'
    END as efficiency_rating,

    -- Usage pattern analysis
    CASE 
      WHEN last_accessed_timestamp < CURRENT_TIMESTAMP - INTERVAL '30 days' THEN 'rarely_used'
      WHEN last_accessed_timestamp < CURRENT_TIMESTAMP - INTERVAL '7 days' THEN 'occasionally_used'
      WHEN access_count > 100 THEN 'frequently_used'
      ELSE 'moderately_used'
    END as usage_pattern,

    -- Maintenance recommendations
    CASE 
      WHEN access_count = 0 AND index_name != '_id_' THEN 'Consider removal'
      WHEN index_size_mb > 1000 AND efficiency_rating = 'inefficient' THEN 'Rebuild index'
      WHEN efficiency_rating = 'unused' THEN 'Review necessity'
      WHEN usage_pattern = 'frequently_used' AND index_size_mb > 500 THEN 'Monitor fragmentation'
      ELSE 'Maintain current strategy'
    END as maintenance_recommendation,

    -- Priority scoring
    CASE 
      WHEN efficiency_rating = 'unused' THEN 10
      WHEN efficiency_rating = 'inefficient' AND index_size_mb > 100 THEN 8
      WHEN maintenance_recommendation LIKE '%Rebuild%' THEN 7
      WHEN usage_pattern = 'rarely_used' AND index_size_mb > 50 THEN 5
      ELSE 1
    END as maintenance_priority_score

  FROM index_statistics_view isv
),

maintenance_scheduling AS (
  -- Automated maintenance scheduling with priority management
  SELECT 
    GENERATE_UUID() as maintenance_id,
    maintenance_type,
    target_database,
    target_collection,
    target_index,

    -- Scheduling information
    CASE maintenance_type
      WHEN 'index_rebuild' THEN CURRENT_DATE + INTERVAL '1 week'
      WHEN 'index_removal' THEN CURRENT_DATE + INTERVAL '2 weeks'
      WHEN 'collection_compaction' THEN CURRENT_DATE + INTERVAL '1 month'
      WHEN 'statistics_refresh' THEN CURRENT_DATE + INTERVAL '1 day'
      ELSE CURRENT_DATE + INTERVAL '1 month'
    END as scheduled_date,

    -- Impact assessment
    CASE maintenance_type
      WHEN 'index_rebuild' THEN 'medium'
      WHEN 'index_removal' THEN 'low'
      WHEN 'collection_compaction' THEN 'high'
      WHEN 'statistics_refresh' THEN 'low'
      ELSE 'medium'
    END as impact_level,

    -- Duration estimation
    CASE maintenance_type
      WHEN 'index_rebuild' THEN INTERVAL '30 minutes'
      WHEN 'index_removal' THEN INTERVAL '5 minutes'
      WHEN 'collection_compaction' THEN INTERVAL '2 hours'
      WHEN 'statistics_refresh' THEN INTERVAL '10 minutes'
      ELSE INTERVAL '1 hour'
    END as estimated_duration,

    -- Approval requirements
    CASE impact_level
      WHEN 'high' THEN 'director_approval'
      WHEN 'medium' THEN 'manager_approval'
      ELSE 'automatic_approval'
    END as approval_requirement,

    -- Maintenance window validation
    CASE 
      WHEN EXTRACT(DOW FROM scheduled_date) IN (0, 6) THEN true -- Weekends preferred
      WHEN EXTRACT(HOUR FROM mc.maintenance_window_start) BETWEEN 2 AND 4 THEN true
      ELSE false
    END as within_maintenance_window,

    -- Business justification
    CASE maintenance_type
      WHEN 'index_rebuild' THEN 'Improve query performance and reduce storage overhead'
      WHEN 'index_removal' THEN 'Eliminate unused indexes to improve write performance'
      WHEN 'collection_compaction' THEN 'Reclaim disk space and optimize storage'
      WHEN 'statistics_refresh' THEN 'Ensure query optimizer has current statistics'
      ELSE 'General database maintenance and optimization'
    END as business_justification,

    CURRENT_TIMESTAMP as maintenance_created

  FROM (
    -- Generate maintenance tasks based on analysis results
    SELECT 'index_rebuild' as maintenance_type, database_name as target_database, 
           collection_name as target_collection, index_name as target_index
    FROM index_analysis_comprehensive
    WHERE maintenance_recommendation LIKE '%Rebuild%'

    UNION ALL

    SELECT 'index_removal' as maintenance_type, database_name, collection_name, index_name
    FROM index_analysis_comprehensive  
    WHERE maintenance_recommendation LIKE '%removal%'

    UNION ALL

    SELECT 'statistics_refresh' as maintenance_type, database_name, collection_name, NULL
    FROM slow_operations_analysis
    WHERE query_complexity = 'high'
    GROUP BY database_name, collection_name
  ) maintenance_tasks
  CROSS JOIN monitoring_configuration mc
),

backup_operations_management AS (
  -- Comprehensive backup operations and scheduling
  SELECT 
    GENERATE_UUID() as backup_id,
    backup_type,
    target_databases,

    -- Backup scheduling
    CASE backup_type
      WHEN 'full' THEN DATE_TRUNC('day', CURRENT_TIMESTAMP) + INTERVAL '2 hours'
      WHEN 'incremental' THEN DATE_TRUNC('hour', CURRENT_TIMESTAMP) + INTERVAL '4 hours'
      WHEN 'differential' THEN DATE_TRUNC('day', CURRENT_TIMESTAMP) + INTERVAL '8 hours'
      ELSE CURRENT_TIMESTAMP + INTERVAL '1 hour'
    END as scheduled_backup_time,

    -- Backup configuration
    JSON_BUILD_OBJECT(
      'compression_enabled', true,
      'encryption_enabled', true,
      'verify_backup', true,
      'test_restore', backup_type = 'full',
      'storage_location', '/backup/mongodb/' || TO_CHAR(CURRENT_DATE, 'YYYY/MM/DD'),
      'retention_days', mc.backup_retention_days
    ) as backup_configuration,

    -- Estimated backup metrics
    CASE backup_type
      WHEN 'full' THEN JSON_BUILD_OBJECT(
        'estimated_size_gb', 25.5,
        'estimated_duration_minutes', 45,
        'estimated_compressed_size_gb', 12.8
      )
      WHEN 'incremental' THEN JSON_BUILD_OBJECT(
        'estimated_size_gb', 2.5,
        'estimated_duration_minutes', 8,
        'estimated_compressed_size_gb', 1.2
      )
      WHEN 'differential' THEN JSON_BUILD_OBJECT(
        'estimated_size_gb', 8.5,
        'estimated_duration_minutes', 15,
        'estimated_compressed_size_gb', 4.2
      )
    END as backup_estimates,

    -- Backup priority and impact
    CASE backup_type
      WHEN 'full' THEN 'high'
      WHEN 'differential' THEN 'medium'
      ELSE 'low'
    END as backup_priority,

    -- Validation requirements
    JSON_BUILD_OBJECT(
      'integrity_check_required', true,
      'restore_test_required', backup_type = 'full',
      'checksum_verification', true,
      'backup_verification', true
    ) as validation_requirements

  FROM (
    SELECT 'full' as backup_type, ARRAY['production_db'] as target_databases
    UNION ALL
    SELECT 'incremental' as backup_type, ARRAY['production_db', 'analytics_db'] as target_databases
    UNION ALL  
    SELECT 'differential' as backup_type, ARRAY['production_db'] as target_databases
  ) backup_schedule
  CROSS JOIN monitoring_configuration mc
),

administration_dashboard AS (
  -- Comprehensive administration dashboard and reporting
  SELECT 
    pmc.metric_timestamp,
    pmc.server_name,
    pmc.environment,

    -- Connection summary
    (pmc.connection_metrics->>'current_connections')::int as current_connections,
    (pmc.connection_metrics->>'utilization_percent')::decimal as connection_utilization,

    -- Performance summary
    (pmc.operation_metrics->>'query_ops_per_sec')::decimal as queries_per_second,
    (pmc.memory_metrics->>'cache_utilization_percent')::decimal as cache_utilization,
    (pmc.database_metrics->>'data_size_gb')::decimal as data_size_gb,

    -- Slow operations summary
    COUNT(soa.operation_id) as slow_operations_count,
    COUNT(soa.operation_id) FILTER (WHERE soa.performance_impact = 'critical') as critical_slow_ops,
    COUNT(soa.operation_id) FILTER (WHERE soa.index_usage = 'full_scan') as full_scan_operations,
    AVG(soa.query_efficiency_percent) as avg_query_efficiency,

    -- Index maintenance summary
    COUNT(iac.index_name) as total_indexes_analyzed,
    COUNT(iac.index_name) FILTER (WHERE iac.efficiency_rating = 'unused') as unused_indexes,
    COUNT(iac.index_name) FILTER (WHERE iac.maintenance_recommendation != 'Maintain current strategy') as indexes_needing_maintenance,
    SUM(iac.maintenance_priority_score) as total_maintenance_priority,

    -- Maintenance scheduling summary
    COUNT(ms.maintenance_id) as scheduled_maintenance_tasks,
    COUNT(ms.maintenance_id) FILTER (WHERE ms.impact_level = 'high') as high_impact_maintenance,
    COUNT(ms.maintenance_id) FILTER (WHERE ms.approval_requirement = 'director_approval') as director_approval_required,

    -- Backup operations summary
    COUNT(bom.backup_id) as scheduled_backups,
    COUNT(bom.backup_id) FILTER (WHERE bom.backup_type = 'full') as full_backups_scheduled,
    SUM((bom.backup_estimates->>'estimated_size_gb')::decimal) as total_estimated_backup_size_gb,

    -- Alert summary
    array_length(pmc.performance_alerts, 1) as active_alerts_count,

    -- Overall health score
    CASE 
      WHEN connection_utilization < 60 AND cache_utilization < 80 AND critical_slow_ops = 0 THEN 'excellent'
      WHEN connection_utilization < 75 AND cache_utilization < 85 AND critical_slow_ops <= 2 THEN 'good'
      WHEN connection_utilization < 85 AND cache_utilization < 90 AND critical_slow_ops <= 5 THEN 'fair'
      ELSE 'needs_attention'
    END as overall_health_status,

    -- Recommendations summary
    CASE 
      WHEN unused_indexes > 5 THEN 'Review and remove unused indexes for improved performance'
      WHEN full_scan_operations > 10 THEN 'Create missing indexes to optimize query performance'
      WHEN avg_query_efficiency < 50 THEN 'Optimize queries for better selectivity'
      WHEN high_impact_maintenance > 3 THEN 'Schedule maintenance window for optimization tasks'
      ELSE 'Database performance is within acceptable parameters'
    END as primary_recommendation

  FROM performance_metrics_collection pmc
  LEFT JOIN slow_operations_analysis soa ON true
  LEFT JOIN index_analysis_comprehensive iac ON true
  LEFT JOIN maintenance_scheduling ms ON true
  LEFT JOIN backup_operations_management bom ON true
  GROUP BY pmc.metric_timestamp, pmc.server_name, pmc.environment, pmc.connection_metrics,
           pmc.operation_metrics, pmc.memory_metrics, pmc.database_metrics, pmc.performance_alerts
)

-- Generate comprehensive administration report
SELECT 
  ad.metric_timestamp,
  ad.server_name,
  ad.environment,

  -- Current status
  ad.current_connections,
  ad.connection_utilization,
  ad.queries_per_second,
  ad.cache_utilization,
  ad.data_size_gb,

  -- Performance analysis
  ad.slow_operations_count,
  ad.critical_slow_ops,
  ad.full_scan_operations,
  ROUND(ad.avg_query_efficiency, 1) as avg_query_efficiency_percent,

  -- Index management
  ad.total_indexes_analyzed,
  ad.unused_indexes,
  ad.indexes_needing_maintenance,
  ad.total_maintenance_priority,

  -- Operational management
  ad.scheduled_maintenance_tasks,
  ad.high_impact_maintenance,
  ad.director_approval_required,

  -- Backup management
  ad.scheduled_backups,
  ad.full_backups_scheduled,
  ROUND(ad.total_estimated_backup_size_gb, 1) as total_backup_size_gb,

  -- Health assessment
  ad.active_alerts_count,
  ad.overall_health_status,
  ad.primary_recommendation,

  -- Operational metrics
  CASE ad.overall_health_status
    WHEN 'excellent' THEN 'Continue current operations and monitoring'
    WHEN 'good' THEN 'Monitor performance trends and plan preventive maintenance'
    WHEN 'fair' THEN 'Schedule performance review and optimization planning'
    ELSE 'Immediate attention required - review alerts and implement optimizations'
  END as operational_guidance,

  -- Next actions
  ARRAY[
    CASE WHEN ad.critical_slow_ops > 0 THEN 'Investigate critical slow operations' END,
    CASE WHEN ad.unused_indexes > 3 THEN 'Schedule unused index removal' END,
    CASE WHEN ad.connection_utilization > 80 THEN 'Review connection pooling configuration' END,
    CASE WHEN ad.cache_utilization > 85 THEN 'Consider memory allocation optimization' END,
    CASE WHEN ad.full_scan_operations > 5 THEN 'Create missing indexes for better performance' END
  ] as immediate_action_items

FROM administration_dashboard ad
ORDER BY ad.metric_timestamp DESC;

-- QueryLeaf provides comprehensive MongoDB administration capabilities:
-- 1. Real-time performance monitoring with configurable thresholds and alerting
-- 2. Automated slow operation analysis and query optimization recommendations
-- 3. Intelligent index analysis with maintenance prioritization and scheduling
-- 4. Comprehensive backup management with scheduling and retention policies
-- 5. Maintenance planning with approval workflows and impact assessment
-- 6. Security auditing and compliance monitoring for enterprise requirements
-- 7. Capacity planning with growth trend analysis and resource optimization
-- 8. Integration with MongoDB native administration tools and enterprise monitoring
-- 9. SQL-familiar syntax for complex administration operations and reporting
-- 10. Automated administration workflows with intelligent decision-making capabilities

Best Practices for Production MongoDB Administration

Enterprise Operations Management and Monitoring

Essential principles for effective MongoDB database administration:

Comprehensive Monitoring: Implement real-time performance monitoring with configurable thresholds and intelligent alerting
Proactive Maintenance: Schedule automated maintenance operations with approval workflows and impact assessment
Performance Optimization: Continuously analyze slow operations and optimize query performance with index recommendations
Capacity Planning: Monitor growth trends and plan resource allocation for optimal performance and cost efficiency
Security Management: Implement comprehensive security auditing and access control monitoring for compliance
Backup Strategy: Maintain robust backup operations with automated scheduling, verification, and retention management

Scalability and Production Deployment

Optimize MongoDB administration for enterprise-scale requirements:

Monitoring Infrastructure: Deploy scalable monitoring solutions that handle high-volume metrics collection and analysis
Automated Operations: Implement intelligent automation for routine maintenance and optimization tasks
Performance Baselines: Establish and maintain performance baselines for proactive issue detection
Disaster Recovery: Design comprehensive backup and recovery procedures with regular testing and validation
Compliance Integration: Integrate administration workflows with enterprise compliance and governance frameworks
Operational Excellence: Create standardized procedures for MongoDB operations with documentation and training

Conclusion

MongoDB database administration provides comprehensive operations management capabilities that enable enterprise-grade performance monitoring, maintenance automation, and operational excellence through sophisticated administration frameworks and intelligent automation. The combination of real-time monitoring, automated maintenance, and proactive optimization ensures optimal MongoDB performance and reliability.

Key MongoDB Database Administration benefits include:

Real-time Monitoring: Comprehensive performance metrics collection with intelligent alerting and threshold management
Automated Maintenance: Intelligent scheduling and execution of maintenance operations with approval workflows
Performance Optimization: Continuous analysis and optimization of query performance with actionable recommendations
Enterprise Security: Comprehensive security auditing and access control monitoring for compliance requirements
Capacity Management: Proactive capacity planning with growth trend analysis and resource optimization
SQL Accessibility: Familiar SQL-style administration operations through QueryLeaf for accessible database management

Whether you're managing production MongoDB deployments, optimizing database performance, implementing compliance monitoring, or planning capacity growth, MongoDB database administration with QueryLeaf's familiar SQL interface provides the foundation for robust, scalable database operations management.

QueryLeaf Integration: QueryLeaf automatically optimizes MongoDB database administration while providing SQL-familiar syntax for performance monitoring, maintenance scheduling, and operational management. Advanced administration patterns, automated optimization workflows, and enterprise monitoring capabilities are seamlessly handled through familiar SQL constructs, making sophisticated database operations accessible to SQL-oriented administration teams.

The combination of MongoDB's comprehensive administration capabilities with SQL-style operations management makes it an ideal platform for applications requiring both sophisticated database operations and familiar administration patterns, ensuring your MongoDB deployments can operate efficiently while maintaining optimal performance and operational excellence.

November 26, 2025
23 min read

MongoDB Document Versioning and Audit Trails: Enterprise-Grade Data History Management and Compliance Tracking

Enterprise applications require comprehensive data history tracking, audit trails, and compliance management to meet regulatory requirements, support forensic analysis, and maintain data integrity across complex business operations. Traditional approaches to document versioning often struggle with storage efficiency, query performance, and the complexity of managing historical data alongside current records.

MongoDB document versioning provides sophisticated data history management capabilities that enable audit trails, compliance tracking, and temporal data analysis through flexible schema design and optimized storage strategies. Unlike rigid relational audit tables that require complex joins and separate storage, MongoDB's document model allows for efficient embedded versioning, reference-based history tracking, and hybrid approaches that balance performance with storage requirements.

The Traditional Audit Trail Challenge

Conventional relational approaches to audit trails and versioning face significant limitations:

-- Traditional PostgreSQL audit trail approach - complex table management and poor performance

-- Main business entity table
CREATE TABLE products (
    product_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    name VARCHAR(500) NOT NULL,
    category VARCHAR(100),
    price DECIMAL(10,2) NOT NULL,
    description TEXT,
    supplier_id UUID,
    status VARCHAR(50) DEFAULT 'active',

    -- Versioning metadata
    version INTEGER NOT NULL DEFAULT 1,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    created_by VARCHAR(100),
    updated_by VARCHAR(100),

    -- Soft delete support
    deleted BOOLEAN DEFAULT FALSE,
    deleted_at TIMESTAMP,
    deleted_by VARCHAR(100)
);

-- Separate audit trail table with complex structure
CREATE TABLE product_audit (
    audit_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    product_id UUID NOT NULL REFERENCES products(product_id),
    operation_type VARCHAR(20) NOT NULL, -- INSERT, UPDATE, DELETE, UNDELETE

    -- Version tracking
    version_from INTEGER,
    version_to INTEGER NOT NULL,

    -- Complete historical snapshot (storage intensive)
    historical_data JSONB NOT NULL,
    changed_fields JSONB,

    -- Change metadata
    change_timestamp TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    changed_by VARCHAR(100) NOT NULL,
    change_reason TEXT,
    change_source VARCHAR(100), -- 'api', 'admin_panel', 'batch_process', etc.

    -- Audit context
    session_id VARCHAR(100),
    request_id VARCHAR(100),
    ip_address INET,
    user_agent TEXT,

    -- Compliance metadata
    retention_until TIMESTAMP,
    compliance_flags JSONB DEFAULT '{}'::jsonb,
    regulatory_context VARCHAR(200)
);

-- Complex trigger system for automatic audit trail generation
CREATE OR REPLACE FUNCTION create_product_audit()
RETURNS TRIGGER AS $$
DECLARE
    changed_fields jsonb := '{}'::jsonb;
    field_name text;
    audit_operation text;
    user_context jsonb;
BEGIN
    -- Determine operation type
    IF TG_OP = 'INSERT' THEN
        audit_operation := 'INSERT';
        changed_fields := to_jsonb(NEW) - 'created_at' - 'updated_at';

        INSERT INTO product_audit (
            product_id, operation_type, version_to, 
            historical_data, changed_fields, changed_by, change_source
        ) VALUES (
            NEW.product_id, audit_operation, NEW.version,
            to_jsonb(NEW), changed_fields, 
            NEW.created_by, COALESCE(current_setting('app.change_source', true), 'system')
        );

        RETURN NEW;

    ELSIF TG_OP = 'UPDATE' THEN
        -- Detect soft delete
        IF OLD.deleted = FALSE AND NEW.deleted = TRUE THEN
            audit_operation := 'DELETE';
        ELSIF OLD.deleted = TRUE AND NEW.deleted = FALSE THEN
            audit_operation := 'UNDELETE';
        ELSE
            audit_operation := 'UPDATE';
        END IF;

        -- Complex field-by-field change detection
        FOR field_name IN SELECT key FROM jsonb_each(to_jsonb(NEW)) LOOP
            IF to_jsonb(NEW)->>field_name IS DISTINCT FROM to_jsonb(OLD)->>field_name THEN
                changed_fields := changed_fields || jsonb_build_object(
                    field_name, jsonb_build_object(
                        'old_value', to_jsonb(OLD)->>field_name,
                        'new_value', to_jsonb(NEW)->>field_name
                    )
                );
            END IF;
        END LOOP;

        -- Only create audit record if there are meaningful changes
        IF changed_fields != '{}'::jsonb THEN
            -- Increment version
            NEW.version := OLD.version + 1;
            NEW.updated_at := CURRENT_TIMESTAMP;

            INSERT INTO product_audit (
                product_id, operation_type, version_from, version_to,
                historical_data, changed_fields, changed_by, 
                change_source, change_reason
            ) VALUES (
                NEW.product_id, audit_operation, OLD.version, NEW.version,
                to_jsonb(OLD), changed_fields, 
                COALESCE(NEW.updated_by, OLD.updated_by),
                COALESCE(current_setting('app.change_source', true), 'system'),
                COALESCE(current_setting('app.change_reason', true), 'System update')
            );
        END IF;

        RETURN NEW;

    ELSIF TG_OP = 'DELETE' THEN
        -- Handle hard delete (rarely used in enterprise applications)
        INSERT INTO product_audit (
            product_id, operation_type, version_from,
            historical_data, changed_by, change_source
        ) VALUES (
            OLD.product_id, 'HARD_DELETE', OLD.version,
            to_jsonb(OLD), 
            COALESCE(current_setting('app.user_id', true), 'system'),
            COALESCE(current_setting('app.change_source', true), 'system')
        );

        RETURN OLD;
    END IF;

    RETURN COALESCE(NEW, OLD);
END;
$$ LANGUAGE plpgsql;

-- Apply audit trigger (adds significant overhead)
CREATE TRIGGER product_audit_trigger
    BEFORE INSERT OR UPDATE OR DELETE ON products
    FOR EACH ROW EXECUTE FUNCTION create_product_audit();

-- Complex audit trail queries with poor performance
WITH audit_trail_analysis AS (
    SELECT 
        pa.product_id,
        pa.audit_id,
        pa.operation_type,
        pa.version_from,
        pa.version_to,
        pa.change_timestamp,
        pa.changed_by,
        pa.changed_fields,

        -- Extract specific field changes (complex JSON processing)
        CASE 
            WHEN pa.changed_fields ? 'price' THEN 
                jsonb_build_object(
                    'old_price', (pa.changed_fields->'price'->>'old_value')::decimal,
                    'new_price', (pa.changed_fields->'price'->>'new_value')::decimal,
                    'price_change_percent', 
                        CASE WHEN (pa.changed_fields->'price'->>'old_value')::decimal > 0 THEN
                            (((pa.changed_fields->'price'->>'new_value')::decimal - 
                              (pa.changed_fields->'price'->>'old_value')::decimal) /
                             (pa.changed_fields->'price'->>'old_value')::decimal) * 100
                        ELSE 0 END
                )
            ELSE NULL
        END as price_change_analysis,

        -- Calculate time between changes
        LAG(pa.change_timestamp) OVER (
            PARTITION BY pa.product_id 
            ORDER BY pa.change_timestamp
        ) as previous_change_timestamp,

        -- Identify frequent changers
        COUNT(*) OVER (
            PARTITION BY pa.product_id 
            ORDER BY pa.change_timestamp 
            RANGE BETWEEN INTERVAL '1 day' PRECEDING AND CURRENT ROW
        ) as changes_in_last_day,

        -- User activity analysis
        COUNT(DISTINCT pa.changed_by) OVER (
            PARTITION BY pa.product_id
        ) as unique_users_modified,

        -- Compliance tracking
        CASE 
            WHEN pa.retention_until IS NOT NULL AND pa.retention_until < CURRENT_TIMESTAMP THEN 'expired'
            WHEN pa.compliance_flags ? 'gdpr_subject' THEN 'gdpr_protected'
            WHEN pa.compliance_flags ? 'financial_record' THEN 'sox_compliant'
            ELSE 'standard'
        END as compliance_status

    FROM product_audit pa
    WHERE pa.change_timestamp >= CURRENT_TIMESTAMP - INTERVAL '30 days'
),

audit_summary AS (
    -- Aggregate audit information (expensive operations)
    SELECT 
        ata.product_id,

        -- Change frequency analysis
        COUNT(*) as total_changes,
        COUNT(DISTINCT ata.changed_by) as unique_modifiers,
        COUNT(*) FILTER (WHERE ata.operation_type = 'UPDATE') as update_count,
        COUNT(*) FILTER (WHERE ata.operation_type = 'DELETE') as delete_count,

        -- Time-based analysis
        MAX(ata.change_timestamp) as last_modified,
        MIN(ata.change_timestamp) as first_modified_in_period,
        AVG(EXTRACT(EPOCH FROM (ata.change_timestamp - ata.previous_change_timestamp))) as avg_time_between_changes,

        -- Field change analysis
        COUNT(*) FILTER (WHERE ata.changed_fields ? 'price') as price_changes,
        COUNT(*) FILTER (WHERE ata.changed_fields ? 'status') as status_changes,
        COUNT(*) FILTER (WHERE ata.changed_fields ? 'supplier_id') as supplier_changes,

        -- Compliance summary
        array_agg(DISTINCT ata.compliance_status) as compliance_statuses,

        -- Most active user
        MODE() WITHIN GROUP (ORDER BY ata.changed_by) as most_active_modifier,

        -- Recent activity indicators
        MAX(ata.changes_in_last_day) as max_daily_changes,
        CASE WHEN MAX(ata.changes_in_last_day) > 10 THEN 'high_activity' 
             WHEN MAX(ata.changes_in_last_day) > 3 THEN 'moderate_activity'
             ELSE 'low_activity' END as activity_level

    FROM audit_trail_analysis ata
    GROUP BY ata.product_id
)

-- Generate audit report with complex joins
SELECT 
    p.product_id,
    p.name as current_name,
    p.price as current_price,
    p.status as current_status,
    p.version as current_version,

    -- Audit summary information
    aus.total_changes,
    aus.unique_modifiers,
    aus.last_modified,
    aus.activity_level,
    aus.most_active_modifier,

    -- Change breakdown
    aus.update_count,
    aus.delete_count,
    aus.price_changes,
    aus.status_changes,
    aus.supplier_changes,

    -- Compliance information
    aus.compliance_statuses,

    -- Performance metrics
    ROUND(aus.avg_time_between_changes / 3600, 2) as avg_hours_between_changes,

    -- Recent changes (expensive subquery)
    (
        SELECT jsonb_agg(
            jsonb_build_object(
                'timestamp', pa.change_timestamp,
                'operation', pa.operation_type,
                'changed_by', pa.changed_by,
                'changed_fields', jsonb_object_keys(pa.changed_fields)
            ) ORDER BY pa.change_timestamp DESC
        )
        FROM product_audit pa 
        WHERE pa.product_id = p.product_id 
        AND pa.change_timestamp >= CURRENT_TIMESTAMP - INTERVAL '7 days'
        LIMIT 10
    ) as recent_changes,

    -- Historical versions (very expensive)
    (
        SELECT jsonb_agg(
            jsonb_build_object(
                'version', pa.version_to,
                'timestamp', pa.change_timestamp,
                'data_snapshot', pa.historical_data
            ) ORDER BY pa.version_to DESC
        )
        FROM product_audit pa 
        WHERE pa.product_id = p.product_id
    ) as version_history

FROM products p
LEFT JOIN audit_summary aus ON p.product_id = aus.product_id
WHERE p.deleted = FALSE
ORDER BY aus.total_changes DESC NULLS LAST, p.updated_at DESC;

-- Problems with traditional audit trail approaches:
-- 1. Complex trigger systems that add significant overhead to every database operation
-- 2. Separate audit tables requiring expensive joins for historical analysis
-- 3. Storage inefficiency with complete document snapshots for every change
-- 4. Poor query performance for audit trail analysis and reporting
-- 5. Complex field-level change detection and comparison logic
-- 6. Difficult maintenance of audit table schemas as business entities evolve
-- 7. Limited flexibility for different versioning strategies per entity type
-- 8. Complex compliance and retention management across multiple tables
-- 9. Difficult integration with modern event-driven architectures
-- 10. Poor scalability with high-frequency change environments

-- Compliance reporting challenges (complex multi-table queries)
WITH gdpr_audit_compliance AS (
    SELECT 
        pa.product_id,
        pa.changed_by,
        pa.change_timestamp,
        pa.changed_fields,

        -- GDPR compliance analysis
        CASE 
            WHEN pa.compliance_flags ? 'gdpr_subject' THEN
                jsonb_build_object(
                    'requires_anonymization', true,
                    'retention_period', pa.retention_until,
                    'lawful_basis', pa.compliance_flags->'gdpr_lawful_basis',
                    'data_subject_rights', ARRAY['access', 'rectification', 'erasure', 'portability']
                )
            ELSE NULL
        END as gdpr_metadata,

        -- SOX compliance for financial data
        CASE 
            WHEN pa.compliance_flags ? 'financial_record' THEN
                jsonb_build_object(
                    'sox_compliant', true,
                    'immutable_record', true,
                    'retention_required', '7 years',
                    'audit_trail_complete', true
                )
            ELSE NULL
        END as sox_metadata,

        -- Change approval workflow
        CASE 
            WHEN pa.changed_fields ? 'price' AND (pa.changed_fields->'price'->>'new_value')::decimal > 1000 THEN
                'requires_manager_approval'
            WHEN pa.operation_type = 'DELETE' THEN
                'requires_director_approval'
            ELSE 'standard_approval'
        END as approval_requirement

    FROM product_audit pa
    WHERE pa.compliance_flags IS NOT NULL
    AND pa.change_timestamp >= CURRENT_TIMESTAMP - INTERVAL '1 year'
)

SELECT 
    COUNT(*) as total_compliance_events,
    COUNT(*) FILTER (WHERE gdpr_metadata IS NOT NULL) as gdpr_events,
    COUNT(*) FILTER (WHERE sox_metadata IS NOT NULL) as sox_events,
    COUNT(*) FILTER (WHERE approval_requirement != 'standard_approval') as approval_required_events,

    -- Compliance summary
    jsonb_object_agg(
        approval_requirement,
        COUNT(*)
    ) as approval_breakdown,

    array_agg(DISTINCT changed_by) as users_with_compliance_changes

FROM gdpr_audit_compliance;

-- Traditional approach limitations:
-- 1. Performance degradation with large audit tables and complex queries
-- 2. Storage overhead from complete document snapshots and redundant data
-- 3. Maintenance complexity for evolving audit schemas and compliance requirements
-- 4. Limited flexibility for different versioning strategies per business context
-- 5. Complex reporting and analytics across multiple related audit tables
-- 6. Difficult implementation of retention policies and data lifecycle management
-- 7. Poor integration with modern microservices and event-driven architectures
-- 8. Limited support for distributed audit trails across multiple systems
-- 9. Complex user access control and audit log security management
-- 10. Difficult compliance reporting across multiple regulatory frameworks

MongoDB provides comprehensive document versioning capabilities with flexible storage strategies:

// MongoDB Advanced Document Versioning - Enterprise-grade audit trails with flexible versioning strategies
const { MongoClient, ObjectId } = require('mongodb');

const client = new MongoClient('mongodb://localhost:27017');
const db = client.db('enterprise_audit_system');

// Comprehensive MongoDB Document Versioning Manager
class AdvancedDocumentVersioningManager {
  constructor(db, config = {}) {
    this.db = db;
    this.collections = {
      products: db.collection('products'),
      productVersions: db.collection('product_versions'),
      auditTrail: db.collection('audit_trail'),
      complianceTracking: db.collection('compliance_tracking'),
      retentionPolicies: db.collection('retention_policies'),
      userSessions: db.collection('user_sessions')
    };

    // Advanced versioning configuration
    this.config = {
      // Versioning strategy
      versioningStrategy: config.versioningStrategy || 'hybrid', // 'embedded', 'referenced', 'hybrid'
      maxEmbeddedVersions: config.maxEmbeddedVersions || 5,
      compressionEnabled: config.compressionEnabled !== false,
      enableFieldLevelVersioning: config.enableFieldLevelVersioning !== false,

      // Audit configuration
      enableAuditTrail: config.enableAuditTrail !== false,
      auditLevel: config.auditLevel || 'comprehensive', // 'basic', 'detailed', 'comprehensive'
      enableUserTracking: config.enableUserTracking !== false,
      enableSessionTracking: config.enableSessionTracking !== false,

      // Compliance configuration
      enableComplianceTracking: config.enableComplianceTracking !== false,
      gdprCompliance: config.gdprCompliance !== false,
      soxCompliance: config.soxCompliance || false,
      hipaCompliance: config.hipaCompliance || false,
      customComplianceRules: config.customComplianceRules || [],

      // Performance optimization
      enableIndexOptimization: config.enableIndexOptimization !== false,
      enableBackgroundArchiving: config.enableBackgroundArchiving || false,
      retentionPeriod: config.retentionPeriod || 365 * 7, // 7 years default
      enableChangeStreamIntegration: config.enableChangeStreamIntegration || false
    };

    // Version tracking
    this.versionCounters = new Map();
    this.sessionContext = new Map();
    this.complianceRules = new Map();

    this.initializeVersioningSystem();
  }

  async initializeVersioningSystem() {
    console.log('Initializing advanced document versioning system...');

    try {
      // Setup versioning infrastructure
      await this.setupVersioningInfrastructure();

      // Initialize compliance tracking
      if (this.config.enableComplianceTracking) {
        await this.initializeComplianceTracking();
      }

      // Setup audit trail processing
      if (this.config.enableAuditTrail) {
        await this.setupAuditTrailProcessing();
      }

      // Initialize background processes
      await this.initializeBackgroundProcesses();

      console.log('Document versioning system initialized successfully');

    } catch (error) {
      console.error('Error initializing versioning system:', error);
      throw error;
    }
  }

  async setupVersioningInfrastructure() {
    console.log('Setting up versioning infrastructure...');

    try {
      // Create optimized indexes for versioning
      await this.collections.products.createIndexes([
        { key: { _id: 1, version: -1 }, background: true },
        { key: { 'metadata.lastModified': -1 }, background: true },
        { key: { 'metadata.modifiedBy': 1, 'metadata.lastModified': -1 }, background: true },
        { key: { 'compliance.retentionUntil': 1 }, background: true, sparse: true }
      ]);

      // Version collection indexes
      await this.collections.productVersions.createIndexes([
        { key: { documentId: 1, version: -1 }, background: true },
        { key: { 'metadata.timestamp': -1 }, background: true },
        { key: { 'metadata.operationType': 1, 'metadata.timestamp': -1 }, background: true },
        { key: { 'compliance.retentionUntil': 1 }, background: true, sparse: true }
      ]);

      // Audit trail indexes
      await this.collections.auditTrail.createIndexes([
        { key: { documentId: 1, timestamp: -1 }, background: true },
        { key: { userId: 1, timestamp: -1 }, background: true },
        { key: { operationType: 1, timestamp: -1 }, background: true },
        { key: { 'compliance.requiresRetention': 1, timestamp: -1 }, background: true }
      ]);

      console.log('Versioning infrastructure setup completed');

    } catch (error) {
      console.error('Error setting up versioning infrastructure:', error);
      throw error;
    }
  }

  async createVersionedDocument(documentData, userContext = {}) {
    console.log('Creating new versioned document...');
    const startTime = Date.now();

    try {
      // Generate document metadata
      const documentId = new ObjectId();
      const currentTimestamp = new Date();

      // Prepare versioned document
      const versionedDocument = {
        _id: documentId,
        ...documentData,

        // Version metadata
        version: 1,
        metadata: {
          createdAt: currentTimestamp,
          lastModified: currentTimestamp,
          createdBy: userContext.userId || 'system',
          modifiedBy: userContext.userId || 'system',

          // Change tracking
          changeHistory: [],
          totalChanges: 0,

          // Session context
          sessionId: userContext.sessionId || new ObjectId().toString(),
          requestId: userContext.requestId || new ObjectId().toString(),
          ipAddress: userContext.ipAddress,
          userAgent: userContext.userAgent,

          // Version strategy metadata
          versioningStrategy: this.config.versioningStrategy,
          embeddedVersions: []
        },

        // Compliance tracking
        compliance: await this.generateComplianceMetadata(documentData, userContext, 'create'),

        // Audit context
        auditContext: {
          operationType: 'create',
          operationTimestamp: currentTimestamp,
          businessContext: userContext.businessContext || {},
          regulatoryContext: userContext.regulatoryContext || {}
        }
      };

      // Insert document with session for consistency
      const session = client.startSession();

      try {
        await session.withTransaction(async () => {
          // Create main document
          const insertResult = await this.collections.products.insertOne(versionedDocument, { session });

          // Create initial audit trail entry
          if (this.config.enableAuditTrail) {
            await this.createAuditTrailEntry({
              documentId: documentId,
              operationType: 'create',
              version: 1,
              documentData: versionedDocument,
              userContext: userContext,
              timestamp: currentTimestamp
            }, { session });
          }

          // Initialize compliance tracking
          if (this.config.enableComplianceTracking) {
            await this.initializeDocumentCompliance(documentId, versionedDocument, userContext, { session });
          }

          return insertResult;
        });

      } finally {
        await session.endSession();
      }

      const processingTime = Date.now() - startTime;

      console.log(`Document created successfully: ${documentId}`, {
        version: 1,
        processingTime: processingTime,
        complianceEnabled: this.config.enableComplianceTracking
      });

      return {
        documentId: documentId,
        version: 1,
        created: true,
        processingTime: processingTime,
        complianceMetadata: versionedDocument.compliance
      };

    } catch (error) {
      console.error('Error creating versioned document:', error);
      throw error;
    }
  }

  async updateVersionedDocument(documentId, updateData, userContext = {}) {
    console.log(`Updating versioned document: ${documentId}...`);
    const startTime = Date.now();

    try {
      const session = client.startSession();
      let updateResult;

      try {
        await session.withTransaction(async () => {
          // Retrieve current document
          const currentDocument = await this.collections.products.findOne(
            { _id: new ObjectId(documentId) },
            { session }
          );

          if (!currentDocument) {
            throw new Error(`Document not found: ${documentId}`);
          }

          // Analyze changes
          const changeAnalysis = await this.analyzeDocumentChanges(currentDocument, updateData, userContext);

          // Determine versioning strategy based on change significance
          const versioningStrategy = this.determineVersioningStrategy(changeAnalysis, currentDocument);

          // Create version backup based on strategy
          if (versioningStrategy.createVersionBackup) {
            await this.createVersionBackup(currentDocument, changeAnalysis, userContext, { session });
          }

          // Prepare updated document
          const updatedDocument = await this.prepareUpdatedDocument(
            currentDocument,
            updateData,
            changeAnalysis,
            userContext
          );

          // Update main document
          const updateOperation = await this.collections.products.replaceOne(
            { _id: new ObjectId(documentId) },
            updatedDocument,
            { session }
          );

          // Create audit trail entry
          if (this.config.enableAuditTrail) {
            await this.createAuditTrailEntry({
              documentId: new ObjectId(documentId),
              operationType: 'update',
              version: updatedDocument.version,
              previousVersion: currentDocument.version,
              changeAnalysis: changeAnalysis,
              documentData: updatedDocument,
              previousDocumentData: currentDocument,
              userContext: userContext,
              timestamp: new Date()
            }, { session });
          }

          // Update compliance tracking
          if (this.config.enableComplianceTracking) {
            await this.updateComplianceTracking(
              new ObjectId(documentId),
              changeAnalysis,
              userContext,
              { session }
            );
          }

          updateResult = {
            documentId: documentId,
            version: updatedDocument.version,
            previousVersion: currentDocument.version,
            changeAnalysis: changeAnalysis,
            versioningStrategy: versioningStrategy
          };
        });

      } finally {
        await session.endSession();
      }

      const processingTime = Date.now() - startTime;

      console.log(`Document updated successfully: ${documentId}`, {
        newVersion: updateResult.version,
        changesDetected: Object.keys(updateResult.changeAnalysis.changedFields).length,
        processingTime: processingTime
      });

      return {
        ...updateResult,
        updated: true,
        processingTime: processingTime
      };

    } catch (error) {
      console.error('Error updating versioned document:', error);
      throw error;
    }
  }

  async analyzeDocumentChanges(currentDocument, updateData, userContext) {
    console.log('Analyzing document changes...');

    const changeAnalysis = {
      changedFields: {},
      addedFields: {},
      removedFields: {},
      significantChanges: [],
      minorChanges: [],
      businessImpact: 'low',
      complianceImpact: 'none',
      approvalRequired: false,
      changeReason: userContext.changeReason || 'User update',
      changeCategory: 'standard'
    };

    // Perform deep comparison
    for (const [fieldPath, newValue] of Object.entries(updateData)) {
      const currentValue = this.getNestedValue(currentDocument, fieldPath);

      if (this.isDifferentValue(currentValue, newValue)) {
        const changeInfo = {
          field: fieldPath,
          oldValue: currentValue,
          newValue: newValue,
          changeType: this.determineChangeType(currentValue, newValue),
          timestamp: new Date()
        };

        changeAnalysis.changedFields[fieldPath] = changeInfo;

        // Categorize change significance
        if (this.isSignificantChange(fieldPath, currentValue, newValue, currentDocument)) {
          changeAnalysis.significantChanges.push(changeInfo);

          // Update business impact
          const fieldBusinessImpact = this.assessBusinessImpact(fieldPath, currentValue, newValue, currentDocument);
          if (this.compareImpactLevels(fieldBusinessImpact, changeAnalysis.businessImpact) > 0) {
            changeAnalysis.businessImpact = fieldBusinessImpact;
          }

          // Check compliance impact
          const fieldComplianceImpact = await this.assessComplianceImpact(fieldPath, currentValue, newValue, currentDocument);
          if (fieldComplianceImpact !== 'none') {
            changeAnalysis.complianceImpact = fieldComplianceImpact;
          }

        } else {
          changeAnalysis.minorChanges.push(changeInfo);
        }
      }
    }

    // Determine if approval is required
    changeAnalysis.approvalRequired = await this.requiresApproval(changeAnalysis, currentDocument, userContext);

    // Categorize change
    changeAnalysis.changeCategory = this.categorizeChange(changeAnalysis, currentDocument);

    return changeAnalysis;
  }

  async createVersionBackup(currentDocument, changeAnalysis, userContext, transactionOptions = {}) {
    console.log(`Creating version backup for document: ${currentDocument._id}`);

    try {
      // Determine backup strategy based on versioning configuration
      const backupStrategy = this.determineBackupStrategy(currentDocument, changeAnalysis);

      if (backupStrategy.strategy === 'embedded') {
        // Add version to embedded history
        await this.addEmbeddedVersion(currentDocument, changeAnalysis, userContext, transactionOptions);
      } else if (backupStrategy.strategy === 'referenced') {
        // Create separate version document
        await this.createReferencedVersion(currentDocument, changeAnalysis, userContext, transactionOptions);
      } else if (backupStrategy.strategy === 'hybrid') {
        // Use hybrid approach based on change significance
        if (changeAnalysis.businessImpact === 'high' || changeAnalysis.complianceImpact !== 'none') {
          await this.createReferencedVersion(currentDocument, changeAnalysis, userContext, transactionOptions);
        } else {
          await this.addEmbeddedVersion(currentDocument, changeAnalysis, userContext, transactionOptions);
        }
      }

    } catch (error) {
      console.error('Error creating version backup:', error);
      throw error;
    }
  }

  async addEmbeddedVersion(currentDocument, changeAnalysis, userContext, transactionOptions = {}) {
    console.log('Adding embedded version to document history...');

    const versionSnapshot = {
      version: currentDocument.version,
      timestamp: new Date(),
      data: this.createVersionSnapshot(currentDocument),
      metadata: {
        changeReason: changeAnalysis.changeReason,
        changedBy: userContext.userId || 'system',
        sessionId: userContext.sessionId,
        changeCategory: changeAnalysis.changeCategory,
        businessImpact: changeAnalysis.businessImpact,
        complianceImpact: changeAnalysis.complianceImpact
      }
    };

    // Add to embedded versions (with size limit)
    const updateOperation = {
      $push: {
        'metadata.embeddedVersions': {
          $each: [versionSnapshot],
          $slice: -this.config.maxEmbeddedVersions // Keep only recent versions
        }
      },
      $inc: {
        'metadata.totalVersions': 1
      }
    };

    await this.collections.products.updateOne(
      { _id: currentDocument._id },
      updateOperation,
      transactionOptions
    );
  }

  async createReferencedVersion(currentDocument, changeAnalysis, userContext, transactionOptions = {}) {
    console.log('Creating referenced version document...');

    const versionDocument = {
      _id: new ObjectId(),
      documentId: currentDocument._id,
      version: currentDocument.version,
      timestamp: new Date(),

      // Complete document snapshot
      documentSnapshot: this.createVersionSnapshot(currentDocument),

      // Change metadata
      changeMetadata: {
        changeReason: changeAnalysis.changeReason,
        changedBy: userContext.userId || 'system',
        sessionId: userContext.sessionId,
        requestId: userContext.requestId,
        changeCategory: changeAnalysis.changeCategory,
        businessImpact: changeAnalysis.businessImpact,
        complianceImpact: changeAnalysis.complianceImpact,
        changedFields: Object.keys(changeAnalysis.changedFields),
        significantChanges: changeAnalysis.significantChanges.length
      },

      // Compliance metadata
      compliance: await this.generateVersionComplianceMetadata(currentDocument, changeAnalysis, userContext),

      // Storage metadata
      storageMetadata: {
        compressionEnabled: this.config.compressionEnabled,
        storageSize: JSON.stringify(currentDocument).length,
        createdAt: new Date()
      }
    };

    await this.collections.productVersions.insertOne(versionDocument, transactionOptions);
  }

  async prepareUpdatedDocument(currentDocument, updateData, changeAnalysis, userContext) {
    console.log('Preparing updated document with versioning metadata...');

    // Create updated document
    const updatedDocument = {
      ...currentDocument,
      ...updateData,

      // Update version information
      version: currentDocument.version + 1,

      // Update metadata
      metadata: {
        ...currentDocument.metadata,
        lastModified: new Date(),
        modifiedBy: userContext.userId || 'system',
        totalChanges: currentDocument.metadata.totalChanges + 1,

        // Add change to history
        changeHistory: [
          ...currentDocument.metadata.changeHistory.slice(-9), // Keep recent 10 changes
          {
            version: currentDocument.version + 1,
            timestamp: new Date(),
            changedBy: userContext.userId || 'system',
            changeReason: changeAnalysis.changeReason,
            changedFields: Object.keys(changeAnalysis.changedFields),
            businessImpact: changeAnalysis.businessImpact
          }
        ],

        // Update session context
        sessionId: userContext.sessionId || currentDocument.metadata.sessionId,
        requestId: userContext.requestId || new ObjectId().toString(),
        ipAddress: userContext.ipAddress,
        userAgent: userContext.userAgent
      },

      // Update compliance metadata
      compliance: await this.updateComplianceMetadata(currentDocument.compliance, changeAnalysis, userContext),

      // Update audit context
      auditContext: {
        ...currentDocument.auditContext,
        lastOperation: {
          operationType: 'update',
          operationTimestamp: new Date(),
          changeAnalysis: {
            businessImpact: changeAnalysis.businessImpact,
            complianceImpact: changeAnalysis.complianceImpact,
            changeCategory: changeAnalysis.changeCategory,
            significantChanges: changeAnalysis.significantChanges.length
          },
          userContext: {
            userId: userContext.userId,
            sessionId: userContext.sessionId,
            businessContext: userContext.businessContext
          }
        }
      }
    };

    return updatedDocument;
  }

  async createAuditTrailEntry(auditData, transactionOptions = {}) {
    console.log('Creating comprehensive audit trail entry...');

    const auditEntry = {
      _id: new ObjectId(),
      documentId: auditData.documentId,
      operationType: auditData.operationType,
      version: auditData.version,
      previousVersion: auditData.previousVersion,
      timestamp: auditData.timestamp,

      // User and session context
      userId: auditData.userContext.userId || 'system',
      sessionId: auditData.userContext.sessionId,
      requestId: auditData.userContext.requestId,
      ipAddress: auditData.userContext.ipAddress,
      userAgent: auditData.userContext.userAgent,

      // Change details
      changeDetails: auditData.changeAnalysis ? {
        changedFieldsCount: Object.keys(auditData.changeAnalysis.changedFields).length,
        changedFields: Object.keys(auditData.changeAnalysis.changedFields),
        significantChangesCount: auditData.changeAnalysis.significantChanges.length,
        businessImpact: auditData.changeAnalysis.businessImpact,
        complianceImpact: auditData.changeAnalysis.complianceImpact,
        changeReason: auditData.changeAnalysis.changeReason,
        changeCategory: auditData.changeAnalysis.changeCategory,
        approvalRequired: auditData.changeAnalysis.approvalRequired
      } : null,

      // Document snapshots (based on audit level)
      documentSnapshots: this.createAuditSnapshots(auditData),

      // Business context
      businessContext: auditData.userContext.businessContext || {},
      regulatoryContext: auditData.userContext.regulatoryContext || {},

      // Compliance metadata
      compliance: {
        requiresRetention: await this.requiresComplianceRetention(auditData),
        retentionUntil: await this.calculateRetentionDate(auditData),
        complianceFlags: await this.generateComplianceFlags(auditData),
        regulatoryRequirements: await this.getApplicableRegulations(auditData)
      },

      // Technical metadata
      technicalMetadata: {
        auditLevel: this.config.auditLevel,
        processingTimestamp: new Date(),
        auditVersion: '2.0',
        dataClassification: await this.classifyAuditData(auditData)
      }
    };

    await this.collections.auditTrail.insertOne(auditEntry, transactionOptions);

    return auditEntry._id;
  }

  createAuditSnapshots(auditData) {
    const snapshots = {};

    switch (this.config.auditLevel) {
      case 'basic':
        // Only capture essential identifiers
        snapshots.documentId = auditData.documentId;
        snapshots.version = auditData.version;
        break;

      case 'detailed':
        // Capture changed fields and metadata
        snapshots.documentId = auditData.documentId;
        snapshots.version = auditData.version;
        if (auditData.changeAnalysis) {
          snapshots.changedFields = auditData.changeAnalysis.changedFields;
        }
        break;

      case 'comprehensive':
        // Capture complete document states
        snapshots.documentId = auditData.documentId;
        snapshots.version = auditData.version;
        if (auditData.documentData) {
          snapshots.currentState = this.createVersionSnapshot(auditData.documentData);
        }
        if (auditData.previousDocumentData) {
          snapshots.previousState = this.createVersionSnapshot(auditData.previousDocumentData);
        }
        if (auditData.changeAnalysis) {
          snapshots.changeAnalysis = auditData.changeAnalysis;
        }
        break;

      default:
        snapshots.documentId = auditData.documentId;
        snapshots.version = auditData.version;
    }

    return snapshots;
  }

  // Utility methods for comprehensive document versioning

  createVersionSnapshot(document) {
    // Create a clean snapshot without internal metadata
    const snapshot = { ...document };

    // Remove MongoDB internal fields
    delete snapshot._id;
    delete snapshot.metadata;
    delete snapshot.auditContext;

    // Apply compression if enabled
    if (this.config.compressionEnabled) {
      return this.compressSnapshot(snapshot);
    }

    return snapshot;
  }

  compressSnapshot(snapshot) {
    // Implement snapshot compression logic
    // This would typically use a compression algorithm like gzip
    return {
      compressed: true,
      data: snapshot, // In production, this would be compressed
      originalSize: JSON.stringify(snapshot).length,
      compressionRatio: 0.7 // Simulated compression ratio
    };
  }

  getNestedValue(obj, path) {
    return path.split('.').reduce((current, key) => current?.[key], obj);
  }

  isDifferentValue(oldValue, newValue) {
    // Handle different types and deep comparison
    if (oldValue === newValue) return false;

    if (typeof oldValue !== typeof newValue) return true;

    if (oldValue === null || newValue === null) return oldValue !== newValue;

    if (typeof oldValue === 'object') {
      return JSON.stringify(oldValue) !== JSON.stringify(newValue);
    }

    return oldValue !== newValue;
  }

  determineChangeType(oldValue, newValue) {
    if (oldValue === undefined || oldValue === null) return 'addition';
    if (newValue === undefined || newValue === null) return 'removal';
    if (typeof oldValue !== typeof newValue) return 'type_change';
    return 'modification';
  }

  isSignificantChange(fieldPath, oldValue, newValue, document) {
    // Define business-specific significant fields
    const significantFields = ['price', 'status', 'category', 'supplier_id', 'compliance_status'];

    if (significantFields.includes(fieldPath)) return true;

    // Check for percentage-based changes for numeric fields
    if (fieldPath === 'price' && typeof oldValue === 'number' && typeof newValue === 'number') {
      const changePercentage = Math.abs((newValue - oldValue) / oldValue);
      return changePercentage > 0.05; // 5% threshold
    }

    return false;
  }

  assessBusinessImpact(fieldPath, oldValue, newValue, document) {
    // Business impact assessment logic
    const highImpactFields = ['status', 'compliance_status', 'legal_status'];
    const mediumImpactFields = ['price', 'category', 'supplier_id'];

    if (highImpactFields.includes(fieldPath)) return 'high';
    if (mediumImpactFields.includes(fieldPath)) return 'medium';
    return 'low';
  }

  async assessComplianceImpact(fieldPath, oldValue, newValue, document) {
    // Compliance impact assessment
    if (document.compliance?.gdprSubject && ['name', 'email', 'phone'].includes(fieldPath)) {
      return 'gdpr_personal_data';
    }

    if (document.compliance?.financialRecord && ['price', 'cost', 'revenue'].includes(fieldPath)) {
      return 'sox_financial_data';
    }

    return 'none';
  }

  compareImpactLevels(level1, level2) {
    const levels = { 'low': 1, 'medium': 2, 'high': 3 };
    return levels[level1] - levels[level2];
  }

  async requiresApproval(changeAnalysis, document, userContext) {
    // Approval requirement logic
    if (changeAnalysis.businessImpact === 'high') return true;
    if (changeAnalysis.complianceImpact !== 'none') return true;

    // Check for high-value changes
    if (changeAnalysis.changedFields.price) {
      const priceChange = changeAnalysis.changedFields.price;
      if (priceChange.newValue > 10000 || Math.abs(priceChange.newValue - priceChange.oldValue) > 1000) {
        return true;
      }
    }

    return false;
  }

  categorizeChange(changeAnalysis, document) {
    if (changeAnalysis.businessImpact === 'high') return 'critical_business_change';
    if (changeAnalysis.complianceImpact !== 'none') return 'compliance_change';
    if (changeAnalysis.approvalRequired) return 'approval_required_change';
    if (changeAnalysis.significantChanges.length > 3) return 'major_change';
    return 'standard_change';
  }

  determineVersioningStrategy(changeAnalysis, document) {
    return {
      createVersionBackup: true,
      strategy: this.config.versioningStrategy,
      reason: changeAnalysis.businessImpact === 'high' ? 'high_impact_change' : 'standard_versioning'
    };
  }

  determineBackupStrategy(document, changeAnalysis) {
    if (this.config.versioningStrategy === 'hybrid') {
      // Use referenced storage for high-impact changes
      if (changeAnalysis.businessImpact === 'high' || changeAnalysis.complianceImpact !== 'none') {
        return { strategy: 'referenced', reason: 'high_impact_or_compliance' };
      }

      // Use embedded for standard changes if under limit
      if (document.metadata.embeddedVersions && document.metadata.embeddedVersions.length < this.config.maxEmbeddedVersions) {
        return { strategy: 'embedded', reason: 'under_embedded_limit' };
      }

      return { strategy: 'referenced', reason: 'embedded_limit_exceeded' };
    }

    return { strategy: this.config.versioningStrategy, reason: 'configured_strategy' };
  }

  async generateComplianceMetadata(documentData, userContext, operationType) {
    const complianceMetadata = {
      gdprSubject: false,
      financialRecord: false,
      healthRecord: false,
      customClassifications: [],
      dataRetentionRequired: true,
      retentionPeriod: this.config.retentionPeriod,
      retentionUntil: null
    };

    // GDPR classification
    if (this.config.gdprCompliance && this.containsPersonalData(documentData)) {
      complianceMetadata.gdprSubject = true;
      complianceMetadata.gdprLawfulBasis = userContext.gdprLawfulBasis || 'legitimate_interest';
      complianceMetadata.dataSubjectRights = ['access', 'rectification', 'erasure', 'portability'];
    }

    // SOX compliance
    if (this.config.soxCompliance && this.containsFinancialData(documentData)) {
      complianceMetadata.financialRecord = true;
      complianceMetadata.soxRetentionPeriod = 7 * 365; // 7 years
      complianceMetadata.immutableRecord = true;
    }

    // Calculate retention date
    if (complianceMetadata.dataRetentionRequired) {
      const retentionDays = complianceMetadata.soxRetentionPeriod || complianceMetadata.retentionPeriod;
      complianceMetadata.retentionUntil = new Date(Date.now() + (retentionDays * 24 * 60 * 60 * 1000));
    }

    return complianceMetadata;
  }

  containsPersonalData(documentData) {
    const personalDataFields = ['email', 'phone', 'address', 'ssn', 'name', 'dob'];
    return personalDataFields.some(field => documentData[field] !== undefined);
  }

  containsFinancialData(documentData) {
    const financialDataFields = ['price', 'cost', 'revenue', 'profit', 'tax', 'salary'];
    return financialDataFields.some(field => documentData[field] !== undefined);
  }
}

// Benefits of MongoDB Advanced Document Versioning:
// - Flexible versioning strategies (embedded, referenced, hybrid) for optimal performance
// - Comprehensive audit trails with configurable detail levels  
// - Built-in compliance tracking for GDPR, SOX, HIPAA, and custom regulations
// - Intelligent change analysis and business impact assessment
// - Automatic retention management and data lifecycle policies
// - High-performance versioning with optimized storage and indexing
// - Session and user context tracking for complete audit visibility
// - Change approval workflows for sensitive data modifications
// - Seamless integration with MongoDB's native features and operations
// - SQL-compatible versioning operations through QueryLeaf integration

module.exports = {
  AdvancedDocumentVersioningManager
};

Understanding MongoDB Document Versioning Architecture

Enterprise Compliance and Audit Trail Strategies

Implement sophisticated versioning for production compliance requirements:

// Production-ready MongoDB Document Versioning with comprehensive compliance and audit capabilities
class ProductionComplianceManager extends AdvancedDocumentVersioningManager {
  constructor(db, productionConfig) {
    super(db, productionConfig);

    this.productionConfig = {
      ...productionConfig,
      enableRegulatoryCompliance: true,
      enableAutomaticArchiving: true,
      enableComplianceReporting: true,
      enableDataGovernance: true,
      enablePrivacyProtection: true,
      enableForensicAnalysis: true
    };

    this.setupProductionCompliance();
    this.initializeRegulatoryFrameworks();
    this.setupDataGovernance();
  }

  async implementRegulatoryCompliance(documentData, regulatoryFramework) {
    console.log('Implementing comprehensive regulatory compliance...');

    const complianceFramework = {
      // GDPR compliance implementation
      gdpr: {
        dataMinimization: true,
        consentManagement: true,
        rightToRectification: true,
        rightToErasure: true,
        dataPortability: true,
        privacyByDesign: true
      },

      // SOX compliance implementation
      sox: {
        financialDataIntegrity: true,
        immutableAuditTrails: true,
        executiveApprovalWorkflows: true,
        quarterlyComplianceReporting: true,
        internalControlsTesting: true
      },

      // HIPAA compliance implementation
      hipaa: {
        phiProtection: true,
        accessControlEnforcement: true,
        encryptionAtRest: true,
        auditLogProtection: true,
        businessAssociateCompliance: true
      }
    };

    return await this.deployComplianceFramework(complianceFramework, regulatoryFramework);
  }

  async setupDataGovernanceFramework() {
    console.log('Setting up comprehensive data governance framework...');

    const governanceFramework = {
      // Data classification and cataloging
      dataClassification: {
        sensitivityLevels: ['public', 'internal', 'confidential', 'restricted'],
        dataCategories: ['personal', 'financial', 'operational', 'strategic'],
        automaticClassification: true,
        classificationWorkflows: true
      },

      // Access control and security
      accessControl: {
        roleBasedAccess: true,
        attributeBasedAccess: true,
        dynamicPermissions: true,
        privilegedAccessMonitoring: true
      },

      // Data quality management
      dataQuality: {
        validationRules: true,
        qualityMetrics: true,
        dataProfileling: true,
        qualityReporting: true
      }
    };

    return await this.deployGovernanceFramework(governanceFramework);
  }

  async implementForensicAnalysis(investigationContext) {
    console.log('Implementing forensic analysis capabilities...');

    const forensicCapabilities = {
      // Digital forensics support
      forensicAnalysis: {
        chainOfCustody: true,
        evidencePreservation: true,
        forensicReporting: true,
        expertWitnessSupport: true
      },

      // Investigation workflows
      investigationSupport: {
        timelineReconstruction: true,
        activityCorrelation: true,
        anomalyDetection: true,
        reportGeneration: true
      }
    };

    return await this.deployForensicFramework(forensicCapabilities, investigationContext);
  }
}

SQL-Style Document Versioning with QueryLeaf

QueryLeaf provides familiar SQL syntax for MongoDB document versioning and audit trails:

-- QueryLeaf advanced document versioning with SQL-familiar syntax for MongoDB

-- Configure document versioning settings
SET versioning_strategy = 'hybrid';
SET max_embedded_versions = 5;
SET audit_level = 'comprehensive';
SET compliance_tracking = true;
SET retention_period = 2557; -- 7 years in days

-- Advanced document versioning with comprehensive audit trail creation
WITH versioning_configuration AS (
  SELECT 
    -- Versioning strategy configuration
    'hybrid' as versioning_strategy,
    5 as max_embedded_versions,
    true as enable_compression,
    true as enable_compliance_tracking,

    -- Audit configuration
    'comprehensive' as audit_level,
    true as enable_field_level_auditing,
    true as enable_user_context_tracking,
    true as enable_session_tracking,

    -- Compliance configuration
    true as gdpr_compliance,
    false as sox_compliance,
    false as hipaa_compliance,
    ARRAY['financial_data', 'personal_data', 'health_data'] as sensitive_data_types,

    -- Retention policies
    2557 as default_retention_days, -- 7 years
    365 as gdpr_retention_days, -- 1 year for GDPR
    2557 as sox_retention_days  -- 7 years for SOX
),

document_change_analysis AS (
  -- Analyze changes and determine versioning strategy
  SELECT 
    doc_id,
    previous_version,
    new_version,
    operation_type,

    -- Field-level change analysis
    changed_fields,
    added_fields,
    removed_fields,

    -- Change significance assessment
    CASE 
      WHEN changed_fields ? 'price' AND 
           ABS((new_data->>'price')::decimal - (old_data->>'price')::decimal) > 1000 THEN 'high'
      WHEN changed_fields ? 'status' OR changed_fields ? 'compliance_status' THEN 'high'
      WHEN array_length(array(SELECT jsonb_object_keys(changed_fields)), 1) > 5 THEN 'medium'
      ELSE 'low'
    END as business_impact,

    -- Compliance impact assessment
    CASE 
      WHEN changed_fields ?& ARRAY['email', 'phone', 'address', 'name'] AND 
           old_data->>'gdpr_subject' = 'true' THEN 'gdpr_personal_data'
      WHEN changed_fields ?& ARRAY['price', 'cost', 'revenue'] AND 
           old_data->>'financial_record' = 'true' THEN 'sox_financial_data'
      WHEN changed_fields ?& ARRAY['medical_info', 'health_data'] AND 
           old_data->>'health_record' = 'true' THEN 'hipaa_health_data'
      ELSE 'none'
    END as compliance_impact,

    -- Approval requirement determination
    CASE 
      WHEN changed_fields ? 'price' AND (new_data->>'price')::decimal > 10000 THEN true
      WHEN operation_type = 'DELETE' THEN true
      WHEN changed_fields ? 'compliance_status' THEN true
      ELSE false
    END as requires_approval,

    -- Change categorization
    CASE 
      WHEN operation_type = 'DELETE' THEN 'deletion_operation'
      WHEN changed_fields ? 'price' THEN 'pricing_change'
      WHEN changed_fields ? 'status' THEN 'status_change'  
      WHEN changed_fields ? 'supplier_id' THEN 'supplier_change'
      WHEN array_length(array(SELECT jsonb_object_keys(changed_fields)), 1) > 3 THEN 'major_change'
      ELSE 'standard_change'
    END as change_category,

    -- User and session context
    user_id,
    session_id,
    request_id,
    ip_address,
    user_agent,
    change_reason,
    business_context,

    -- Timestamps
    change_timestamp,
    processing_timestamp

  FROM document_changes dc
  JOIN versioning_configuration vc ON true
  WHERE dc.change_timestamp >= CURRENT_TIMESTAMP - INTERVAL '1 hour'
),

version_backup_strategy AS (
  -- Determine optimal backup strategy for each change
  SELECT 
    dca.*,

    -- Version backup strategy determination
    CASE 
      WHEN dca.business_impact = 'high' OR dca.compliance_impact != 'none' THEN 'referenced'
      WHEN dca.change_category IN ('deletion_operation', 'major_change') THEN 'referenced'
      ELSE 'embedded'
    END as backup_strategy,

    -- Storage optimization
    CASE 
      WHEN business_impact = 'high' THEN false  -- No compression for high-impact changes
      WHEN LENGTH(old_data::text) > 100000 THEN true  -- Compress large documents
      ELSE vc.enable_compression
    END as enable_compression,

    -- Retention policy determination
    CASE dca.compliance_impact
      WHEN 'gdpr_personal_data' THEN vc.gdpr_retention_days
      WHEN 'sox_financial_data' THEN vc.sox_retention_days
      WHEN 'hipaa_health_data' THEN vc.sox_retention_days -- Use same retention as SOX
      ELSE vc.default_retention_days
    END as retention_days,

    -- Compliance metadata generation
    JSON_BUILD_OBJECT(
      'gdpr_subject', (dca.old_data->>'gdpr_subject')::boolean,
      'financial_record', (dca.old_data->>'financial_record')::boolean,
      'health_record', (dca.old_data->>'health_record')::boolean,
      'data_classification', dca.old_data->>'data_classification',
      'sensitivity_level', dca.old_data->>'sensitivity_level',
      'retention_required', true,
      'retention_until', CURRENT_TIMESTAMP + (
        CASE dca.compliance_impact
          WHEN 'gdpr_personal_data' THEN vc.gdpr_retention_days
          WHEN 'sox_financial_data' THEN vc.sox_retention_days
          ELSE vc.default_retention_days
        END || ' days'
      )::interval,
      'compliance_flags', JSON_BUILD_OBJECT(
        'requires_encryption', dca.compliance_impact != 'none',
        'requires_audit_trail', true,
        'requires_approval', dca.requires_approval,
        'immutable_record', dca.compliance_impact = 'sox_financial_data'
      )
    ) as compliance_metadata

  FROM document_change_analysis dca
  CROSS JOIN versioning_configuration vc
),

audit_trail_creation AS (
  -- Create comprehensive audit trail entries
  SELECT 
    vbs.doc_id,
    vbs.operation_type,
    vbs.previous_version,
    vbs.new_version,
    vbs.change_timestamp,

    -- Audit entry data
    JSON_BUILD_OBJECT(
      'audit_id', GENERATE_UUID(),
      'document_id', vbs.doc_id,
      'operation_type', vbs.operation_type,
      'version_from', vbs.previous_version,
      'version_to', vbs.new_version,
      'timestamp', vbs.change_timestamp,

      -- User context
      'user_context', JSON_BUILD_OBJECT(
        'user_id', vbs.user_id,
        'session_id', vbs.session_id,
        'request_id', vbs.request_id,
        'ip_address', vbs.ip_address,
        'user_agent', vbs.user_agent
      ),

      -- Change analysis
      'change_analysis', JSON_BUILD_OBJECT(
        'changed_fields', array(SELECT jsonb_object_keys(vbs.changed_fields)),
        'added_fields', array(SELECT jsonb_object_keys(vbs.added_fields)),
        'removed_fields', array(SELECT jsonb_object_keys(vbs.removed_fields)),
        'business_impact', vbs.business_impact,
        'compliance_impact', vbs.compliance_impact,
        'change_category', vbs.change_category,
        'change_reason', vbs.change_reason,
        'requires_approval', vbs.requires_approval
      ),

      -- Document snapshots (based on audit level)
      'document_snapshots', CASE vc.audit_level
        WHEN 'basic' THEN JSON_BUILD_OBJECT(
          'document_id', vbs.doc_id,
          'version', vbs.new_version
        )
        WHEN 'detailed' THEN JSON_BUILD_OBJECT(
          'document_id', vbs.doc_id,
          'version', vbs.new_version,
          'changed_fields', vbs.changed_fields
        )
        WHEN 'comprehensive' THEN JSON_BUILD_OBJECT(
          'document_id', vbs.doc_id,
          'version', vbs.new_version,
          'previous_state', vbs.old_data,
          'current_state', vbs.new_data,
          'field_changes', vbs.changed_fields
        )
        ELSE JSON_BUILD_OBJECT('document_id', vbs.doc_id)
      END,

      -- Business context
      'business_context', vbs.business_context,

      -- Compliance metadata
      'compliance', vbs.compliance_metadata,

      -- Technical metadata
      'technical_metadata', JSON_BUILD_OBJECT(
        'audit_level', vc.audit_level,
        'versioning_strategy', vbs.backup_strategy,
        'compression_enabled', vbs.enable_compression,
        'retention_days', vbs.retention_days,
        'processing_timestamp', vbs.processing_timestamp,
        'audit_version', '2.0'
      )

    ) as audit_entry_data

  FROM version_backup_strategy vbs
  CROSS JOIN versioning_configuration vc
),

version_storage_operations AS (
  -- Execute version backup operations based on strategy
  SELECT 
    vbs.doc_id,
    vbs.backup_strategy,
    vbs.previous_version,
    vbs.new_version,

    -- Embedded version data (for embedded strategy)
    CASE WHEN vbs.backup_strategy = 'embedded' THEN
      JSON_BUILD_OBJECT(
        'version', vbs.previous_version,
        'timestamp', vbs.change_timestamp,
        'data', CASE WHEN vbs.enable_compression THEN
          JSON_BUILD_OBJECT(
            'compressed', true,
            'data', vbs.old_data,
            'compression_ratio', 0.7
          )
          ELSE vbs.old_data
        END,
        'metadata', JSON_BUILD_OBJECT(
          'change_reason', vbs.change_reason,
          'changed_by', vbs.user_id,
          'session_id', vbs.session_id,
          'change_category', vbs.change_category,
          'business_impact', vbs.business_impact,
          'compliance_impact', vbs.compliance_impact
        )
      )
      ELSE NULL
    END as embedded_version_data,

    -- Referenced version document (for referenced strategy)
    CASE WHEN vbs.backup_strategy = 'referenced' THEN
      JSON_BUILD_OBJECT(
        'version_id', GENERATE_UUID(),
        'document_id', vbs.doc_id,
        'version', vbs.previous_version,
        'timestamp', vbs.change_timestamp,
        'document_snapshot', CASE WHEN vbs.enable_compression THEN
          JSON_BUILD_OBJECT(
            'compressed', true,
            'data', vbs.old_data,
            'original_size', LENGTH(vbs.old_data::text),
            'compression_ratio', 0.7
          )
          ELSE vbs.old_data
        END,
        'change_metadata', JSON_BUILD_OBJECT(
          'change_reason', vbs.change_reason,
          'changed_by', vbs.user_id,
          'session_id', vbs.session_id,
          'request_id', vbs.request_id,
          'change_category', vbs.change_category,
          'business_impact', vbs.business_impact,
          'compliance_impact', vbs.compliance_impact,
          'changed_fields', array(SELECT jsonb_object_keys(vbs.changed_fields)),
          'significant_changes', array_length(array(SELECT jsonb_object_keys(vbs.changed_fields)), 1)
        ),
        'compliance', vbs.compliance_metadata,
        'storage_metadata', JSON_BUILD_OBJECT(
          'compression_enabled', vbs.enable_compression,
          'storage_size', LENGTH(vbs.old_data::text),
          'created_at', vbs.processing_timestamp,
          'retention_until', CURRENT_TIMESTAMP + (vbs.retention_days || ' days')::interval
        )
      )
      ELSE NULL
    END as referenced_version_data

  FROM version_backup_strategy vbs
)

-- Execute versioning operations
INSERT INTO document_versions (
  document_id,
  version_strategy,
  version_data,
  audit_trail_id,
  compliance_metadata,
  created_at
)
SELECT 
  vso.doc_id,
  vso.backup_strategy,
  COALESCE(vso.embedded_version_data, vso.referenced_version_data),
  atc.audit_entry_data->>'audit_id',
  (atc.audit_entry_data->'compliance'),
  CURRENT_TIMESTAMP
FROM version_storage_operations vso
JOIN audit_trail_creation atc ON vso.doc_id = atc.doc_id;

-- Comprehensive versioning analytics and compliance reporting
WITH versioning_analytics AS (
  SELECT 
    DATE_TRUNC('day', created_at) as date_bucket,
    version_strategy,

    -- Volume metrics
    COUNT(*) as total_versions_created,
    COUNT(DISTINCT document_id) as unique_documents_versioned,

    -- Strategy distribution
    COUNT(*) FILTER (WHERE version_strategy = 'embedded') as embedded_versions,
    COUNT(*) FILTER (WHERE version_strategy = 'referenced') as referenced_versions,

    -- Compliance metrics
    COUNT(*) FILTER (WHERE compliance_metadata->>'gdpr_subject' = 'true') as gdpr_versions,
    COUNT(*) FILTER (WHERE compliance_metadata->>'financial_record' = 'true') as sox_versions,
    COUNT(*) FILTER (WHERE compliance_metadata->>'health_record' = 'true') as hipaa_versions,

    -- Business impact analysis
    COUNT(*) FILTER (WHERE (version_data->'change_metadata'->>'business_impact') = 'high') as high_impact_changes,
    COUNT(*) FILTER (WHERE (version_data->'change_metadata'->>'business_impact') = 'medium') as medium_impact_changes,
    COUNT(*) FILTER (WHERE (version_data->'change_metadata'->>'business_impact') = 'low') as low_impact_changes,

    -- Storage utilization
    AVG(LENGTH((version_data->'document_snapshot')::text)) as avg_version_size,
    SUM(LENGTH((version_data->'document_snapshot')::text)) as total_storage_used,
    AVG(CASE WHEN version_data->'storage_metadata'->>'compression_enabled' = 'true' 
             THEN (version_data->'storage_metadata'->>'compression_ratio')::decimal 
             ELSE 1.0 END) as avg_compression_ratio,

    -- User activity analysis
    COUNT(DISTINCT (version_data->'change_metadata'->>'changed_by')) as unique_users,
    MODE() WITHIN GROUP (ORDER BY (version_data->'change_metadata'->>'changed_by')) as most_active_user,

    -- Change pattern analysis
    COUNT(*) FILTER (WHERE (version_data->'change_metadata'->>'requires_approval') = 'true') as approval_required_changes,
    MODE() WITHIN GROUP (ORDER BY (version_data->'change_metadata'->>'change_category')) as most_common_change_type

  FROM document_versions
  WHERE created_at >= CURRENT_TIMESTAMP - INTERVAL '30 days'
  GROUP BY DATE_TRUNC('day', created_at), version_strategy
),

compliance_reporting AS (
  SELECT 
    va.*,

    -- Compliance percentage calculations
    ROUND((gdpr_versions * 100.0 / NULLIF(total_versions_created, 0)), 2) as gdpr_compliance_percent,
    ROUND((sox_versions * 100.0 / NULLIF(total_versions_created, 0)), 2) as sox_compliance_percent,
    ROUND((hipaa_versions * 100.0 / NULLIF(total_versions_created, 0)), 2) as hipaa_compliance_percent,

    -- Storage efficiency metrics
    ROUND((total_storage_used / 1024.0 / 1024.0), 2) as storage_used_mb,
    ROUND((avg_compression_ratio * 100), 1) as avg_compression_percent,
    ROUND((total_storage_used * (1 - avg_compression_ratio) / 1024.0 / 1024.0), 2) as storage_saved_mb,

    -- Change approval metrics
    ROUND((approval_required_changes * 100.0 / NULLIF(total_versions_created, 0)), 2) as approval_rate_percent,

    -- Risk assessment
    CASE 
      WHEN high_impact_changes > total_versions_created * 0.1 THEN 'high_risk'
      WHEN high_impact_changes > total_versions_created * 0.05 THEN 'medium_risk'
      ELSE 'low_risk'
    END as change_risk_level,

    -- Optimization recommendations
    CASE 
      WHEN avg_compression_ratio < 0.5 THEN 'review_compression_settings'
      WHEN referenced_versions > embedded_versions * 2 THEN 'optimize_versioning_strategy'
      WHEN approval_required_changes > total_versions_created * 0.2 THEN 'review_approval_thresholds'
      WHEN unique_users < 5 THEN 'review_user_access_patterns'
      ELSE 'performance_optimal'
    END as optimization_recommendation

  FROM versioning_analytics va
)

SELECT 
  date_bucket,
  version_strategy,

  -- Volume summary
  total_versions_created,
  unique_documents_versioned,
  unique_users,
  most_active_user,

  -- Strategy breakdown
  embedded_versions,
  referenced_versions,
  ROUND((embedded_versions * 100.0 / NULLIF(total_versions_created, 0)), 1) as embedded_strategy_percent,
  ROUND((referenced_versions * 100.0 / NULLIF(total_versions_created, 0)), 1) as referenced_strategy_percent,

  -- Impact analysis
  high_impact_changes,
  medium_impact_changes, 
  low_impact_changes,
  most_common_change_type,

  -- Compliance summary
  gdpr_versions,
  sox_versions,
  hipaa_versions,
  gdpr_compliance_percent,
  sox_compliance_percent,
  hipaa_compliance_percent,

  -- Storage metrics
  ROUND(avg_version_size / 1024.0, 1) as avg_version_size_kb,
  storage_used_mb,
  avg_compression_percent,
  storage_saved_mb,

  -- Approval workflow metrics
  approval_required_changes,
  approval_rate_percent,

  -- Risk and optimization
  change_risk_level,
  optimization_recommendation,

  -- Detailed recommendations
  CASE optimization_recommendation
    WHEN 'review_compression_settings' THEN 'Enable compression for better storage efficiency'
    WHEN 'optimize_versioning_strategy' THEN 'Consider increasing embedded version limits'
    WHEN 'review_approval_thresholds' THEN 'Adjust approval requirements for better workflow efficiency'
    WHEN 'review_user_access_patterns' THEN 'Evaluate user permissions and training needs'
    ELSE 'Continue current versioning configuration - performance is optimal'
  END as detailed_recommendation

FROM compliance_reporting
ORDER BY date_bucket DESC, version_strategy;

-- QueryLeaf provides comprehensive document versioning capabilities:
-- 1. Flexible versioning strategies with embedded, referenced, and hybrid approaches
-- 2. Advanced audit trails with configurable detail levels and compliance tracking
-- 3. Comprehensive change analysis and business impact assessment
-- 4. Built-in compliance support for GDPR, SOX, HIPAA, and custom regulations
-- 5. Intelligent storage optimization with compression and retention management
-- 6. User context tracking and session management for complete audit visibility
-- 7. Change approval workflows for sensitive data and high-impact modifications
-- 8. Performance monitoring and optimization recommendations for versioning strategies
-- 9. SQL-familiar syntax for complex versioning operations and compliance reporting
-- 10. Integration with MongoDB's native document features and indexing optimizations

Best Practices for Production Document Versioning

Enterprise Compliance and Audit Trail Management

Essential principles for effective MongoDB document versioning deployment:

Versioning Strategy Selection: Choose appropriate versioning strategies based on change frequency, document size, and business requirements
Compliance Integration: Implement comprehensive compliance tracking for regulatory frameworks like GDPR, SOX, and HIPAA
Audit Trail Design: Create detailed audit trails with configurable granularity for different business contexts
Storage Optimization: Balance version history completeness with storage efficiency through compression and retention policies
User Context Tracking: Capture complete user, session, and business context for forensic analysis capabilities
Change Approval Workflows: Implement automated approval workflows for high-impact changes and sensitive data modifications

Scalability and Production Deployment

Optimize document versioning for enterprise-scale requirements:

Performance Optimization: Design efficient indexing strategies for versioning and audit collections
Storage Management: Implement automated archiving and retention policies for historical data lifecycle management
Compliance Reporting: Create comprehensive reporting capabilities for regulatory audits and compliance verification
Data Governance: Integrate versioning with enterprise data governance frameworks and security policies
Forensic Readiness: Ensure versioning systems support digital forensics and legal discovery requirements
Integration Architecture: Design versioning systems that integrate seamlessly with existing enterprise applications and workflows

Conclusion

MongoDB document versioning provides comprehensive data history management capabilities that enable enterprise-grade audit trails, regulatory compliance, and forensic analysis through flexible storage strategies and configurable compliance frameworks. The combination of embedded, referenced, and hybrid versioning approaches ensures optimal performance while maintaining complete change visibility.

Key MongoDB Document Versioning benefits include:

Flexible Versioning Strategies: Multiple approaches to balance storage efficiency with audit completeness requirements
Comprehensive Compliance Support: Built-in support for GDPR, SOX, HIPAA, and custom regulatory frameworks
Detailed Audit Trails: Configurable audit granularity from basic change tracking to complete document snapshots
Intelligent Storage Management: Automated compression, retention, and archiving for optimized storage utilization
Complete Context Tracking: User, session, and business context capture for forensic analysis and compliance reporting
SQL Accessibility: Familiar SQL-style versioning operations through QueryLeaf for accessible audit trail management

Whether you're managing regulatory compliance, supporting forensic investigations, implementing data governance policies, or maintaining comprehensive change history for business operations, MongoDB document versioning with QueryLeaf's familiar SQL interface provides the foundation for robust, scalable audit trail management.

QueryLeaf Integration: QueryLeaf automatically optimizes MongoDB document versioning while providing SQL-familiar syntax for audit trail creation, compliance tracking, and historical analysis. Advanced versioning patterns, regulatory compliance workflows, and forensic analysis capabilities are seamlessly handled through familiar SQL constructs, making enterprise-grade audit trail management accessible to SQL-oriented development teams.

The combination of MongoDB's flexible document versioning capabilities with SQL-style audit operations makes it an ideal platform for applications requiring both comprehensive change tracking and familiar database management patterns, ensuring your audit trails can scale efficiently while maintaining regulatory compliance and operational transparency.

November 25, 2025
21 min read

MongoDB Bulk Operations for High-Performance Batch Processing: Enterprise Data Ingestion and Mass Data Processing with SQL-Style Bulk Operations

Enterprise applications frequently require processing large volumes of data through bulk operations that can efficiently insert, update, or delete thousands or millions of records while maintaining data consistency and optimal performance. Traditional approaches to mass data processing often struggle with network round-trips, transaction overhead, and resource utilization when handling high-volume batch operations common in ETL pipelines, data migrations, and real-time ingestion systems.

MongoDB's bulk operations provide native support for high-performance batch processing with automatic optimization, intelligent batching, and comprehensive error handling designed specifically for enterprise-scale data operations. Unlike traditional row-by-row processing that creates excessive network overhead and transaction costs, MongoDB bulk operations combine multiple operations into optimized batches while maintaining ACID guarantees and providing detailed execution feedback for complex data processing workflows.

The Traditional Batch Processing Challenge

Relational databases face significant performance limitations when processing large data volumes:

-- Traditional PostgreSQL bulk processing - inefficient row-by-row operations with poor performance

-- Large-scale product catalog management with individual operations
CREATE TABLE products (
    product_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    sku VARCHAR(100) UNIQUE NOT NULL,
    name VARCHAR(500) NOT NULL,
    description TEXT,
    category_id UUID,
    brand_id UUID,

    -- Pricing information
    base_price DECIMAL(12,2) NOT NULL,
    sale_price DECIMAL(12,2),
    cost_price DECIMAL(12,2),
    currency VARCHAR(3) DEFAULT 'USD',

    -- Inventory tracking
    stock_quantity INTEGER DEFAULT 0,
    reserved_quantity INTEGER DEFAULT 0,
    available_quantity INTEGER GENERATED ALWAYS AS (stock_quantity - reserved_quantity) STORED,
    low_stock_threshold INTEGER DEFAULT 10,
    reorder_point INTEGER DEFAULT 5,

    -- Product attributes
    weight_kg DECIMAL(8,3),
    dimensions_cm VARCHAR(50), -- Length x Width x Height
    color VARCHAR(50),
    size VARCHAR(50),
    material VARCHAR(100),

    -- Status and lifecycle
    status VARCHAR(20) DEFAULT 'active' CHECK (status IN ('active', 'inactive', 'discontinued')),
    launch_date DATE,
    discontinue_date DATE,

    -- SEO and marketing
    seo_title VARCHAR(200),
    seo_description TEXT,
    tags TEXT[],

    -- Supplier information
    supplier_id UUID,
    supplier_sku VARCHAR(100),
    lead_time_days INTEGER DEFAULT 14,

    -- Tracking timestamps
    created_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
    last_inventory_update TIMESTAMP WITH TIME ZONE,

    -- Audit fields
    created_by UUID,
    updated_by UUID,
    version INTEGER DEFAULT 1,

    FOREIGN KEY (category_id) REFERENCES categories(category_id),
    FOREIGN KEY (brand_id) REFERENCES brands(brand_id),
    FOREIGN KEY (supplier_id) REFERENCES suppliers(supplier_id)
);

-- Product images table for multiple images per product
CREATE TABLE product_images (
    image_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    product_id UUID NOT NULL,
    image_url VARCHAR(500) NOT NULL,
    image_type VARCHAR(20) DEFAULT 'product' CHECK (image_type IN ('product', 'thumbnail', 'gallery', 'variant')),
    alt_text VARCHAR(200),
    sort_order INTEGER DEFAULT 0,
    is_primary BOOLEAN DEFAULT false,
    created_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,

    FOREIGN KEY (product_id) REFERENCES products(product_id) ON DELETE CASCADE
);

-- Product variants for size/color combinations
CREATE TABLE product_variants (
    variant_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    product_id UUID NOT NULL,
    variant_sku VARCHAR(100) UNIQUE NOT NULL,
    variant_name VARCHAR(200),

    -- Variant-specific attributes
    size VARCHAR(50),
    color VARCHAR(50),
    material VARCHAR(100),

    -- Variant pricing and inventory
    price_adjustment DECIMAL(12,2) DEFAULT 0,
    stock_quantity INTEGER DEFAULT 0,
    weight_adjustment_kg DECIMAL(8,3) DEFAULT 0,

    -- Status
    is_active BOOLEAN DEFAULT true,

    -- Timestamps
    created_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,

    FOREIGN KEY (product_id) REFERENCES products(product_id) ON DELETE CASCADE
);

-- Inefficient individual insert approach for bulk data loading
DO $$
DECLARE
    batch_size INTEGER := 1000;
    total_records INTEGER := 50000;
    current_batch INTEGER := 0;
    start_time TIMESTAMP;
    end_time TIMESTAMP;
    record_count INTEGER := 0;
    error_count INTEGER := 0;

    -- Sample product data
    categories UUID[] := ARRAY[
        '550e8400-e29b-41d4-a716-446655440001',
        '550e8400-e29b-41d4-a716-446655440002',
        '550e8400-e29b-41d4-a716-446655440003'
    ];

    brands UUID[] := ARRAY[
        '660e8400-e29b-41d4-a716-446655440001',
        '660e8400-e29b-41d4-a716-446655440002',
        '660e8400-e29b-41d4-a716-446655440003'
    ];

    suppliers UUID[] := ARRAY[
        '770e8400-e29b-41d4-a716-446655440001',
        '770e8400-e29b-41d4-a716-446655440002'
    ];

BEGIN
    start_time := clock_timestamp();

    RAISE NOTICE 'Starting bulk product insertion of % records...', total_records;

    -- Inefficient approach - individual INSERT statements with poor performance
    FOR i IN 1..total_records LOOP
        BEGIN
            -- Single row insert - creates network round-trip and transaction overhead for each record
            INSERT INTO products (
                sku, name, description, category_id, brand_id,
                base_price, cost_price, stock_quantity, weight_kg,
                status, supplier_id, created_by
            ) VALUES (
                'SKU-' || LPAD(i::text, 8, '0'),
                'Product ' || i,
                'Description for product ' || i || ' with detailed specifications and features.',
                categories[((i-1) % array_length(categories, 1)) + 1],
                brands[((i-1) % array_length(brands, 1)) + 1],
                ROUND((RANDOM() * 1000 + 10)::numeric, 2),
                ROUND((RANDOM() * 500 + 5)::numeric, 2),
                FLOOR(RANDOM() * 100),
                ROUND((RANDOM() * 10 + 0.1)::numeric, 3),
                CASE WHEN RANDOM() < 0.9 THEN 'active' ELSE 'inactive' END,
                suppliers[((i-1) % array_length(suppliers, 1)) + 1],
                '880e8400-e29b-41d4-a716-446655440001'
            );

            record_count := record_count + 1;

            -- Progress reporting every batch
            IF record_count % batch_size = 0 THEN
                current_batch := current_batch + 1;
                RAISE NOTICE 'Inserted batch %: % records (% total)', current_batch, batch_size, record_count;
            END IF;

        EXCEPTION 
            WHEN OTHERS THEN
                error_count := error_count + 1;
                RAISE NOTICE 'Error inserting record %: %', i, SQLERRM;
        END;
    END LOOP;

    end_time := clock_timestamp();

    RAISE NOTICE 'Bulk insertion completed:';
    RAISE NOTICE '  Total time: %', end_time - start_time;
    RAISE NOTICE '  Records inserted: %', record_count;
    RAISE NOTICE '  Errors encountered: %', error_count;
    RAISE NOTICE '  Average rate: % records/second', ROUND(record_count / EXTRACT(EPOCH FROM end_time - start_time));

    -- Performance issues with individual inserts:
    -- 1. Each INSERT creates a separate network round-trip
    -- 2. Individual transaction overhead for each operation
    -- 3. No batch optimization or bulk loading capabilities
    -- 4. Poor resource utilization and high latency
    -- 5. Difficulty in handling partial failures and rollbacks
    -- 6. Limited parallelization options
    -- 7. Inefficient index maintenance for each individual operation
END $$;

-- Equally inefficient bulk update approach
DO $$
DECLARE
    update_batch_size INTEGER := 500;
    products_cursor CURSOR FOR 
        SELECT product_id, base_price, stock_quantity 
        FROM products 
        WHERE status = 'active';

    product_record RECORD;
    updated_count INTEGER := 0;
    batch_count INTEGER := 0;
    start_time TIMESTAMP := clock_timestamp();

BEGIN
    RAISE NOTICE 'Starting bulk price and inventory update...';

    -- Individual UPDATE statements - highly inefficient for bulk operations
    FOR product_record IN products_cursor LOOP
        BEGIN
            -- Single row update with complex business logic
            UPDATE products 
            SET 
                base_price = CASE 
                    WHEN product_record.base_price < 50 THEN product_record.base_price * 1.15
                    WHEN product_record.base_price < 200 THEN product_record.base_price * 1.10
                    ELSE product_record.base_price * 1.05
                END,

                sale_price = CASE 
                    WHEN RANDOM() < 0.3 THEN base_price * 0.85  -- 30% chance of sale
                    ELSE NULL
                END,

                stock_quantity = CASE 
                    WHEN product_record.stock_quantity < 5 THEN product_record.stock_quantity + 50
                    WHEN product_record.stock_quantity < 20 THEN product_record.stock_quantity + 25
                    ELSE product_record.stock_quantity
                END,

                updated_at = CURRENT_TIMESTAMP,
                version = version + 1,
                updated_by = '880e8400-e29b-41d4-a716-446655440002'

            WHERE product_id = product_record.product_id;

            updated_count := updated_count + 1;

            -- Batch progress reporting
            IF updated_count % update_batch_size = 0 THEN
                batch_count := batch_count + 1;
                RAISE NOTICE 'Updated batch %: % products', batch_count, update_batch_size;
                COMMIT; -- Frequent commits for progress tracking
            END IF;

        EXCEPTION 
            WHEN OTHERS THEN
                RAISE NOTICE 'Error updating product %: %', product_record.product_id, SQLERRM;
        END;
    END LOOP;

    RAISE NOTICE 'Bulk update completed: % products updated in %', 
        updated_count, clock_timestamp() - start_time;
END $$;

-- Traditional batch delete with poor performance characteristics
DELETE FROM products 
WHERE status = 'discontinued' 
  AND discontinue_date < CURRENT_DATE - INTERVAL '2 years'
  AND product_id IN (
    -- Subquery creates additional overhead and complexity
    SELECT p.product_id 
    FROM products p
    LEFT JOIN product_variants pv ON p.product_id = pv.product_id
    LEFT JOIN order_items oi ON p.product_id = oi.product_id
    WHERE pv.variant_id IS NULL  -- No variants
      AND oi.item_id IS NULL     -- No order history
      AND p.stock_quantity = 0   -- No inventory
  );

-- Problems with traditional bulk processing:
-- 1. Row-by-row processing creates excessive network round-trips and latency
-- 2. Individual transaction overhead significantly impacts performance
-- 3. Poor resource utilization and limited parallelization capabilities  
-- 4. Complex error handling for partial failures and rollback scenarios
-- 5. Inefficient index maintenance with frequent individual operations
-- 6. Limited batch optimization and bulk loading strategies
-- 7. Difficulty in monitoring progress and performance of bulk operations
-- 8. Poor integration with modern data pipeline and ETL tools
-- 9. Scalability limitations with very large datasets and concurrent operations
-- 10. Lack of automatic retry and recovery mechanisms for failed operations

-- MySQL bulk operations (even more limited capabilities)
CREATE TABLE mysql_products (
    id BIGINT AUTO_INCREMENT PRIMARY KEY,
    sku VARCHAR(100) UNIQUE,
    name VARCHAR(500),
    price DECIMAL(10,2),
    quantity INT,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

-- MySQL limited bulk insert (no advanced error handling)
INSERT INTO mysql_products (sku, name, price, quantity) VALUES
('SKU001', 'Product 1', 99.99, 10),
('SKU002', 'Product 2', 149.99, 5),
('SKU003', 'Product 3', 199.99, 15);

-- Basic bulk update (limited conditional logic)
UPDATE mysql_products 
SET price = price * 1.1 
WHERE quantity > 0;

-- Simple bulk delete (no complex conditions)
DELETE FROM mysql_products 
WHERE quantity = 0 AND created_at < DATE_SUB(NOW(), INTERVAL 30 DAY);

-- MySQL limitations for bulk operations:
-- - Basic INSERT VALUES limited by max_allowed_packet size
-- - No advanced bulk operation APIs or optimization
-- - Poor error handling and partial failure recovery
-- - Limited conditional logic in bulk updates
-- - Basic performance monitoring and progress tracking
-- - No automatic batching or optimization strategies

MongoDB's bulk operations provide comprehensive high-performance batch processing:

// MongoDB Bulk Operations - enterprise-scale high-performance batch processing
const { MongoClient, ObjectId } = require('mongodb');

const client = new MongoClient('mongodb://localhost:27017');
const db = client.db('enterprise_data_platform');

// Advanced bulk operations manager for enterprise data processing
class AdvancedBulkOperationsManager {
  constructor(db, config = {}) {
    this.db = db;
    this.config = {
      // Performance configuration
      batchSize: config.batchSize || 1000,
      maxConcurrentOperations: config.maxConcurrentOperations || 10,
      retryAttempts: config.retryAttempts || 3,
      retryDelayMs: config.retryDelayMs || 1000,

      // Memory management
      maxMemoryUsageMB: config.maxMemoryUsageMB || 500,
      memoryMonitoringInterval: config.memoryMonitoringInterval || 1000,

      // Progress tracking
      enableProgressTracking: config.enableProgressTracking || true,
      progressUpdateInterval: config.progressUpdateInterval || 1000,
      enableDetailedLogging: config.enableDetailedLogging || true,

      // Error handling
      continueOnError: config.continueOnError || false,
      errorLoggingLevel: config.errorLoggingLevel || 'detailed',

      // Optimization features
      enableIndexOptimization: config.enableIndexOptimization || true,
      enableCompressionOptimization: config.enableCompressionOptimization || true,
      enableShardingOptimization: config.enableShardingOptimization || true,

      // Monitoring and analytics
      enablePerformanceAnalytics: config.enablePerformanceAnalytics || true,
      enableResourceMonitoring: config.enableResourceMonitoring || true
    };

    this.operationStats = {
      totalOperations: 0,
      successfulOperations: 0,
      failedOperations: 0,
      bytesProcessed: 0,
      operationsPerSecond: 0,
      averageLatency: 0
    };

    this.setupMonitoring();
  }

  async performBulkInsert(collectionName, documents, options = {}) {
    console.log(`Starting bulk insert of ${documents.length} documents into ${collectionName}...`);
    const startTime = Date.now();

    try {
      const collection = this.db.collection(collectionName);
      const operationOptions = {
        ordered: options.ordered !== false, // Default to ordered operations
        bypassDocumentValidation: options.bypassDocumentValidation || false,
        ...options
      };

      // Prepare documents with validation and enrichment
      const preparedDocuments = await this.prepareDocumentsForInsertion(documents, options);

      // Create bulk operation
      const bulkOperations = [];
      const results = {
        insertedIds: [],
        insertedCount: 0,
        errors: [],
        processingTime: 0,
        throughput: 0
      };

      // Process in optimized batches
      const batches = this.createOptimizedBatches(preparedDocuments, this.config.batchSize);

      console.log(`Processing ${batches.length} batches of up to ${this.config.batchSize} documents each...`);

      for (let batchIndex = 0; batchIndex < batches.length; batchIndex++) {
        const batch = batches[batchIndex];
        const batchStartTime = Date.now();

        try {
          // Create bulk write operations for current batch
          const batchOperations = batch.map(doc => ({
            insertOne: {
              document: {
                ...doc,
                _createdAt: new Date(),
                _batchId: options.batchId || new ObjectId(),
                _batchIndex: batchIndex,
                _processingMetadata: {
                  insertedAt: new Date(),
                  batchSize: batch.length,
                  totalBatches: batches.length
                }
              }
            }
          }));

          // Execute bulk write with comprehensive error handling
          const batchResult = await collection.bulkWrite(batchOperations, operationOptions);

          // Track successful operations
          results.insertedCount += batchResult.insertedCount;
          results.insertedIds.push(...Object.values(batchResult.insertedIds));

          // Update statistics
          this.operationStats.totalOperations += batch.length;
          this.operationStats.successfulOperations += batchResult.insertedCount;

          const batchTime = Date.now() - batchStartTime;
          const batchThroughput = Math.round(batch.length / (batchTime / 1000));

          if (this.config.enableProgressTracking) {
            const progress = Math.round(((batchIndex + 1) / batches.length) * 100);
            console.log(`Batch ${batchIndex + 1}/${batches.length} completed: ${batchResult.insertedCount} inserted (${batchThroughput} docs/sec, ${progress}% complete)`);
          }

        } catch (batchError) {
          console.error(`Batch ${batchIndex + 1} failed:`, batchError.message);

          if (!this.config.continueOnError) {
            throw batchError;
          }

          results.errors.push({
            batchIndex,
            batchSize: batch.length,
            error: batchError.message,
            timestamp: new Date()
          });

          this.operationStats.failedOperations += batch.length;
        }
      }

      // Calculate final statistics
      const totalTime = Date.now() - startTime;
      results.processingTime = totalTime;
      results.throughput = Math.round(results.insertedCount / (totalTime / 1000));

      // Update global statistics
      this.operationStats.operationsPerSecond = results.throughput;
      this.operationStats.averageLatency = totalTime / batches.length;

      console.log(`Bulk insert completed: ${results.insertedCount} documents inserted in ${totalTime}ms (${results.throughput} docs/sec)`);

      if (results.errors.length > 0) {
        console.warn(`${results.errors.length} batch errors encountered during insertion`);
      }

      return results;

    } catch (error) {
      console.error(`Bulk insert operation failed:`, error);
      throw error;
    }
  }

  async performBulkUpdate(collectionName, updateOperations, options = {}) {
    console.log(`Starting bulk update of ${updateOperations.length} operations on ${collectionName}...`);
    const startTime = Date.now();

    try {
      const collection = this.db.collection(collectionName);
      const results = {
        matchedCount: 0,
        modifiedCount: 0,
        upsertedCount: 0,
        upsertedIds: [],
        errors: [],
        processingTime: 0,
        throughput: 0
      };

      // Process update operations in optimized batches
      const batches = this.createOptimizedBatches(updateOperations, this.config.batchSize);

      console.log(`Processing ${batches.length} update batches...`);

      for (let batchIndex = 0; batchIndex < batches.length; batchIndex++) {
        const batch = batches[batchIndex];
        const batchStartTime = Date.now();

        try {
          // Create bulk write operations for updates
          const batchOperations = batch.map(operation => {
            const updateDoc = {
              ...operation.update,
              $set: {
                ...operation.update.$set,
                _updatedAt: new Date(),
                _batchId: options.batchId || new ObjectId(),
                _batchIndex: batchIndex,
                _updateMetadata: {
                  updatedAt: new Date(),
                  batchSize: batch.length,
                  operationType: 'bulkUpdate'
                }
              }
            };

            if (operation.upsert) {
              return {
                updateOne: {
                  filter: operation.filter,
                  update: updateDoc,
                  upsert: true
                }
              };
            } else {
              return {
                updateMany: {
                  filter: operation.filter,
                  update: updateDoc
                }
              };
            }
          });

          // Execute bulk write
          const batchResult = await collection.bulkWrite(batchOperations, {
            ordered: options.ordered !== false,
            ...options
          });

          // Aggregate results
          results.matchedCount += batchResult.matchedCount;
          results.modifiedCount += batchResult.modifiedCount;
          results.upsertedCount += batchResult.upsertedCount;
          results.upsertedIds.push(...Object.values(batchResult.upsertedIds || {}));

          const batchTime = Date.now() - batchStartTime;
          const batchThroughput = Math.round(batch.length / (batchTime / 1000));

          if (this.config.enableProgressTracking) {
            const progress = Math.round(((batchIndex + 1) / batches.length) * 100);
            console.log(`Update batch ${batchIndex + 1}/${batches.length}: ${batchResult.modifiedCount} modified (${batchThroughput} ops/sec, ${progress}% complete)`);
          }

        } catch (batchError) {
          console.error(`Update batch ${batchIndex + 1} failed:`, batchError.message);

          if (!this.config.continueOnError) {
            throw batchError;
          }

          results.errors.push({
            batchIndex,
            batchSize: batch.length,
            error: batchError.message,
            timestamp: new Date()
          });
        }
      }

      // Calculate final statistics
      const totalTime = Date.now() - startTime;
      results.processingTime = totalTime;
      results.throughput = Math.round(updateOperations.length / (totalTime / 1000));

      console.log(`Bulk update completed: ${results.modifiedCount} documents modified in ${totalTime}ms (${results.throughput} ops/sec)`);

      return results;

    } catch (error) {
      console.error(`Bulk update operation failed:`, error);
      throw error;
    }
  }

  async performBulkDelete(collectionName, deleteFilters, options = {}) {
    console.log(`Starting bulk delete of ${deleteFilters.length} operations on ${collectionName}...`);
    const startTime = Date.now();

    try {
      const collection = this.db.collection(collectionName);
      const results = {
        deletedCount: 0,
        errors: [],
        processingTime: 0,
        throughput: 0
      };

      // Archive documents before deletion if required
      if (options.archiveBeforeDelete) {
        console.log('Archiving documents before deletion...');
        await this.archiveDocumentsBeforeDelete(collection, deleteFilters, options);
      }

      // Process delete operations in batches
      const batches = this.createOptimizedBatches(deleteFilters, this.config.batchSize);

      console.log(`Processing ${batches.length} delete batches...`);

      for (let batchIndex = 0; batchIndex < batches.length; batchIndex++) {
        const batch = batches[batchIndex];
        const batchStartTime = Date.now();

        try {
          // Create bulk write operations for deletes
          const batchOperations = batch.map(filter => ({
            deleteMany: { filter }
          }));

          // Execute bulk write
          const batchResult = await collection.bulkWrite(batchOperations, {
            ordered: options.ordered !== false,
            ...options
          });

          results.deletedCount += batchResult.deletedCount;

          const batchTime = Date.now() - batchStartTime;
          const batchThroughput = Math.round(batch.length / (batchTime / 1000));

          if (this.config.enableProgressTracking) {
            const progress = Math.round(((batchIndex + 1) / batches.length) * 100);
            console.log(`Delete batch ${batchIndex + 1}/${batches.length}: ${batchResult.deletedCount} deleted (${batchThroughput} ops/sec, ${progress}% complete)`);
          }

        } catch (batchError) {
          console.error(`Delete batch ${batchIndex + 1} failed:`, batchError.message);

          if (!this.config.continueOnError) {
            throw batchError;
          }

          results.errors.push({
            batchIndex,
            batchSize: batch.length,
            error: batchError.message,
            timestamp: new Date()
          });
        }
      }

      // Calculate final statistics
      const totalTime = Date.now() - startTime;
      results.processingTime = totalTime;
      results.throughput = Math.round(deleteFilters.length / (totalTime / 1000));

      console.log(`Bulk delete completed: ${results.deletedCount} documents deleted in ${totalTime}ms (${results.throughput} ops/sec)`);

      return results;

    } catch (error) {
      console.error(`Bulk delete operation failed:`, error);
      throw error;
    }
  }

  async performMixedBulkOperations(collectionName, operations, options = {}) {
    console.log(`Starting mixed bulk operations (${operations.length} operations) on ${collectionName}...`);
    const startTime = Date.now();

    try {
      const collection = this.db.collection(collectionName);
      const results = {
        insertedCount: 0,
        matchedCount: 0,
        modifiedCount: 0,
        deletedCount: 0,
        upsertedCount: 0,
        upsertedIds: [],
        errors: [],
        processingTime: 0,
        throughput: 0
      };

      // Process mixed operations in optimized batches
      const batches = this.createOptimizedBatches(operations, this.config.batchSize);

      console.log(`Processing ${batches.length} mixed operation batches...`);

      for (let batchIndex = 0; batchIndex < batches.length; batchIndex++) {
        const batch = batches[batchIndex];
        const batchStartTime = Date.now();

        try {
          // Transform operations into MongoDB bulk write format
          const batchOperations = batch.map(op => {
            switch (op.type) {
              case 'insert':
                return {
                  insertOne: {
                    document: {
                      ...op.document,
                      _createdAt: new Date(),
                      _batchId: options.batchId || new ObjectId()
                    }
                  }
                };

              case 'update':
                return {
                  updateMany: {
                    filter: op.filter,
                    update: {
                      ...op.update,
                      $set: {
                        ...op.update.$set,
                        _updatedAt: new Date(),
                        _batchId: options.batchId || new ObjectId()
                      }
                    }
                  }
                };

              case 'delete':
                return {
                  deleteMany: {
                    filter: op.filter
                  }
                };

              case 'upsert':
                return {
                  replaceOne: {
                    filter: op.filter,
                    replacement: {
                      ...op.document,
                      _updatedAt: new Date(),
                      _batchId: options.batchId || new ObjectId()
                    },
                    upsert: true
                  }
                };

              default:
                throw new Error(`Unsupported operation type: ${op.type}`);
            }
          });

          // Execute mixed bulk operations
          const batchResult = await collection.bulkWrite(batchOperations, {
            ordered: options.ordered !== false,
            ...options
          });

          // Aggregate results
          results.insertedCount += batchResult.insertedCount || 0;
          results.matchedCount += batchResult.matchedCount || 0;
          results.modifiedCount += batchResult.modifiedCount || 0;
          results.deletedCount += batchResult.deletedCount || 0;
          results.upsertedCount += batchResult.upsertedCount || 0;
          if (batchResult.upsertedIds) {
            results.upsertedIds.push(...Object.values(batchResult.upsertedIds));
          }

          const batchTime = Date.now() - batchStartTime;
          const batchThroughput = Math.round(batch.length / (batchTime / 1000));

          if (this.config.enableProgressTracking) {
            const progress = Math.round(((batchIndex + 1) / batches.length) * 100);
            console.log(`Mixed batch ${batchIndex + 1}/${batches.length}: completed (${batchThroughput} ops/sec, ${progress}% complete)`);
          }

        } catch (batchError) {
          console.error(`Mixed batch ${batchIndex + 1} failed:`, batchError.message);

          if (!this.config.continueOnError) {
            throw batchError;
          }

          results.errors.push({
            batchIndex,
            batchSize: batch.length,
            error: batchError.message,
            timestamp: new Date()
          });
        }
      }

      // Calculate final statistics
      const totalTime = Date.now() - startTime;
      results.processingTime = totalTime;
      results.throughput = Math.round(operations.length / (totalTime / 1000));

      console.log(`Mixed bulk operations completed in ${totalTime}ms (${results.throughput} ops/sec)`);
      console.log(`Results: ${results.insertedCount} inserted, ${results.modifiedCount} modified, ${results.deletedCount} deleted`);

      return results;

    } catch (error) {
      console.error(`Mixed bulk operations failed:`, error);
      throw error;
    }
  }

  async prepareDocumentsForInsertion(documents, options) {
    return documents.map((doc, index) => ({
      ...doc,
      _id: doc._id || new ObjectId(),
      _documentIndex: index,
      _validationStatus: 'validated',
      _preparationTimestamp: new Date()
    }));
  }

  createOptimizedBatches(items, batchSize) {
    const batches = [];
    for (let i = 0; i < items.length; i += batchSize) {
      batches.push(items.slice(i, i + batchSize));
    }
    return batches;
  }

  async archiveDocumentsBeforeDelete(collection, filters, options) {
    const archiveCollection = this.db.collection(`${collection.collectionName}_archive`);

    for (const filter of filters) {
      const documentsToArchive = await collection.find(filter).toArray();
      if (documentsToArchive.length > 0) {
        const archiveDocuments = documentsToArchive.map(doc => ({
          ...doc,
          _archivedAt: new Date(),
          _originalId: doc._id,
          _archiveReason: options.archiveReason || 'bulk_delete'
        }));

        await archiveCollection.insertMany(archiveDocuments);
      }
    }
  }

  setupMonitoring() {
    if (this.config.enablePerformanceAnalytics) {
      console.log('Performance analytics enabled');
    }
  }

  async getOperationStatistics() {
    return {
      ...this.operationStats,
      timestamp: new Date()
    };
  }
}

// Example usage for enterprise-scale bulk operations
async function demonstrateAdvancedBulkOperations() {
  const bulkManager = new AdvancedBulkOperationsManager(db, {
    batchSize: 2000,
    maxConcurrentOperations: 15,
    enableProgressTracking: true,
    enableDetailedLogging: true,
    continueOnError: true
  });

  try {
    // Bulk insert: Load 100,000 product records
    const productDocuments = Array.from({ length: 100000 }, (_, index) => ({
      sku: `BULK-PRODUCT-${String(index + 1).padStart(8, '0')}`,
      name: `Enterprise Product ${index + 1}`,
      description: `Comprehensive product description for bulk-loaded item ${index + 1}`,
      category: ['electronics', 'clothing', 'books', 'home'][index % 4],
      brand: ['BrandA', 'BrandB', 'BrandC', 'BrandD'][index % 4],
      price: Math.round((Math.random() * 1000 + 10) * 100) / 100,
      costPrice: Math.round((Math.random() * 500 + 5) * 100) / 100,
      stockQuantity: Math.floor(Math.random() * 100),
      weight: Math.round((Math.random() * 10 + 0.1) * 1000) / 1000,
      status: Math.random() < 0.9 ? 'active' : 'inactive',
      tags: [`tag-${index % 10}`, `category-${index % 5}`],
      metadata: {
        supplier: `supplier-${index % 20}`,
        leadTime: Math.floor(Math.random() * 30) + 1,
        quality: Math.random() < 0.8 ? 'high' : 'medium'
      }
    }));

    console.log('Starting bulk product insertion...');
    const insertResults = await bulkManager.performBulkInsert('products', productDocuments, {
      batchId: new ObjectId(),
      ordered: true
    });

    // Bulk update: Update pricing for all active products
    const priceUpdateOperations = [
      {
        filter: { status: 'active', price: { $lt: 100 } },
        update: {
          $mul: { price: 1.15 },
          $set: { priceUpdatedReason: 'low_price_adjustment' }
        }
      },
      {
        filter: { status: 'active', price: { $gte: 100, $lt: 500 } },
        update: {
          $mul: { price: 1.10 },
          $set: { priceUpdatedReason: 'medium_price_adjustment' }
        }
      },
      {
        filter: { status: 'active', price: { $gte: 500 } },
        update: {
          $mul: { price: 1.05 },
          $set: { priceUpdatedReason: 'premium_price_adjustment' }
        }
      }
    ];

    console.log('Starting bulk price updates...');
    const updateResults = await bulkManager.performBulkUpdate('products', priceUpdateOperations, {
      batchId: new ObjectId()
    });

    // Mixed operations: Complex business logic
    const mixedOperations = [
      // Insert new seasonal products
      ...Array.from({ length: 1000 }, (_, index) => ({
        type: 'insert',
        document: {
          sku: `SEASONAL-${String(index + 1).padStart(6, '0')}`,
          name: `Seasonal Product ${index + 1}`,
          category: 'seasonal',
          price: Math.round((Math.random() * 200 + 20) * 100) / 100,
          status: 'active',
          seasonal: true,
          season: 'winter'
        }
      })),

      // Update low-stock products
      {
        type: 'update',
        filter: { stockQuantity: { $lt: 10 }, status: 'active' },
        update: {
          $set: { 
            lowStockAlert: true,
            restockPriority: 'high',
            alertTriggeredAt: new Date()
          }
        }
      },

      // Delete discontinued products with no inventory
      {
        type: 'delete',
        filter: { 
          status: 'discontinued', 
          stockQuantity: 0,
          lastOrderDate: { $lt: new Date(Date.now() - 365 * 24 * 60 * 60 * 1000) }
        }
      }
    ];

    console.log('Starting mixed bulk operations...');
    const mixedResults = await bulkManager.performMixedBulkOperations('products', mixedOperations);

    // Get final statistics
    const finalStats = await bulkManager.getOperationStatistics();
    console.log('Final Operation Statistics:', finalStats);

    return {
      insertResults,
      updateResults,
      mixedResults,
      finalStats
    };

  } catch (error) {
    console.error('Bulk operations demonstration failed:', error);
    throw error;
  }
}

// Benefits of MongoDB Bulk Operations:
// - Native batch optimization with intelligent operation grouping and network efficiency
// - Comprehensive error handling with partial failure recovery and detailed error reporting
// - Flexible operation mixing with support for complex business logic in single batches
// - Advanced progress tracking and performance monitoring for enterprise operations
// - Memory-efficient processing with automatic resource management and optimization
// - Production-ready scalability with sharding optimization and distributed processing
// - Integrated retry mechanisms with configurable backoff strategies and failure handling
// - Rich analytics and monitoring capabilities for operational insight and optimization

module.exports = {
  AdvancedBulkOperationsManager,
  demonstrateAdvancedBulkOperations
};

Understanding MongoDB Bulk Operations Architecture

High-Performance Batch Processing Patterns

Implement sophisticated bulk operation strategies for enterprise data processing:

// Production-scale bulk operations implementation with advanced monitoring and optimization
class ProductionBulkProcessingPlatform extends AdvancedBulkOperationsManager {
  constructor(db, productionConfig) {
    super(db, productionConfig);

    this.productionConfig = {
      ...productionConfig,
      distributedProcessing: true,
      realTimeMonitoring: true,
      advancedOptimization: true,
      enterpriseErrorHandling: true,
      automaticRecovery: true,
      performanceAnalytics: true
    };

    this.setupProductionOptimizations();
    this.initializeDistributedProcessing();
    this.setupAdvancedMonitoring();
  }

  async implementDistributedBulkProcessing() {
    console.log('Setting up distributed bulk processing architecture...');

    const distributedStrategy = {
      // Parallel processing
      parallelization: {
        maxConcurrentBatches: 20,
        adaptiveBatchSizing: true,
        loadBalancing: true,
        resourceOptimization: true
      },

      // Distributed coordination
      coordination: {
        shardAwareness: true,
        crossShardOptimization: true,
        distributedTransactions: true,
        consistencyGuarantees: 'majority'
      },

      // Performance optimization
      performance: {
        networkOptimization: true,
        compressionEnabled: true,
        pipelineOptimization: true,
        indexOptimization: true
      }
    };

    return await this.deployDistributedStrategy(distributedStrategy);
  }

  async implementAdvancedErrorHandling() {
    console.log('Implementing enterprise-grade error handling...');

    const errorHandlingStrategy = {
      // Recovery mechanisms
      recovery: {
        automaticRetry: true,
        exponentialBackoff: true,
        circuitBreaker: true,
        fallbackStrategies: true
      },

      // Error classification
      classification: {
        transientErrors: 'retry',
        permanentErrors: 'skip_and_log',
        partialFailures: 'continue_processing',
        criticalErrors: 'immediate_stop'
      },

      // Monitoring and alerting
      monitoring: {
        realTimeAlerts: true,
        errorTrend: true,
        performanceImpact: true,
        operationalDashboard: true
      }
    };

    return await this.deployErrorHandlingStrategy(errorHandlingStrategy);
  }

  async implementPerformanceOptimization() {
    console.log('Implementing advanced performance optimization...');

    const optimizationStrategy = {
      // Batch optimization
      batchOptimization: {
        dynamicBatchSizing: true,
        contentAwareBatching: true,
        resourceBasedAdjustment: true,
        latencyOptimization: true
      },

      // Memory management
      memoryManagement: {
        streamingProcessing: true,
        memoryPooling: true,
        garbageCollectionOptimization: true,
        resourceMonitoring: true
      },

      // Network optimization
      networkOptimization: {
        connectionPooling: true,
        compressionEnabled: true,
        pipeliningEnabled: true,
        latencyReduction: true
      }
    };

    return await this.deployOptimizationStrategy(optimizationStrategy);
  }
}

SQL-Style Bulk Operations with QueryLeaf

QueryLeaf provides familiar SQL syntax for MongoDB bulk operations and batch processing:

-- QueryLeaf bulk operations with SQL-familiar batch processing syntax

-- Bulk insert with advanced batch configuration
BULK INSERT INTO products (sku, name, category, price, stock_quantity)
VALUES 
  -- Batch of 1000 products with optimized performance settings
  ('BULK-001', 'Enterprise Product 1', 'electronics', 299.99, 50),
  ('BULK-002', 'Enterprise Product 2', 'electronics', 399.99, 75),
  ('BULK-003', 'Enterprise Product 3', 'clothing', 99.99, 25),
  -- ... continuing for large datasets
WITH (
  batch_size = 2000,
  ordered = true,
  continue_on_error = false,
  enable_progress_tracking = true,

  -- Performance optimization
  bypass_document_validation = false,
  compression_enabled = true,
  parallel_processing = true,
  max_concurrent_batches = 10,

  -- Error handling
  retry_attempts = 3,
  retry_delay_ms = 1000,
  detailed_error_logging = true,

  -- Monitoring
  enable_analytics = true,
  progress_update_interval = 1000,
  performance_monitoring = true
);

-- Bulk insert from SELECT with data transformation
BULK INSERT INTO product_summaries (category, product_count, avg_price, total_value, last_updated)
SELECT 
  category,
  COUNT(*) as product_count,
  ROUND(AVG(price), 2) as avg_price,
  ROUND(SUM(price * stock_quantity), 2) as total_value,
  CURRENT_TIMESTAMP as last_updated
FROM products
WHERE status = 'active'
GROUP BY category
HAVING COUNT(*) > 10
WITH (
  batch_size = 500,
  upsert_on_conflict = true,
  conflict_resolution = 'replace'
);

-- Advanced bulk update with complex conditions and business logic
BULK UPDATE products
SET 
  price = CASE 
    WHEN category = 'electronics' AND price < 100 THEN price * 1.20
    WHEN category = 'electronics' AND price >= 100 THEN price * 1.15
    WHEN category = 'clothing' AND price < 50 THEN price * 1.25
    WHEN category = 'clothing' AND price >= 50 THEN price * 1.15
    WHEN category = 'books' THEN price * 1.10
    ELSE price * 1.05
  END,

  sale_price = CASE 
    WHEN RANDOM() < 0.3 THEN price * 0.85  -- 30% chance of sale
    ELSE NULL
  END,

  stock_status = CASE 
    WHEN stock_quantity = 0 THEN 'out_of_stock'
    WHEN stock_quantity < 10 THEN 'low_stock'
    WHEN stock_quantity < 5 THEN 'critical_stock'
    ELSE 'in_stock'
  END,

  priority_reorder = CASE 
    WHEN stock_quantity < reorder_point THEN true
    ELSE false
  END,

  last_price_update = CURRENT_TIMESTAMP,
  price_update_reason = 'bulk_adjustment_2025',
  updated_by = CURRENT_USER_ID(),
  version = version + 1

WHERE status = 'active'
  AND created_at >= CURRENT_DATE - INTERVAL '2 years'
WITH (
  batch_size = 1500,
  ordered = false,  -- Allow parallel processing
  continue_on_error = true,

  -- Update strategy
  multi_document_updates = true,
  atomic_operations = true,

  -- Performance tuning
  index_optimization = true,
  parallel_processing = true,
  max_concurrent_operations = 15,

  -- Progress tracking
  enable_progress_tracking = true,
  progress_callback = 'product_update_progress_handler',
  estimated_total_operations = 50000
);

-- Bulk upsert operations with conflict resolution
BULK UPSERT INTO inventory_snapshots (product_id, snapshot_date, stock_quantity, reserved_quantity, available_quantity)
SELECT 
  p.product_id,
  CURRENT_DATE as snapshot_date,
  p.stock_quantity,
  COALESCE(r.reserved_quantity, 0) as reserved_quantity,
  p.stock_quantity - COALESCE(r.reserved_quantity, 0) as available_quantity
FROM products p
LEFT JOIN (
  SELECT 
    product_id,
    SUM(quantity) as reserved_quantity
  FROM order_items oi
  JOIN orders o ON oi.order_id = o.order_id
  WHERE o.status IN ('pending', 'processing')
  GROUP BY product_id
) r ON p.product_id = r.product_id
WHERE p.status = 'active'
WITH (
  batch_size = 1000,
  conflict_resolution = 'update',

  -- Upsert configuration
  upsert_conditions = JSON_OBJECT(
    'match_fields', JSON_ARRAY('product_id', 'snapshot_date'),
    'update_strategy', 'replace_all',
    'preserve_audit_fields', true
  ),

  -- Performance optimization
  enable_bulk_optimization = true,
  parallel_upserts = true,
  transaction_batching = true
);

-- Complex bulk delete with archiving and cascade handling
BULK DELETE FROM products
WHERE status = 'discontinued'
  AND discontinue_date < CURRENT_DATE - INTERVAL '2 years'
  AND stock_quantity = 0
  AND product_id NOT IN (
    -- Exclude products with recent orders
    SELECT DISTINCT product_id 
    FROM order_items oi
    JOIN orders o ON oi.order_id = o.order_id
    WHERE o.order_date > CURRENT_DATE - INTERVAL '1 year'
  )
  AND product_id NOT IN (
    -- Exclude products with active variants
    SELECT DISTINCT product_id 
    FROM product_variants 
    WHERE is_active = true
  )
WITH (
  batch_size = 800,
  ordered = true,

  -- Archive configuration
  archive_before_delete = true,
  archive_collection = 'products_archive',
  archive_metadata = JSON_OBJECT(
    'deletion_reason', 'bulk_cleanup_2025',
    'deleted_by', CURRENT_USER_ID(),
    'deletion_date', CURRENT_TIMESTAMP,
    'cleanup_batch_id', 'batch_2025_001'
  ),

  -- Cascade handling
  handle_cascades = true,
  cascade_operations = JSON_ARRAY(
    JSON_OBJECT('collection', 'product_images', 'action', 'delete'),
    JSON_OBJECT('collection', 'product_reviews', 'action', 'archive'),
    JSON_OBJECT('collection', 'product_analytics', 'action', 'delete')
  ),

  -- Safety measures
  require_confirmation = true,
  max_delete_limit = 10000,
  enable_rollback = true,
  rollback_timeout_minutes = 30
);

-- Mixed bulk operations for complex business processes
BEGIN BULK TRANSACTION 'product_lifecycle_update_2025';

-- Step 1: Insert new seasonal products
BULK INSERT INTO products (sku, name, category, price, stock_quantity, status, seasonal_info)
SELECT 
  'SEASONAL-' || LPAD(ROW_NUMBER() OVER (ORDER BY t.name), 6, '0') as sku,
  t.name,
  t.category,
  t.price,
  t.initial_stock,
  'active' as status,
  JSON_OBJECT(
    'is_seasonal', true,
    'season', 'winter_2025',
    'availability_start', '2025-12-01',
    'availability_end', '2026-02-28'
  ) as seasonal_info
FROM temp_seasonal_products t
WITH (batch_size = 1000);

-- Step 2: Update existing product categories
BULK UPDATE products 
SET 
  category = CASE 
    WHEN category = 'winter_clothing' THEN 'clothing'
    WHEN category = 'holiday_electronics' THEN 'electronics'
    WHEN category = 'seasonal_books' THEN 'books'
    ELSE category
  END,

  tags = CASE 
    WHEN category LIKE '%seasonal%' THEN 
      ARRAY_APPEND(tags, 'seasonal_clearance')
    ELSE tags
  END,

  clearance_eligible = CASE 
    WHEN category LIKE '%seasonal%' AND stock_quantity > 50 THEN true
    ELSE clearance_eligible
  END

WHERE category LIKE '%seasonal%'
   OR category LIKE '%holiday%'
WITH (batch_size = 1200);

-- Step 3: Archive old product versions
BULK MOVE products TO products_archive
WHERE version < 5
  AND last_updated < CURRENT_DATE - INTERVAL '1 year'
  AND status = 'inactive'
WITH (
  batch_size = 500,
  preserve_relationships = false,
  archive_metadata = JSON_OBJECT(
    'archive_reason', 'version_cleanup',
    'archive_date', CURRENT_TIMESTAMP
  )
);

-- Step 4: Clean up orphaned related records
BULK DELETE FROM product_images 
WHERE product_id NOT IN (
  SELECT product_id FROM products
  UNION 
  SELECT original_product_id FROM products_archive
)
WITH (batch_size = 1000);

COMMIT BULK TRANSACTION;

-- Monitoring and analytics for bulk operations
WITH bulk_operation_analytics AS (
  SELECT 
    operation_type,
    collection_name,
    batch_id,
    operation_start_time,
    operation_end_time,
    EXTRACT(EPOCH FROM operation_end_time - operation_start_time) as duration_seconds,

    -- Operation counts
    total_operations_attempted,
    successful_operations,
    failed_operations,

    -- Performance metrics
    operations_per_second,
    average_batch_size,
    peak_memory_usage_mb,
    network_bytes_transmitted,

    -- Error analysis
    error_rate,
    most_common_error_type,
    retry_count,

    -- Resource utilization
    cpu_usage_percent,
    memory_usage_percent,
    disk_io_operations,

    -- Business impact
    data_volume_processed_mb,
    estimated_cost_savings,
    processing_efficiency_score

  FROM bulk_operation_logs
  WHERE operation_date >= CURRENT_DATE - INTERVAL '30 days'
),

performance_summary AS (
  SELECT 
    operation_type,
    COUNT(*) as total_operations,

    -- Performance statistics
    ROUND(AVG(duration_seconds), 2) as avg_duration_seconds,
    ROUND(AVG(operations_per_second), 0) as avg_throughput,
    ROUND(AVG(processing_efficiency_score), 2) as avg_efficiency,

    -- Error analysis
    ROUND(AVG(error_rate) * 100, 2) as avg_error_rate_percent,
    SUM(failed_operations) as total_failures,

    -- Resource consumption
    ROUND(AVG(peak_memory_usage_mb), 0) as avg_peak_memory_mb,
    ROUND(SUM(data_volume_processed_mb), 0) as total_data_processed_mb,

    -- Performance trends
    ROUND(
      (AVG(operations_per_second) FILTER (WHERE operation_start_time >= CURRENT_DATE - INTERVAL '7 days') - 
       AVG(operations_per_second) FILTER (WHERE operation_start_time < CURRENT_DATE - INTERVAL '7 days')) /
      AVG(operations_per_second) FILTER (WHERE operation_start_time < CURRENT_DATE - INTERVAL '7 days') * 100,
      1
    ) as performance_trend_percent,

    -- Cost analysis
    ROUND(SUM(estimated_cost_savings), 2) as total_cost_savings

  FROM bulk_operation_analytics
  GROUP BY operation_type
),

efficiency_recommendations AS (
  SELECT 
    ps.operation_type,
    ps.total_operations,
    ps.avg_throughput,
    ps.avg_efficiency,

    -- Performance assessment
    CASE 
      WHEN ps.avg_efficiency >= 0.9 THEN 'Excellent'
      WHEN ps.avg_efficiency >= 0.8 THEN 'Good'  
      WHEN ps.avg_efficiency >= 0.7 THEN 'Fair'
      ELSE 'Needs Improvement'
    END as performance_rating,

    -- Optimization recommendations
    CASE 
      WHEN ps.avg_error_rate_percent > 5 THEN 'Focus on error handling and data validation'
      WHEN ps.avg_peak_memory_mb > 1000 THEN 'Optimize memory usage and batch sizing'
      WHEN ps.avg_throughput < 100 THEN 'Increase parallelization and batch optimization'
      WHEN ps.performance_trend_percent < -10 THEN 'Investigate performance degradation'
      ELSE 'Performance is optimized'
    END as primary_recommendation,

    -- Capacity planning
    CASE 
      WHEN ps.total_data_processed_mb > 10000 THEN 'Consider distributed processing'
      WHEN ps.total_operations > 1000 THEN 'Implement advanced caching strategies'  
      ELSE 'Current capacity is sufficient'
    END as capacity_recommendation,

    -- Cost optimization
    ps.total_cost_savings,
    CASE 
      WHEN ps.total_cost_savings < 100 THEN 'Minimal cost impact'
      WHEN ps.total_cost_savings < 1000 THEN 'Moderate cost savings'
      ELSE 'Significant cost optimization achieved'
    END as cost_impact

  FROM performance_summary ps
)

-- Comprehensive bulk operations dashboard
SELECT 
  er.operation_type,
  er.total_operations,
  er.performance_rating,
  er.avg_throughput || ' ops/sec' as throughput,
  er.avg_efficiency * 100 || '%' as efficiency_percentage,

  -- Recommendations
  er.primary_recommendation,
  er.capacity_recommendation,
  er.cost_impact,

  -- Detailed metrics
  JSON_OBJECT(
    'avg_duration', ps.avg_duration_seconds || ' seconds',
    'error_rate', ps.avg_error_rate_percent || '%',
    'memory_usage', ps.avg_peak_memory_mb || ' MB',
    'data_processed', ps.total_data_processed_mb || ' MB',
    'performance_trend', ps.performance_trend_percent || '% change',
    'total_failures', ps.total_failures,
    'cost_savings', '$' || ps.total_cost_savings
  ) as detailed_metrics,

  -- Next actions
  CASE 
    WHEN er.performance_rating = 'Needs Improvement' THEN 
      JSON_ARRAY(
        'Review batch sizing configuration',
        'Analyze error patterns and root causes',
        'Consider infrastructure scaling',
        'Implement performance monitoring alerts'
      )
    WHEN er.performance_rating = 'Fair' THEN 
      JSON_ARRAY(
        'Fine-tune batch optimization parameters',
        'Implement advanced error handling',
        'Consider parallel processing improvements'
      )
    ELSE 
      JSON_ARRAY('Continue monitoring performance trends', 'Maintain current optimization level')
  END as recommended_actions

FROM efficiency_recommendations er
JOIN performance_summary ps ON er.operation_type = ps.operation_type
ORDER BY er.avg_throughput DESC;

-- QueryLeaf provides comprehensive bulk operation capabilities:
-- 1. SQL-familiar bulk insert, update, and delete operations with advanced configuration
-- 2. Sophisticated batch processing with intelligent optimization and error handling
-- 3. Mixed operation transactions with complex business logic and cascade handling
-- 4. Comprehensive monitoring and analytics for operational insight and optimization
-- 5. Production-ready performance tuning with parallel processing and resource management
-- 6. Advanced error handling with retry mechanisms and partial failure recovery
-- 7. Seamless integration with MongoDB's native bulk operation APIs and optimizations
-- 8. Enterprise-scale processing capabilities with distributed operation support
-- 9. Intelligent batch sizing and resource optimization for maximum throughput
-- 10. SQL-style syntax for complex bulk operation workflows and data transformations

Best Practices for Production Bulk Operations Implementation

Performance Architecture and Optimization Strategies

Essential principles for scalable MongoDB bulk operations deployment:

Batch Size Optimization: Configure optimal batch sizes based on document size, operation type, and available resources
Error Handling Strategy: Design comprehensive error handling with retry mechanisms and partial failure recovery
Resource Management: Implement intelligent resource monitoring and automatic optimization for memory and CPU usage
Progress Tracking: Provide detailed progress monitoring and analytics for long-running bulk operations
Transaction Management: Design appropriate transaction boundaries and consistency guarantees for bulk operations
Performance Monitoring: Implement comprehensive monitoring for throughput, latency, and resource utilization

Scalability and Operational Excellence

Optimize bulk operations for enterprise-scale requirements:

Distributed Processing: Design sharding-aware bulk operations that optimize cross-shard performance
Parallel Execution: Implement intelligent parallelization strategies that maximize throughput without overwhelming resources
Memory Optimization: Use streaming processing and memory pooling to handle large datasets efficiently
Network Efficiency: Minimize network overhead through compression and intelligent batching strategies
Error Recovery: Implement robust error recovery and rollback mechanisms for complex bulk operation workflows
Operational Monitoring: Provide comprehensive dashboards and alerting for production bulk operation management

Conclusion

MongoDB bulk operations provide comprehensive high-performance batch processing capabilities that enable efficient handling of large-scale data operations with automatic optimization, intelligent batching, and robust error handling designed specifically for enterprise applications. The native MongoDB integration ensures bulk operations benefit from the same scalability, consistency, and operational features as individual operations while providing significant performance improvements.

Key MongoDB bulk operations benefits include:

High Performance: Optimized batch processing with intelligent operation grouping and minimal network overhead
Comprehensive Error Handling: Robust error management with partial failure recovery and detailed reporting
Flexible Operations: Support for mixed operation types with complex business logic in single batch transactions
Resource Efficiency: Memory-efficient processing with automatic resource management and optimization
Production Scalability: Enterprise-ready performance with distributed processing and advanced monitoring
Operational Excellence: Integrated analytics, progress tracking, and performance optimization tools

Whether you're building ETL pipelines, data migration systems, real-time ingestion platforms, or any application requiring high-volume data processing, MongoDB bulk operations with QueryLeaf's familiar SQL interface provide the foundation for scalable and maintainable batch processing solutions.

QueryLeaf Integration: QueryLeaf automatically optimizes MongoDB bulk operations while providing SQL-familiar syntax for complex batch processing workflows. Advanced bulk operation patterns, error handling strategies, and performance optimization are seamlessly handled through familiar SQL constructs, making sophisticated batch processing capabilities accessible to SQL-oriented development teams.

The combination of MongoDB's robust bulk operation capabilities with SQL-style batch processing makes it an ideal platform for modern applications that require both powerful data processing and familiar database management patterns, ensuring your bulk operation solutions scale efficiently while remaining maintainable and feature-rich.

November 24, 2025
27 min read

MongoDB GridFS and Binary Data Management: Advanced File Storage Solutions for Large-Scale Applications with SQL-Style File Operations

Modern applications require robust file storage solutions that can handle large files, multimedia content, and binary data at scale while providing efficient streaming, versioning, and metadata management capabilities. Traditional file storage approaches struggle with managing large files, handling concurrent access, providing atomic operations, and integrating seamlessly with database transactions and application logic.

MongoDB GridFS provides comprehensive large file storage capabilities that enable efficient handling of binary data, multimedia content, and large documents with automatic chunking, streaming support, and integrated metadata management. Unlike traditional file systems that separate file storage from database operations, GridFS integrates file storage directly into MongoDB, enabling atomic operations, transactions, and unified query capabilities across both structured data and file content.

The Traditional File Storage Challenge

Conventional approaches to large file storage and binary data management have significant limitations for modern applications:

-- Traditional PostgreSQL large object storage - complex and limited integration

-- Basic large object table structure with limited capabilities
CREATE TABLE document_files (
    file_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    filename VARCHAR(255) NOT NULL,
    file_size BIGINT NOT NULL,
    mime_type VARCHAR(100) NOT NULL,
    content_hash VARCHAR(64) NOT NULL,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,

    -- PostgreSQL large object reference
    content_oid OID NOT NULL,

    -- File metadata
    original_filename VARCHAR(500),
    upload_session_id UUID,
    uploader_user_id UUID,

    -- File properties
    is_public BOOLEAN DEFAULT FALSE,
    download_count INTEGER DEFAULT 0,
    last_accessed TIMESTAMP,

    -- Content analysis
    file_extension VARCHAR(20),
    encoding VARCHAR(50),
    language VARCHAR(10),

    -- Storage metadata
    storage_location VARCHAR(200),
    backup_status VARCHAR(50) DEFAULT 'pending',
    compression_enabled BOOLEAN DEFAULT FALSE
);

-- Image-specific metadata table
CREATE TABLE image_files (
    file_id UUID PRIMARY KEY REFERENCES document_files(file_id),
    width INTEGER,
    height INTEGER,
    color_depth INTEGER,
    has_transparency BOOLEAN,
    image_format VARCHAR(20),
    resolution_dpi INTEGER,
    color_profile VARCHAR(100),

    -- Image processing metadata
    thumbnail_generated BOOLEAN DEFAULT FALSE,
    processed_versions JSONB,
    exif_data JSONB
);

-- Video-specific metadata table  
CREATE TABLE video_files (
    file_id UUID PRIMARY KEY REFERENCES document_files(file_id),
    duration_seconds INTEGER,
    width INTEGER,
    height INTEGER,
    frame_rate DECIMAL(5,2),
    video_codec VARCHAR(50),
    audio_codec VARCHAR(50),
    bitrate INTEGER,
    container_format VARCHAR(20),

    -- Video processing metadata
    thumbnails_generated BOOLEAN DEFAULT FALSE,
    preview_clips JSONB,
    processing_status VARCHAR(50) DEFAULT 'pending'
);

-- Audio file metadata table
CREATE TABLE audio_files (
    file_id UUID PRIMARY KEY REFERENCES document_files(file_id),
    duration_seconds INTEGER,
    sample_rate INTEGER,
    channels INTEGER,
    bitrate INTEGER,
    audio_codec VARCHAR(50),
    container_format VARCHAR(20),

    -- Audio metadata
    title VARCHAR(200),
    artist VARCHAR(200),
    album VARCHAR(200),
    genre VARCHAR(100),
    year INTEGER,

    -- Processing metadata
    waveform_generated BOOLEAN DEFAULT FALSE,
    transcription_status VARCHAR(50)
);

-- Complex file chunk management for large files
CREATE TABLE file_chunks (
    chunk_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    file_id UUID NOT NULL REFERENCES document_files(file_id),
    chunk_number INTEGER NOT NULL,
    chunk_size INTEGER NOT NULL,
    chunk_hash VARCHAR(64) NOT NULL,
    content_oid OID NOT NULL,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,

    UNIQUE(file_id, chunk_number)
);

-- Index for chunk retrieval performance
CREATE INDEX idx_file_chunks_file_id_number ON file_chunks (file_id, chunk_number);
CREATE INDEX idx_document_files_hash ON document_files (content_hash);
CREATE INDEX idx_document_files_mime_type ON document_files (mime_type);
CREATE INDEX idx_document_files_created ON document_files (created_at);

-- Complex file upload and streaming implementation
CREATE OR REPLACE FUNCTION upload_large_file(
    p_filename TEXT,
    p_file_content BYTEA,
    p_mime_type TEXT DEFAULT 'application/octet-stream',
    p_user_id UUID DEFAULT NULL,
    p_chunk_size INTEGER DEFAULT 1048576  -- 1MB chunks
) RETURNS UUID
LANGUAGE plpgsql
AS $$
DECLARE
    v_file_id UUID;
    v_content_oid OID;
    v_file_size BIGINT;
    v_content_hash TEXT;
    v_chunk_count INTEGER;
    v_chunk_start INTEGER;
    v_chunk_end INTEGER;
    v_chunk_content BYTEA;
    v_chunk_oid OID;
    i INTEGER;
BEGIN
    -- Calculate file properties
    v_file_size := length(p_file_content);
    v_content_hash := encode(digest(p_file_content, 'sha256'), 'hex');
    v_chunk_count := CEIL(v_file_size::DECIMAL / p_chunk_size);

    -- Check for duplicate content
    SELECT file_id INTO v_file_id 
    FROM document_files 
    WHERE content_hash = v_content_hash;

    IF v_file_id IS NOT NULL THEN
        -- Update access count for existing file
        UPDATE document_files 
        SET download_count = download_count + 1,
            last_accessed = CURRENT_TIMESTAMP
        WHERE file_id = v_file_id;
        RETURN v_file_id;
    END IF;

    -- Generate new file ID
    v_file_id := gen_random_uuid();

    -- Store main file content as large object
    v_content_oid := lo_create(0);
    PERFORM lo_put(v_content_oid, 0, p_file_content);

    -- Insert file metadata
    INSERT INTO document_files (
        file_id, filename, file_size, mime_type, content_hash,
        content_oid, original_filename, uploader_user_id,
        file_extension, storage_location
    ) VALUES (
        v_file_id, p_filename, v_file_size, p_mime_type, v_content_hash,
        v_content_oid, p_filename, p_user_id,
        SUBSTRING(p_filename FROM '\.([^.]*)$'),
        'postgresql_large_objects'
    );

    -- Create chunks for streaming and partial access
    FOR i IN 0..(v_chunk_count - 1) LOOP
        v_chunk_start := i * p_chunk_size;
        v_chunk_end := LEAST((i + 1) * p_chunk_size - 1, v_file_size - 1);

        -- Extract chunk content
        v_chunk_content := SUBSTRING(p_file_content FROM v_chunk_start + 1 FOR (v_chunk_end - v_chunk_start + 1));

        -- Store chunk as separate large object
        v_chunk_oid := lo_create(0);
        PERFORM lo_put(v_chunk_oid, 0, v_chunk_content);

        -- Insert chunk metadata
        INSERT INTO file_chunks (
            file_id, chunk_number, chunk_size, 
            chunk_hash, content_oid
        ) VALUES (
            v_file_id, i, length(v_chunk_content),
            encode(digest(v_chunk_content, 'md5'), 'hex'), v_chunk_oid
        );
    END LOOP;

    RETURN v_file_id;

EXCEPTION
    WHEN OTHERS THEN
        -- Cleanup on error
        IF v_content_oid IS NOT NULL THEN
            PERFORM lo_unlink(v_content_oid);
        END IF;
        RAISE;
END;
$$;

-- Complex streaming download function
CREATE OR REPLACE FUNCTION stream_file_chunk(
    p_file_id UUID,
    p_chunk_number INTEGER
) RETURNS TABLE(
    chunk_content BYTEA,
    chunk_size INTEGER,
    total_chunks INTEGER,
    file_size BIGINT,
    mime_type TEXT
)
LANGUAGE plpgsql
AS $$
DECLARE
    v_chunk_oid OID;
    v_content BYTEA;
BEGIN
    -- Get chunk information
    SELECT 
        fc.content_oid, fc.chunk_size,
        (SELECT COUNT(*) FROM file_chunks WHERE file_id = p_file_id),
        df.file_size, df.mime_type
    INTO v_chunk_oid, chunk_size, total_chunks, file_size, mime_type
    FROM file_chunks fc
    JOIN document_files df ON fc.file_id = df.file_id
    WHERE fc.file_id = p_file_id 
    AND fc.chunk_number = p_chunk_number;

    IF v_chunk_oid IS NULL THEN
        RAISE EXCEPTION 'Chunk not found: file_id=%, chunk=%', p_file_id, p_chunk_number;
    END IF;

    -- Read chunk content
    SELECT lo_get(v_chunk_oid) INTO v_content;
    chunk_content := v_content;

    -- Update access statistics
    UPDATE document_files 
    SET last_accessed = CURRENT_TIMESTAMP,
        download_count = CASE 
            WHEN p_chunk_number = 0 THEN download_count + 1 
            ELSE download_count 
        END
    WHERE file_id = p_file_id;

    RETURN NEXT;
END;
$$;

-- File search and management with limited capabilities
WITH file_analytics AS (
    SELECT 
        df.file_id,
        df.filename,
        df.file_size,
        df.mime_type,
        df.created_at,
        df.download_count,
        df.uploader_user_id,

        -- Size categorization
        CASE 
            WHEN df.file_size < 1048576 THEN 'small'    -- < 1MB
            WHEN df.file_size < 104857600 THEN 'medium' -- < 100MB
            WHEN df.file_size < 1073741824 THEN 'large' -- < 1GB
            ELSE 'xlarge'  -- >= 1GB
        END as size_category,

        -- Type categorization
        CASE 
            WHEN df.mime_type LIKE 'image/%' THEN 'image'
            WHEN df.mime_type LIKE 'video/%' THEN 'video'  
            WHEN df.mime_type LIKE 'audio/%' THEN 'audio'
            WHEN df.mime_type LIKE 'application/pdf' THEN 'document'
            WHEN df.mime_type LIKE 'text/%' THEN 'text'
            ELSE 'other'
        END as content_type,

        -- Storage efficiency
        (SELECT COUNT(*) FROM file_chunks WHERE file_id = df.file_id) as chunk_count,

        -- Usage metrics  
        EXTRACT(DAYS FROM CURRENT_TIMESTAMP - df.last_accessed) as days_since_access,

        -- Duplication analysis (limited by hash comparison only)
        (
            SELECT COUNT(*) - 1 
            FROM document_files df2 
            WHERE df2.content_hash = df.content_hash 
            AND df2.file_id != df.file_id
        ) as duplicate_count

    FROM document_files df
    WHERE df.created_at >= CURRENT_DATE - INTERVAL '90 days'
),
storage_summary AS (
    SELECT 
        content_type,
        size_category,
        COUNT(*) as file_count,
        SUM(file_size) as total_size_bytes,
        ROUND(AVG(file_size)::numeric, 0) as avg_file_size,
        SUM(download_count) as total_downloads,
        ROUND(AVG(download_count)::numeric, 1) as avg_downloads_per_file,

        -- Storage optimization opportunities
        SUM(CASE WHEN duplicate_count > 0 THEN file_size ELSE 0 END) as duplicate_storage_waste,
        COUNT(CASE WHEN days_since_access > 30 THEN 1 END) as stale_files,
        SUM(CASE WHEN days_since_access > 30 THEN file_size ELSE 0 END) as stale_storage_bytes

    FROM file_analytics
    GROUP BY content_type, size_category
)
SELECT 
    ss.content_type,
    ss.size_category,
    ss.file_count,

    -- Size formatting
    CASE 
        WHEN ss.total_size_bytes >= 1073741824 THEN 
            ROUND((ss.total_size_bytes / 1073741824.0)::numeric, 2) || ' GB'
        WHEN ss.total_size_bytes >= 1048576 THEN 
            ROUND((ss.total_size_bytes / 1048576.0)::numeric, 2) || ' MB'  
        WHEN ss.total_size_bytes >= 1024 THEN 
            ROUND((ss.total_size_bytes / 1024.0)::numeric, 2) || ' KB'
        ELSE ss.total_size_bytes || ' bytes'
    END as total_storage,

    ss.avg_file_size,
    ss.total_downloads,
    ss.avg_downloads_per_file,

    -- Storage optimization insights
    CASE 
        WHEN ss.duplicate_storage_waste > 0 THEN 
            ROUND((ss.duplicate_storage_waste / 1048576.0)::numeric, 2) || ' MB duplicate waste'
        ELSE 'No duplicates found'
    END as duplication_impact,

    ss.stale_files,
    CASE 
        WHEN ss.stale_storage_bytes > 0 THEN 
            ROUND((ss.stale_storage_bytes / 1048576.0)::numeric, 2) || ' MB in stale files'
        ELSE 'No stale files'
    END as stale_storage_impact,

    -- Storage efficiency recommendations
    CASE 
        WHEN ss.duplicate_count > ss.file_count * 0.1 THEN 'Implement deduplication'
        WHEN ss.stale_files > ss.file_count * 0.2 THEN 'Archive old files'
        WHEN ss.avg_file_size > 104857600 AND ss.content_type != 'video' THEN 'Consider compression'
        ELSE 'Storage optimized'
    END as optimization_recommendation

FROM storage_summary ss
ORDER BY ss.total_size_bytes DESC;

-- Problems with traditional file storage approaches:
-- 1. Complex chunking and streaming implementation with manual management
-- 2. Separate storage of file content and metadata in different systems  
-- 3. No atomic operations across file content and related database records
-- 4. Limited query capabilities for file content and metadata together
-- 5. Manual deduplication and storage optimization required
-- 6. Poor integration with application transactions and consistency
-- 7. Complex backup and replication strategies for large object storage
-- 8. Limited support for file versioning and concurrent access
-- 9. Difficult to implement advanced features like content-based indexing
-- 10. Scalability limitations with very large files and high concurrency

-- MySQL file storage (even more limited)
CREATE TABLE mysql_files (
    id BIGINT AUTO_INCREMENT PRIMARY KEY,
    filename VARCHAR(255) NOT NULL,
    file_content LONGBLOB,  -- Limited to ~4GB
    mime_type VARCHAR(100),
    file_size INT UNSIGNED,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,

    INDEX idx_filename (filename),
    INDEX idx_mime_type (mime_type)
);

-- Basic file insertion (limited by LONGBLOB size)
INSERT INTO mysql_files (filename, file_content, mime_type, file_size)
VALUES (?, ?, ?, LENGTH(?));

-- Simple file retrieval (no streaming capabilities)
SELECT file_content, mime_type, file_size 
FROM mysql_files 
WHERE id = ?;

-- MySQL limitations for file storage:
-- - LONGBLOB limited to ~4GB maximum file size
-- - No built-in chunking or streaming capabilities  
-- - Poor performance with large binary data
-- - No atomic operations with file content and metadata
-- - Limited backup and replication options for large files
-- - No advanced features like deduplication or versioning
-- - Basic search capabilities limited to filename and metadata

MongoDB GridFS provides comprehensive large file storage and binary data management:

// MongoDB GridFS - advanced large file storage with comprehensive binary data management
const { MongoClient, GridFSBucket } = require('mongodb');
const { createReadStream, createWriteStream } = require('fs');
const { pipeline } = require('stream');
const { promisify } = require('util');
const crypto = require('crypto');

const client = new MongoClient('mongodb://localhost:27017');
const db = client.db('advanced_file_management_platform');

// Advanced GridFS file management and multimedia processing system
class AdvancedGridFSManager {
  constructor(db, config = {}) {
    this.db = db;
    this.config = {
      chunkSizeBytes: config.chunkSizeBytes || 261120, // 255KB chunks
      maxFileSizeBytes: config.maxFileSizeBytes || 16 * 1024 * 1024 * 1024, // 16GB
      enableCompression: config.enableCompression || true,
      enableDeduplication: config.enableDeduplication || true,
      enableVersioning: config.enableVersioning || true,
      enableContentAnalysis: config.enableContentAnalysis || true,

      // Storage optimization
      compressionThreshold: config.compressionThreshold || 1048576, // 1MB
      dedupHashAlgorithm: config.dedupHashAlgorithm || 'sha256',
      thumbnailGeneration: config.thumbnailGeneration || true,
      contentIndexing: config.contentIndexing || true,

      // Performance tuning
      concurrentUploads: config.concurrentUploads || 10,
      streamingChunkSize: config.streamingChunkSize || 1024 * 1024, // 1MB
      cacheStrategy: config.cacheStrategy || 'lru',
      maxCacheSize: config.maxCacheSize || 100 * 1024 * 1024 // 100MB
    };

    // Initialize GridFS buckets for different content types
    this.buckets = {
      files: new GridFSBucket(db, { 
        bucketName: 'files',
        chunkSizeBytes: this.config.chunkSizeBytes
      }),
      images: new GridFSBucket(db, { 
        bucketName: 'images',
        chunkSizeBytes: this.config.chunkSizeBytes
      }),
      videos: new GridFSBucket(db, { 
        bucketName: 'videos',
        chunkSizeBytes: this.config.chunkSizeBytes
      }),
      audio: new GridFSBucket(db, { 
        bucketName: 'audio',
        chunkSizeBytes: this.config.chunkSizeBytes
      }),
      documents: new GridFSBucket(db, { 
        bucketName: 'documents',
        chunkSizeBytes: this.config.chunkSizeBytes
      }),
      archives: new GridFSBucket(db, { 
        bucketName: 'archives',
        chunkSizeBytes: this.config.chunkSizeBytes
      })
    };

    // File processing queues and caches
    this.processingQueue = new Map();
    this.contentCache = new Map();
    this.metadataCache = new Map();

    this.setupIndexes();
    this.initializeContentProcessors();
  }

  async setupIndexes() {
    console.log('Setting up GridFS performance indexes...');

    try {
      // Index configurations for all buckets
      const indexConfigs = [
        // Filename and content type indexes
        { 'filename': 1, 'metadata.contentType': 1 },
        { 'metadata.contentType': 1, 'uploadDate': -1 },

        // Content-based indexes
        { 'metadata.contentHash': 1 }, // For deduplication
        { 'metadata.originalHash': 1 },
        { 'metadata.fileSize': 1 },

        // Access pattern indexes
        { 'metadata.createdBy': 1, 'uploadDate': -1 },
        { 'metadata.accessCount': -1 },
        { 'metadata.lastAccessed': -1 },

        // Content analysis indexes
        { 'metadata.tags': 1 },
        { 'metadata.category': 1, 'metadata.subcategory': 1 },
        { 'metadata.language': 1 },
        { 'metadata.processingStatus': 1 },

        // Multimedia-specific indexes
        { 'metadata.imageProperties.width': 1, 'metadata.imageProperties.height': 1 },
        { 'metadata.videoProperties.duration': 1 },
        { 'metadata.audioProperties.duration': 1 },

        // Version and relationship indexes
        { 'metadata.version': 1, 'metadata.baseFileId': 1 },
        { 'metadata.parentFileId': 1 },
        { 'metadata.derivedFrom': 1 },

        // Storage optimization indexes
        { 'metadata.storageClass': 1 },
        { 'metadata.compressionRatio': 1 },
        { 'metadata.isCompressed': 1 },

        // Search and discovery indexes
        { 'metadata.searchableText': 'text' },
        { '$**': 'text' }, // Wildcard text index for flexible search

        // Geospatial indexes for location-based files
        { 'metadata.location': '2dsphere' },

        // Compound indexes for complex queries
        { 'metadata.contentType': 1, 'metadata.fileSize': -1, 'uploadDate': -1 },
        { 'metadata.createdBy': 1, 'metadata.contentType': 1, 'metadata.isPublic': 1 },
        { 'metadata.category': 1, 'metadata.processingStatus': 1, 'uploadDate': -1 }
      ];

      // Apply indexes to all bucket collections
      for (const [bucketName, bucket] of Object.entries(this.buckets)) {
        const filesCollection = this.db.collection(`${bucketName}.files`);
        const chunksCollection = this.db.collection(`${bucketName}.chunks`);

        // Files collection indexes
        for (const indexSpec of indexConfigs) {
          try {
            await filesCollection.createIndex(indexSpec, { background: true });
          } catch (error) {
            if (!error.message.includes('already exists')) {
              console.warn(`Index creation warning for ${bucketName}.files:`, error.message);
            }
          }
        }

        // Chunks collection optimization
        await chunksCollection.createIndex(
          { files_id: 1, n: 1 }, 
          { background: true, unique: true }
        );
      }

      console.log('GridFS indexes created successfully');
    } catch (error) {
      console.error('Error setting up GridFS indexes:', error);
      throw error;
    }
  }

  async uploadFile(fileStream, filename, metadata = {}, options = {}) {
    console.log(`Starting GridFS upload: ${filename}`);
    const uploadStart = Date.now();

    try {
      // Determine appropriate bucket based on content type
      const bucket = this.selectBucket(metadata.contentType || options.contentType);

      // Prepare comprehensive metadata
      const fileMetadata = await this.prepareFileMetadata(filename, metadata, options);

      // Check for deduplication if enabled
      if (this.config.enableDeduplication && fileMetadata.contentHash) {
        const existingFile = await this.checkForDuplicate(fileMetadata.contentHash);
        if (existingFile) {
          console.log(`Duplicate file found, linking to existing: ${existingFile._id}`);
          return await this.linkToDuplicate(existingFile, fileMetadata);
        }
      }

      // Create upload stream with compression if needed
      const uploadStream = bucket.openUploadStream(filename, {
        chunkSizeBytes: options.chunkSize || this.config.chunkSizeBytes,
        metadata: fileMetadata
      });

      // Set up progress tracking and error handling
      let uploadedBytes = 0;
      const totalSize = fileMetadata.fileSize || 0;

      uploadStream.on('progress', (bytesUploaded) => {
        uploadedBytes = bytesUploaded;
        if (options.onProgress) {
          options.onProgress({
            filename,
            uploadedBytes,
            totalSize,
            percentage: totalSize ? Math.round((uploadedBytes / totalSize) * 100) : 0
          });
        }
      });

      // Handle upload completion
      const uploadPromise = new Promise((resolve, reject) => {
        uploadStream.on('finish', async () => {
          try {
            const uploadTime = Date.now() - uploadStart;
            console.log(`Upload completed: ${filename} (${uploadTime}ms)`);

            // Post-upload processing
            const fileDoc = await this.getFileById(uploadStream.id);

            // Queue for content processing if enabled
            if (this.config.enableContentAnalysis) {
              await this.queueContentProcessing(fileDoc);
            }

            // Update upload statistics
            await this.updateUploadStatistics(fileDoc, uploadTime);

            resolve(fileDoc);
          } catch (error) {
            reject(error);
          }
        });

        uploadStream.on('error', reject);
      });

      // Pipe file stream to GridFS with compression if needed
      if (this.shouldCompress(fileMetadata)) {
        const compressionStream = this.createCompressionStream();
        pipeline(fileStream, compressionStream, uploadStream, (error) => {
          if (error) {
            console.error(`Upload pipeline error for ${filename}:`, error);
            uploadStream.destroy(error);
          }
        });
      } else {
        pipeline(fileStream, uploadStream, (error) => {
          if (error) {
            console.error(`Upload pipeline error for ${filename}:`, error);
            uploadStream.destroy(error);
          }
        });
      }

      return await uploadPromise;

    } catch (error) {
      console.error(`GridFS upload error for ${filename}:`, error);
      throw error;
    }
  }

  async prepareFileMetadata(filename, providedMetadata, options) {
    // Generate comprehensive file metadata
    const metadata = {
      // Basic file information
      originalFilename: filename,
      uploadedAt: new Date(),
      createdBy: options.userId || null,
      fileSize: providedMetadata.fileSize || null,

      // Content identification
      contentType: providedMetadata.contentType || this.detectContentType(filename),
      fileExtension: this.extractFileExtension(filename),
      encoding: providedMetadata.encoding || 'binary',

      // Content hashing for deduplication
      contentHash: providedMetadata.contentHash || null,
      originalHash: providedMetadata.originalHash || null,

      // Access control and visibility
      isPublic: options.isPublic || false,
      accessLevel: options.accessLevel || 'private',
      permissions: options.permissions || {},

      // Classification and organization
      category: providedMetadata.category || this.categorizeByContentType(providedMetadata.contentType),
      subcategory: providedMetadata.subcategory || null,
      tags: providedMetadata.tags || [],
      keywords: providedMetadata.keywords || [],

      // Content properties (will be updated during processing)
      language: providedMetadata.language || null,
      searchableText: providedMetadata.searchableText || '',

      // Processing status
      processingStatus: 'uploaded',
      processingQueue: [],
      processingResults: {},

      // Storage optimization
      isCompressed: false,
      compressionAlgorithm: null,
      compressionRatio: null,
      storageClass: options.storageClass || 'standard',

      // Usage tracking
      accessCount: 0,
      downloadCount: 0,
      lastAccessed: null,

      // Versioning and relationships
      version: options.version || 1,
      baseFileId: options.baseFileId || null,
      parentFileId: options.parentFileId || null,
      derivedFrom: options.derivedFrom || null,
      hasVersions: false,

      // Location and context
      location: providedMetadata.location || null,
      uploadSource: options.uploadSource || 'api',
      uploadSessionId: options.uploadSessionId || null,

      // Custom metadata
      customFields: providedMetadata.customFields || {},
      applicationData: providedMetadata.applicationData || {},

      // Media-specific properties (initialized empty, filled during processing)
      imageProperties: {},
      videoProperties: {},
      audioProperties: {},
      documentProperties: {}
    };

    return metadata;
  }

  async downloadFile(fileId, options = {}) {
    console.log(`Starting GridFS download: ${fileId}`);

    try {
      // Get file document first
      const fileDoc = await this.getFileById(fileId);
      if (!fileDoc) {
        throw new Error(`File not found: ${fileId}`);
      }

      // Check access permissions
      if (!await this.checkDownloadPermissions(fileDoc, options.userId)) {
        throw new Error('Insufficient permissions to download file');
      }

      // Select appropriate bucket
      const bucket = this.selectBucketForFile(fileDoc);

      // Create download stream with range support
      let downloadStream;

      if (options.range) {
        // Partial download with HTTP range support
        downloadStream = bucket.openDownloadStream(fileDoc._id, {
          start: options.range.start,
          end: options.range.end
        });
      } else {
        // Full file download
        downloadStream = bucket.openDownloadStream(fileDoc._id);
      }

      // Set up decompression if needed
      let finalStream = downloadStream;
      if (fileDoc.metadata.isCompressed && !options.skipDecompression) {
        const decompressionStream = this.createDecompressionStream(
          fileDoc.metadata.compressionAlgorithm
        );
        finalStream = pipeline(downloadStream, decompressionStream, () => {});
      }

      // Track download statistics
      downloadStream.on('file', async () => {
        await this.updateDownloadStatistics(fileDoc);
      });

      // Handle streaming errors
      downloadStream.on('error', (error) => {
        console.error(`Download error for file ${fileId}:`, error);
        throw error;
      });

      return {
        stream: finalStream,
        metadata: fileDoc.metadata,
        filename: fileDoc.filename,
        contentType: fileDoc.metadata.contentType,
        fileSize: fileDoc.length
      };

    } catch (error) {
      console.error(`GridFS download error for ${fileId}:`, error);
      throw error;
    }
  }

  async searchFiles(query, options = {}) {
    console.log('Performing advanced GridFS file search...', query);

    try {
      // Build comprehensive search pipeline
      const searchPipeline = this.buildFileSearchPipeline(query, options);

      // Select appropriate bucket or search across all
      const results = [];
      const bucketsToSearch = options.bucket ? [options.bucket] : Object.keys(this.buckets);

      for (const bucketName of bucketsToSearch) {
        const filesCollection = this.db.collection(`${bucketName}.files`);
        const bucketResults = await filesCollection.aggregate(searchPipeline).toArray();

        // Add bucket context to results
        const enhancedResults = bucketResults.map(result => ({
          ...result,
          bucketName: bucketName,
          downloadUrl: `/api/files/${bucketName}/${result._id}/download`,
          previewUrl: `/api/files/${bucketName}/${result._id}/preview`,
          metadataUrl: `/api/files/${bucketName}/${result._id}/metadata`
        }));

        results.push(...enhancedResults);
      }

      // Sort combined results by relevance
      results.sort((a, b) => (b.searchScore || 0) - (a.searchScore || 0));

      return {
        results: results.slice(0, options.limit || 50),
        totalCount: results.length,
        searchQuery: query,
        searchOptions: options,
        executionTime: Date.now()
      };

    } catch (error) {
      console.error('GridFS search error:', error);
      throw error;
    }
  }

  buildFileSearchPipeline(query, options) {
    const pipeline = [];
    const matchStage = {};

    // Text search across filename and searchable content
    if (query.text) {
      matchStage.$text = {
        $search: query.text,
        $caseSensitive: false,
        $diacriticSensitive: false
      };
    }

    // Content type filtering
    if (query.contentType) {
      matchStage['metadata.contentType'] = Array.isArray(query.contentType) 
        ? { $in: query.contentType }
        : query.contentType;
    }

    // File size filtering
    if (query.minSize || query.maxSize) {
      matchStage.length = {};
      if (query.minSize) matchStage.length.$gte = query.minSize;
      if (query.maxSize) matchStage.length.$lte = query.maxSize;
    }

    // Date range filtering
    if (query.dateFrom || query.dateTo) {
      matchStage.uploadDate = {};
      if (query.dateFrom) matchStage.uploadDate.$gte = new Date(query.dateFrom);
      if (query.dateTo) matchStage.uploadDate.$lte = new Date(query.dateTo);
    }

    // User/creator filtering
    if (query.createdBy) {
      matchStage['metadata.createdBy'] = query.createdBy;
    }

    // Category and tag filtering
    if (query.category) {
      matchStage['metadata.category'] = query.category;
    }

    if (query.tags && query.tags.length > 0) {
      matchStage['metadata.tags'] = { $in: query.tags };
    }

    // Processing status filtering
    if (query.processingStatus) {
      matchStage['metadata.processingStatus'] = query.processingStatus;
    }

    // Access level filtering
    if (query.accessLevel) {
      matchStage['metadata.accessLevel'] = query.accessLevel;
    }

    // Public/private filtering
    if (query.isPublic !== undefined) {
      matchStage['metadata.isPublic'] = query.isPublic;
    }

    // Add match stage
    if (Object.keys(matchStage).length > 0) {
      pipeline.push({ $match: matchStage });
    }

    // Add search scoring for text queries
    if (query.text) {
      pipeline.push({
        $addFields: {
          searchScore: { $meta: 'textScore' }
        }
      });
    }

    // Add computed fields for enhanced results
    pipeline.push({
      $addFields: {
        fileSizeFormatted: {
          $switch: {
            branches: [
              { case: { $gte: ['$length', 1073741824] }, then: { $concat: [{ $toString: { $round: [{ $divide: ['$length', 1073741824] }, 2] } }, ' GB'] } },
              { case: { $gte: ['$length', 1048576] }, then: { $concat: [{ $toString: { $round: [{ $divide: ['$length', 1048576] }, 2] } }, ' MB'] } },
              { case: { $gte: ['$length', 1024] }, then: { $concat: [{ $toString: { $round: [{ $divide: ['$length', 1024] }, 2] } }, ' KB'] } }
            ],
            default: { $concat: [{ $toString: '$length' }, ' bytes'] }
          }
        },

        uploadDateFormatted: {
          $dateToString: {
            format: '%Y-%m-%d %H:%M:%S',
            date: '$uploadDate'
          }
        },

        // Content category for display
        contentCategory: {
          $switch: {
            branches: [
              { case: { $regexMatch: { input: '$metadata.contentType', regex: '^image/' } }, then: 'Image' },
              { case: { $regexMatch: { input: '$metadata.contentType', regex: '^video/' } }, then: 'Video' },
              { case: { $regexMatch: { input: '$metadata.contentType', regex: '^audio/' } }, then: 'Audio' },
              { case: { $regexMatch: { input: '$metadata.contentType', regex: '^text/' } }, then: 'Text' },
              { case: { $eq: ['$metadata.contentType', 'application/pdf'] }, then: 'PDF Document' }
            ],
            default: 'Other'
          }
        },

        // Processing status indicator
        processingStatusDisplay: {
          $switch: {
            branches: [
              { case: { $eq: ['$metadata.processingStatus', 'uploaded'] }, then: 'Ready' },
              { case: { $eq: ['$metadata.processingStatus', 'processing'] }, then: 'Processing...' },
              { case: { $eq: ['$metadata.processingStatus', 'completed'] }, then: 'Processed' },
              { case: { $eq: ['$metadata.processingStatus', 'failed'] }, then: 'Processing Failed' }
            ],
            default: 'Unknown'
          }
        },

        // Popularity indicator
        popularityScore: {
          $multiply: [
            { $log10: { $add: [{ $ifNull: ['$metadata.downloadCount', 0] }, 1] } },
            { $log10: { $add: [{ $ifNull: ['$metadata.accessCount', 0] }, 1] } }
          ]
        }
      }
    });

    // Sorting
    const sortStage = {};
    if (query.text) {
      sortStage.searchScore = { $meta: 'textScore' };
    }

    if (options.sortBy) {
      switch (options.sortBy) {
        case 'uploadDate':
          sortStage.uploadDate = options.sortOrder === 'asc' ? 1 : -1;
          break;
        case 'fileSize':
          sortStage.length = options.sortOrder === 'asc' ? 1 : -1;
          break;
        case 'filename':
          sortStage.filename = options.sortOrder === 'asc' ? 1 : -1;
          break;
        case 'popularity':
          sortStage.popularityScore = -1;
          break;
        default:
          sortStage.uploadDate = -1;
      }
    } else {
      sortStage.uploadDate = -1; // Default sort by upload date
    }

    pipeline.push({ $sort: sortStage });

    // Pagination
    if (options.skip) {
      pipeline.push({ $skip: options.skip });
    }

    if (options.limit) {
      pipeline.push({ $limit: options.limit });
    }

    return pipeline;
  }

  async processMultimediaContent(fileDoc) {
    console.log(`Processing multimedia content: ${fileDoc.filename}`);

    try {
      const contentType = fileDoc.metadata.contentType;
      let processingResults = {};

      // Update processing status
      await this.updateFileMetadata(fileDoc._id, {
        'metadata.processingStatus': 'processing',
        'metadata.processingStarted': new Date()
      });

      // Image processing
      if (contentType.startsWith('image/')) {
        processingResults.image = await this.processImageFile(fileDoc);
      }
      // Video processing
      else if (contentType.startsWith('video/')) {
        processingResults.video = await this.processVideoFile(fileDoc);
      }
      // Audio processing
      else if (contentType.startsWith('audio/')) {
        processingResults.audio = await this.processAudioFile(fileDoc);
      }
      // Document processing
      else if (this.isDocumentType(contentType)) {
        processingResults.document = await this.processDocumentFile(fileDoc);
      }

      // Update file with processing results
      await this.updateFileMetadata(fileDoc._id, {
        'metadata.processingStatus': 'completed',
        'metadata.processingCompleted': new Date(),
        'metadata.processingResults': processingResults,
        'metadata.imageProperties': processingResults.image || {},
        'metadata.videoProperties': processingResults.video || {},
        'metadata.audioProperties': processingResults.audio || {},
        'metadata.documentProperties': processingResults.document || {}
      });

      console.log(`Multimedia processing completed: ${fileDoc.filename}`);
      return processingResults;

    } catch (error) {
      console.error(`Multimedia processing error for ${fileDoc.filename}:`, error);

      // Update error status
      await this.updateFileMetadata(fileDoc._id, {
        'metadata.processingStatus': 'failed',
        'metadata.processingError': error.message,
        'metadata.processingCompleted': new Date()
      });

      throw error;
    }
  }

  async processImageFile(fileDoc) {
    // Image processing implementation
    return {
      width: 1920,
      height: 1080,
      colorDepth: 24,
      hasTransparency: false,
      format: 'jpeg',
      resolutionDpi: 72,
      colorProfile: 'sRGB',
      thumbnailGenerated: true,
      exifData: {}
    };
  }

  async processVideoFile(fileDoc) {
    // Video processing implementation
    return {
      duration: 120.5,
      width: 1920,
      height: 1080,
      frameRate: 29.97,
      videoCodec: 'h264',
      audioCodec: 'aac',
      bitrate: 2500000,
      containerFormat: 'mp4',
      thumbnailsGenerated: true,
      previewClips: []
    };
  }

  async processAudioFile(fileDoc) {
    // Audio processing implementation
    return {
      duration: 245.3,
      sampleRate: 44100,
      channels: 2,
      bitrate: 320000,
      codec: 'mp3',
      containerFormat: 'mp3',
      title: 'Unknown',
      artist: 'Unknown',
      album: 'Unknown',
      waveformGenerated: true
    };
  }

  async performFileAnalytics(options = {}) {
    console.log('Performing comprehensive GridFS analytics...');

    try {
      const analytics = {};

      // Analyze each bucket
      for (const [bucketName, bucket] of Object.entries(this.buckets)) {
        console.log(`Analyzing bucket: ${bucketName}`);

        const filesCollection = this.db.collection(`${bucketName}.files`);
        const chunksCollection = this.db.collection(`${bucketName}.chunks`);

        // Basic statistics
        const totalFiles = await filesCollection.countDocuments();
        const totalSizeResult = await filesCollection.aggregate([
          { $group: { _id: null, totalSize: { $sum: '$length' } } }
        ]).toArray();

        const totalSize = totalSizeResult[0]?.totalSize || 0;

        // Content type distribution
        const contentTypeDistribution = await filesCollection.aggregate([
          {
            $group: {
              _id: '$metadata.contentType',
              count: { $sum: 1 },
              totalSize: { $sum: '$length' },
              avgSize: { $avg: '$length' }
            }
          },
          { $sort: { count: -1 } }
        ]).toArray();

        // Upload trends
        const uploadTrends = await filesCollection.aggregate([
          {
            $group: {
              _id: {
                year: { $year: '$uploadDate' },
                month: { $month: '$uploadDate' },
                day: { $dayOfMonth: '$uploadDate' }
              },
              dailyUploads: { $sum: 1 },
              dailySize: { $sum: '$length' }
            }
          },
          { $sort: { '_id.year': 1, '_id.month': 1, '_id.day': 1 } },
          { $limit: 30 } // Last 30 days
        ]).toArray();

        // Storage efficiency analysis
        const compressionAnalysis = await filesCollection.aggregate([
          {
            $group: {
              _id: '$metadata.isCompressed',
              count: { $sum: 1 },
              totalSize: { $sum: '$length' },
              avgCompressionRatio: { $avg: '$metadata.compressionRatio' }
            }
          }
        ]).toArray();

        // Usage patterns
        const usagePatterns = await filesCollection.aggregate([
          {
            $group: {
              _id: null,
              totalDownloads: { $sum: '$metadata.downloadCount' },
              totalAccesses: { $sum: '$metadata.accessCount' },
              avgDownloadsPerFile: { $avg: '$metadata.downloadCount' },
              mostDownloaded: { $max: '$metadata.downloadCount' }
            }
          }
        ]).toArray();

        // Chunk analysis
        const chunkAnalysis = await chunksCollection.aggregate([
          {
            $group: {
              _id: null,
              totalChunks: { $sum: 1 },
              avgChunkSize: { $avg: { $binarySize: '$data' } },
              minChunkSize: { $min: { $binarySize: '$data' } },
              maxChunkSize: { $max: { $binarySize: '$data' } }
            }
          }
        ]).toArray();

        analytics[bucketName] = {
          summary: {
            totalFiles,
            totalSize,
            avgFileSize: totalFiles > 0 ? Math.round(totalSize / totalFiles) : 0,
            formattedTotalSize: this.formatFileSize(totalSize)
          },
          contentTypes: contentTypeDistribution,
          uploadTrends: uploadTrends,
          compression: compressionAnalysis,
          usage: usagePatterns[0] || {},
          chunks: chunkAnalysis[0] || {},
          recommendations: this.generateOptimizationRecommendations({
            totalFiles,
            totalSize,
            contentTypeDistribution,
            compressionAnalysis,
            usagePatterns: usagePatterns[0]
          })
        };
      }

      return analytics;

    } catch (error) {
      console.error('GridFS analytics error:', error);
      throw error;
    }
  }

  generateOptimizationRecommendations(stats) {
    const recommendations = [];

    // Storage optimization
    if (stats.totalSize > 100 * 1024 * 1024 * 1024) { // 100GB
      recommendations.push({
        type: 'storage',
        priority: 'high',
        message: 'Large storage usage detected - consider implementing data archival strategies'
      });
    }

    // Compression recommendations
    const uncompressedFiles = stats.compressionAnalysis.find(c => c._id === false);
    if (uncompressedFiles && uncompressedFiles.count > stats.totalFiles * 0.8) {
      recommendations.push({
        type: 'compression',
        priority: 'medium',
        message: 'Many files could benefit from compression to save storage space'
      });
    }

    // Usage pattern recommendations
    if (stats.usagePatterns && stats.usagePatterns.avgDownloadsPerFile < 1) {
      recommendations.push({
        type: 'usage',
        priority: 'low',
        message: 'Low file access rates - consider implementing content cleanup policies'
      });
    }

    return recommendations;
  }

  // Utility methods
  selectBucket(contentType) {
    if (!contentType) return this.buckets.files;

    if (contentType.startsWith('image/')) return this.buckets.images;
    if (contentType.startsWith('video/')) return this.buckets.videos;
    if (contentType.startsWith('audio/')) return this.buckets.audio;
    if (this.isDocumentType(contentType)) return this.buckets.documents;
    if (contentType.includes('zip') || contentType.includes('tar')) return this.buckets.archives;

    return this.buckets.files;
  }

  isDocumentType(contentType) {
    return contentType === 'application/pdf' || 
           contentType.startsWith('text/') || 
           contentType.includes('document') ||
           contentType.includes('office') ||
           contentType.includes('word') ||
           contentType.includes('excel') ||
           contentType.includes('powerpoint');
  }

  formatFileSize(bytes) {
    if (bytes >= 1073741824) return `${(bytes / 1073741824).toFixed(2)} GB`;
    if (bytes >= 1048576) return `${(bytes / 1048576).toFixed(2)} MB`;
    if (bytes >= 1024) return `${(bytes / 1024).toFixed(2)} KB`;
    return `${bytes} bytes`;
  }

  detectContentType(filename) {
    const ext = this.extractFileExtension(filename).toLowerCase();
    const mimeTypes = {
      'jpg': 'image/jpeg', 'jpeg': 'image/jpeg', 'png': 'image/png', 'gif': 'image/gif',
      'mp4': 'video/mp4', 'avi': 'video/avi', 'mov': 'video/quicktime',
      'mp3': 'audio/mpeg', 'wav': 'audio/wav', 'flac': 'audio/flac',
      'pdf': 'application/pdf', 'doc': 'application/msword', 'txt': 'text/plain',
      'zip': 'application/zip', 'tar': 'application/x-tar'
    };
    return mimeTypes[ext] || 'application/octet-stream';
  }

  extractFileExtension(filename) {
    const lastDot = filename.lastIndexOf('.');
    return lastDot > 0 ? filename.substring(lastDot + 1) : '';
  }

  categorizeByContentType(contentType) {
    if (!contentType) return 'other';
    if (contentType.startsWith('image/')) return 'image';
    if (contentType.startsWith('video/')) return 'video';
    if (contentType.startsWith('audio/')) return 'audio';
    if (contentType === 'application/pdf') return 'document';
    if (contentType.startsWith('text/')) return 'text';
    return 'other';
  }

  async getFileById(fileId) {
    // Search across all buckets for the file
    for (const [bucketName, bucket] of Object.entries(this.buckets)) {
      const filesCollection = this.db.collection(`${bucketName}.files`);
      const fileDoc = await filesCollection.findOne({ _id: fileId });
      if (fileDoc) {
        fileDoc.bucketName = bucketName;
        return fileDoc;
      }
    }
    return null;
  }

  async updateFileMetadata(fileId, updates) {
    const fileDoc = await this.getFileById(fileId);
    if (!fileDoc) {
      throw new Error(`File not found: ${fileId}`);
    }

    const filesCollection = this.db.collection(`${fileDoc.bucketName}.files`);
    return await filesCollection.updateOne({ _id: fileId }, { $set: updates });
  }
}

// Benefits of MongoDB GridFS for Large File Management:
// - Native chunking and streaming capabilities with automatic chunk management
// - Atomic operations combining file content and metadata in database transactions
// - Built-in replication and sharding support for distributed file storage
// - Comprehensive indexing capabilities for file metadata and content properties
// - Integrated backup and restore operations with database-level consistency
// - Advanced querying capabilities across file content and associated data
// - Automatic load balancing and failover for file operations
// - Version control and concurrent access management built into the database
// - Seamless integration with MongoDB's security and access control systems
// - Production-ready scalability with automatic optimization and performance tuning

module.exports = {
  AdvancedGridFSManager
};

Understanding MongoDB GridFS Architecture

Advanced File Storage Patterns and Multimedia Processing

Implement sophisticated GridFS strategies for production file management systems:

// Production-scale GridFS implementation with advanced multimedia processing and content management
class ProductionGridFSPlatform extends AdvancedGridFSManager {
  constructor(db, productionConfig) {
    super(db, productionConfig);

    this.productionConfig = {
      ...productionConfig,
      highAvailability: true,
      globalDistribution: true,
      advancedSecurity: true,
      contentDelivery: true,
      realTimeProcessing: true,
      aiContentAnalysis: true
    };

    this.setupProductionOptimizations();
    this.initializeAdvancedProcessing();
    this.setupMonitoringAndAlerts();
  }

  async implementAdvancedContentProcessing() {
    console.log('Implementing advanced content processing pipeline...');

    const processingPipeline = {
      // AI-powered content analysis
      contentAnalysis: {
        imageRecognition: true,
        videoContentAnalysis: true,
        audioTranscription: true,
        documentOCR: true,
        contentModerationAI: true
      },

      // Multimedia optimization
      mediaOptimization: {
        imageCompression: true,
        videoTranscoding: true,
        audioNormalization: true,
        thumbnailGeneration: true,
        previewGeneration: true
      },

      // Content delivery optimization
      deliveryOptimization: {
        adaptiveStreaming: true,
        globalCDN: true,
        edgeCache: true,
        compressionOptimization: true
      }
    };

    return await this.deployProcessingPipeline(processingPipeline);
  }

  async setupDistributedFileStorage() {
    console.log('Setting up distributed file storage architecture...');

    const distributionStrategy = {
      // Geographic distribution
      regions: ['us-east-1', 'eu-west-1', 'ap-southeast-1'],
      replicationFactor: 3,

      // Storage tiers
      storageTiers: {
        hot: { accessPattern: 'frequent', retention: '30d' },
        warm: { accessPattern: 'occasional', retention: '90d' },
        cold: { accessPattern: 'rare', retention: '1y' },
        archive: { accessPattern: 'backup', retention: '7y' }
      },

      // Performance optimization
      performanceOptimization: {
        readPreference: 'nearest',
        writePreference: 'majority',
        connectionPooling: true,
        indexOptimization: true
      }
    };

    return await this.deployDistributionStrategy(distributionStrategy);
  }

  async implementAdvancedSecurity() {
    console.log('Implementing advanced security measures...');

    const securityMeasures = {
      // Encryption
      encryption: {
        encryptionAtRest: true,
        encryptionInTransit: true,
        fieldLevelEncryption: true,
        keyManagement: 'aws-kms'
      },

      // Access control
      accessControl: {
        roleBasedAccess: true,
        attributeBasedAccess: true,
        tokenBasedAuth: true,
        auditLogging: true
      },

      // Content security
      contentSecurity: {
        virusScanning: true,
        contentValidation: true,
        integrityChecking: true,
        accessTracking: true
      }
    };

    return await this.deploySecurityMeasures(securityMeasures);
  }
}

SQL-Style GridFS Operations with QueryLeaf

QueryLeaf provides familiar SQL syntax for MongoDB GridFS operations and file management:

-- QueryLeaf GridFS operations with SQL-familiar file management syntax

-- Create GridFS storage buckets with advanced configuration
CREATE GRIDFS BUCKET files_storage 
WITH (
  chunk_size = 261120,  -- 255KB chunks
  bucket_name = 'files',
  compression = 'zstd',
  deduplication = true,
  versioning = true,

  -- Storage optimization
  storage_class = 'standard',
  auto_tiering = true,
  retention_policy = '365 days',

  -- Performance tuning
  max_concurrent_uploads = 10,
  streaming_chunk_size = 1048576,
  index_optimization = 'performance'
);

CREATE GRIDFS BUCKET images_storage 
WITH (
  chunk_size = 261120,
  bucket_name = 'images',
  content_processing = true,
  thumbnail_generation = true,

  -- Image-specific settings
  image_optimization = true,
  format_conversion = true,
  quality_presets = JSON_ARRAY('thumbnail', 'medium', 'high', 'original')
);

CREATE GRIDFS BUCKET videos_storage 
WITH (
  chunk_size = 1048576,  -- 1MB chunks for videos
  bucket_name = 'videos',
  content_processing = true,

  -- Video-specific settings
  transcoding_enabled = true,
  preview_generation = true,
  streaming_optimization = true,
  adaptive_bitrate = true
);

-- Upload files with comprehensive metadata and processing options
UPLOAD FILE '/path/to/document.pdf'
TO GRIDFS BUCKET files_storage
AS 'important-document.pdf'
WITH (
  content_type = 'application/pdf',
  category = 'legal_documents',
  tags = JSON_ARRAY('contract', 'legal', '2025'),
  access_level = 'restricted',
  created_by = CURRENT_USER_ID(),

  -- Custom metadata
  metadata = JSON_OBJECT(
    'department', 'legal',
    'client_id', '12345',
    'confidentiality_level', 'high',
    'retention_period', '7 years'
  ),

  -- Processing options
  enable_ocr = true,
  enable_full_text_indexing = true,
  generate_thumbnail = true,
  content_analysis = true
);

-- Batch upload multiple files with pattern matching
UPLOAD FILES FROM DIRECTORY '/uploads/batch_2025/'
PATTERN '*.{jpg,png,gif}'
TO GRIDFS BUCKET images_storage
WITH (
  category = 'product_images',
  batch_id = 'batch_2025_001',
  auto_categorize = true,

  -- Image processing options
  generate_thumbnails = JSON_ARRAY('128x128', '256x256', '512x512'),
  compress_originals = true,
  extract_metadata = true,

  -- Content analysis
  image_recognition = true,
  face_detection = true,
  content_moderation = true
);

-- Advanced file search with complex filtering and ranking
WITH file_search AS (
  SELECT 
    f.file_id,
    f.filename,
    f.upload_date,
    f.file_size,
    f.content_type,
    f.metadata,

    -- Full-text search scoring
    GRIDFS_SEARCH_SCORE(f.filename || ' ' || f.metadata.searchable_text, 'contract legal document') as text_score,

    -- Content-based similarity (for images/videos)
    GRIDFS_CONTENT_SIMILARITY(f.file_id, 'reference_image_id') as content_similarity,

    -- Metadata-based relevance
    CASE 
      WHEN f.metadata.category = 'legal_documents' THEN 1.0
      WHEN f.metadata.tags @> JSON_ARRAY('legal') THEN 0.8
      WHEN f.metadata.tags @> JSON_ARRAY('contract') THEN 0.6
      ELSE 0.0
    END as category_relevance,

    -- Recency boost
    CASE 
      WHEN f.upload_date > CURRENT_DATE - INTERVAL '30 days' THEN 0.2
      WHEN f.upload_date > CURRENT_DATE - INTERVAL '90 days' THEN 0.1
      ELSE 0.0
    END as recency_boost,

    -- Usage popularity
    LOG(f.metadata.download_count + 1) * 0.1 as popularity_score,

    -- File quality indicators
    CASE 
      WHEN f.metadata.processing_status = 'completed' THEN 0.1
      WHEN f.metadata.has_thumbnail = true THEN 0.05
      WHEN f.metadata.content_indexed = true THEN 0.05
      ELSE 0.0
    END as quality_score

  FROM GRIDFS_FILES('files_storage') f
  WHERE 
    -- Content type filtering
    f.content_type IN ('application/pdf', 'application/msword', 'text/plain')

    -- Date range filtering
    AND f.upload_date >= CURRENT_DATE - INTERVAL '2 years'

    -- Access level filtering (based on user permissions)
    AND GRIDFS_CHECK_ACCESS(f.file_id, CURRENT_USER_ID()) = true

    -- Size filtering
    AND f.file_size BETWEEN 1024 AND 100*1024*1024  -- 1KB to 100MB

    -- Metadata filtering
    AND (
      f.metadata.category = 'legal_documents'
      OR f.metadata.tags @> JSON_ARRAY('legal')
      OR GRIDFS_FULL_TEXT_SEARCH(f.file_id, 'contract agreement legal') > 0.5
    )

    -- Processing status filtering
    AND f.metadata.processing_status IN ('completed', 'partial')
),

ranked_results AS (
  SELECT *,
    -- Combined relevance scoring
    (
      COALESCE(text_score, 0) * 0.4 +
      COALESCE(content_similarity, 0) * 0.2 +
      category_relevance * 0.2 +
      recency_boost +
      popularity_score +
      quality_score
    ) as combined_relevance_score,

    -- Result categorization
    CASE 
      WHEN content_similarity > 0.8 THEN 'visually_similar'
      WHEN text_score > 0.8 THEN 'text_match'
      WHEN category_relevance > 0.8 THEN 'category_match'
      ELSE 'general_relevance'
    END as match_type,

    -- Access recommendations
    CASE 
      WHEN metadata.access_level = 'public' THEN 'immediate_access'
      WHEN metadata.access_level = 'restricted' THEN 'approval_required'
      WHEN metadata.access_level = 'confidential' THEN 'special_authorization'
      ELSE 'standard_access'
    END as access_recommendation

  FROM file_search
  WHERE text_score > 0.1 OR content_similarity > 0.3 OR category_relevance > 0.0
),

file_analytics AS (
  SELECT 
    COUNT(*) as total_results,
    AVG(combined_relevance_score) as avg_relevance,

    -- Content type distribution
    JSON_OBJECT_AGG(
      content_type,
      COUNT(*)
    ) as content_type_distribution,

    -- Match type analysis
    JSON_OBJECT_AGG(
      match_type,
      COUNT(*)
    ) as match_type_distribution,

    -- Size distribution analysis
    JSON_OBJECT(
      'small_files', COUNT(*) FILTER (WHERE file_size < 1048576),
      'medium_files', COUNT(*) FILTER (WHERE file_size BETWEEN 1048576 AND 104857600),
      'large_files', COUNT(*) FILTER (WHERE file_size > 104857600)
    ) as size_distribution,

    -- Temporal distribution
    JSON_OBJECT_AGG(
      DATE_TRUNC('month', upload_date)::text,
      COUNT(*)
    ) as upload_timeline

  FROM ranked_results
)

-- Final comprehensive file search results with analytics
SELECT 
  -- File identification
  rr.file_id,
  rr.filename,
  rr.content_type,

  -- File properties
  GRIDFS_FORMAT_FILE_SIZE(rr.file_size) as file_size_formatted,
  rr.upload_date,
  DATE_TRUNC('day', rr.upload_date)::date as upload_date_formatted,

  -- Relevance and matching
  ROUND(rr.combined_relevance_score, 4) as relevance_score,
  rr.match_type,
  ROUND(rr.text_score, 3) as text_match_score,
  ROUND(rr.content_similarity, 3) as visual_similarity_score,

  -- Content and metadata
  rr.metadata.category,
  rr.metadata.tags,
  rr.metadata.description,

  -- Processing status and capabilities
  rr.metadata.processing_status,
  rr.metadata.has_thumbnail,
  rr.metadata.content_indexed,
  JSON_OBJECT(
    'ocr_available', COALESCE(rr.metadata.ocr_completed, false),
    'full_text_searchable', COALESCE(rr.metadata.full_text_indexed, false),
    'content_analyzed', COALESCE(rr.metadata.content_analysis_completed, false)
  ) as processing_capabilities,

  -- Access and usage
  rr.access_recommendation,
  rr.metadata.download_count,
  rr.metadata.last_accessed,

  -- File operations URLs
  CONCAT('/api/gridfs/files/', rr.file_id, '/download') as download_url,
  CONCAT('/api/gridfs/files/', rr.file_id, '/preview') as preview_url,
  CONCAT('/api/gridfs/files/', rr.file_id, '/thumbnail') as thumbnail_url,
  CONCAT('/api/gridfs/files/', rr.file_id, '/metadata') as metadata_url,

  -- Related files
  GRIDFS_FIND_SIMILAR_FILES(
    rr.file_id, 
    limit => 3,
    similarity_threshold => 0.7
  ) as related_files,

  -- Version information
  CASE 
    WHEN rr.metadata.has_versions = true THEN 
      JSON_OBJECT(
        'is_versioned', true,
        'version_number', rr.metadata.version,
        'latest_version', GRIDFS_GET_LATEST_VERSION(rr.metadata.base_file_id),
        'version_history_url', CONCAT('/api/gridfs/files/', rr.file_id, '/versions')
      )
    ELSE JSON_OBJECT('is_versioned', false)
  END as version_info,

  -- Search analytics (same for all results)
  (SELECT JSON_BUILD_OBJECT(
    'total_results', fa.total_results,
    'average_relevance', ROUND(fa.avg_relevance, 3),
    'content_types', fa.content_type_distribution,
    'match_types', fa.match_type_distribution,
    'size_distribution', fa.size_distribution,
    'upload_timeline', fa.upload_timeline
  ) FROM file_analytics fa) as search_analytics

FROM ranked_results rr
WHERE rr.combined_relevance_score > 0.2
ORDER BY rr.combined_relevance_score DESC
LIMIT 50;

-- Advanced file streaming and download operations
WITH streaming_session AS (
  SELECT 
    f.file_id,
    f.filename,
    f.file_size,
    f.content_type,
    f.metadata,

    -- Calculate optimal streaming parameters
    CASE 
      WHEN f.file_size > 1073741824 THEN 'chunked'  -- > 1GB
      WHEN f.file_size > 104857600 THEN 'buffered'   -- > 100MB
      ELSE 'direct'
    END as streaming_strategy,

    -- Determine chunk size based on file type and size
    CASE 
      WHEN f.content_type LIKE 'video/%' THEN 2097152  -- 2MB chunks for video
      WHEN f.content_type LIKE 'audio/%' THEN 524288   -- 512KB chunks for audio  
      WHEN f.file_size > 104857600 THEN 1048576        -- 1MB chunks for large files
      ELSE 262144                                      -- 256KB chunks for others
    END as optimal_chunk_size,

    -- Caching strategy
    CASE 
      WHEN f.metadata.download_count > 100 THEN 'cache_aggressively'
      WHEN f.metadata.download_count > 10 THEN 'cache_moderately'
      ELSE 'cache_minimally'
    END as cache_strategy

  FROM GRIDFS_FILES('videos_storage') f
  WHERE f.content_type LIKE 'video/%'
    AND f.file_size > 10485760  -- > 10MB
)

-- Stream video files with adaptive bitrate and quality selection
SELECT 
  ss.file_id,
  ss.filename,
  ss.streaming_strategy,
  ss.optimal_chunk_size,

  -- Generate streaming URLs for different qualities
  JSON_OBJECT(
    'original', GRIDFS_STREAMING_URL(ss.file_id, quality => 'original'),
    'hd', GRIDFS_STREAMING_URL(ss.file_id, quality => 'hd'),
    'sd', GRIDFS_STREAMING_URL(ss.file_id, quality => 'sd'),
    'mobile', GRIDFS_STREAMING_URL(ss.file_id, quality => 'mobile')
  ) as streaming_urls,

  -- Adaptive streaming manifest
  GRIDFS_GENERATE_HLS_MANIFEST(
    ss.file_id,
    qualities => JSON_ARRAY('original', 'hd', 'sd', 'mobile'),
    segment_duration => 10
  ) as hls_manifest_url,

  -- Video metadata for player
  JSON_OBJECT(
    'duration', ss.metadata.video_properties.duration,
    'width', ss.metadata.video_properties.width,
    'height', ss.metadata.video_properties.height,
    'frame_rate', ss.metadata.video_properties.frame_rate,
    'bitrate', ss.metadata.video_properties.bitrate,
    'codec', ss.metadata.video_properties.video_codec,
    'has_subtitles', COALESCE(ss.metadata.has_subtitles, false),
    'thumbnail_count', ARRAY_LENGTH(ss.metadata.video_thumbnails, 1)
  ) as video_metadata,

  -- Streaming optimization
  ss.cache_strategy,

  -- CDN and delivery optimization
  JSON_OBJECT(
    'cdn_enabled', true,
    'edge_cache_ttl', CASE ss.cache_strategy 
      WHEN 'cache_aggressively' THEN 3600
      WHEN 'cache_moderately' THEN 1800
      ELSE 600
    END,
    'compression_enabled', true,
    'adaptive_streaming', true
  ) as delivery_options

FROM streaming_session ss
ORDER BY ss.metadata.download_count DESC;

-- File management and lifecycle operations
WITH file_lifecycle_analysis AS (
  SELECT 
    f.file_id,
    f.filename,
    f.upload_date,
    f.file_size,
    f.metadata,

    -- Age categorization
    CASE 
      WHEN f.upload_date > CURRENT_DATE - INTERVAL '30 days' THEN 'recent'
      WHEN f.upload_date > CURRENT_DATE - INTERVAL '90 days' THEN 'current'  
      WHEN f.upload_date > CURRENT_DATE - INTERVAL '365 days' THEN 'old'
      ELSE 'archived'
    END as age_category,

    -- Usage categorization
    CASE 
      WHEN f.metadata.download_count > 100 THEN 'high_usage'
      WHEN f.metadata.download_count > 10 THEN 'medium_usage'
      WHEN f.metadata.download_count > 0 THEN 'low_usage'
      ELSE 'unused'
    END as usage_category,

    -- Storage efficiency analysis
    GRIDFS_CALCULATE_STORAGE_EFFICIENCY(f.file_id) as storage_efficiency,

    -- Content value scoring
    (
      LOG(f.metadata.download_count + 1) * 0.3 +
      CASE WHEN f.metadata.access_level = 'public' THEN 0.2 ELSE 0 END +
      CASE WHEN f.metadata.has_versions = true THEN 0.1 ELSE 0 END +
      CASE WHEN f.metadata.content_indexed = true THEN 0.1 ELSE 0 END +
      CASE WHEN ARRAY_LENGTH(f.metadata.tags, 1) > 0 THEN 0.1 ELSE 0 END
    ) as content_value_score,

    -- Days since last access
    COALESCE(EXTRACT(DAYS FROM CURRENT_DATE - f.metadata.last_accessed::date), 9999) as days_since_access

  FROM GRIDFS_FILES() f  -- Search across all buckets
  WHERE f.upload_date >= CURRENT_DATE - INTERVAL '2 years'
),

lifecycle_recommendations AS (
  SELECT 
    fla.*,

    -- Lifecycle action recommendations
    CASE 
      WHEN fla.age_category = 'archived' AND fla.usage_category = 'unused' THEN 'delete_candidate'
      WHEN fla.age_category = 'old' AND fla.usage_category IN ('unused', 'low_usage') THEN 'archive_candidate'
      WHEN fla.usage_category = 'high_usage' AND fla.storage_efficiency < 0.7 THEN 'optimize_candidate'
      WHEN fla.days_since_access > 180 AND fla.usage_category != 'high_usage' THEN 'cold_storage_candidate'
      ELSE 'maintain_current'
    END as lifecycle_action,

    -- Storage tier recommendation
    CASE 
      WHEN fla.usage_category = 'high_usage' AND fla.days_since_access <= 7 THEN 'hot'
      WHEN fla.usage_category IN ('high_usage', 'medium_usage') AND fla.days_since_access <= 30 THEN 'warm'
      WHEN fla.days_since_access <= 90 THEN 'cool'
      ELSE 'cold'
    END as recommended_storage_tier,

    -- Estimated cost savings
    GRIDFS_ESTIMATE_COST_SAVINGS(
      fla.file_id,
      current_tier => fla.metadata.storage_class,
      recommended_tier => CASE 
        WHEN fla.usage_category = 'high_usage' AND fla.days_since_access <= 7 THEN 'hot'
        WHEN fla.usage_category IN ('high_usage', 'medium_usage') AND fla.days_since_access <= 30 THEN 'warm'  
        WHEN fla.days_since_access <= 90 THEN 'cool'
        ELSE 'cold'
      END
    ) as estimated_monthly_savings,

    -- Priority score for lifecycle actions
    CASE fla.age_category
      WHEN 'archived' THEN 1
      WHEN 'old' THEN 2  
      WHEN 'current' THEN 3
      ELSE 4
    END * 
    CASE fla.usage_category
      WHEN 'unused' THEN 1
      WHEN 'low_usage' THEN 2
      WHEN 'medium_usage' THEN 3  
      ELSE 4
    END as action_priority

  FROM file_lifecycle_analysis fla
)

-- Execute lifecycle management recommendations
SELECT 
  lr.lifecycle_action,
  COUNT(*) as affected_files,
  SUM(lr.file_size) as total_size_bytes,
  GRIDFS_FORMAT_FILE_SIZE(SUM(lr.file_size)) as total_size_formatted,
  SUM(lr.estimated_monthly_savings) as total_monthly_savings,
  AVG(lr.action_priority) as avg_priority,

  -- Detailed breakdown by file characteristics
  JSON_OBJECT_AGG(lr.age_category, COUNT(*)) as age_distribution,
  JSON_OBJECT_AGG(lr.usage_category, COUNT(*)) as usage_distribution,
  JSON_OBJECT_AGG(lr.recommended_storage_tier, COUNT(*)) as tier_distribution,

  -- Sample files for review
  JSON_AGG(
    JSON_OBJECT(
      'file_id', lr.file_id,
      'filename', lr.filename,
      'size', GRIDFS_FORMAT_FILE_SIZE(lr.file_size),
      'age_days', EXTRACT(DAYS FROM CURRENT_DATE - lr.upload_date),
      'last_access_days', lr.days_since_access,
      'download_count', lr.metadata.download_count,
      'estimated_savings', lr.estimated_monthly_savings
    ) 
    ORDER BY lr.action_priority ASC, lr.file_size DESC
    LIMIT 5
  ) as sample_files,

  -- Implementation recommendations
  CASE lr.lifecycle_action
    WHEN 'delete_candidate' THEN 'Schedule for deletion after 30-day notice period'
    WHEN 'archive_candidate' THEN 'Move to archive storage tier'
    WHEN 'optimize_candidate' THEN 'Apply compression and deduplication'
    WHEN 'cold_storage_candidate' THEN 'Migrate to cold storage tier'
    ELSE 'No action required'
  END as implementation_recommendation

FROM lifecycle_recommendations lr
WHERE lr.lifecycle_action != 'maintain_current'
GROUP BY lr.lifecycle_action
ORDER BY total_size_bytes DESC;

-- Storage analytics and optimization insights
CREATE VIEW gridfs_storage_dashboard AS
WITH bucket_analytics AS (
  SELECT 
    bucket_name,
    COUNT(*) as total_files,
    SUM(file_size) as total_size_bytes,
    AVG(file_size) as avg_file_size,
    MIN(file_size) as min_file_size,
    MAX(file_size) as max_file_size,

    -- Content type distribution
    JSON_OBJECT_AGG(content_type, COUNT(*)) as content_type_counts,

    -- Upload trends
    JSON_OBJECT_AGG(
      DATE_TRUNC('month', upload_date)::text,
      COUNT(*)
    ) as monthly_upload_trends,

    -- Usage statistics
    SUM(metadata.download_count) as total_downloads,
    AVG(metadata.download_count) as avg_downloads_per_file,

    -- Processing statistics
    COUNT(*) FILTER (WHERE metadata.processing_status = 'completed') as processed_files,
    COUNT(*) FILTER (WHERE metadata.has_thumbnail = true) as files_with_thumbnails,
    COUNT(*) FILTER (WHERE metadata.content_indexed = true) as indexed_files,

    -- Storage efficiency
    AVG(
      CASE WHEN metadata.is_compressed = true 
        THEN metadata.compression_ratio 
        ELSE 1.0 
      END
    ) as avg_compression_ratio,

    COUNT(*) FILTER (WHERE metadata.is_compressed = true) as compressed_files,

    -- Age distribution
    COUNT(*) FILTER (WHERE upload_date > CURRENT_DATE - INTERVAL '30 days') as recent_files,
    COUNT(*) FILTER (WHERE upload_date <= CURRENT_DATE - INTERVAL '365 days') as old_files

  FROM GRIDFS_FILES() 
  GROUP BY bucket_name
)

SELECT 
  bucket_name,
  total_files,
  GRIDFS_FORMAT_FILE_SIZE(total_size_bytes) as total_storage,
  GRIDFS_FORMAT_FILE_SIZE(avg_file_size) as avg_file_size,

  -- Storage efficiency metrics
  ROUND((compressed_files::numeric / total_files) * 100, 1) as compression_percentage,
  ROUND(avg_compression_ratio, 2) as avg_compression_ratio,
  ROUND((processed_files::numeric / total_files) * 100, 1) as processing_completion_rate,

  -- Usage metrics
  total_downloads,
  ROUND(avg_downloads_per_file, 1) as avg_downloads_per_file,
  ROUND((indexed_files::numeric / total_files) * 100, 1) as indexing_coverage,

  -- Content insights
  content_type_counts,
  monthly_upload_trends,

  -- Storage optimization opportunities
  CASE 
    WHEN compressed_files::numeric / total_files < 0.5 THEN 
      CONCAT('Enable compression for ', ROUND(((total_files - compressed_files)::numeric / total_files) * 100, 1), '% of files')
    WHEN processed_files::numeric / total_files < 0.8 THEN
      CONCAT('Complete processing for ', ROUND(((total_files - processed_files)::numeric / total_files) * 100, 1), '% of files')
    WHEN old_files > total_files * 0.3 THEN
      CONCAT('Consider archiving ', old_files, ' old files (', ROUND((old_files::numeric / total_files) * 100, 1), '%)')
    ELSE 'Storage optimized'
  END as optimization_opportunity,

  -- Performance indicators
  JSON_OBJECT(
    'recent_activity', recent_files,
    'storage_growth_rate', ROUND((recent_files::numeric / GREATEST(total_files - recent_files, 1)) * 100, 1),
    'avg_file_age_days', ROUND(AVG(EXTRACT(DAYS FROM CURRENT_DATE - upload_date)), 0),
    'thumbnail_coverage', ROUND((files_with_thumbnails::numeric / total_files) * 100, 1)
  ) as performance_indicators

FROM bucket_analytics
ORDER BY total_size_bytes DESC;

-- QueryLeaf provides comprehensive GridFS capabilities:
-- 1. SQL-familiar file upload, download, and streaming operations
-- 2. Advanced file search with content-based and metadata filtering
-- 3. Multimedia processing integration with thumbnail and preview generation
-- 4. Intelligent file lifecycle management and storage optimization
-- 5. Comprehensive analytics and monitoring for file storage systems
-- 6. Production-ready security, access control, and audit logging
-- 7. Seamless integration with MongoDB's replication and sharding
-- 8. Advanced content analysis and AI-powered file processing
-- 9. Distributed file storage with global CDN integration
-- 10. SQL-style syntax for complex file management workflows

Best Practices for Production GridFS Implementation

Storage Architecture and Performance Optimization

Essential principles for scalable MongoDB GridFS deployment:

Bucket Organization: Design bucket structure based on content types, access patterns, and processing requirements
Chunk Size Optimization: Configure optimal chunk sizes based on file types, access patterns, and network characteristics
Index Strategy: Implement comprehensive indexing for file metadata, content properties, and access patterns
Storage Tiering: Design intelligent storage tiering strategies for cost optimization and performance
Processing Pipeline: Implement automated content processing for multimedia optimization and analysis
Security Integration: Ensure comprehensive security controls for file access, encryption, and audit logging

Scalability and Operational Excellence

Optimize GridFS deployments for enterprise-scale requirements:

Distributed Architecture: Design sharding strategies for large-scale file storage across multiple regions
Performance Monitoring: Implement comprehensive monitoring for storage usage, access patterns, and processing performance
Backup and Recovery: Design robust backup strategies that handle both file content and metadata consistency
Content Delivery: Integrate with CDN and edge caching for optimal file delivery performance
Cost Optimization: Implement automated lifecycle management and storage optimization policies
Disaster Recovery: Plan for business continuity with replicated file storage and failover capabilities

Conclusion

MongoDB GridFS provides comprehensive large file storage and binary data management capabilities that enable efficient handling of multimedia content, documents, and large datasets with automatic chunking, streaming, and integrated metadata management. The native MongoDB integration ensures GridFS benefits from the same scalability, consistency, and operational features as document storage.

Key MongoDB GridFS benefits include:

Native Integration: Seamless integration with MongoDB's document model, transactions, and consistency guarantees
Automatic Chunking: Efficient handling of large files with automatic chunking and streaming capabilities
Comprehensive Metadata: Rich metadata management with flexible schemas and advanced querying capabilities
Processing Integration: Built-in support for content processing, thumbnail generation, and multimedia optimization
Scalable Architecture: Production-ready scalability with sharding, replication, and distributed storage
Operational Excellence: Integrated backup, monitoring, and management tools for enterprise deployments

Whether you're building content management systems, multimedia platforms, document repositories, or any application requiring robust file storage, MongoDB GridFS with QueryLeaf's familiar SQL interface provides the foundation for scalable and maintainable file management solutions.

QueryLeaf Integration: QueryLeaf automatically manages MongoDB GridFS operations while providing SQL-familiar syntax for file uploads, downloads, content processing, and storage optimization. Advanced file management patterns, multimedia processing workflows, and storage analytics are seamlessly handled through familiar SQL constructs, making sophisticated file storage capabilities accessible to SQL-oriented development teams.

The combination of MongoDB's robust GridFS capabilities with SQL-style file operations makes it an ideal platform for modern applications that require both powerful file storage and familiar database management patterns, ensuring your file storage solutions scale efficiently while remaining maintainable and feature-rich.

November 23, 2025
32 min read

MongoDB Time Series Collections for IoT Sensor Data Management: Real-Time Analytics and High-Performance Time-Based Data Processing

Modern IoT applications generate massive volumes of time-stamped sensor data that require specialized storage and processing strategies to handle high ingestion rates, efficient time-based queries, and real-time analytics workloads. Traditional relational databases struggle with time series data due to their row-oriented storage models, lack of built-in time-based optimizations, and inefficient handling of high-frequency data ingestion patterns common in IoT environments.

MongoDB Time Series Collections provide native support for time-stamped data with automatic data organization, compression optimizations, and specialized indexing strategies designed specifically for temporal workloads. Unlike traditional approaches that require custom partitioning schemes and complex query optimization, MongoDB's time series collections automatically optimize storage layout, query performance, and data retention policies while maintaining familiar query interfaces and operational simplicity.

The Traditional Time Series Data Challenge

Relational databases face significant limitations when handling high-volume time series data:

-- Traditional PostgreSQL time series data management - complex partitioning and limited optimization

-- IoT sensor readings table with manual partitioning strategy
CREATE TABLE sensor_readings (
    reading_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    device_id UUID NOT NULL,
    sensor_type VARCHAR(50) NOT NULL,
    timestamp TIMESTAMP WITH TIME ZONE NOT NULL DEFAULT CURRENT_TIMESTAMP,

    -- Sensor measurement values
    temperature DECIMAL(5,2),
    humidity DECIMAL(5,2),
    pressure DECIMAL(7,2),
    battery_level DECIMAL(5,2),
    signal_strength INTEGER,

    -- Location data
    latitude DECIMAL(10,8),
    longitude DECIMAL(11,8),
    altitude DECIMAL(8,2),

    -- Device status information
    device_status VARCHAR(20) DEFAULT 'active',
    firmware_version VARCHAR(20),
    last_calibration TIMESTAMP WITH TIME ZONE,

    -- Data quality indicators
    data_quality_score DECIMAL(3,2) DEFAULT 1.0,
    anomaly_detected BOOLEAN DEFAULT false,
    validation_flags JSONB,

    -- Metadata
    created_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
    processing_timestamp TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,

    -- Constraints
    CONSTRAINT chk_timestamp_valid CHECK (timestamp <= CURRENT_TIMESTAMP + INTERVAL '1 hour'),
    CONSTRAINT chk_temperature_range CHECK (temperature BETWEEN -50 AND 125),
    CONSTRAINT chk_humidity_range CHECK (humidity BETWEEN 0 AND 100),
    CONSTRAINT chk_battery_level CHECK (battery_level BETWEEN 0 AND 100)
) PARTITION BY RANGE (timestamp);

-- Device metadata table
CREATE TABLE devices (
    device_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    device_name VARCHAR(200) NOT NULL,
    device_type VARCHAR(50) NOT NULL,
    manufacturer VARCHAR(100),
    model VARCHAR(100),
    firmware_version VARCHAR(20),

    -- Installation details
    installation_location VARCHAR(200),
    installation_date DATE,
    installation_coordinates POINT,

    -- Configuration
    sampling_interval_seconds INTEGER DEFAULT 60,
    reporting_interval_seconds INTEGER DEFAULT 300,
    sensor_configuration JSONB,

    -- Status tracking
    device_status VARCHAR(20) DEFAULT 'active',
    last_seen TIMESTAMP WITH TIME ZONE,
    last_maintenance TIMESTAMP WITH TIME ZONE,
    next_maintenance_due DATE,

    -- Metadata
    created_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP
);

-- Manual monthly partitioning (requires maintenance)
CREATE TABLE sensor_readings_2025_01 PARTITION OF sensor_readings
FOR VALUES FROM ('2025-01-01') TO ('2025-02-01');

CREATE TABLE sensor_readings_2025_02 PARTITION OF sensor_readings
FOR VALUES FROM ('2025-02-01') TO ('2025-03-01');

CREATE TABLE sensor_readings_2025_03 PARTITION OF sensor_readings
FOR VALUES FROM ('2025-03-01') TO ('2025-04-01');

-- Need to create partitions for every month manually or with automation
-- This becomes a maintenance burden and source of potential failures

-- Indexing strategy for time series queries (limited effectiveness)
CREATE INDEX idx_sensor_readings_timestamp ON sensor_readings (timestamp DESC);
CREATE INDEX idx_sensor_readings_device_time ON sensor_readings (device_id, timestamp DESC);
CREATE INDEX idx_sensor_readings_sensor_type_time ON sensor_readings (sensor_type, timestamp DESC);
CREATE INDEX idx_sensor_readings_location ON sensor_readings USING GIST (ST_MakePoint(longitude, latitude));

-- Complex aggregation query for sensor analytics
WITH hourly_sensor_averages AS (
  SELECT 
    device_id,
    sensor_type,
    DATE_TRUNC('hour', timestamp) as hour_bucket,

    -- Aggregated measurements
    COUNT(*) as reading_count,
    AVG(temperature) as avg_temperature,
    AVG(humidity) as avg_humidity,
    AVG(pressure) as avg_pressure,
    AVG(battery_level) as avg_battery_level,
    AVG(signal_strength) as avg_signal_strength,

    -- Statistical measures
    STDDEV(temperature) as temp_stddev,
    MIN(temperature) as min_temperature,
    MAX(temperature) as max_temperature,

    -- Data quality metrics
    AVG(data_quality_score) as avg_data_quality,
    COUNT(*) FILTER (WHERE anomaly_detected = true) as anomaly_count,

    -- Time-based calculations
    MIN(timestamp) as period_start,
    MAX(timestamp) as period_end,
    MAX(timestamp) - MIN(timestamp) as actual_duration

  FROM sensor_readings sr
  WHERE sr.timestamp >= CURRENT_TIMESTAMP - INTERVAL '24 hours'
    AND sr.device_status = 'active'
    AND sr.data_quality_score >= 0.8
  GROUP BY device_id, sensor_type, DATE_TRUNC('hour', timestamp)
),

device_performance_analysis AS (
  SELECT 
    hsa.*,
    d.device_name,
    d.device_type,
    d.installation_location,
    d.sampling_interval_seconds,

    -- Performance calculations
    CASE 
      WHEN hsa.reading_count < (3600 / d.sampling_interval_seconds) * 0.8 THEN 'under_reporting'
      WHEN hsa.reading_count > (3600 / d.sampling_interval_seconds) * 1.2 THEN 'over_reporting'
      ELSE 'normal'
    END as reporting_status,

    -- Battery analysis
    CASE 
      WHEN hsa.avg_battery_level < 20 THEN 'critical'
      WHEN hsa.avg_battery_level < 40 THEN 'low'
      WHEN hsa.avg_battery_level < 60 THEN 'medium'
      ELSE 'good'
    END as battery_status,

    -- Signal strength analysis
    CASE 
      WHEN hsa.avg_signal_strength < -80 THEN 'poor'
      WHEN hsa.avg_signal_strength < -60 THEN 'fair'
      WHEN hsa.avg_signal_strength < -40 THEN 'good'
      ELSE 'excellent'
    END as signal_status,

    -- Calculate trends using window functions (expensive)
    LAG(hsa.avg_temperature) OVER (
      PARTITION BY hsa.device_id, hsa.sensor_type 
      ORDER BY hsa.hour_bucket
    ) as prev_hour_temperature,

    -- Rolling averages (very expensive across partitions)
    AVG(hsa.avg_temperature) OVER (
      PARTITION BY hsa.device_id, hsa.sensor_type 
      ORDER BY hsa.hour_bucket 
      ROWS 23 PRECEDING
    ) as rolling_24h_avg_temperature

  FROM hourly_sensor_averages hsa
  JOIN devices d ON hsa.device_id = d.device_id
),

environmental_conditions AS (
  -- Complex environmental analysis requiring expensive calculations
  SELECT 
    dpa.hour_bucket,
    dpa.installation_location,

    -- Location-based aggregations
    AVG(dpa.avg_temperature) as location_avg_temperature,
    AVG(dpa.avg_humidity) as location_avg_humidity,
    AVG(dpa.avg_pressure) as location_avg_pressure,

    -- Device count and health by location
    COUNT(*) as active_devices,
    COUNT(*) FILTER (WHERE dpa.battery_status = 'critical') as critical_battery_devices,
    COUNT(*) FILTER (WHERE dpa.signal_status = 'poor') as poor_signal_devices,
    COUNT(*) FILTER (WHERE dpa.anomaly_count > 0) as devices_with_anomalies,

    -- Environmental variance analysis
    STDDEV(dpa.avg_temperature) as temperature_variance,
    STDDEV(dpa.avg_humidity) as humidity_variance,

    -- Extreme conditions detection
    BOOL_OR(dpa.avg_temperature > 40 OR dpa.avg_temperature < -10) as extreme_temperature_detected,
    BOOL_OR(dpa.avg_humidity > 90 OR dpa.avg_humidity < 10) as extreme_humidity_detected,

    -- Data quality aggregation
    AVG(dpa.avg_data_quality) as location_data_quality,
    SUM(dpa.anomaly_count) as total_location_anomalies

  FROM device_performance_analysis dpa
  GROUP BY dpa.hour_bucket, dpa.installation_location
)

SELECT 
  ec.hour_bucket,
  ec.installation_location,
  ec.active_devices,

  -- Environmental metrics
  ROUND(ec.location_avg_temperature::NUMERIC, 2) as avg_temperature,
  ROUND(ec.location_avg_humidity::NUMERIC, 2) as avg_humidity,
  ROUND(ec.location_avg_pressure::NUMERIC, 2) as avg_pressure,

  -- Device health summary
  ec.critical_battery_devices,
  ec.poor_signal_devices,
  ec.devices_with_anomalies,

  -- Environmental conditions
  CASE 
    WHEN ec.extreme_temperature_detected OR ec.extreme_humidity_detected THEN 'extreme'
    WHEN ec.temperature_variance > 5 OR ec.humidity_variance > 15 THEN 'variable'
    ELSE 'stable'
  END as environmental_stability,

  -- Data quality indicators
  ROUND(ec.location_data_quality::NUMERIC, 3) as data_quality_score,
  ec.total_location_anomalies,

  -- Health scoring
  (
    100 - 
    (ec.critical_battery_devices * 20) - 
    (ec.poor_signal_devices * 10) - 
    (ec.devices_with_anomalies * 5) -
    CASE WHEN ec.location_data_quality < 0.9 THEN 15 ELSE 0 END
  ) as location_health_score,

  -- Operational recommendations
  CASE 
    WHEN ec.critical_battery_devices > 0 THEN 'URGENT: Replace batteries on ' || ec.critical_battery_devices || ' devices'
    WHEN ec.poor_signal_devices > ec.active_devices * 0.3 THEN 'Consider signal boosters for location'
    WHEN ec.total_location_anomalies > 10 THEN 'Investigate environmental factors causing anomalies'
    WHEN ec.location_data_quality < 0.8 THEN 'Review device calibration and maintenance schedules'
    ELSE 'Location operating within normal parameters'
  END as operational_recommendation

FROM environmental_conditions ec
ORDER BY ec.hour_bucket DESC, ec.location_health_score ASC;

-- Performance problems with traditional time series approaches:
-- 1. Manual partition management creates operational overhead and failure points
-- 2. Complex query plans across multiple partitions reduce performance
-- 3. Limited compression and storage optimization for time-stamped data
-- 4. No native support for time-based retention policies and archiving
-- 5. Expensive aggregation operations across large time ranges
-- 6. Poor performance for recent data queries due to partition pruning limitations
-- 7. Complex indexing strategies required for different time-based access patterns
-- 8. Difficult to optimize for both high-throughput writes and analytical reads
-- 9. No built-in support for downsampling and data compaction strategies
-- 10. Limited ability to handle irregular time intervals and sparse data efficiently

-- Attempt at data retention management (complex and error-prone)
CREATE OR REPLACE FUNCTION manage_sensor_data_retention()
RETURNS void AS $$
DECLARE
    partition_name text;
    retention_date timestamp;
BEGIN
    -- Calculate retention boundary (keep 1 year of data)
    retention_date := CURRENT_TIMESTAMP - INTERVAL '1 year';

    -- Find partitions older than retention period
    FOR partition_name IN
        SELECT schemaname||'.'||tablename
        FROM pg_tables 
        WHERE tablename LIKE 'sensor_readings_%'
        AND schemaname = 'public'
    LOOP
        -- Extract date from partition name (fragile parsing)
        IF partition_name ~ 'sensor_readings_[0-9]{4}_[0-9]{2}$' THEN
            -- This logic is complex and error-prone
            -- Need to parse partition name, validate dates, check data
            -- Then carefully drop partitions without losing data
            RAISE NOTICE 'Would evaluate partition % for retention', partition_name;
        END IF;
    END LOOP;

    -- Complex logic needed to:
    -- 1. Verify partition contains only old data
    -- 2. Archive data if needed before deletion  
    -- 3. Update constraints and metadata
    -- 4. Handle dependencies and foreign keys
    -- 5. Clean up indexes and statistics

EXCEPTION
    WHEN OTHERS THEN
        -- Error handling for partition management failures
        RAISE EXCEPTION 'Retention management failed: %', SQLERRM;
END;
$$ LANGUAGE plpgsql;

-- Expensive real-time alerting query
WITH real_time_sensor_status AS (
  SELECT DISTINCT ON (device_id, sensor_type)
    device_id,
    sensor_type,
    timestamp,
    temperature,
    humidity,
    battery_level,
    signal_strength,
    anomaly_detected,
    data_quality_score
  FROM sensor_readings
  WHERE timestamp >= CURRENT_TIMESTAMP - INTERVAL '15 minutes'
  ORDER BY device_id, sensor_type, timestamp DESC
),

alert_conditions AS (
  SELECT 
    rtss.*,
    d.device_name,
    d.installation_location,

    -- Define alert conditions
    CASE 
      WHEN rtss.temperature > 50 OR rtss.temperature < -20 THEN 'temperature_extreme'
      WHEN rtss.humidity > 95 OR rtss.humidity < 5 THEN 'humidity_extreme'
      WHEN rtss.battery_level < 15 THEN 'battery_critical'
      WHEN rtss.signal_strength < -85 THEN 'signal_poor'
      WHEN rtss.anomaly_detected = true THEN 'anomaly_detected'
      WHEN rtss.data_quality_score < 0.7 THEN 'data_quality_poor'
      WHEN rtss.timestamp < CURRENT_TIMESTAMP - INTERVAL '10 minutes' THEN 'device_offline'
      ELSE null
    END as alert_type,

    -- Alert severity
    CASE 
      WHEN rtss.battery_level < 10 OR rtss.timestamp < CURRENT_TIMESTAMP - INTERVAL '20 minutes' THEN 'critical'
      WHEN rtss.temperature > 45 OR rtss.temperature < -15 OR rtss.anomaly_detected THEN 'high'
      WHEN rtss.battery_level < 20 OR rtss.signal_strength < -80 THEN 'medium'
      ELSE 'low'
    END as alert_severity

  FROM real_time_sensor_status rtss
  JOIN devices d ON rtss.device_id = d.device_id
  WHERE d.device_status = 'active'
)

SELECT 
  device_id,
  device_name,
  installation_location,
  sensor_type,
  alert_type,
  alert_severity,
  temperature,
  humidity,
  battery_level,
  signal_strength,
  timestamp,

  -- Alert message generation
  CASE alert_type
    WHEN 'temperature_extreme' THEN FORMAT('Temperature %s°C is outside safe range', temperature)
    WHEN 'humidity_extreme' THEN FORMAT('Humidity %s%% is at extreme level', humidity)
    WHEN 'battery_critical' THEN FORMAT('Battery level %s%% requires immediate attention', battery_level)
    WHEN 'signal_poor' THEN FORMAT('Signal strength %s dBm indicates connectivity issues', signal_strength)
    WHEN 'anomaly_detected' THEN 'Sensor readings show anomalous patterns'
    WHEN 'data_quality_poor' THEN FORMAT('Data quality score %s indicates sensor issues', data_quality_score)
    WHEN 'device_offline' THEN FORMAT('Device has not reported for %s minutes', EXTRACT(EPOCH FROM (CURRENT_TIMESTAMP - timestamp))/60)
    ELSE 'Unknown alert condition'
  END as alert_message,

  -- Recommended actions
  CASE alert_type
    WHEN 'temperature_extreme' THEN 'Check environmental conditions and sensor calibration'
    WHEN 'humidity_extreme' THEN 'Verify sensor operation and environmental factors'
    WHEN 'battery_critical' THEN 'Schedule immediate battery replacement'
    WHEN 'signal_poor' THEN 'Check device antenna and network infrastructure'
    WHEN 'anomaly_detected' THEN 'Investigate sensor readings and potential interference'
    WHEN 'data_quality_poor' THEN 'Perform sensor calibration and diagnostic checks'
    WHEN 'device_offline' THEN 'Check device power and network connectivity'
    ELSE 'Monitor device status'
  END as recommended_action

FROM alert_conditions
WHERE alert_type IS NOT NULL
ORDER BY 
  CASE alert_severity
    WHEN 'critical' THEN 1
    WHEN 'high' THEN 2
    WHEN 'medium' THEN 3
    ELSE 4
  END,
  timestamp DESC;

-- Traditional limitations for IoT time series data:
-- 1. Manual partition management and maintenance complexity
-- 2. Poor compression ratios for repetitive time series data patterns
-- 3. Expensive aggregation queries across large time ranges
-- 4. Limited real-time query performance for recent data analysis
-- 5. Complex retention policy implementation and data archiving
-- 6. No native support for irregular time intervals and sparse sensor data
-- 7. Difficult optimization for mixed analytical and operational workloads
-- 8. Limited scalability for high-frequency data ingestion (>1000 inserts/sec)
-- 9. Complex alerting and real-time monitoring query patterns
-- 10. Poor storage efficiency for IoT metadata and device information duplication

MongoDB Time Series Collections provide comprehensive optimization for temporal data workloads:

// MongoDB Time Series Collections - optimized for IoT sensor data and real-time analytics
const { MongoClient, ObjectId } = require('mongodb');

// Advanced IoT Time Series Data Manager
class IoTTimeSeriesDataManager {
  constructor() {
    this.client = null;
    this.db = null;
    this.collections = new Map();
    this.performanceMetrics = new Map();
    this.alertingRules = new Map();
    this.retentionPolicies = new Map();
  }

  async initialize() {
    console.log('Initializing IoT Time Series Data Manager...');

    // Connect with optimized settings for time series workloads
    this.client = new MongoClient(process.env.MONGODB_URI || 'mongodb://localhost:27017', {
      // Optimized for high-throughput time series writes
      maxPoolSize: 50,
      minPoolSize: 10,
      maxIdleTimeMS: 30000,
      serverSelectionTimeoutMS: 5000,

      // Write settings for time series data
      writeConcern: { 
        w: 1, 
        j: false, // Disable journaling for better write performance
        wtimeout: 5000
      },

      // Read preferences for time series queries
      readPreference: 'primaryPreferred',
      readConcern: { level: 'local' },

      // Compression for large time series datasets
      compressors: ['zstd', 'zlib'],

      appName: 'IoTTimeSeriesManager'
    });

    await this.client.connect();
    this.db = this.client.db('iot_platform');

    // Initialize time series collections with optimized configurations
    await this.setupTimeSeriesCollections();

    // Setup data retention policies
    await this.setupRetentionPolicies();

    // Initialize real-time alerting system
    await this.setupRealTimeAlerting();

    console.log('✅ IoT Time Series Data Manager initialized');
  }

  async setupTimeSeriesCollections() {
    console.log('Setting up optimized time series collections...');

    try {
      // Primary sensor readings time series collection
      const sensorReadingsTS = await this.db.createCollection('sensor_readings', {
        timeseries: {
          timeField: 'timestamp',           // Field containing the timestamp
          metaField: 'device',             // Field containing device metadata
          granularity: 'minutes',          // Optimized for minute-level granularity
          bucketMaxSpanSeconds: 3600       // 1-hour buckets for optimal compression
        },

        // Storage optimization
        storageEngine: {
          wiredTiger: {
            configString: 'block_compressor=zstd'  // Use zstd compression
          }
        }
      });

      // Device status and health time series
      const deviceHealthTS = await this.db.createCollection('device_health', {
        timeseries: {
          timeField: 'timestamp',
          metaField: 'device_info',
          granularity: 'hours',            // Less frequent health updates
          bucketMaxSpanSeconds: 86400      // 24-hour buckets
        }
      });

      // Environmental conditions aggregated data
      const environmentalTS = await this.db.createCollection('environmental_conditions', {
        timeseries: {
          timeField: 'timestamp',
          metaField: 'location',
          granularity: 'hours',
          bucketMaxSpanSeconds: 86400
        }
      });

      // Alert events time series
      const alertsTS = await this.db.createCollection('alert_events', {
        timeseries: {
          timeField: 'timestamp',
          metaField: 'alert_context',
          granularity: 'seconds',          // Fine-grained for alert analysis
          bucketMaxSpanSeconds: 3600
        }
      });

      // Store collection references
      this.collections.set('sensor_readings', this.db.collection('sensor_readings'));
      this.collections.set('device_health', this.db.collection('device_health'));
      this.collections.set('environmental_conditions', this.db.collection('environmental_conditions'));
      this.collections.set('alert_events', this.db.collection('alert_events'));

      // Create supporting collections for device metadata
      await this.setupDeviceCollections();

      // Create optimized indexes for time series queries
      await this.createTimeSeriesIndexes();

      console.log('✅ Time series collections configured with optimal settings');

    } catch (error) {
      console.error('Error setting up time series collections:', error);
      throw error;
    }
  }

  async setupDeviceCollections() {
    console.log('Setting up device metadata collections...');

    // Device registry collection (regular collection)
    const devicesCollection = this.db.collection('devices');
    await devicesCollection.createIndex({ device_id: 1 }, { unique: true });
    await devicesCollection.createIndex({ installation_location: 1 });
    await devicesCollection.createIndex({ device_type: 1 });
    await devicesCollection.createIndex({ "location_coordinates": "2dsphere" });

    // Location registry for environmental analytics
    const locationsCollection = this.db.collection('locations');
    await locationsCollection.createIndex({ location_id: 1 }, { unique: true });
    await locationsCollection.createIndex({ "coordinates": "2dsphere" });

    this.collections.set('devices', devicesCollection);
    this.collections.set('locations', locationsCollection);

    console.log('✅ Device metadata collections configured');
  }

  async createTimeSeriesIndexes() {
    console.log('Creating optimized time series indexes...');

    const sensorReadings = this.collections.get('sensor_readings');

    // Compound indexes optimized for common query patterns
    await sensorReadings.createIndex({ 
      'device.device_id': 1, 
      'timestamp': -1 
    }, { 
      name: 'device_timestamp_desc',
      background: true 
    });

    await sensorReadings.createIndex({ 
      'device.sensor_type': 1, 
      'timestamp': -1 
    }, { 
      name: 'sensor_type_timestamp_desc',
      background: true 
    });

    await sensorReadings.createIndex({ 
      'device.installation_location': 1, 
      'timestamp': -1 
    }, { 
      name: 'location_timestamp_desc',
      background: true 
    });

    // Partial indexes for alerting queries
    await sensorReadings.createIndex(
      { 'timestamp': -1 },
      { 
        name: 'recent_readings_partial',
        partialFilterExpression: { 
          timestamp: { $gte: new Date(Date.now() - 24 * 60 * 60 * 1000) }
        },
        background: true
      }
    );

    console.log('✅ Time series indexes created');
  }

  async ingestSensorData(sensorDataBatch) {
    console.log(`Ingesting batch of ${sensorDataBatch.length} sensor readings...`);

    const startTime = Date.now();

    try {
      // Transform data for time series optimized format
      const timeSeriesDocuments = sensorDataBatch.map(reading => ({
        timestamp: new Date(reading.timestamp),

        // Device metadata (metaField for automatic bucketing)
        device: {
          device_id: reading.device_id,
          sensor_type: reading.sensor_type,
          installation_location: reading.installation_location,
          device_model: reading.device_model
        },

        // Measurement values (optimized for compression)
        measurements: {
          temperature: reading.temperature,
          humidity: reading.humidity,
          pressure: reading.pressure,
          battery_level: reading.battery_level,
          signal_strength: reading.signal_strength
        },

        // Location data (when available)
        ...(reading.latitude && reading.longitude && {
          location: {
            coordinates: [reading.longitude, reading.latitude],
            altitude: reading.altitude
          }
        }),

        // Data quality and status
        quality: {
          data_quality_score: reading.data_quality_score || 1.0,
          anomaly_detected: reading.anomaly_detected || false,
          validation_flags: reading.validation_flags || {}
        },

        // Processing metadata
        ingestion_time: new Date(),
        processing_version: '1.0'
      }));

      // Bulk insert with optimal batch size for time series
      const insertResult = await this.collections.get('sensor_readings').insertMany(
        timeSeriesDocuments,
        { 
          ordered: false,  // Allow parallel inserts
          writeConcern: { w: 1, j: false }  // Optimize for throughput
        }
      );

      // Update device health tracking
      await this.updateDeviceHealthTracking(sensorDataBatch);

      // Process real-time alerts
      await this.processRealTimeAlerts(timeSeriesDocuments);

      // Update performance metrics
      const ingestionTime = Date.now() - startTime;
      await this.trackIngestionMetrics({
        batchSize: sensorDataBatch.length,
        ingestionTimeMs: ingestionTime,
        documentsInserted: insertResult.insertedCount,
        timestamp: new Date()
      });

      console.log(`✅ Ingested ${insertResult.insertedCount} sensor readings in ${ingestionTime}ms`);

      return {
        success: true,
        documentsInserted: insertResult.insertedCount,
        ingestionTimeMs: ingestionTime,
        throughput: (insertResult.insertedCount / ingestionTime * 1000).toFixed(2) + ' docs/sec'
      };

    } catch (error) {
      console.error('Error ingesting sensor data:', error);
      return { success: false, error: error.message };
    }
  }

  async queryRecentSensorData(deviceId, timeRangeMinutes = 60) {
    console.log(`Querying recent sensor data for device ${deviceId} over ${timeRangeMinutes} minutes...`);

    const startTime = Date.now();
    const queryStartTime = new Date(Date.now() - timeRangeMinutes * 60 * 1000);

    try {
      const pipeline = [
        {
          $match: {
            'device.device_id': deviceId,
            timestamp: { $gte: queryStartTime }
          }
        },
        {
          $sort: { timestamp: -1 }
        },
        {
          $limit: 1000  // Reasonable limit for recent data
        },
        {
          $project: {
            timestamp: 1,
            'device.sensor_type': 1,
            'device.installation_location': 1,
            measurements: 1,
            location: 1,
            'quality.data_quality_score': 1,
            'quality.anomaly_detected': 1
          }
        }
      ];

      const results = await this.collections.get('sensor_readings')
        .aggregate(pipeline, { allowDiskUse: false })
        .toArray();

      const queryTime = Date.now() - startTime;

      console.log(`✅ Retrieved ${results.length} readings in ${queryTime}ms`);

      return {
        deviceId: deviceId,
        timeRangeMinutes: timeRangeMinutes,
        readingCount: results.length,
        queryTimeMs: queryTime,
        data: results,

        // Data summary
        summary: {
          latestReading: results[0]?.timestamp,
          oldestReading: results[results.length - 1]?.timestamp,
          sensorTypes: [...new Set(results.map(r => r.device.sensor_type))],
          averageDataQuality: results.reduce((sum, r) => sum + (r.quality.data_quality_score || 1), 0) / results.length,
          anomaliesDetected: results.filter(r => r.quality.anomaly_detected).length
        }
      };

    } catch (error) {
      console.error('Error querying recent sensor data:', error);
      return { success: false, error: error.message };
    }
  }

  async performTimeSeriesAggregation(aggregationOptions) {
    console.log('Performing optimized time series aggregation...');

    const {
      timeRange = { hours: 24 },
      granularity = 'hour',
      metrics = ['temperature', 'humidity', 'pressure'],
      groupBy = ['device.installation_location'],
      filters = {}
    } = aggregationOptions;

    const startTime = Date.now();

    try {
      // Calculate time range
      const timeRangeMs = (timeRange.days || 0) * 24 * 60 * 60 * 1000 +
                          (timeRange.hours || 0) * 60 * 60 * 1000 +
                          (timeRange.minutes || 0) * 60 * 1000;
      const queryStartTime = new Date(Date.now() - timeRangeMs);

      // Build granularity for $dateTrunc
      const granularityMap = {
        'minute': 'minute',
        'hour': 'hour', 
        'day': 'day',
        'week': 'week'
      };

      const pipeline = [
        // Match stage with time range and filters
        {
          $match: {
            timestamp: { $gte: queryStartTime },
            ...filters
          }
        },

        // Add time bucket field
        {
          $addFields: {
            timeBucket: {
              $dateTrunc: {
                date: '$timestamp',
                unit: granularityMap[granularity] || 'hour'
              }
            }
          }
        },

        // Group by time bucket and specified dimensions
        {
          $group: {
            _id: {
              timeBucket: '$timeBucket',
              ...Object.fromEntries(groupBy.map(field => [field.replace('.', '_'), `$${field}`]))
            },

            // Count and time range
            readingCount: { $sum: 1 },
            periodStart: { $min: '$timestamp' },
            periodEnd: { $max: '$timestamp' },

            // Dynamic metric aggregations
            ...Object.fromEntries(metrics.flatMap(metric => [
              [`avg_${metric}`, { $avg: `$measurements.${metric}` }],
              [`min_${metric}`, { $min: `$measurements.${metric}` }],
              [`max_${metric}`, { $max: `$measurements.${metric}` }],
              [`stdDev_${metric}`, { $stdDevPop: `$measurements.${metric}` }]
            ])),

            // Data quality metrics
            avgDataQuality: { $avg: '$quality.data_quality_score' },
            anomalyCount: {
              $sum: { $cond: [{ $eq: ['$quality.anomaly_detected', true] }, 1, 0] }
            },

            // Device diversity
            uniqueDevices: { $addToSet: '$device.device_id' },
            sensorTypes: { $addToSet: '$device.sensor_type' }
          }
        },

        // Calculate derived metrics
        {
          $addFields: {
            // Device count
            deviceCount: { $size: '$uniqueDevices' },
            sensorTypeCount: { $size: '$sensorTypes' },

            // Data coverage and reliability
            dataCoveragePercent: {
              $multiply: [
                { $divide: ['$readingCount', { $multiply: ['$deviceCount', 60] }] }, // Assuming 1-minute intervals
                100
              ]
            },

            // Anomaly rate
            anomalyRate: {
              $cond: [
                { $gt: ['$readingCount', 0] },
                { $divide: ['$anomalyCount', '$readingCount'] },
                0
              ]
            }
          }
        },

        // Sort by time bucket
        {
          $sort: { '_id.timeBucket': 1 }
        },

        // Project final structure
        {
          $project: {
            timeBucket: '$_id.timeBucket',
            grouping: {
              $objectToArray: {
                $arrayToObject: {
                  $filter: {
                    input: { $objectToArray: '$_id' },
                    cond: { $ne: ['$$this.k', 'timeBucket'] }
                  }
                }
              }
            },

            // Measurements
            measurements: Object.fromEntries(metrics.map(metric => [
              metric,
              {
                avg: { $round: [`$avg_${metric}`, 2] },
                min: { $round: [`$min_${metric}`, 2] },
                max: { $round: [`$max_${metric}`, 2] },
                stdDev: { $round: [`$stdDev_${metric}`, 3] }
              }
            ])),

            // Metadata
            metadata: {
              readingCount: '$readingCount',
              deviceCount: '$deviceCount',
              sensorTypeCount: '$sensorTypeCount',
              periodStart: '$periodStart',
              periodEnd: '$periodEnd',
              dataCoveragePercent: { $round: ['$dataCoveragePercent', 1] },
              avgDataQuality: { $round: ['$avgDataQuality', 3] },
              anomalyCount: '$anomalyCount',
              anomalyRate: { $round: ['$anomalyRate', 4] }
            }
          }
        }
      ];

      // Execute aggregation with optimization hints
      const results = await this.collections.get('sensor_readings').aggregate(pipeline, {
        allowDiskUse: false,  // Use memory for better performance
        maxTimeMS: 30000,     // 30-second timeout
        hint: 'location_timestamp_desc'  // Use optimized index
      }).toArray();

      const aggregationTime = Date.now() - startTime;

      console.log(`✅ Completed time series aggregation: ${results.length} buckets in ${aggregationTime}ms`);

      return {
        success: true,
        aggregationTimeMs: aggregationTime,
        bucketCount: results.length,
        timeRange: timeRange,
        granularity: granularity,
        data: results,

        // Performance metrics
        performance: {
          documentsScanned: results.reduce((sum, bucket) => sum + bucket.metadata.readingCount, 0),
          averageBucketProcessingTime: aggregationTime / results.length,
          throughput: (results.length / aggregationTime * 1000).toFixed(2) + ' buckets/sec'
        }
      };

    } catch (error) {
      console.error('Error performing time series aggregation:', error);
      return { success: false, error: error.message };
    }
  }

  async setupRetentionPolicies() {
    console.log('Setting up automated data retention policies...');

    const retentionConfigs = {
      sensor_readings: {
        rawDataRetentionDays: 90,      // Keep raw data for 90 days
        aggregatedDataRetentionDays: 365, // Keep aggregated data for 1 year
        archiveAfterDays: 30,          // Archive data older than 30 days
        compressionLevel: 'high'
      },

      device_health: {
        rawDataRetentionDays: 180,     // Keep device health for 6 months
        aggregatedDataRetentionDays: 730, // Keep aggregated health data for 2 years
        archiveAfterDays: 60
      },

      alert_events: {
        rawDataRetentionDays: 365,     // Keep alerts for 1 year
        archiveAfterDays: 90
      }
    };

    // Store retention policies
    for (const [collection, config] of Object.entries(retentionConfigs)) {
      this.retentionPolicies.set(collection, config);

      // Create TTL indexes for automatic deletion
      await this.collections.get(collection).createIndex(
        { timestamp: 1 },
        { 
          expireAfterSeconds: config.rawDataRetentionDays * 24 * 60 * 60,
          name: `ttl_${collection}`,
          background: true
        }
      );
    }

    console.log('✅ Retention policies configured');
  }

  async processRealTimeAlerts(sensorDocuments) {
    const alertingRules = [
      {
        name: 'temperature_extreme',
        condition: (doc) => doc.measurements.temperature > 50 || doc.measurements.temperature < -20,
        severity: 'critical',
        message: (doc) => `Extreme temperature ${doc.measurements.temperature}°C detected at ${doc.device.installation_location}`
      },
      {
        name: 'battery_critical',
        condition: (doc) => doc.measurements.battery_level < 15,
        severity: 'high',
        message: (doc) => `Critical battery level ${doc.measurements.battery_level}% on device ${doc.device.device_id}`
      },
      {
        name: 'anomaly_detected',
        condition: (doc) => doc.quality.anomaly_detected === true,
        severity: 'medium',
        message: (doc) => `Anomalous readings detected from device ${doc.device.device_id}`
      },
      {
        name: 'data_quality_poor',
        condition: (doc) => doc.quality.data_quality_score < 0.7,
        severity: 'medium',
        message: (doc) => `Poor data quality (${doc.quality.data_quality_score}) from device ${doc.device.device_id}`
      }
    ];

    const alerts = [];
    const currentTime = new Date();

    for (const document of sensorDocuments) {
      for (const rule of alertingRules) {
        if (rule.condition(document)) {
          alerts.push({
            timestamp: currentTime,
            alert_context: {
              rule_name: rule.name,
              device_id: document.device.device_id,
              location: document.device.installation_location,
              sensor_type: document.device.sensor_type
            },
            severity: rule.severity,
            message: rule.message(document),
            source_data: {
              measurements: document.measurements,
              quality: document.quality,
              reading_timestamp: document.timestamp
            },
            status: 'active',
            acknowledgment: null,
            created_at: currentTime
          });
        }
      }
    }

    // Insert alerts into time series collection
    if (alerts.length > 0) {
      await this.collections.get('alert_events').insertMany(alerts, {
        ordered: false,
        writeConcern: { w: 1, j: false }
      });

      console.log(`🚨 Generated ${alerts.length} real-time alerts`);
    }

    return alerts;
  }

  async updateDeviceHealthTracking(sensorDataBatch) {
    // Aggregate device health metrics from sensor readings
    const deviceHealthUpdates = {};

    for (const reading of sensorDataBatch) {
      if (!deviceHealthUpdates[reading.device_id]) {
        deviceHealthUpdates[reading.device_id] = {
          readings: [],
          location: reading.installation_location,
          deviceModel: reading.device_model
        };
      }
      deviceHealthUpdates[reading.device_id].readings.push(reading);
    }

    const healthDocuments = Object.entries(deviceHealthUpdates).map(([deviceId, data]) => {
      const readings = data.readings;
      const avgBattery = readings.reduce((sum, r) => sum + (r.battery_level || 0), 0) / readings.length;
      const avgSignal = readings.reduce((sum, r) => sum + (r.signal_strength || -50), 0) / readings.length;
      const avgDataQuality = readings.reduce((sum, r) => sum + (r.data_quality_score || 1), 0) / readings.length;

      return {
        timestamp: new Date(),
        device_info: {
          device_id: deviceId,
          installation_location: data.location,
          device_model: data.deviceModel
        },
        health_metrics: {
          battery_level: avgBattery,
          signal_strength: avgSignal,
          data_quality_score: avgDataQuality,
          reading_frequency: readings.length,
          last_reading_time: new Date(Math.max(...readings.map(r => new Date(r.timestamp))))
        },
        health_status: {
          battery_status: avgBattery > 40 ? 'good' : avgBattery > 20 ? 'warning' : 'critical',
          connectivity_status: avgSignal > -60 ? 'excellent' : avgSignal > -80 ? 'good' : 'poor',
          overall_health: avgBattery > 20 && avgSignal > -80 && avgDataQuality > 0.8 ? 'healthy' : 'attention_needed'
        }
      };
    });

    if (healthDocuments.length > 0) {
      await this.collections.get('device_health').insertMany(healthDocuments, {
        ordered: false,
        writeConcern: { w: 1, j: false }
      });
    }
  }

  async trackIngestionMetrics(metrics) {
    const key = `${metrics.timestamp.getFullYear()}-${metrics.timestamp.getMonth() + 1}-${metrics.timestamp.getDate()}`;

    if (!this.performanceMetrics.has(key)) {
      this.performanceMetrics.set(key, {
        totalBatches: 0,
        totalDocuments: 0,
        totalIngestionTime: 0,
        averageBatchSize: 0,
        averageThroughput: 0
      });
    }

    const dailyMetrics = this.performanceMetrics.get(key);
    dailyMetrics.totalBatches++;
    dailyMetrics.totalDocuments += metrics.documentsInserted;
    dailyMetrics.totalIngestionTime += metrics.ingestionTimeMs;
    dailyMetrics.averageBatchSize = dailyMetrics.totalDocuments / dailyMetrics.totalBatches;
    dailyMetrics.averageThroughput = dailyMetrics.totalDocuments / (dailyMetrics.totalIngestionTime / 1000);
  }

  async getPerformanceMetrics() {
    const metrics = {
      timestamp: new Date(),
      ingestionMetrics: Object.fromEntries(this.performanceMetrics),

      // Collection statistics
      collectionStats: {},

      // Current throughput (last 5 minutes)
      recentThroughput: await this.calculateRecentThroughput(),

      // Storage efficiency
      storageStats: await this.getStorageStatistics()
    };

    // Get collection stats
    for (const [name, collection] of this.collections) {
      try {
        if (name.includes('sensor_readings') || name.includes('device_health')) {
          const stats = await this.db.command({ collStats: collection.collectionName });
          metrics.collectionStats[name] = {
            documentCount: stats.count,
            storageSize: stats.storageSize,
            averageDocumentSize: stats.avgObjSize,
            indexSize: stats.totalIndexSize,
            compressionRatio: stats.storageSize > 0 ? (stats.size / stats.storageSize).toFixed(2) : 0
          };
        }
      } catch (error) {
        console.warn(`Could not get stats for collection ${name}:`, error.message);
      }
    }

    return metrics;
  }

  async calculateRecentThroughput() {
    const fiveMinutesAgo = new Date(Date.now() - 5 * 60 * 1000);

    try {
      const recentCount = await this.collections.get('sensor_readings').countDocuments({
        timestamp: { $gte: fiveMinutesAgo }
      });

      return {
        documentsLast5Minutes: recentCount,
        throughputDocsPerSecond: (recentCount / 300).toFixed(2)
      };
    } catch (error) {
      return { error: error.message };
    }
  }

  async getStorageStatistics() {
    try {
      const dbStats = await this.db.command({ dbStats: 1 });

      return {
        totalDataSize: dbStats.dataSize,
        totalStorageSize: dbStats.storageSize,
        totalIndexSize: dbStats.indexSize,
        compressionRatio: dbStats.dataSize > 0 ? (dbStats.dataSize / dbStats.storageSize).toFixed(2) : 0,
        collections: dbStats.collections,
        objects: dbStats.objects
      };
    } catch (error) {
      return { error: error.message };
    }
  }

  async shutdown() {
    console.log('Shutting down IoT Time Series Data Manager...');

    if (this.client) {
      await this.client.close();
      console.log('✅ MongoDB connection closed');
    }

    this.collections.clear();
    this.performanceMetrics.clear();
    this.alertingRules.clear();
    this.retentionPolicies.clear();
  }
}

// Export the IoT time series data manager
module.exports = { IoTTimeSeriesDataManager };

// Benefits of MongoDB Time Series Collections for IoT:
// - Native time series optimization with automatic bucketing and compression
// - High-throughput data ingestion optimized for IoT sensor data patterns
// - Efficient storage with specialized compression algorithms for time series data
// - Automatic data retention policies with TTL indexes
// - Real-time query performance for recent data analysis and alerting
// - Flexible metadata handling for diverse IoT device types and sensor configurations
// - Built-in support for time-based aggregations and analytics
// - Seamless integration with existing MongoDB tooling and ecosystem
// - SQL-compatible time series operations through QueryLeaf integration
// - Enterprise-grade scalability for high-volume IoT data ingestion

Understanding MongoDB Time Series Architecture

Advanced IoT Data Processing and Analytics Patterns

Implement sophisticated time series data management for production IoT deployments:

// Production-ready IoT time series processing with advanced analytics and monitoring
class ProductionIoTTimeSeriesProcessor extends IoTTimeSeriesDataManager {
  constructor(productionConfig) {
    super();

    this.productionConfig = {
      ...productionConfig,
      enableAdvancedAnalytics: true,
      enablePredictiveAlerting: true,
      enableDataQualityMonitoring: true,
      enableAutomaticScaling: true,
      enableCompliance: true,
      enableDisasterRecovery: true
    };

    this.analyticsEngine = new Map();
    this.predictionModels = new Map();
    this.qualityMetrics = new Map();
    this.complianceTracking = new Map();
  }

  async setupAdvancedAnalytics() {
    console.log('Setting up advanced IoT analytics pipeline...');

    // Time series analytics configurations
    const analyticsConfigs = {
      // Real-time stream processing for immediate insights
      realTimeAnalytics: {
        windowSizes: ['5min', '15min', '1hour'],
        metrics: ['avg', 'min', 'max', 'stddev', 'percentiles'],
        alertingThresholds: {
          temperature: { min: -20, max: 50, stddev: 5 },
          humidity: { min: 0, max: 100, stddev: 10 },
          battery: { critical: 15, warning: 30 }
        },
        processingLatency: 'sub_second'
      },

      // Predictive analytics for proactive maintenance
      predictiveAnalytics: {
        algorithms: ['linear_trend', 'seasonal_decomposition', 'anomaly_detection'],
        predictionHorizons: ['1hour', '6hours', '24hours', '7days'],
        confidenceIntervals: [0.95, 0.99],
        modelRetraining: 'daily'
      },

      // Environmental correlation analysis
      environmentalAnalytics: {
        correlationAnalysis: true,
        spatialAnalysis: true,
        temporalPatterns: true,
        crossDeviceAnalysis: true
      },

      // Device performance analytics
      devicePerformanceAnalytics: {
        batteryLifePrediction: true,
        connectivityAnalysis: true,
        sensorDriftDetection: true,
        maintenancePrediction: true
      }
    };

    // Initialize analytics pipelines
    for (const [analyticsType, config] of Object.entries(analyticsConfigs)) {
      await this.initializeAnalyticsPipeline(analyticsType, config);
    }

    console.log('✅ Advanced analytics pipelines configured');
  }

  async performPredictiveAnalytics(deviceId, predictionType, horizonHours = 24) {
    console.log(`Performing predictive analytics for device ${deviceId}...`);

    const historicalHours = horizonHours * 10; // Use 10x horizon for historical analysis
    const startTime = new Date(Date.now() - historicalHours * 60 * 60 * 1000);

    const pipeline = [
      {
        $match: {
          'device.device_id': deviceId,
          timestamp: { $gte: startTime }
        }
      },
      {
        $sort: { timestamp: 1 }
      },
      {
        $group: {
          _id: {
            hour: { $dateToString: { format: '%Y-%m-%d-%H', date: '$timestamp' } }
          },
          avgTemperature: { $avg: '$measurements.temperature' },
          avgHumidity: { $avg: '$measurements.humidity' },
          avgBattery: { $avg: '$measurements.battery_level' },
          avgSignal: { $avg: '$measurements.signal_strength' },
          readingCount: { $sum: 1 },
          anomalyCount: { $sum: { $cond: ['$quality.anomaly_detected', 1, 0] } },
          minTimestamp: { $min: '$timestamp' },
          maxTimestamp: { $max: '$timestamp' }
        }
      },
      {
        $sort: { '_id.hour': 1 }
      },
      {
        $project: {
          hour: '$_id.hour',
          metrics: {
            temperature: '$avgTemperature',
            humidity: '$avgHumidity',
            battery: '$avgBattery',
            signal: '$avgSignal',
            readingCount: '$readingCount',
            anomalyRate: { $divide: ['$anomalyCount', '$readingCount'] }
          },
          timeRange: {
            start: '$minTimestamp',
            end: '$maxTimestamp'
          }
        }
      }
    ];

    const historicalData = await this.collections.get('sensor_readings')
      .aggregate(pipeline)
      .toArray();

    // Perform predictive analysis based on historical patterns
    const predictions = await this.generatePredictions(deviceId, historicalData, predictionType, horizonHours);

    return {
      deviceId: deviceId,
      predictionType: predictionType,
      horizon: `${horizonHours} hours`,
      historicalDataPoints: historicalData.length,
      predictions: predictions,
      confidence: this.calculatePredictionConfidence(historicalData),
      recommendations: this.generateMaintenanceRecommendations(predictions)
    };
  }

  async generatePredictions(deviceId, historicalData, predictionType, horizonHours) {
    const predictions = {
      batteryLife: null,
      temperatureTrends: null,
      connectivityHealth: null,
      maintenanceNeeds: null
    };

    if (predictionType === 'battery_life' || predictionType === 'all') {
      predictions.batteryLife = this.predictBatteryLife(historicalData, horizonHours);
    }

    if (predictionType === 'temperature_trends' || predictionType === 'all') {
      predictions.temperatureTrends = this.predictTemperatureTrends(historicalData, horizonHours);
    }

    if (predictionType === 'connectivity' || predictionType === 'all') {
      predictions.connectivityHealth = this.predictConnectivityHealth(historicalData, horizonHours);
    }

    if (predictionType === 'maintenance' || predictionType === 'all') {
      predictions.maintenanceNeeds = this.predictMaintenanceNeeds(historicalData, horizonHours);
    }

    return predictions;
  }

  predictBatteryLife(historicalData, horizonHours) {
    if (historicalData.length < 24) { // Need at least 24 hours of data
      return { error: 'Insufficient historical data for battery prediction' };
    }

    // Simple linear regression on battery level
    const batteryData = historicalData
      .filter(d => d.metrics.battery !== null)
      .map((d, index) => ({ x: index, y: d.metrics.battery }));

    if (batteryData.length < 10) {
      return { error: 'Insufficient battery data points' };
    }

    // Calculate linear trend
    const n = batteryData.length;
    const sumX = batteryData.reduce((sum, d) => sum + d.x, 0);
    const sumY = batteryData.reduce((sum, d) => sum + d.y, 0);
    const sumXY = batteryData.reduce((sum, d) => sum + (d.x * d.y), 0);
    const sumXX = batteryData.reduce((sum, d) => sum + (d.x * d.x), 0);

    const slope = (n * sumXY - sumX * sumY) / (n * sumXX - sumX * sumX);
    const intercept = (sumY - slope * sumX) / n;

    // Project battery level
    const currentBattery = batteryData[batteryData.length - 1].y;
    const futureX = batteryData.length - 1 + horizonHours;
    const projectedBattery = slope * futureX + intercept;

    // Calculate time until battery reaches critical level (15%)
    const criticalTime = slope !== 0 ? (15 - intercept) / slope : null;
    const hoursUntilCritical = criticalTime ? Math.max(0, criticalTime - (batteryData.length - 1)) : null;

    return {
      currentLevel: Math.round(currentBattery * 10) / 10,
      projectedLevel: Math.round(projectedBattery * 10) / 10,
      drainRate: Math.round((-slope) * 100) / 100, // % per hour
      hoursUntilCritical: hoursUntilCritical,
      confidence: this.calculateTrendConfidence(batteryData, slope, intercept),
      recommendation: hoursUntilCritical && hoursUntilCritical < 72 ? 
        'Schedule battery replacement within 3 days' : 
        'Battery levels normal'
    };
  }

  predictTemperatureTrends(historicalData, horizonHours) {
    const temperatureData = historicalData
      .filter(d => d.metrics.temperature !== null)
      .map((d, index) => ({ 
        x: index, 
        y: d.metrics.temperature,
        hour: d.hour
      }));

    if (temperatureData.length < 12) {
      return { error: 'Insufficient temperature data for trend analysis' };
    }

    // Calculate moving averages and trends
    const recentAvg = temperatureData.slice(-6).reduce((sum, d) => sum + d.y, 0) / 6;
    const historicalAvg = temperatureData.reduce((sum, d) => sum + d.y, 0) / temperatureData.length;
    const trend = recentAvg - historicalAvg;

    // Detect patterns (simplified seasonal detection)
    const hourlyAverages = this.calculateHourlyAverages(temperatureData);
    const dailyPattern = this.detectDailyPattern(hourlyAverages);

    return {
      currentAverage: Math.round(recentAvg * 100) / 100,
      historicalAverage: Math.round(historicalAvg * 100) / 100,
      trend: Math.round(trend * 100) / 100,
      trendDirection: trend > 1 ? 'increasing' : trend < -1 ? 'decreasing' : 'stable',
      dailyPattern: dailyPattern,
      extremeRisk: recentAvg > 40 || recentAvg < -10 ? 'high' : 'low',
      projectedRange: {
        min: Math.round((recentAvg + trend - 5) * 100) / 100,
        max: Math.round((recentAvg + trend + 5) * 100) / 100
      }
    };
  }

  predictConnectivityHealth(historicalData, horizonHours) {
    const signalData = historicalData
      .filter(d => d.metrics.signal !== null)
      .map(d => d.metrics.signal);

    const readingCountData = historicalData.map(d => d.metrics.readingCount);

    if (signalData.length < 6) {
      return { error: 'Insufficient connectivity data' };
    }

    const avgSignal = signalData.reduce((sum, s) => sum + s, 0) / signalData.length;
    const signalTrend = this.calculateSimpleTrend(signalData);

    const avgReadings = readingCountData.reduce((sum, r) => sum + r, 0) / readingCountData.length;
    const expectedReadings = 60; // Assuming 1-minute intervals
    const connectivityRatio = avgReadings / expectedReadings;

    return {
      averageSignalStrength: Math.round(avgSignal),
      signalTrend: Math.round(signalTrend * 100) / 100,
      connectivityRatio: Math.round(connectivityRatio * 1000) / 1000,
      connectivityStatus: connectivityRatio > 0.9 ? 'excellent' : 
                         connectivityRatio > 0.8 ? 'good' : 
                         connectivityRatio > 0.6 ? 'fair' : 'poor',
      projectedSignal: Math.round((avgSignal + signalTrend * horizonHours)),
      riskFactors: this.identifyConnectivityRisks(avgSignal, connectivityRatio, signalTrend)
    };
  }

  predictMaintenanceNeeds(historicalData, horizonHours) {
    const anomalyRates = historicalData.map(d => d.metrics.anomalyRate || 0);
    const recentAnomalyRate = anomalyRates.slice(-6).reduce((sum, r) => sum + r, 0) / 6;

    const batteryPrediction = this.predictBatteryLife(historicalData, horizonHours);
    const connectivityPrediction = this.predictConnectivityHealth(historicalData, horizonHours);

    const maintenanceScore = this.calculateMaintenanceScore(
      recentAnomalyRate,
      batteryPrediction,
      connectivityPrediction
    );

    return {
      maintenanceScore: Math.round(maintenanceScore),
      priority: maintenanceScore > 80 ? 'critical' :
                maintenanceScore > 60 ? 'high' :
                maintenanceScore > 40 ? 'medium' : 'low',

      recommendedActions: this.generateMaintenanceActions(
        maintenanceScore,
        batteryPrediction,
        connectivityPrediction,
        recentAnomalyRate
      ),

      estimatedMaintenanceWindow: this.estimateMaintenanceWindow(maintenanceScore),

      riskAssessment: {
        dataLossRisk: recentAnomalyRate > 0.1 ? 'high' : 'low',
        deviceFailureRisk: maintenanceScore > 70 ? 'high' : 'medium',
        serviceDisruptionRisk: connectivityPrediction.connectivityStatus === 'poor' ? 'high' : 'low'
      }
    };
  }

  calculateSimpleTrend(data) {
    if (data.length < 2) return 0;

    const n = data.length;
    const sumX = (n - 1) * n / 2; // Sum of 0,1,2,...,n-1
    const sumY = data.reduce((sum, y) => sum + y, 0);
    const sumXY = data.reduce((sum, y, x) => sum + (x * y), 0);
    const sumXX = (n - 1) * n * (2 * n - 1) / 6; // Sum of squares

    return (n * sumXY - sumX * sumY) / (n * sumXX - sumX * sumX);
  }

  calculateMaintenanceScore(anomalyRate, batteryPrediction, connectivityPrediction) {
    let score = 0;

    // Anomaly rate impact (0-30 points)
    score += Math.min(30, anomalyRate * 300);

    // Battery level impact (0-40 points)
    if (batteryPrediction && !batteryPrediction.error) {
      if (batteryPrediction.currentLevel < 20) score += 40;
      else if (batteryPrediction.currentLevel < 40) score += 25;
      else if (batteryPrediction.currentLevel < 60) score += 10;

      if (batteryPrediction.hoursUntilCritical && batteryPrediction.hoursUntilCritical < 72) {
        score += 30;
      }
    }

    // Connectivity impact (0-30 points)
    if (connectivityPrediction && !connectivityPrediction.error) {
      if (connectivityPrediction.connectivityStatus === 'poor') score += 30;
      else if (connectivityPrediction.connectivityStatus === 'fair') score += 20;
      else if (connectivityPrediction.connectivityStatus === 'good') score += 10;
    }

    return Math.min(100, score);
  }

  generateMaintenanceActions(score, batteryPrediction, connectivityPrediction, anomalyRate) {
    const actions = [];

    if (batteryPrediction && !batteryPrediction.error && batteryPrediction.currentLevel < 30) {
      actions.push({
        action: 'Replace device battery',
        priority: batteryPrediction.currentLevel < 15 ? 'urgent' : 'high',
        timeframe: batteryPrediction.currentLevel < 15 ? '24 hours' : '72 hours'
      });
    }

    if (connectivityPrediction && !connectivityPrediction.error && 
        connectivityPrediction.connectivityStatus === 'poor') {
      actions.push({
        action: 'Inspect device antenna and positioning',
        priority: 'high',
        timeframe: '48 hours'
      });
    }

    if (anomalyRate > 0.1) {
      actions.push({
        action: 'Perform sensor calibration and diagnostic check',
        priority: 'medium',
        timeframe: '1 week'
      });
    }

    if (score > 70) {
      actions.push({
        action: 'Comprehensive device health inspection',
        priority: 'high',
        timeframe: '48 hours'
      });
    }

    return actions.length > 0 ? actions : [{
      action: 'Continue routine monitoring',
      priority: 'low',
      timeframe: 'next_maintenance_cycle'
    }];
  }

  estimateMaintenanceWindow(score) {
    if (score > 80) return '0-24 hours';
    if (score > 60) return '1-3 days';
    if (score > 40) return '1-2 weeks';
    return '1-3 months';
  }

  calculateTrendConfidence(data, slope, intercept) {
    // Calculate R-squared for trend confidence
    const yMean = data.reduce((sum, d) => sum + d.y, 0) / data.length;
    const ssTotal = data.reduce((sum, d) => sum + Math.pow(d.y - yMean, 2), 0);
    const ssRes = data.reduce((sum, d) => {
      const predicted = slope * d.x + intercept;
      return sum + Math.pow(d.y - predicted, 2);
    }, 0);

    const rSquared = ssTotal > 0 ? 1 - (ssRes / ssTotal) : 0;

    if (rSquared > 0.8) return 'high';
    if (rSquared > 0.6) return 'medium';
    return 'low';
  }

  calculateHourlyAverages(temperatureData) {
    // Simplified hourly pattern detection
    const hourlyData = {};

    temperatureData.forEach(d => {
      const hour = d.hour.split('-')[3]; // Extract hour from YYYY-MM-DD-HH format
      if (!hourlyData[hour]) {
        hourlyData[hour] = [];
      }
      hourlyData[hour].push(d.y);
    });

    const hourlyAverages = {};
    for (const [hour, temps] of Object.entries(hourlyData)) {
      hourlyAverages[hour] = temps.reduce((sum, t) => sum + t, 0) / temps.length;
    }

    return hourlyAverages;
  }

  detectDailyPattern(hourlyAverages) {
    const hours = Object.keys(hourlyAverages).sort();
    if (hours.length < 6) return 'insufficient_data';

    const temperatures = hours.map(h => hourlyAverages[h]);
    const minTemp = Math.min(...temperatures);
    const maxTemp = Math.max(...temperatures);
    const range = maxTemp - minTemp;

    if (range > 10) return 'high_variation';
    if (range > 5) return 'moderate_variation';
    return 'stable';
  }

  identifyConnectivityRisks(avgSignal, connectivityRatio, signalTrend) {
    const risks = [];

    if (avgSignal < -80) {
      risks.push('Weak signal strength may cause intermittent connectivity');
    }

    if (connectivityRatio < 0.7) {
      risks.push('High packet loss affecting data reliability');
    }

    if (signalTrend < -1) {
      risks.push('Degrading signal strength trend detected');
    }

    if (risks.length === 0) {
      risks.push('No significant connectivity risks identified');
    }

    return risks;
  }

  generateMaintenanceRecommendations(predictions) {
    const recommendations = [];

    // Battery recommendations
    if (predictions.batteryLife && !predictions.batteryLife.error) {
      if (predictions.batteryLife.hoursUntilCritical && predictions.batteryLife.hoursUntilCritical < 168) {
        recommendations.push({
          type: 'battery',
          urgency: 'high',
          message: `Battery replacement needed within ${Math.ceil(predictions.batteryLife.hoursUntilCritical / 24)} days`,
          action: 'schedule_battery_replacement'
        });
      }
    }

    // Temperature recommendations
    if (predictions.temperatureTrends && !predictions.temperatureTrends.error) {
      if (predictions.temperatureTrends.extremeRisk === 'high') {
        recommendations.push({
          type: 'environmental',
          urgency: 'medium',
          message: 'Device operating in extreme temperature conditions',
          action: 'verify_installation_environment'
        });
      }
    }

    // Connectivity recommendations
    if (predictions.connectivityHealth && !predictions.connectivityHealth.error) {
      if (predictions.connectivityHealth.connectivityStatus === 'poor') {
        recommendations.push({
          type: 'connectivity',
          urgency: 'high',
          message: 'Poor connectivity affecting data transmission reliability',
          action: 'inspect_device_positioning_and_antenna'
        });
      }
    }

    // Maintenance recommendations
    if (predictions.maintenanceNeeds && !predictions.maintenanceNeeds.error) {
      if (predictions.maintenanceNeeds.priority === 'critical') {
        recommendations.push({
          type: 'maintenance',
          urgency: 'critical',
          message: 'Device requires immediate maintenance attention',
          action: 'schedule_emergency_maintenance'
        });
      }
    }

    return recommendations.length > 0 ? recommendations : [{
      type: 'status',
      urgency: 'none',
      message: 'Device operating within normal parameters',
      action: 'continue_monitoring'
    }];
  }

  calculatePredictionConfidence(historicalData) {
    if (historicalData.length < 12) return 'low';
    if (historicalData.length < 48) return 'medium';
    return 'high';
  }

  async getAdvancedAnalytics() {
    return {
      timestamp: new Date(),
      analyticsEngine: Object.fromEntries(this.analyticsEngine),
      predictionModels: Object.fromEntries(this.predictionModels),
      qualityMetrics: Object.fromEntries(this.qualityMetrics),
      systemHealth: await this.assessSystemHealth()
    };
  }

  async assessSystemHealth() {
    return {
      ingestionRate: 'optimal',
      queryPerformance: 'good',
      storageUtilization: 'normal',
      alertingSystem: 'operational',
      predictiveModels: 'trained'
    };
  }
}

// Export the production time series processor
module.exports = { ProductionIoTTimeSeriesProcessor };

SQL-Style Time Series Operations with QueryLeaf

QueryLeaf provides familiar SQL syntax for MongoDB time series operations:

-- QueryLeaf time series operations with SQL-familiar syntax for MongoDB

-- Create time series collection with SQL-style DDL
CREATE TIME SERIES COLLECTION sensor_readings (
  timestamp TIMESTAMP PRIMARY KEY,
  device OBJECT AS metaField (
    device_id VARCHAR(50),
    sensor_type VARCHAR(50),
    installation_location VARCHAR(200),
    device_model VARCHAR(100)
  ),
  measurements OBJECT (
    temperature DECIMAL(5,2),
    humidity DECIMAL(5,2), 
    pressure DECIMAL(7,2),
    battery_level DECIMAL(5,2),
    signal_strength INTEGER
  ),
  location OBJECT (
    coordinates ARRAY[2] OF DECIMAL(11,8),
    altitude DECIMAL(8,2)
  ),
  quality OBJECT (
    data_quality_score DECIMAL(3,2) DEFAULT 1.0,
    anomaly_detected BOOLEAN DEFAULT FALSE,
    validation_flags JSON
  )
)
WITH (
  granularity = 'minutes',
  bucket_max_span_seconds = 3600,
  compression = 'zstd',
  retention_days = 90
);

-- Insert time series data using familiar SQL syntax
INSERT INTO sensor_readings (timestamp, device, measurements, quality)
VALUES 
  (
    CURRENT_TIMESTAMP,
    JSON_OBJECT(
      'device_id', 'sensor_001',
      'sensor_type', 'environmental', 
      'installation_location', 'Warehouse A',
      'device_model', 'TempHumid Pro'
    ),
    JSON_OBJECT(
      'temperature', 23.5,
      'humidity', 45.2,
      'pressure', 1013.25,
      'battery_level', 87.5,
      'signal_strength', -65
    ),
    JSON_OBJECT(
      'data_quality_score', 0.95,
      'anomaly_detected', FALSE
    )
  );

-- Time-based queries with SQL window functions and time series optimizations
WITH recent_readings AS (
  SELECT 
    device->>'device_id' as device_id,
    device->>'installation_location' as location,
    timestamp,
    measurements->>'temperature'::DECIMAL as temperature,
    measurements->>'humidity'::DECIMAL as humidity,
    measurements->>'battery_level'::DECIMAL as battery_level,
    quality->>'data_quality_score'::DECIMAL as data_quality

  FROM sensor_readings
  WHERE timestamp >= CURRENT_TIMESTAMP - INTERVAL '24 hours'
    AND quality->>'data_quality_score'::DECIMAL >= 0.8
),

time_series_analysis AS (
  SELECT 
    device_id,
    location,
    timestamp,
    temperature,
    humidity,
    battery_level,

    -- Time series window functions for trend analysis
    AVG(temperature) OVER (
      PARTITION BY device_id 
      ORDER BY timestamp 
      ROWS BETWEEN 11 PRECEDING AND CURRENT ROW
    ) as temperature_12_point_avg,

    LAG(temperature, 1) OVER (
      PARTITION BY device_id 
      ORDER BY timestamp
    ) as prev_temperature,

    -- Calculate rate of change
    (temperature - LAG(temperature, 1) OVER (
      PARTITION BY device_id ORDER BY timestamp
    )) / EXTRACT(EPOCH FROM (
      timestamp - LAG(timestamp, 1) OVER (
        PARTITION BY device_id ORDER BY timestamp
      )
    )) * 3600 as temp_change_per_hour,

    -- Moving standard deviation for anomaly detection
    STDDEV(temperature) OVER (
      PARTITION BY device_id 
      ORDER BY timestamp 
      ROWS BETWEEN 23 PRECEDING AND CURRENT ROW
    ) as temperature_24_point_stddev,

    -- Battery drain analysis
    FIRST_VALUE(battery_level) OVER (
      PARTITION BY device_id 
      ORDER BY timestamp 
      ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
    ) - battery_level as total_battery_drain,

    -- Time-based calculations
    EXTRACT(HOUR FROM timestamp) as hour_of_day,
    EXTRACT(DOW FROM timestamp) as day_of_week,
    DATE_TRUNC('hour', timestamp) as hour_bucket

  FROM recent_readings
),

hourly_aggregations AS (
  SELECT 
    hour_bucket,
    location,
    COUNT(*) as reading_count,

    -- Statistical aggregations optimized for time series
    AVG(temperature) as avg_temperature,
    MIN(temperature) as min_temperature,
    MAX(temperature) as max_temperature,
    STDDEV(temperature) as temp_stddev,
    PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY temperature) as temp_median,
    PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY temperature) as temp_p95,

    AVG(humidity) as avg_humidity,
    STDDEV(humidity) as humidity_stddev,

    -- Battery analysis
    AVG(battery_level) as avg_battery_level,
    MIN(battery_level) as min_battery_level,
    COUNT(*) FILTER (WHERE battery_level < 20) as low_battery_readings,

    -- Data quality metrics
    AVG(data_quality) as avg_data_quality,
    COUNT(*) FILTER (WHERE data_quality < 0.9) as poor_quality_readings,

    -- Anomaly detection indicators
    COUNT(*) FILTER (WHERE ABS(temp_change_per_hour) > 5) as rapid_temp_changes,
    COUNT(*) FILTER (WHERE ABS(temperature - temperature_12_point_avg) > 2 * temperature_24_point_stddev) as statistical_anomalies

  FROM time_series_analysis
  GROUP BY hour_bucket, location
),

location_health_analysis AS (
  SELECT 
    location,

    -- Time range analysis
    MIN(hour_bucket) as analysis_start,
    MAX(hour_bucket) as analysis_end,
    COUNT(*) as total_hours,

    -- Environmental conditions
    AVG(avg_temperature) as overall_avg_temperature,
    MAX(max_temperature) as peak_temperature,
    MIN(min_temperature) as lowest_temperature,
    AVG(temp_stddev) as avg_temperature_variability,

    AVG(avg_humidity) as overall_avg_humidity,
    AVG(humidity_stddev) as avg_humidity_variability,

    -- Device health indicators
    AVG(avg_battery_level) as location_avg_battery,
    SUM(low_battery_readings) as total_low_battery_readings,
    AVG(avg_data_quality) as location_data_quality,
    SUM(poor_quality_readings) as total_poor_quality_readings,

    -- Anomaly aggregations
    SUM(rapid_temp_changes) as total_rapid_changes,
    SUM(statistical_anomalies) as total_anomalies,
    SUM(reading_count) as total_readings,

    -- Calculated health metrics
    (SUM(reading_count) - SUM(poor_quality_readings)) * 100.0 / NULLIF(SUM(reading_count), 0) as data_reliability_percent,

    CASE 
      WHEN AVG(avg_temperature) BETWEEN 18 AND 25 AND AVG(avg_humidity) BETWEEN 40 AND 60 THEN 'optimal'
      WHEN AVG(avg_temperature) BETWEEN 15 AND 30 AND AVG(avg_humidity) BETWEEN 30 AND 70 THEN 'acceptable'
      WHEN AVG(avg_temperature) BETWEEN 10 AND 35 AND AVG(avg_humidity) BETWEEN 20 AND 80 THEN 'suboptimal'
      ELSE 'extreme'
    END as environmental_classification,

    CASE 
      WHEN AVG(avg_battery_level) > 60 THEN 'healthy'
      WHEN AVG(avg_battery_level) > 30 THEN 'moderate'
      WHEN AVG(avg_battery_level) > 15 THEN 'concerning'
      ELSE 'critical'
    END as battery_health_status

  FROM hourly_aggregations
  GROUP BY location
)

SELECT 
  location,
  TO_CHAR(analysis_start, 'YYYY-MM-DD HH24:MI') as period_start,
  TO_CHAR(analysis_end, 'YYYY-MM-DD HH24:MI') as period_end,
  total_hours,

  -- Environmental metrics
  ROUND(overall_avg_temperature::NUMERIC, 2) as avg_temperature_c,
  ROUND(peak_temperature::NUMERIC, 2) as max_temperature_c,
  ROUND(lowest_temperature::NUMERIC, 2) as min_temperature_c,
  ROUND(avg_temperature_variability::NUMERIC, 2) as temperature_stability,

  ROUND(overall_avg_humidity::NUMERIC, 2) as avg_humidity_percent,
  ROUND(avg_humidity_variability::NUMERIC, 2) as humidity_stability,

  environmental_classification,

  -- Device health metrics
  ROUND(location_avg_battery::NUMERIC, 2) as avg_battery_percent,
  battery_health_status,
  total_low_battery_readings,

  -- Data quality metrics
  ROUND(location_data_quality::NUMERIC, 3) as avg_data_quality,
  ROUND(data_reliability_percent::NUMERIC, 2) as data_reliability_percent,
  total_poor_quality_readings,

  -- Anomaly metrics
  total_rapid_changes,
  total_anomalies,
  ROUND((total_anomalies * 100.0 / NULLIF(total_readings, 0))::NUMERIC, 3) as anomaly_rate_percent,

  -- Overall location health score
  (
    CASE environmental_classification
      WHEN 'optimal' THEN 25
      WHEN 'acceptable' THEN 20
      WHEN 'suboptimal' THEN 15
      ELSE 5
    END +
    CASE battery_health_status
      WHEN 'healthy' THEN 25
      WHEN 'moderate' THEN 20
      WHEN 'concerning' THEN 10
      ELSE 5
    END +
    CASE 
      WHEN data_reliability_percent >= 95 THEN 25
      WHEN data_reliability_percent >= 90 THEN 20
      WHEN data_reliability_percent >= 85 THEN 15
      ELSE 10
    END +
    CASE 
      WHEN total_anomalies * 100.0 / NULLIF(total_readings, 0) < 1 THEN 25
      WHEN total_anomalies * 100.0 / NULLIF(total_readings, 0) < 3 THEN 20
      WHEN total_anomalies * 100.0 / NULLIF(total_readings, 0) < 5 THEN 15
      ELSE 10
    END
  ) as location_health_score,

  -- Operational recommendations
  CASE 
    WHEN environmental_classification = 'extreme' THEN 'URGENT: Review environmental conditions'
    WHEN battery_health_status = 'critical' THEN 'URGENT: Multiple devices need battery replacement'
    WHEN data_reliability_percent < 90 THEN 'HIGH: Investigate data quality issues'
    WHEN total_anomalies * 100.0 / NULLIF(total_readings, 0) > 5 THEN 'MEDIUM: High anomaly rate needs investigation'
    WHEN environmental_classification = 'suboptimal' THEN 'LOW: Monitor environmental conditions'
    ELSE 'Location operating within normal parameters'
  END as operational_recommendation,

  total_readings

FROM location_health_analysis
ORDER BY location_health_score ASC, location;

-- Real-time alerting with time series optimizations
WITH latest_device_readings AS (
  SELECT DISTINCT ON (device->>'device_id')
    device->>'device_id' as device_id,
    device->>'installation_location' as location,
    device->>'sensor_type' as sensor_type,
    timestamp,
    measurements->>'temperature'::DECIMAL as temperature,
    measurements->>'humidity'::DECIMAL as humidity,
    measurements->>'battery_level'::DECIMAL as battery_level,
    measurements->>'signal_strength'::INTEGER as signal_strength,
    quality->>'anomaly_detected'::BOOLEAN as anomaly_detected,
    quality->>'data_quality_score'::DECIMAL as data_quality_score

  FROM sensor_readings
  WHERE timestamp >= CURRENT_TIMESTAMP - INTERVAL '15 minutes'
  ORDER BY device->>'device_id', timestamp DESC
),

alert_conditions AS (
  SELECT 
    device_id,
    location,
    sensor_type,
    timestamp,
    temperature,
    humidity,
    battery_level,
    signal_strength,
    data_quality_score,

    -- Alert condition evaluation
    CASE 
      WHEN temperature > 50 OR temperature < -20 THEN 'temperature_extreme'
      WHEN humidity > 95 OR humidity < 5 THEN 'humidity_extreme'
      WHEN battery_level < 15 THEN 'battery_critical'
      WHEN signal_strength < -85 THEN 'signal_poor'
      WHEN anomaly_detected = TRUE THEN 'anomaly_detected'
      WHEN data_quality_score < 0.7 THEN 'data_quality_poor'
      WHEN timestamp < CURRENT_TIMESTAMP - INTERVAL '10 minutes' THEN 'device_offline'
    END as alert_type,

    -- Alert severity calculation
    CASE 
      WHEN battery_level < 10 OR timestamp < CURRENT_TIMESTAMP - INTERVAL '20 minutes' THEN 'critical'
      WHEN temperature > 45 OR temperature < -15 OR anomaly_detected THEN 'high'
      WHEN battery_level < 20 OR signal_strength < -80 THEN 'medium'
      ELSE 'low'
    END as alert_severity,

    -- Time since last reading
    EXTRACT(EPOCH FROM (CURRENT_TIMESTAMP - timestamp))/60 as minutes_since_reading

  FROM latest_device_readings
)

SELECT 
  device_id,
  location,
  sensor_type,
  alert_type,
  alert_severity,
  TO_CHAR(timestamp, 'YYYY-MM-DD HH24:MI:SS') as last_reading_time,
  ROUND(minutes_since_reading::NUMERIC, 1) as minutes_ago,

  -- Current sensor values
  temperature,
  humidity,
  battery_level,
  signal_strength,
  ROUND(data_quality_score::NUMERIC, 3) as data_quality,

  -- Alert message generation
  CASE alert_type
    WHEN 'temperature_extreme' THEN 
      FORMAT('Temperature %s°C exceeds safe operating range', temperature)
    WHEN 'humidity_extreme' THEN 
      FORMAT('Humidity %s%% is at extreme level', humidity)
    WHEN 'battery_critical' THEN 
      FORMAT('Battery level %s%% requires immediate replacement', battery_level)
    WHEN 'signal_poor' THEN 
      FORMAT('Signal strength %s dBm indicates connectivity issues', signal_strength)
    WHEN 'anomaly_detected' THEN 
      'Sensor anomaly detected in recent readings'
    WHEN 'data_quality_poor' THEN 
      FORMAT('Data quality score %s indicates sensor problems', data_quality_score)
    WHEN 'device_offline' THEN 
      FORMAT('Device offline for %s minutes', ROUND(minutes_since_reading::NUMERIC, 0))
    ELSE 'Unknown alert condition'
  END as alert_message,

  -- Recommended actions with time urgency
  CASE alert_type
    WHEN 'temperature_extreme' THEN 'Verify environmental conditions and sensor calibration within 2 hours'
    WHEN 'humidity_extreme' THEN 'Check sensor operation and environmental factors within 4 hours'
    WHEN 'battery_critical' THEN 'Replace battery immediately (within 24 hours)'
    WHEN 'signal_poor' THEN 'Inspect antenna and network infrastructure within 48 hours'
    WHEN 'anomaly_detected' THEN 'Investigate sensor readings and interference within 24 hours'
    WHEN 'data_quality_poor' THEN 'Perform sensor calibration within 48 hours'
    WHEN 'device_offline' THEN 'Check power and connectivity immediately'
    ELSE 'Monitor device status'
  END as recommended_action

FROM alert_conditions
WHERE alert_type IS NOT NULL
ORDER BY 
  CASE alert_severity
    WHEN 'critical' THEN 1
    WHEN 'high' THEN 2
    WHEN 'medium' THEN 3
    ELSE 4
  END,
  timestamp DESC;

-- Time series aggregation with downsampling for long-term analysis
WITH daily_device_summary AS (
  SELECT 
    DATE_TRUNC('day', timestamp) as day,
    device->>'device_id' as device_id,
    device->>'installation_location' as location,

    -- Daily statistical aggregations
    COUNT(*) as daily_reading_count,

    -- Temperature analysis
    AVG(measurements->>'temperature'::DECIMAL) as avg_temperature,
    MIN(measurements->>'temperature'::DECIMAL) as min_temperature,
    MAX(measurements->>'temperature'::DECIMAL) as max_temperature,
    STDDEV(measurements->>'temperature'::DECIMAL) as temp_daily_stddev,

    -- Humidity analysis
    AVG(measurements->>'humidity'::DECIMAL) as avg_humidity,
    STDDEV(measurements->>'humidity'::DECIMAL) as humidity_daily_stddev,

    -- Battery degradation tracking
    FIRST_VALUE(measurements->>'battery_level'::DECIMAL) OVER (
      PARTITION BY device->>'device_id', DATE_TRUNC('day', timestamp)
      ORDER BY timestamp ASC
      ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
    ) as day_start_battery,

    LAST_VALUE(measurements->>'battery_level'::DECIMAL) OVER (
      PARTITION BY device->>'device_id', DATE_TRUNC('day', timestamp)
      ORDER BY timestamp ASC
      ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING  
    ) as day_end_battery,

    -- Data quality metrics
    AVG(quality->>'data_quality_score'::DECIMAL) as avg_daily_data_quality,
    COUNT(*) FILTER (WHERE quality->>'anomaly_detected'::BOOLEAN = TRUE) as daily_anomaly_count,

    -- Connectivity metrics
    AVG(measurements->>'signal_strength'::INTEGER) as avg_signal_strength,
    COUNT(*) as expected_readings, -- Based on device configuration

    -- Environmental stability
    CASE 
      WHEN STDDEV(measurements->>'temperature'::DECIMAL) < 2 AND 
           STDDEV(measurements->>'humidity'::DECIMAL) < 5 THEN 'stable'
      WHEN STDDEV(measurements->>'temperature'::DECIMAL) < 5 AND 
           STDDEV(measurements->>'humidity'::DECIMAL) < 15 THEN 'moderate'
      ELSE 'variable'
    END as environmental_stability

  FROM sensor_readings
  WHERE timestamp >= CURRENT_DATE - INTERVAL '30 days'
  GROUP BY DATE_TRUNC('day', timestamp), device->>'device_id', device->>'installation_location'
),

device_trend_analysis AS (
  SELECT 
    device_id,
    location,

    -- Time period
    MIN(day) as analysis_start_date,
    MAX(day) as analysis_end_date,
    COUNT(*) as total_days,

    -- Reading consistency
    AVG(daily_reading_count) as avg_daily_readings,
    STDDEV(daily_reading_count) as reading_count_consistency,

    -- Temperature trends
    AVG(avg_temperature) as overall_avg_temperature,
    STDDEV(avg_temperature) as temperature_day_to_day_variation,

    -- Linear regression on daily averages for trend detection
    REGR_SLOPE(avg_temperature, EXTRACT(EPOCH FROM day)) * 86400 as temp_trend_per_day,
    REGR_R2(avg_temperature, EXTRACT(EPOCH FROM day)) as temp_trend_confidence,

    -- Battery degradation analysis  
    AVG(day_end_battery - day_start_battery) as avg_daily_battery_drain,

    -- Battery trend analysis
    REGR_SLOPE(day_end_battery, EXTRACT(EPOCH FROM day)) * 86400 as battery_trend_per_day,
    REGR_R2(day_end_battery, EXTRACT(EPOCH FROM day)) as battery_trend_confidence,

    -- Data quality trends
    AVG(avg_daily_data_quality) as overall_avg_data_quality,
    AVG(daily_anomaly_count) as avg_daily_anomalies,

    -- Connectivity trends
    AVG(avg_signal_strength) as overall_avg_signal,
    REGR_SLOPE(avg_signal_strength, EXTRACT(EPOCH FROM day)) * 86400 as signal_trend_per_day,

    -- Environmental stability assessment
    MODE() WITHIN GROUP (ORDER BY environmental_stability) as predominant_stability,
    COUNT(*) FILTER (WHERE environmental_stability = 'stable') * 100.0 / COUNT(*) as stable_days_percent

  FROM daily_device_summary
  GROUP BY device_id, location
)

SELECT 
  device_id,
  location,
  TO_CHAR(analysis_start_date, 'YYYY-MM-DD') as period_start,
  TO_CHAR(analysis_end_date, 'YYYY-MM-DD') as period_end,
  total_days,

  -- Reading consistency metrics
  ROUND(avg_daily_readings::NUMERIC, 1) as avg_daily_readings,
  CASE 
    WHEN reading_count_consistency / NULLIF(avg_daily_readings, 0) < 0.1 THEN 'very_consistent'
    WHEN reading_count_consistency / NULLIF(avg_daily_readings, 0) < 0.2 THEN 'consistent'
    WHEN reading_count_consistency / NULLIF(avg_daily_readings, 0) < 0.4 THEN 'variable'
    ELSE 'inconsistent'
  END as reading_consistency,

  -- Temperature analysis
  ROUND(overall_avg_temperature::NUMERIC, 2) as avg_temperature_c,
  ROUND(temperature_day_to_day_variation::NUMERIC, 2) as temp_daily_variation,
  ROUND((temp_trend_per_day * 30)::NUMERIC, 3) as temp_trend_per_month_c,
  CASE 
    WHEN temp_trend_confidence > 0.7 THEN 'high_confidence'
    WHEN temp_trend_confidence > 0.4 THEN 'medium_confidence'
    ELSE 'low_confidence'
  END as temp_trend_reliability,

  -- Battery health analysis
  ROUND(avg_daily_battery_drain::NUMERIC, 2) as avg_daily_drain_percent,
  ROUND((battery_trend_per_day * 30)::NUMERIC, 2) as battery_degradation_per_month,

  -- Estimated battery life (assuming linear degradation)
  CASE 
    WHEN battery_trend_per_day < -0.1 THEN 
      ROUND((50.0 / ABS(battery_trend_per_day))::NUMERIC, 0) -- Days until 50% from current
    ELSE NULL
  END as estimated_days_until_50_percent,

  CASE 
    WHEN battery_trend_per_day < -0.05 THEN 
      ROUND((85.0 / ABS(battery_trend_per_day))::NUMERIC, 0) -- Days until need replacement
    ELSE NULL
  END as estimated_days_until_replacement,

  -- Data quality assessment
  ROUND(overall_avg_data_quality::NUMERIC, 3) as avg_data_quality,
  ROUND(avg_daily_anomalies::NUMERIC, 1) as avg_daily_anomalies,

  -- Connectivity assessment
  ROUND(overall_avg_signal::NUMERIC, 0) as avg_signal_strength_dbm,
  CASE 
    WHEN signal_trend_per_day < -0.5 THEN 'degrading'
    WHEN signal_trend_per_day > 0.5 THEN 'improving'
    ELSE 'stable'
  END as signal_trend,

  -- Environmental assessment
  predominant_stability as environmental_stability,
  ROUND(stable_days_percent::NUMERIC, 1) as stable_days_percent,

  -- Overall device health scoring
  (
    -- Reading consistency (0-25 points)
    CASE 
      WHEN reading_count_consistency / NULLIF(avg_daily_readings, 0) < 0.1 THEN 25
      WHEN reading_count_consistency / NULLIF(avg_daily_readings, 0) < 0.2 THEN 20
      WHEN reading_count_consistency / NULLIF(avg_daily_readings, 0) < 0.4 THEN 15
      ELSE 10
    END +

    -- Battery health (0-25 points)
    CASE 
      WHEN battery_trend_per_day > -0.05 THEN 25
      WHEN battery_trend_per_day > -0.1 THEN 20
      WHEN battery_trend_per_day > -0.2 THEN 15
      ELSE 10
    END +

    -- Data quality (0-25 points)
    CASE 
      WHEN overall_avg_data_quality >= 0.95 THEN 25
      WHEN overall_avg_data_quality >= 0.90 THEN 20
      WHEN overall_avg_data_quality >= 0.85 THEN 15
      ELSE 10
    END +

    -- Environmental stability (0-25 points)
    CASE 
      WHEN stable_days_percent >= 80 THEN 25
      WHEN stable_days_percent >= 60 THEN 20
      WHEN stable_days_percent >= 40 THEN 15
      ELSE 10
    END
  ) as device_health_score,

  -- Maintenance recommendations
  CASE 
    WHEN battery_trend_per_day < -0.2 OR overall_avg_data_quality < 0.8 THEN 'URGENT: Schedule maintenance'
    WHEN battery_trend_per_day < -0.1 OR stable_days_percent < 50 THEN 'HIGH: Review device status'
    WHEN overall_avg_signal < -80 OR avg_daily_anomalies > 5 THEN 'MEDIUM: Monitor closely'
    ELSE 'LOW: Continue routine monitoring'
  END as maintenance_priority

FROM device_trend_analysis
WHERE total_days >= 7  -- Only analyze devices with sufficient data
ORDER BY device_health_score ASC, device_id;

-- QueryLeaf provides comprehensive time series capabilities:
-- 1. Native time series collection creation with SQL DDL syntax
-- 2. Optimized time-based queries with window functions and aggregations  
-- 3. Real-time alerting with complex condition evaluation
-- 4. Long-term trend analysis with statistical functions
-- 5. Automated data retention and lifecycle management
-- 6. High-performance ingestion optimized for IoT data patterns
-- 7. Advanced analytics with predictive capabilities
-- 8. Integration with MongoDB's time series optimizations
-- 9. Familiar SQL syntax for complex temporal operations
-- 10. Enterprise-grade scalability for high-volume IoT applications

Best Practices for MongoDB Time Series Implementation

IoT Data Architecture and Optimization Strategies

Essential practices for production MongoDB time series deployments:

Granularity Selection: Choose appropriate time series granularity based on data frequency and query patterns
Metadata Organization: Structure device metadata efficiently in the metaField for optimal bucketing and compression
Index Strategy: Create compound indexes on metaField components and timestamp for optimal query performance
Retention Policies: Implement TTL indexes and automated data archiving based on business requirements
Compression Optimization: Use zstd compression for maximum storage efficiency with time series data
Query Optimization: Design aggregation pipelines that leverage time series collection optimizations

Scalability and Production Deployment

Optimize time series collections for enterprise IoT requirements:

High-Throughput Ingestion: Configure write settings and batch sizes for optimal data ingestion rates
Real-Time Analytics: Implement efficient real-time query patterns that leverage time series optimizations
Predictive Analytics: Build statistical models using historical time series data for proactive maintenance
Multi-Tenant Architecture: Design time series schemas that support multiple device types and customers
Compliance Integration: Ensure time series data meets regulatory retention and audit requirements
Disaster Recovery: Implement backup and recovery strategies optimized for time series data volumes

Conclusion

MongoDB Time Series Collections provide comprehensive optimization for IoT sensor data management through native time-stamped data support, automatic compression algorithms, and specialized indexing strategies designed specifically for temporal workloads. The integrated approach eliminates complex manual partitioning while delivering superior performance for high-frequency data ingestion and time-based analytics operations.

Key MongoDB Time Series benefits for IoT applications include:

Native IoT Optimization: Purpose-built time series collections with automatic bucketing and compression for sensor data
High-Performance Ingestion: Optimized write paths capable of handling thousands of sensor readings per second
Intelligent Storage Management: Automatic data compression and retention policies that scale with IoT data volumes
Real-Time Analytics: Efficient time-based queries and aggregations optimized for recent data analysis and alerting
Predictive Capabilities: Advanced analytics support for device maintenance, trend analysis, and anomaly detection
SQL Compatibility: Familiar time series operations accessible through SQL-style interfaces for operational simplicity

Whether you're managing environmental sensors, industrial equipment monitoring, smart city infrastructure, or consumer IoT devices, MongoDB Time Series Collections with QueryLeaf's SQL-familiar interface provide the foundation for scalable IoT data architecture that maintains high performance while supporting sophisticated real-time analytics and predictive maintenance workflows.

QueryLeaf Integration: QueryLeaf automatically optimizes MongoDB Time Series Collections while providing SQL-familiar syntax for time series data management, real-time analytics, and predictive maintenance operations. Advanced temporal query patterns, statistical analysis, and IoT-specific optimizations are seamlessly accessible through familiar SQL constructs, making sophisticated time series data management both powerful and approachable for SQL-oriented development teams.

The combination of MongoDB's specialized time series optimizations with familiar SQL-style operations makes it an ideal platform for IoT applications that require both high-performance temporal data processing and operational simplicity, ensuring your sensor data architecture scales efficiently while maintaining familiar development and analytical patterns.

November 22, 2025
23 min read

MongoDB Atlas Vector Search for AI and Embedding Similarity: Building Intelligent Search Applications with SQL-Compatible Vector Operations

Modern AI applications require sophisticated search capabilities that go beyond traditional keyword matching to understand semantic meaning, user intent, and content similarity. Traditional database search approaches struggle with high-dimensional vector data, semantic relationships, and the complex similarity calculations required for recommendation systems, content discovery, and AI-powered features.

MongoDB Atlas Vector Search provides native vector database capabilities that enable efficient storage, indexing, and querying of high-dimensional embeddings generated by machine learning models. Unlike traditional search engines that require separate vector databases and complex data synchronization, Atlas Vector Search integrates seamlessly with your existing MongoDB data while delivering enterprise-grade performance for AI applications.

The Traditional Vector Search Challenge

Implementing vector similarity search with conventional approaches creates significant architectural complexity and performance challenges:

-- Traditional PostgreSQL vector search - complex and limited

-- Attempting vector similarity with PostgreSQL extensions
CREATE EXTENSION IF NOT EXISTS vector;
CREATE EXTENSION IF NOT EXISTS pg_trgm;

-- Product embeddings table with limited vector support
CREATE TABLE product_embeddings (
    product_id BIGINT PRIMARY KEY,
    product_name VARCHAR(500) NOT NULL,
    description TEXT,
    category VARCHAR(100),
    price DECIMAL(10,2),

    -- Vector embeddings (limited to 2000 dimensions in pg_vector)
    title_embedding vector(384),        -- Limited dimensionality
    description_embedding vector(768),  -- Separate embeddings
    image_embedding vector(512),

    -- Traditional text search fallbacks
    search_vector tsvector,
    keywords TEXT[],

    -- Metadata
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,

    -- Manual similarity caching (performance workaround)
    similar_products JSONB,
    similarity_last_computed TIMESTAMP
);

-- Create vector indexes (limited optimization)
CREATE INDEX idx_title_embedding ON product_embeddings 
USING ivfflat (title_embedding vector_cosine_ops)
WITH (lists = 100);  -- Fixed index parameters

CREATE INDEX idx_description_embedding ON product_embeddings 
USING ivfflat (description_embedding vector_cosine_ops)
WITH (lists = 100);

-- Traditional text search as fallback
CREATE INDEX idx_search_vector ON product_embeddings 
USING GIN (search_vector);

-- Complex similarity search query with poor performance
WITH query_vector AS (
    SELECT vector('[0.1, 0.2, 0.3, ...]'::vector) as embedding
),
similarity_scores AS (
    SELECT 
        pe.product_id,
        pe.product_name,
        pe.description,
        pe.category,
        pe.price,

        -- Expensive similarity calculations
        1 - (pe.title_embedding <=> qv.embedding) as title_similarity,
        1 - (pe.description_embedding <=> qv.embedding) as desc_similarity,
        1 - (pe.image_embedding <=> qv.embedding) as image_similarity,

        -- Combined scoring (manual implementation)
        (
            (1 - (pe.title_embedding <=> qv.embedding)) * 0.4 +
            (1 - (pe.description_embedding <=> qv.embedding)) * 0.4 +
            (1 - (pe.image_embedding <=> qv.embedding)) * 0.2
        ) as combined_similarity_score,

        -- Traditional text relevance as fallback
        ts_rank_cd(pe.search_vector, plainto_tsquery('search query')) as text_relevance,

        -- Distance calculations for debugging
        pe.title_embedding <=> qv.embedding as title_distance,
        pe.description_embedding <=> qv.embedding as desc_distance

    FROM product_embeddings pe
    CROSS JOIN query_vector qv
    WHERE 
        -- Pre-filtering to reduce computation (limited effectiveness)
        pe.category IN ('electronics', 'clothing', 'books')
        AND pe.price BETWEEN 10 AND 1000
        AND pe.updated_at >= CURRENT_DATE - INTERVAL '1 year'
),

ranked_results AS (
    SELECT *,
        -- Manual ranking logic
        ROW_NUMBER() OVER (ORDER BY combined_similarity_score DESC) as similarity_rank,
        ROW_NUMBER() OVER (ORDER BY text_relevance DESC) as text_rank,

        -- Hybrid scoring attempt
        (combined_similarity_score * 0.7 + text_relevance * 0.3) as hybrid_score

    FROM similarity_scores
    WHERE combined_similarity_score > 0.6  -- Arbitrary threshold
)

SELECT 
    product_id,
    product_name,
    category,
    price,

    -- Similarity metrics
    ROUND(combined_similarity_score::NUMERIC, 4) as similarity_score,
    ROUND(title_similarity::NUMERIC, 4) as title_sim,
    ROUND(desc_similarity::NUMERIC, 4) as desc_sim,
    ROUND(image_similarity::NUMERIC, 4) as image_sim,

    -- Ranking information
    similarity_rank,
    hybrid_score,

    -- Performance debugging
    title_distance as debug_title_dist,
    desc_distance as debug_desc_dist

FROM ranked_results
ORDER BY hybrid_score DESC
LIMIT 20;

-- Problems with traditional vector search approaches:
-- 1. Limited vector dimensionality and poor performance scaling
-- 2. Complex manual similarity calculations and scoring logic
-- 3. No native support for advanced similarity algorithms
-- 4. Poor integration with existing application data
-- 5. Manual index optimization and maintenance
-- 6. Limited filtering capabilities during vector search
-- 7. No built-in support for multiple embedding models
-- 8. Complex hybrid search implementation
-- 9. Poor performance with large vector datasets
-- 10. Limited support for real-time embedding updates

-- Attempt at recommendation system (extremely inefficient)
WITH user_preferences AS (
    SELECT 
        user_id,
        -- Compute average embedding from user's purchase history
        AVG(pe.title_embedding::vector) as preference_embedding,
        COUNT(*) as purchase_count,
        ARRAY_AGG(DISTINCT pe.category) as preferred_categories
    FROM user_purchases up
    JOIN product_embeddings pe ON pe.product_id = up.product_id
    WHERE up.purchase_date >= CURRENT_DATE - INTERVAL '6 months'
    GROUP BY user_id
    HAVING COUNT(*) >= 5  -- Minimum purchase history
),

candidate_products AS (
    SELECT DISTINCT
        pe.product_id,
        pe.product_name,
        pe.category,
        pe.price,
        pe.title_embedding,
        pe.description_embedding
    FROM product_embeddings pe
    WHERE pe.product_id NOT IN (
        -- Exclude already purchased products
        SELECT product_id 
        FROM user_purchases 
        WHERE user_id = $1 
        AND purchase_date >= CURRENT_DATE - INTERVAL '3 months'
    )
),

recommendations AS (
    SELECT 
        up.user_id,
        cp.product_id,
        cp.product_name,
        cp.category,
        cp.price,

        -- Expensive similarity calculation for each user-product pair
        1 - (up.preference_embedding <=> cp.title_embedding) as title_preference_sim,
        1 - (up.preference_embedding <=> cp.description_embedding) as desc_preference_sim,

        -- Category matching bonus
        CASE 
            WHEN cp.category = ANY(up.preferred_categories) THEN 0.2
            ELSE 0.0
        END as category_bonus,

        -- Purchase history influence
        up.purchase_count,

        -- Combined recommendation score
        (
            (1 - (up.preference_embedding <=> cp.title_embedding)) * 0.5 +
            (1 - (up.preference_embedding <=> cp.description_embedding)) * 0.3 +
            CASE WHEN cp.category = ANY(up.preferred_categories) THEN 0.2 ELSE 0.0 END
        ) as recommendation_score

    FROM user_preferences up
    CROSS JOIN candidate_products cp
    WHERE cp.category = ANY(up.preferred_categories)  -- Basic filtering
)

SELECT 
    user_id,
    product_id,
    product_name,
    category,
    price,
    ROUND(recommendation_score::NUMERIC, 4) as score,
    ROUND(title_preference_sim::NUMERIC, 4) as title_sim,
    ROUND(desc_preference_sim::NUMERIC, 4) as desc_sim
FROM recommendations
WHERE recommendation_score > 0.5
ORDER BY user_id, recommendation_score DESC;

-- This approach is extremely slow and doesn't scale beyond small datasets
-- Vector operations are not optimized for recommendation workloads
-- Manual preference modeling lacks sophistication
-- No support for real-time recommendation updates
-- Limited ability to incorporate multiple signals and features

MongoDB Atlas Vector Search provides comprehensive vector database capabilities with enterprise performance:

// MongoDB Atlas Vector Search - advanced AI-powered search capabilities
const { MongoClient } = require('mongodb');

class AtlasVectorSearchManager {
  constructor() {
    this.client = null;
    this.db = null;
    this.searchIndexes = new Map();
    this.embeddingModels = new Map();
    this.searchPerformanceMetrics = new Map();
  }

  async initialize() {
    console.log('Initializing MongoDB Atlas Vector Search Manager...');

    // Connect to Atlas with vector search optimization
    this.client = new MongoClient(process.env.MONGODB_ATLAS_URI, {
      // Optimized connection settings for vector operations
      maxPoolSize: 20,
      minPoolSize: 5,
      maxIdleTimeMS: 30000,

      // Read preference for vector search workloads
      readPreference: 'primary',
      readConcern: { level: 'local' },

      // Compression for large vector payloads
      compression: ['zlib', 'snappy'],

      appName: 'VectorSearchApplication'
    });

    await this.client.connect();
    this.db = this.client.db('ai_application');

    // Initialize vector search indexes and models
    await this.setupVectorSearchIndexes();
    await this.initializeEmbeddingModels();

    console.log('✅ Atlas Vector Search Manager initialized');
  }

  async setupVectorSearchIndexes() {
    console.log('Setting up Atlas Vector Search indexes...');

    const productsCollection = this.db.collection('products');

    // Create comprehensive vector search index
    const productVectorIndex = {
      name: 'products_vector_search',
      type: 'vectorSearch',
      definition: {
        // Multi-field vector search configuration
        fields: [
          {
            // Primary product embedding for semantic search
            type: 'vector',
            path: 'embeddings.combined',
            numDimensions: 1536,          // OpenAI ada-002 dimensions
            similarity: 'cosine'           // Cosine similarity for semantic search
          },
          {
            // Title-specific embedding for title-focused search
            type: 'vector', 
            path: 'embeddings.title',
            numDimensions: 384,            // Sentence transformers dimensions
            similarity: 'euclidean'
          },
          {
            // Description embedding for content-based search
            type: 'vector',
            path: 'embeddings.description', 
            numDimensions: 768,            // BERT-based embeddings
            similarity: 'dotProduct'
          },
          {
            // Visual embedding for image similarity
            type: 'vector',
            path: 'embeddings.image',
            numDimensions: 512,            // Vision transformer embeddings
            similarity: 'cosine'
          },

          // Filterable fields for hybrid search
          {
            type: 'filter',
            path: 'category'
          },
          {
            type: 'filter', 
            path: 'brand'
          },
          {
            type: 'filter',
            path: 'pricing.basePrice'
          },
          {
            type: 'filter',
            path: 'availability.isActive'
          },
          {
            type: 'filter',
            path: 'ratings.averageRating'
          },
          {
            type: 'filter',
            path: 'metadata.tags'
          }
        ]
      }
    };

    // Create user preferences vector index
    const userPreferencesIndex = {
      name: 'user_preferences_vector_search',
      type: 'vectorSearch', 
      definition: {
        fields: [
          {
            // User preference embedding for personalization
            type: 'vector',
            path: 'preferences.embedding',
            numDimensions: 1536,
            similarity: 'cosine'
          },
          {
            // Session-based embedding for short-term preferences
            type: 'vector',
            path: 'preferences.sessionEmbedding', 
            numDimensions: 384,
            similarity: 'cosine'
          },

          // User demographic and behavioral filters
          {
            type: 'filter',
            path: 'demographics.ageRange'
          },
          {
            type: 'filter',
            path: 'demographics.location'
          },
          {
            type: 'filter', 
            path: 'behavior.purchaseFrequency'
          },
          {
            type: 'filter',
            path: 'preferences.categories'
          }
        ]
      }
    };

    // Store index configurations
    this.searchIndexes.set('products', productVectorIndex);
    this.searchIndexes.set('user_preferences', userPreferencesIndex);

    console.log('✅ Vector search indexes configured');
  }

  async initializeEmbeddingModels() {
    console.log('Initializing embedding models...');

    // Configure different embedding models for different use cases
    const embeddingConfigs = {
      'openai-ada-002': {
        provider: 'openai',
        model: 'text-embedding-ada-002',
        dimensions: 1536,
        maxTokens: 8192,
        useCase: 'general_semantic_search',
        costPerToken: 0.0001
      },

      'sentence-transformers': {
        provider: 'huggingface',
        model: 'all-MiniLM-L6-v2', 
        dimensions: 384,
        maxTokens: 256,
        useCase: 'title_and_short_text',
        costPerToken: 0.0  // Free local model
      },

      'cohere-embed-v3': {
        provider: 'cohere',
        model: 'embed-english-v3.0',
        dimensions: 1024,
        maxTokens: 512,
        useCase: 'multilingual_content',
        costPerToken: 0.0001
      },

      'vision-transformer': {
        provider: 'openai',
        model: 'clip-vit-base-patch32',
        dimensions: 512,
        useCase: 'image_similarity',
        costPerToken: 0.0002
      }
    };

    for (const [modelName, config] of Object.entries(embeddingConfigs)) {
      this.embeddingModels.set(modelName, config);
    }

    console.log('✅ Embedding models initialized');
  }

  async performSemanticProductSearch(queryText, options = {}) {
    console.log(`Performing semantic product search: "${queryText}"`);

    const startTime = Date.now();

    // Generate query embedding using configured model
    const queryEmbedding = await this.generateEmbedding(
      queryText, 
      options.embeddingModel || 'openai-ada-002'
    );

    const productsCollection = this.db.collection('products');

    // Construct vector search pipeline with advanced filtering
    const searchPipeline = [
      {
        $vectorSearch: {
          index: 'products_vector_search',
          path: 'embeddings.combined',
          queryVector: queryEmbedding,
          numCandidates: options.numCandidates || 1000,
          limit: options.limit || 20,

          // Advanced filtering during vector search
          filter: {
            $and: [
              { 'availability.isActive': true },
              ...(options.categories ? [{ category: { $in: options.categories } }] : []),
              ...(options.priceRange ? [{ 
                'pricing.basePrice': { 
                  $gte: options.priceRange.min, 
                  $lte: options.priceRange.max 
                }
              }] : []),
              ...(options.minRating ? [{ 
                'ratings.averageRating': { $gte: options.minRating }
              }] : []),
              ...(options.brands ? [{ brand: { $in: options.brands } }] : [])
            ]
          }
        }
      },

      // Add similarity score and additional fields
      {
        $addFields: {
          vectorSearchScore: { $meta: 'vectorSearchScore' },

          // Calculate additional similarity metrics
          titleRelevance: {
            $function: {
              body: function(title, query) {
                // Custom relevance scoring function
                const titleLower = title.toLowerCase();
                const queryLower = query.toLowerCase();
                const words = queryLower.split(/\s+/);
                let score = 0;

                words.forEach(word => {
                  if (titleLower.includes(word)) {
                    score += word.length / title.length;
                  }
                });

                return Math.min(score, 1.0);
              },
              args: ['$name', queryText],
              lang: 'js'
            }
          },

          // Boost scoring based on business rules
          businessBoost: {
            $switch: {
              branches: [
                {
                  case: { $eq: ['$featured', true] },
                  then: 0.2  // Featured products get boost
                },
                {
                  case: { $gte: ['$inventory.stockQuantity', 100] },
                  then: 0.1  // Well-stocked items get boost
                },
                {
                  case: { $gte: ['$ratings.averageRating', 4.5] },
                  then: 0.15 // Highly rated products get boost
                }
              ],
              default: 0.0
            }
          }
        }
      },

      // Calculate final relevance score
      {
        $addFields: {
          finalRelevanceScore: {
            $add: [
              { $multiply: ['$vectorSearchScore', 0.7] },  // Vector similarity weight
              { $multiply: ['$titleRelevance', 0.2] },     // Title relevance weight
              '$businessBoost'                             // Business rule boost
            ]
          }
        }
      },

      // Re-sort by final relevance score
      {
        $sort: { finalRelevanceScore: -1 }
      },

      // Project final result structure
      {
        $project: {
          productId: '$_id',
          name: 1,
          description: 1,
          category: 1,
          brand: 1,
          pricing: 1,
          ratings: 1,
          images: 1,
          availability: 1,

          // Search relevance metrics
          relevance: {
            vectorScore: '$vectorSearchScore',
            titleRelevance: '$titleRelevance', 
            businessBoost: '$businessBoost',
            finalScore: '$finalRelevanceScore'
          },

          // Additional context
          searchContext: {
            query: queryText,
            embeddingModel: options.embeddingModel || 'openai-ada-002',
            searchTimestamp: new Date()
          }
        }
      }
    ];

    // Execute search with performance tracking
    const searchResults = await productsCollection.aggregate(searchPipeline).toArray();

    const searchDuration = Date.now() - startTime;

    // Track search performance metrics
    await this.trackSearchMetrics({
      queryText: queryText,
      resultsCount: searchResults.length,
      searchDurationMs: searchDuration,
      embeddingModel: options.embeddingModel || 'openai-ada-002',
      filters: options
    });

    console.log(`✅ Semantic search completed: ${searchResults.length} results in ${searchDuration}ms`);

    return {
      results: searchResults,
      metadata: {
        query: queryText,
        totalResults: searchResults.length,
        searchDurationMs: searchDuration,
        embeddingModel: options.embeddingModel || 'openai-ada-002',
        searchTimestamp: new Date()
      }
    };
  }

  async generatePersonalizedRecommendations(userId, options = {}) {
    console.log(`Generating personalized recommendations for user: ${userId}`);

    const startTime = Date.now();

    // Get user preference embedding
    const userProfile = await this.getUserPreferenceEmbedding(userId);

    if (!userProfile || !userProfile.preferences?.embedding) {
      console.log('No user preference data available, falling back to popularity-based recommendations');
      return await this.getPopularityBasedRecommendations(options);
    }

    const productsCollection = this.db.collection('products');

    // Generate recommendations using vector similarity
    const recommendationPipeline = [
      {
        $vectorSearch: {
          index: 'products_vector_search',
          path: 'embeddings.combined',
          queryVector: userProfile.preferences.embedding,
          numCandidates: options.numCandidates || 2000,
          limit: options.limit || 50,

          // Exclude previously purchased/viewed products
          filter: {
            $and: [
              { 'availability.isActive': true },
              { '_id': { $not: { $in: userProfile.excludeProductIds || [] } } },
              ...(options.categories ? [{ category: { $in: options.categories } }] : []),
              ...(userProfile.preferences?.priceRange ? [{ 
                'pricing.basePrice': { 
                  $gte: userProfile.preferences.priceRange.min,
                  $lte: userProfile.preferences.priceRange.max
                }
              }] : [])
            ]
          }
        }
      },

      // Add personalization scoring
      {
        $addFields: {
          vectorSimilarity: { $meta: 'vectorSearchScore' },

          // Category preference matching
          categoryPreferenceScore: {
            $switch: {
              branches: userProfile.preferences.categories?.map(cat => ({
                case: { $eq: ['$category', cat.name] },
                then: cat.score || 0.5
              })) || [],
              default: 0.1
            }
          },

          // Brand preference scoring
          brandPreferenceScore: {
            $cond: {
              if: { $in: ['$brand', userProfile.preferences.brands || []] },
              then: 0.3,
              else: 0.0
            }
          },

          // Price preference scoring
          pricePreferenceScore: {
            $cond: {
              if: {
                $and: [
                  { $gte: ['$pricing.basePrice', userProfile.preferences.priceRange?.min || 0] },
                  { $lte: ['$pricing.basePrice', userProfile.preferences.priceRange?.max || 999999] }
                ]
              },
              then: 0.2,
              else: 0.0
            }
          },

          // Recency bias for trending products
          recencyScore: {
            $cond: {
              if: { $gte: ['$createdAt', new Date(Date.now() - 30 * 24 * 60 * 60 * 1000)] },
              then: 0.1,
              else: 0.0
            }
          }
        }
      },

      // Calculate final recommendation score
      {
        $addFields: {
          recommendationScore: {
            $add: [
              { $multiply: ['$vectorSimilarity', 0.5] },        // Vector similarity weight
              { $multiply: ['$categoryPreferenceScore', 0.2] }, // Category preference
              '$brandPreferenceScore',                          // Brand preference
              '$pricePreferenceScore',                          // Price preference  
              '$recencyScore'                                   // Recency boost
            ]
          }
        }
      },

      // Sort by final recommendation score
      {
        $sort: { recommendationScore: -1 }
      },

      // Limit to requested number of recommendations
      {
        $limit: options.limit || 20
      },

      // Project final recommendation structure
      {
        $project: {
          productId: '$_id',
          name: 1,
          description: 1,
          category: 1,
          brand: 1,
          pricing: 1,
          ratings: 1,
          images: 1,

          // Recommendation scoring details
          recommendation: {
            score: '$recommendationScore',
            vectorSimilarity: '$vectorSimilarity',
            categoryMatch: '$categoryPreferenceScore',
            brandMatch: '$brandPreferenceScore',
            priceMatch: '$pricePreferenceScore',
            recencyBoost: '$recencyScore',
            reason: {
              $switch: {
                branches: [
                  {
                    case: { $gt: ['$categoryPreferenceScore', 0.3] },
                    then: 'Based on your interest in this category'
                  },
                  {
                    case: { $gt: ['$brandPreferenceScore', 0.2] },
                    then: 'From a brand you like'
                  },
                  {
                    case: { $gt: ['$vectorSimilarity', 0.8] },
                    then: 'Similar to products you\'ve liked'
                  }
                ],
                default: 'Recommended for you'
              }
            }
          },

          // Recommendation metadata
          recommendationContext: {
            userId: userId,
            basedOn: 'user_preferences',
            generatedAt: new Date()
          }
        }
      }
    ];

    const recommendations = await productsCollection.aggregate(recommendationPipeline).toArray();

    const generationDuration = Date.now() - startTime;

    console.log(`✅ Generated ${recommendations.length} personalized recommendations in ${generationDuration}ms`);

    return {
      recommendations: recommendations,
      userProfile: {
        userId: userId,
        preferences: userProfile.preferences,
        excludedProducts: userProfile.excludeProductIds?.length || 0
      },
      metadata: {
        generationDurationMs: generationDuration,
        totalRecommendations: recommendations.length,
        algorithm: 'vector_similarity_personalized',
        generatedAt: new Date()
      }
    };
  }

  async performHybridSearch(queryText, userId, options = {}) {
    console.log(`Performing hybrid search for query: "${queryText}", user: ${userId}`);

    // Execute both semantic search and personalized recommendations
    const [semanticResults, personalizedResults] = await Promise.all([
      this.performSemanticProductSearch(queryText, {
        ...options,
        limit: Math.ceil((options.limit || 20) * 0.7)  // 70% semantic results
      }),
      userId ? this.generatePersonalizedRecommendations(userId, {
        ...options,
        limit: Math.ceil((options.limit || 20) * 0.3)  // 30% personalized results
      }) : Promise.resolve({ recommendations: [] })
    ]);

    // Merge and re-rank results
    const hybridResults = this.mergeAndRankHybridResults(
      semanticResults.results,
      personalizedResults.recommendations || [],
      options
    );

    return {
      results: hybridResults,
      sources: {
        semantic: semanticResults.results.length,
        personalized: personalizedResults.recommendations?.length || 0
      },
      metadata: {
        query: queryText,
        userId: userId,
        algorithm: 'hybrid_semantic_personalized',
        searchTimestamp: new Date()
      }
    };
  }

  async generateEmbedding(text, modelName) {
    const model = this.embeddingModels.get(modelName);
    if (!model) {
      throw new Error(`Unknown embedding model: ${modelName}`);
    }

    // Implementation would integrate with actual embedding service
    // This is a placeholder for the actual embedding generation
    console.log(`Generating embedding with ${modelName} for text: ${text.substring(0, 50)}...`);

    // Return mock embedding vector for demonstration
    return Array.from({ length: model.dimensions }, () => Math.random() - 0.5);
  }

  async getUserPreferenceEmbedding(userId) {
    const userPreferencesCollection = this.db.collection('user_preferences');

    const userProfile = await userPreferencesCollection.findOne(
      { userId: userId },
      { 
        projection: {
          preferences: 1,
          excludeProductIds: 1,
          lastUpdated: 1
        }
      }
    );

    return userProfile;
  }

  async trackSearchMetrics(metrics) {
    const searchMetricsCollection = this.db.collection('search_metrics');

    await searchMetricsCollection.insertOne({
      ...metrics,
      timestamp: new Date()
    });

    // Update performance tracking
    if (!this.searchPerformanceMetrics.has(metrics.embeddingModel)) {
      this.searchPerformanceMetrics.set(metrics.embeddingModel, {
        totalSearches: 0,
        totalDurationMs: 0,
        avgResults: 0
      });
    }

    const modelMetrics = this.searchPerformanceMetrics.get(metrics.embeddingModel);
    modelMetrics.totalSearches++;
    modelMetrics.totalDurationMs += metrics.searchDurationMs;
    modelMetrics.avgResults = (modelMetrics.avgResults + metrics.resultsCount) / 2;
  }

  mergeAndRankHybridResults(semanticResults, personalizedResults, options) {
    // Combine results with hybrid scoring
    const combinedResults = new Map();

    // Add semantic results with base score
    semanticResults.forEach((result, index) => {
      combinedResults.set(result.productId.toString(), {
        ...result,
        hybridScore: (result.relevance?.finalScore || 0) * 0.7 + (1 - index / semanticResults.length) * 0.3,
        sources: ['semantic']
      });
    });

    // Add personalized results, boosting score if already present
    personalizedResults.forEach((result, index) => {
      const productId = result.productId.toString();
      const personalizedScore = (result.recommendation?.score || 0) * 0.6 + (1 - index / personalizedResults.length) * 0.4;

      if (combinedResults.has(productId)) {
        // Boost existing result
        const existing = combinedResults.get(productId);
        existing.hybridScore = existing.hybridScore * 0.8 + personalizedScore * 0.2;
        existing.sources.push('personalized');
        existing.personalization = result.recommendation;
      } else {
        // Add new personalized result
        combinedResults.set(productId, {
          ...result,
          hybridScore: personalizedScore * 0.8,  // Slightly lower weight for pure personalized
          sources: ['personalized'],
          relevance: { finalScore: personalizedScore }
        });
      }
    });

    // Convert to array and sort by hybrid score
    return Array.from(combinedResults.values())
      .sort((a, b) => b.hybridScore - a.hybridScore)
      .slice(0, options.limit || 20);
  }

  async getSearchAnalytics() {
    const searchMetricsCollection = this.db.collection('search_metrics');

    const analytics = await searchMetricsCollection.aggregate([
      {
        $match: {
          timestamp: { $gte: new Date(Date.now() - 24 * 60 * 60 * 1000) }
        }
      },
      {
        $group: {
          _id: '$embeddingModel',
          totalSearches: { $sum: 1 },
          avgDuration: { $avg: '$searchDurationMs' },
          avgResults: { $avg: '$resultsCount' },
          minDuration: { $min: '$searchDurationMs' },
          maxDuration: { $max: '$searchDurationMs' }
        }
      },
      {
        $sort: { totalSearches: -1 }
      }
    ]).toArray();

    return {
      timestamp: new Date(),
      period: '24_hours',
      models: analytics
    };
  }

  async shutdown() {
    console.log('Shutting down Atlas Vector Search Manager...');

    if (this.client) {
      await this.client.close();
      console.log('✅ MongoDB Atlas connection closed');
    }

    this.searchIndexes.clear();
    this.embeddingModels.clear();
    this.searchPerformanceMetrics.clear();
  }
}

// Export the Atlas Vector Search manager
module.exports = { AtlasVectorSearchManager };

// Benefits of MongoDB Atlas Vector Search:
// - Native vector database capabilities integrated with document data
// - High-performance vector indexing and similarity search
// - Advanced filtering during vector search operations
// - Multiple embedding model support with configurable algorithms
// - Hybrid search combining semantic and traditional approaches
// - Real-time personalization with user preference embeddings
// - Enterprise-grade scalability and performance optimization
// - Comprehensive analytics and performance monitoring
// - SQL-compatible vector operations through QueryLeaf integration
// - Zero additional infrastructure for vector search capabilities

Understanding MongoDB Atlas Vector Search Architecture

Advanced Vector Search Implementation Patterns

Implement sophisticated vector search strategies for different AI application scenarios:

// Advanced Atlas Vector Search patterns for enterprise AI applications
class EnterpriseVectorSearchOrchestrator {
  constructor() {
    this.searchStrategies = new Map();
    this.embeddingPipelines = new Map();
    this.performanceOptimizer = new Map();
    this.cacheManager = new Map();
  }

  async initializeSearchStrategies() {
    console.log('Initializing enterprise vector search strategies...');

    const strategies = {
      // E-commerce product discovery
      'product_discovery': {
        primaryEmbeddingModel: 'openai-ada-002',
        fallbackEmbeddingModel: 'sentence-transformers',

        searchConfiguration: {
          numCandidates: 2000,
          similarity: 'cosine',
          indexName: 'products_vector_search',

          scoringWeights: {
            vectorSimilarity: 0.6,
            textRelevance: 0.2,
            popularityBoost: 0.1,
            businessRules: 0.1
          },

          filterPriority: ['availability', 'category', 'priceRange', 'ratings'],
          resultDiversification: true
        },

        performanceTargets: {
          maxLatencyMs: 500,
          minResultCount: 10,
          maxResultCount: 50
        }
      },

      // Content recommendation system
      'content_recommendations': {
        primaryEmbeddingModel: 'cohere-embed-v3',
        fallbackEmbeddingModel: 'sentence-transformers',

        searchConfiguration: {
          numCandidates: 5000,
          similarity: 'dotProduct',
          indexName: 'content_vector_search',

          scoringWeights: {
            vectorSimilarity: 0.7,
            userEngagement: 0.15,
            recency: 0.1,
            contentQuality: 0.05
          },

          filterPriority: ['contentType', 'publishDate', 'userPreferences'],
          resultDiversification: false
        },

        performanceTargets: {
          maxLatencyMs: 300,
          minResultCount: 20,
          maxResultCount: 100
        }
      },

      // Customer support knowledge base
      'knowledge_search': {
        primaryEmbeddingModel: 'openai-ada-002',
        fallbackEmbeddingModel: 'sentence-transformers',

        searchConfiguration: {
          numCandidates: 1000,
          similarity: 'cosine',
          indexName: 'knowledge_vector_search',

          scoringWeights: {
            vectorSimilarity: 0.8,
            documentAuthority: 0.1,
            recency: 0.05,
            userFeedback: 0.05
          },

          filterPriority: ['category', 'difficulty', 'department'],
          resultDiversification: true
        },

        performanceTargets: {
          maxLatencyMs: 200,
          minResultCount: 5,
          maxResultCount: 15
        }
      },

      // Image similarity search
      'image_similarity': {
        primaryEmbeddingModel: 'vision-transformer',
        fallbackEmbeddingModel: null,

        searchConfiguration: {
          numCandidates: 3000,
          similarity: 'cosine',
          indexName: 'images_vector_search',

          scoringWeights: {
            vectorSimilarity: 0.9,
            imageMetadata: 0.05,
            userPreferences: 0.05
          },

          filterPriority: ['imageType', 'resolution', 'tags'],
          resultDiversification: false
        },

        performanceTargets: {
          maxLatencyMs: 800,
          minResultCount: 10,
          maxResultCount: 30
        }
      }
    };

    for (const [strategyName, strategy] of Object.entries(strategies)) {
      this.searchStrategies.set(strategyName, strategy);
    }

    console.log('✅ Enterprise search strategies initialized');
  }

  async executeVectorSearchStrategy(strategyName, query, context = {}) {
    const strategy = this.searchStrategies.get(strategyName);
    if (!strategy) {
      throw new Error(`Unknown search strategy: ${strategyName}`);
    }

    console.log(`Executing ${strategyName} strategy for query: "${query}"`);

    const searchContext = {
      strategy: strategyName,
      query: query,
      userId: context.userId,
      sessionId: context.sessionId,
      startTime: Date.now(),
      filters: context.filters || {},
      options: context.options || {}
    };

    try {
      // Generate embeddings using primary model
      let queryEmbedding;
      try {
        queryEmbedding = await this.generateEmbedding(
          query,
          strategy.primaryEmbeddingModel
        );
      } catch (primaryError) {
        console.warn(`Primary embedding model failed, using fallback:`, primaryError.message);

        if (strategy.fallbackEmbeddingModel) {
          queryEmbedding = await this.generateEmbedding(
            query,
            strategy.fallbackEmbeddingModel
          );
        } else {
          throw primaryError;
        }
      }

      // Execute vector search with strategy-specific configuration
      const searchResults = await this.performOptimizedVectorSearch(
        queryEmbedding,
        strategy,
        searchContext
      );

      // Apply strategy-specific post-processing
      const processedResults = await this.applySearchPostProcessing(
        searchResults,
        strategy,
        searchContext
      );

      const searchDuration = Date.now() - searchContext.startTime;

      console.log(`✅ ${strategyName} completed: ${processedResults.length} results in ${searchDuration}ms`);

      return {
        results: processedResults,
        strategy: strategyName,
        metadata: {
          ...searchContext,
          searchDurationMs: searchDuration,
          resultCount: processedResults.length,
          embeddingModel: strategy.primaryEmbeddingModel
        }
      };

    } catch (error) {
      console.error(`Search strategy ${strategyName} failed:`, error);
      return {
        results: [],
        strategy: strategyName,
        error: error.message,
        metadata: searchContext
      };
    }
  }

  async performOptimizedVectorSearch(queryEmbedding, strategy, context) {
    const collection = this.getCollectionForStrategy(strategy.searchConfiguration.indexName);

    // Build vector search aggregation pipeline
    const searchPipeline = [
      {
        $vectorSearch: {
          index: strategy.searchConfiguration.indexName,
          path: this.getVectorPathForStrategy(strategy),
          queryVector: queryEmbedding,
          numCandidates: strategy.searchConfiguration.numCandidates,
          limit: strategy.performanceTargets.maxResultCount,

          // Apply context-aware filtering
          ...(Object.keys(context.filters).length > 0 && {
            filter: this.buildFilterExpression(context.filters, strategy)
          })
        }
      },

      // Add vector search score
      {
        $addFields: {
          vectorSearchScore: { $meta: 'vectorSearchScore' }
        }
      },

      // Apply strategy-specific scoring enhancements
      ...this.buildScoringEnhancements(strategy, context),

      // Sort by enhanced score
      {
        $sort: { enhancedScore: -1 }
      },

      // Limit to strategy target
      {
        $limit: strategy.performanceTargets.maxResultCount
      }
    ];

    return await collection.aggregate(searchPipeline).toArray();
  }

  buildScoringEnhancements(strategy, context) {
    const enhancements = [];
    const weights = strategy.searchConfiguration.scoringWeights;

    // Base scoring calculation
    enhancements.push({
      $addFields: {
        enhancedScore: {
          $add: [
            { $multiply: ['$vectorSearchScore', weights.vectorSimilarity] },

            // Add text relevance if applicable
            ...(weights.textRelevance ? [{
              $multiply: [
                { $ifNull: ['$textRelevanceScore', 0] },
                weights.textRelevance
              ]
            }] : []),

            // Add popularity boost
            ...(weights.popularityBoost ? [{
              $multiply: [
                { $ifNull: ['$popularityScore', 0] },
                weights.popularityBoost
              ]
            }] : []),

            // Add business rule adjustments
            ...(weights.businessRules ? [{
              $multiply: [
                { $ifNull: ['$businessRuleScore', 0] },
                weights.businessRules
              ]
            }] : []),

            // Add user engagement metrics
            ...(weights.userEngagement ? [{
              $multiply: [
                { $ifNull: ['$userEngagementScore', 0] },
                weights.userEngagement
              ]
            }] : []),

            // Add recency boost
            ...(weights.recency ? [{
              $multiply: [
                { $ifNull: ['$recencyScore', 0] },
                weights.recency
              ]
            }] : [])
          ]
        }
      }
    });

    // Add strategy-specific scoring fields
    if (strategy.searchConfiguration.indexName === 'products_vector_search') {
      enhancements.unshift({
        $addFields: {
          popularityScore: {
            $divide: [
              { $ln: { $add: ['$metrics.totalSales', 1] } },
              10
            ]
          },

          businessRuleScore: {
            $switch: {
              branches: [
                {
                  case: { $eq: ['$featured', true] },
                  then: 0.3
                },
                {
                  case: { $gte: ['$ratings.averageRating', 4.5] },
                  then: 0.2
                },
                {
                  case: { $gte: ['$inventory.stockQuantity', 50] },
                  then: 0.1
                }
              ],
              default: 0.0
            }
          }
        }
      });
    } else if (strategy.searchConfiguration.indexName === 'content_vector_search') {
      enhancements.unshift({
        $addFields: {
          userEngagementScore: {
            $divide: [
              { $add: ['$metrics.views', '$metrics.likes', '$metrics.shares'] },
              1000
            ]
          },

          recencyScore: {
            $cond: {
              if: { 
                $gte: [
                  '$publishedAt',
                  new Date(Date.now() - 7 * 24 * 60 * 60 * 1000)
                ]
              },
              then: 0.2,
              else: 0.0
            }
          }
        }
      });
    }

    return enhancements;
  }

  async applySearchPostProcessing(results, strategy, context) {
    // Apply result diversification if enabled
    if (strategy.searchConfiguration.resultDiversification) {
      results = await this.diversifyResults(results, strategy);
    }

    // Apply user personalization
    if (context.userId && strategy.searchConfiguration.personalizeResults !== false) {
      results = await this.personalizeResults(results, context.userId, strategy);
    }

    // Apply business rules and filters
    results = this.applyBusinessRules(results, strategy, context);

    // Ensure minimum result count
    if (results.length < strategy.performanceTargets.minResultCount) {
      console.warn(`Insufficient results for ${strategy.strategy}: ${results.length} < ${strategy.performanceTargets.minResultCount}`);
    }

    return results;
  }

  async diversifyResults(results, strategy) {
    if (results.length <= 5) return results; // Skip diversification for small result sets

    const diversified = [];
    const categories = new Set();
    const maxPerCategory = Math.ceil(strategy.performanceTargets.maxResultCount / 5);
    const categoryCount = new Map();

    // First, add top results ensuring category diversity
    for (const result of results) {
      const category = result.category || result.contentType || 'default';
      const currentCount = categoryCount.get(category) || 0;

      if (currentCount < maxPerCategory || categories.size < 3) {
        diversified.push(result);
        categories.add(category);
        categoryCount.set(category, currentCount + 1);

        if (diversified.length >= strategy.performanceTargets.maxResultCount) {
          break;
        }
      }
    }

    // Fill remaining slots with best remaining results
    const remaining = strategy.performanceTargets.maxResultCount - diversified.length;
    const usedIds = new Set(diversified.map(r => r._id?.toString() || r.productId?.toString()));

    for (const result of results) {
      if (remaining <= 0) break;

      const resultId = result._id?.toString() || result.productId?.toString();
      if (!usedIds.has(resultId)) {
        diversified.push(result);
        usedIds.add(resultId);
      }
    }

    return diversified;
  }

  async personalizeResults(results, userId, strategy) {
    // Get user preferences and behavior
    const userProfile = await this.getUserProfile(userId);

    if (!userProfile) return results;

    // Apply personalization scoring
    return results.map(result => {
      let personalizationBoost = 0;

      // Category preferences
      if (userProfile.preferredCategories?.includes(result.category)) {
        personalizationBoost += 0.1;
      }

      // Brand preferences
      if (userProfile.preferredBrands?.includes(result.brand)) {
        personalizationBoost += 0.05;
      }

      // Price range preferences
      if (result.pricing && userProfile.priceRange) {
        if (result.pricing.basePrice >= userProfile.priceRange.min && 
            result.pricing.basePrice <= userProfile.priceRange.max) {
          personalizationBoost += 0.05;
        }
      }

      // Update enhanced score with personalization
      result.enhancedScore = (result.enhancedScore || result.vectorSearchScore || 0) + personalizationBoost;

      return result;
    }).sort((a, b) => (b.enhancedScore || 0) - (a.enhancedScore || 0));
  }

  applyBusinessRules(results, strategy, context) {
    // Apply business-specific filtering and boosting rules
    return results
      .filter(result => {
        // Basic availability check
        if (result.availability && !result.availability.isActive) {
          return false;
        }

        // Inventory check for products
        if (result.inventory && result.inventory.stockQuantity === 0 && 
            !result.inventory.allowBackorder) {
          return false;
        }

        // Content moderation check
        if (result.moderation && result.moderation.status === 'rejected') {
          return false;
        }

        return true;
      })
      .map(result => {
        // Apply business rule boosts
        if (result.featured || result.promoted) {
          result.enhancedScore = (result.enhancedScore || 0) * 1.2;
        }

        if (result.ratings && result.ratings.averageRating >= 4.5) {
          result.enhancedScore = (result.enhancedScore || 0) * 1.1;
        }

        return result;
      });
  }

  getCollectionForStrategy(indexName) {
    const collectionMap = {
      'products_vector_search': 'products',
      'content_vector_search': 'content_items',
      'knowledge_vector_search': 'knowledge_articles',
      'images_vector_search': 'images'
    };

    const collectionName = collectionMap[indexName];
    if (!collectionName) {
      throw new Error(`Unknown index name: ${indexName}`);
    }

    return this.db.collection(collectionName);
  }

  getVectorPathForStrategy(strategy) {
    const pathMap = {
      'products_vector_search': 'embeddings.combined',
      'content_vector_search': 'embeddings.content',
      'knowledge_vector_search': 'embeddings.article',
      'images_vector_search': 'embeddings.visual'
    };

    return pathMap[strategy.searchConfiguration.indexName] || 'embeddings.default';
  }

  buildFilterExpression(filters, strategy) {
    const filterExpression = { $and: [] };

    // Apply filters based on strategy priority
    for (const filterType of strategy.searchConfiguration.filterPriority) {
      if (filters[filterType] !== undefined) {
        switch (filterType) {
          case 'category':
            if (Array.isArray(filters.category)) {
              filterExpression.$and.push({ category: { $in: filters.category } });
            } else {
              filterExpression.$and.push({ category: filters.category });
            }
            break;

          case 'priceRange':
            if (filters.priceRange.min !== undefined || filters.priceRange.max !== undefined) {
              const priceFilter = {};
              if (filters.priceRange.min !== undefined) {
                priceFilter.$gte = filters.priceRange.min;
              }
              if (filters.priceRange.max !== undefined) {
                priceFilter.$lte = filters.priceRange.max;
              }
              filterExpression.$and.push({ 'pricing.basePrice': priceFilter });
            }
            break;

          case 'availability':
            filterExpression.$and.push({ 'availability.isActive': true });
            break;

          case 'ratings':
            if (filters.ratings?.min !== undefined) {
              filterExpression.$and.push({ 
                'ratings.averageRating': { $gte: filters.ratings.min }
              });
            }
            break;
        }
      }
    }

    return filterExpression.$and.length > 0 ? filterExpression : undefined;
  }

  async getUserProfile(userId) {
    const userProfilesCollection = this.db.collection('user_profiles');
    return await userProfilesCollection.findOne(
      { userId: userId },
      { 
        projection: {
          preferredCategories: 1,
          preferredBrands: 1,
          priceRange: 1,
          behaviors: 1
        }
      }
    );
  }

  async getPerformanceMetrics() {
    const searchMetrics = [];

    for (const [strategyName, strategy] of this.searchStrategies) {
      const metrics = await this.getStrategyMetrics(strategyName);
      searchMetrics.push({
        strategy: strategyName,
        configuration: strategy,
        metrics: metrics
      });
    }

    return {
      timestamp: new Date(),
      strategies: searchMetrics
    };
  }

  async getStrategyMetrics(strategyName) {
    const searchLogs = this.db.collection('search_logs');

    const metrics = await searchLogs.aggregate([
      {
        $match: {
          strategy: strategyName,
          timestamp: { $gte: new Date(Date.now() - 24 * 60 * 60 * 1000) }
        }
      },
      {
        $group: {
          _id: null,
          totalSearches: { $sum: 1 },
          avgDuration: { $avg: '$searchDurationMs' },
          avgResults: { $avg: '$resultCount' },
          p95Duration: { $percentile: { input: '$searchDurationMs', p: [0.95], method: 'approximate' } },
          successRate: { 
            $avg: { 
              $cond: [{ $gt: ['$resultCount', 0] }, 1, 0]
            }
          }
        }
      }
    ]).toArray();

    return metrics[0] || {
      totalSearches: 0,
      avgDuration: 0,
      avgResults: 0,
      successRate: 0
    };
  }
}

// Export the enterprise vector search orchestrator
module.exports = { EnterpriseVectorSearchOrchestrator };

SQL-Style Vector Operations with QueryLeaf

QueryLeaf provides familiar SQL syntax for MongoDB Atlas Vector Search operations:

-- QueryLeaf vector search operations with SQL-familiar syntax

-- Create vector search indexes with SQL-style DDL
CREATE VECTOR INDEX products_semantic_search ON products (
  -- Primary embedding field
  embeddings.combined VECTOR(1536) USING cosine,

  -- Additional vector fields for multi-modal search
  embeddings.title VECTOR(384) USING euclidean,
  embeddings.description VECTOR(768) USING dotProduct,
  embeddings.image VECTOR(512) USING cosine,

  -- Filterable fields for hybrid search
  category FILTER,
  brand FILTER,
  pricing.basePrice FILTER,
  availability.isActive FILTER,
  ratings.averageRating FILTER
) WITH (
  similarity_algorithm = 'cosine',
  num_candidates = 2000
);

-- Vector similarity search with SQL syntax
WITH semantic_search AS (
  SELECT 
    p.*,
    VECTOR_SIMILARITY(
      p.embeddings.combined,
      VECTOR_EMBED('openai-ada-002', 'wireless bluetooth headphones with noise cancellation'),
      'cosine'
    ) as semantic_similarity,

    -- Multi-modal similarity scoring
    VECTOR_SIMILARITY(
      p.embeddings.title,
      VECTOR_EMBED('sentence-transformer', 'wireless bluetooth headphones with noise cancellation'), 
      'euclidean'
    ) as title_similarity,

    VECTOR_SIMILARITY(
      p.embeddings.description,
      VECTOR_EMBED('cohere-embed-v3', 'wireless bluetooth headphones with noise cancellation'),
      'dotProduct' 
    ) as description_similarity

  FROM products p
  WHERE 
    -- Vector search with filtering
    VECTOR_SEARCH(
      p.embeddings.combined,
      VECTOR_EMBED('openai-ada-002', 'wireless bluetooth headphones with noise cancellation'),
      num_candidates = 1000,
      limit = 50
    )
    AND p.availability.isActive = true
    AND p.category IN ('electronics', 'audio', 'headphones')
    AND p.pricing.basePrice BETWEEN 50 AND 500
    AND p.ratings.averageRating >= 4.0
),

scored_results AS (
  SELECT *,
    -- Hybrid scoring combining multiple similarity measures
    (
      semantic_similarity * 0.6 +
      title_similarity * 0.25 +
      description_similarity * 0.15
    ) as combined_similarity_score,

    -- Business rule adjustments
    CASE 
      WHEN featured = true THEN 0.2
      WHEN ratings.averageRating >= 4.5 THEN 0.1
      WHEN inventory.stockQuantity > 100 THEN 0.05
      ELSE 0.0
    END as business_boost,

    -- Popularity scoring
    LOG(COALESCE(metrics.totalSales, 1) + 1) / 10.0 as popularity_score,

    -- Recency boost for new products
    CASE 
      WHEN created_at >= CURRENT_DATE - INTERVAL '30 days' THEN 0.1
      WHEN created_at >= CURRENT_DATE - INTERVAL '90 days' THEN 0.05
      ELSE 0.0
    END as recency_boost

  FROM semantic_search
),

final_rankings AS (
  SELECT *,
    -- Calculate final relevance score
    combined_similarity_score + business_boost + popularity_score + recency_boost as final_score,

    -- Ranking within categories for diversification
    ROW_NUMBER() OVER (
      PARTITION BY category 
      ORDER BY combined_similarity_score DESC
    ) as category_rank,

    -- Overall ranking
    ROW_NUMBER() OVER (ORDER BY final_score DESC) as overall_rank

  FROM scored_results
)

SELECT 
  product_id,
  name,
  description,
  category,
  brand,
  pricing.basePrice as price,
  ratings.averageRating as avg_rating,

  -- Similarity and scoring details
  ROUND(semantic_similarity, 4) as semantic_sim,
  ROUND(title_similarity, 4) as title_sim,
  ROUND(description_similarity, 4) as desc_sim,
  ROUND(final_score, 4) as relevance_score,

  -- Ranking information
  overall_rank,
  category_rank,

  -- Business context
  CASE
    WHEN business_boost > 0 THEN 'Featured/Highly Rated'
    WHEN popularity_score > 0.5 THEN 'Popular Choice'
    WHEN recency_boost > 0 THEN 'New Product'
    ELSE 'Standard'
  END as recommendation_reason,

  -- Search context
  'semantic_vector_search' as search_method,
  CURRENT_TIMESTAMP as search_timestamp

FROM final_rankings
WHERE overall_rank <= 20
ORDER BY final_score DESC, overall_rank ASC;

-- Personalized recommendations using vector similarity
WITH user_preference_embedding AS (
  SELECT 
    user_id,
    preferences.embedding as preference_vector,
    preferences.categories as preferred_categories,
    preferences.price_range as price_range
  FROM user_preferences
  WHERE user_id = $1
),

personalized_candidates AS (
  SELECT 
    p.*,
    VECTOR_SIMILARITY(
      upe.preference_vector,
      p.embeddings.combined,
      'cosine'
    ) as preference_similarity,

    -- Category preference matching
    CASE 
      WHEN p.category = ANY(upe.preferred_categories) THEN 0.3
      ELSE 0.0
    END as category_preference_score,

    -- Price preference alignment
    CASE 
      WHEN p.pricing.basePrice BETWEEN upe.price_range.min AND upe.price_range.max THEN 0.2
      ELSE 0.0
    END as price_preference_score

  FROM products p
  CROSS JOIN user_preference_embedding upe
  WHERE 
    VECTOR_SEARCH(
      p.embeddings.combined,
      upe.preference_vector,
      num_candidates = 2000,
      limit = 100
    )
    AND p.availability.isActive = true
    AND p.product_id NOT IN (
      -- Exclude recently purchased/viewed products
      SELECT product_id 
      FROM user_interactions ui
      WHERE ui.user_id = $1 
      AND ui.interaction_type IN ('purchase', 'view')
      AND ui.interaction_date >= CURRENT_DATE - INTERVAL '30 days'
    )
),

recommendation_scores AS (
  SELECT *,
    -- Combined recommendation score
    (
      preference_similarity * 0.5 +
      category_preference_score +
      price_preference_score +
      (ratings.averageRating / 5.0) * 0.1
    ) as recommendation_score,

    -- Diversification ranking
    ROW_NUMBER() OVER (
      PARTITION BY category 
      ORDER BY preference_similarity DESC
    ) as category_diversity_rank

  FROM personalized_candidates
)

SELECT 
  product_id,
  name,
  category,
  brand,
  pricing.basePrice as price,
  ratings.averageRating as rating,

  -- Recommendation metrics
  ROUND(preference_similarity, 4) as preference_match,
  ROUND(recommendation_score, 4) as recommendation_score,

  -- Explanation
  CASE 
    WHEN category_preference_score > 0 THEN 'Based on your category preferences'
    WHEN price_preference_score > 0 THEN 'Within your preferred price range'
    WHEN preference_similarity > 0.8 THEN 'Highly similar to your preferences'
    ELSE 'Recommended for you'
  END as recommendation_reason,

  category_diversity_rank,
  'personalized_vector_recommendation' as recommendation_type

FROM recommendation_scores
WHERE category_diversity_rank <= 5  -- Max 5 per category for diversity
ORDER BY recommendation_score DESC
LIMIT 20;

-- Hybrid search combining semantic search with traditional text search
WITH vector_search_results AS (
  SELECT 
    p.*,
    VECTOR_SIMILARITY(
      p.embeddings.combined,
      VECTOR_EMBED('openai-ada-002', $1),  -- Query parameter
      'cosine'
    ) as vector_score,
    'vector_search' as source
  FROM products p
  WHERE 
    VECTOR_SEARCH(
      p.embeddings.combined,
      VECTOR_EMBED('openai-ada-002', $1),
      num_candidates = 1000,
      limit = 30
    )
    AND p.availability.isActive = true
),

text_search_results AS (
  SELECT 
    p.*,
    MATCH_SCORE(p.search_text, $1) as text_score,
    'text_search' as source
  FROM products p
  WHERE 
    MATCH(p.search_text) AGAINST ($1 IN BOOLEAN MODE)
    AND p.availability.isActive = true
  ORDER BY text_score DESC
  LIMIT 30
),

combined_results AS (
  SELECT *, vector_score as relevance_score FROM vector_search_results
  UNION ALL
  SELECT *, text_score as relevance_score FROM text_search_results
),

deduplicated_results AS (
  SELECT 
    product_id,
    name,
    description,
    category,
    brand,
    pricing,
    ratings,

    -- Aggregate scores from multiple sources
    MAX(relevance_score) as max_score,
    AVG(relevance_score) as avg_score,
    COUNT(*) as source_count,
    ARRAY_AGG(DISTINCT source) as search_sources,

    -- Hybrid scoring - boost items found by multiple methods
    CASE 
      WHEN COUNT(*) > 1 THEN MAX(relevance_score) * 1.2  -- Multi-source boost
      ELSE MAX(relevance_score)
    END as hybrid_score

  FROM combined_results
  GROUP BY product_id, name, description, category, brand, pricing, ratings
)

SELECT 
  product_id,
  name,
  category,
  brand,
  pricing.basePrice as price,
  ratings.averageRating as rating,

  -- Scoring details
  ROUND(hybrid_score, 4) as relevance_score,
  ROUND(max_score, 4) as max_individual_score,
  source_count,
  search_sources,

  -- Search method classification
  CASE 
    WHEN source_count > 1 THEN 'hybrid_match'
    WHEN 'vector_search' = ANY(search_sources) THEN 'semantic_match'
    WHEN 'text_search' = ANY(search_sources) THEN 'keyword_match'
    ELSE 'unknown'
  END as match_type,

  'hybrid_search' as search_algorithm

FROM deduplicated_results
ORDER BY hybrid_score DESC, source_count DESC
LIMIT 25;

-- Vector search performance analysis and optimization
WITH search_performance AS (
  SELECT 
    embedding_model,
    search_type,
    DATE_TRUNC('hour', search_timestamp) as hour_bucket,

    -- Performance metrics
    COUNT(*) as total_searches,
    AVG(search_duration_ms) as avg_duration_ms,
    PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY search_duration_ms) as p95_duration_ms,
    PERCENTILE_CONT(0.99) WITHIN GROUP (ORDER BY search_duration_ms) as p99_duration_ms,

    -- Result quality metrics
    AVG(result_count) as avg_result_count,
    AVG(avg_similarity_score) as avg_similarity,
    COUNT(*) FILTER (WHERE result_count = 0) as zero_result_searches,

    -- User engagement metrics
    AVG(click_through_rate) as avg_ctr,
    AVG(conversion_rate) as avg_conversion_rate,

    -- Resource utilization
    AVG(candidates_examined) as avg_candidates,
    AVG(memory_usage_mb) as avg_memory_usage

  FROM vector_search_logs vsl
  WHERE search_timestamp >= CURRENT_TIMESTAMP - INTERVAL '7 days'
  GROUP BY embedding_model, search_type, DATE_TRUNC('hour', search_timestamp)
),

performance_trends AS (
  SELECT *,
    -- Calculate performance trends
    LAG(avg_duration_ms) OVER (
      PARTITION BY embedding_model, search_type 
      ORDER BY hour_bucket
    ) as prev_avg_duration,

    LAG(avg_similarity) OVER (
      PARTITION BY embedding_model, search_type 
      ORDER BY hour_bucket
    ) as prev_avg_similarity,

    LAG(avg_ctr) OVER (
      PARTITION BY embedding_model, search_type 
      ORDER BY hour_bucket
    ) as prev_avg_ctr

  FROM search_performance
)

SELECT 
  embedding_model,
  search_type,
  TO_CHAR(hour_bucket, 'YYYY-MM-DD HH24:00') as analysis_hour,

  -- Volume metrics
  total_searches,
  ROUND(avg_result_count, 1) as avg_results,

  -- Performance metrics
  ROUND(avg_duration_ms, 0) as avg_duration_ms,
  ROUND(p95_duration_ms, 0) as p95_duration_ms,
  ROUND(p99_duration_ms, 0) as p99_duration_ms,

  -- Quality metrics
  ROUND(avg_similarity, 3) as avg_similarity,
  ROUND((zero_result_searches::DECIMAL / total_searches) * 100, 1) as zero_result_rate_pct,

  -- Engagement metrics
  ROUND(avg_ctr * 100, 2) as avg_ctr_pct,
  ROUND(avg_conversion_rate * 100, 2) as avg_conversion_pct,

  -- Resource metrics
  ROUND(avg_candidates, 0) as avg_candidates_examined,
  ROUND(avg_memory_usage, 1) as avg_memory_mb,

  -- Trend analysis
  CASE 
    WHEN prev_avg_duration IS NOT NULL THEN
      ROUND(((avg_duration_ms - prev_avg_duration) / prev_avg_duration) * 100, 1)
    ELSE NULL
  END as duration_change_pct,

  CASE 
    WHEN prev_avg_similarity IS NOT NULL THEN
      ROUND(((avg_similarity - prev_avg_similarity) / prev_avg_similarity) * 100, 1) 
    ELSE NULL
  END as similarity_change_pct,

  -- Performance assessment
  CASE 
    WHEN avg_duration_ms > 1000 THEN 'slow'
    WHEN avg_duration_ms > 500 THEN 'moderate'
    ELSE 'fast'
  END as performance_rating,

  -- Optimization recommendations
  CASE 
    WHEN avg_duration_ms > 1000 THEN 'Reduce num_candidates or optimize index'
    WHEN zero_result_searches > total_searches * 0.1 THEN 'Review embedding quality or expand corpus'
    WHEN avg_ctr < 0.05 THEN 'Improve result relevance ranking'
    WHEN avg_memory_usage > 1000 THEN 'Consider batch size optimization'
    ELSE 'Performance within acceptable parameters'
  END as optimization_recommendation

FROM performance_trends
WHERE hour_bucket >= CURRENT_TIMESTAMP - INTERVAL '24 hours'
ORDER BY embedding_model, search_type, hour_bucket DESC;

-- QueryLeaf provides comprehensive vector search capabilities:
-- 1. SQL-familiar syntax for MongoDB Atlas Vector Search operations
-- 2. Multi-modal vector similarity search with configurable algorithms
-- 3. Hybrid search combining semantic and traditional text matching
-- 4. Personalized recommendations using user preference embeddings
-- 5. Advanced filtering and ranking with business rule integration
-- 6. Performance monitoring with comprehensive analytics and optimization
-- 7. Real-time vector search with enterprise-grade scalability
-- 8. Integration with popular embedding models and AI services
-- 9. Familiar SQL constructs for complex vector operations
-- 10. Production-ready vector database capabilities through MongoDB Atlas

Best Practices for MongoDB Atlas Vector Search Implementation

Vector Search Optimization Strategies

Essential practices for maximizing vector search performance and accuracy:

Embedding Model Selection: Choose appropriate embedding models based on data type and use case requirements
Index Configuration: Optimize vector indexes for similarity algorithms and dimensionality
Hybrid Search Design: Combine vector similarity with traditional search methods for comprehensive results
Performance Monitoring: Track search latency, result quality, and user engagement metrics
Result Diversification: Implement strategies to ensure diverse and relevant search results
Personalization Integration: Leverage user preference embeddings for customized experiences

Production Deployment Considerations

Key factors for enterprise vector search deployments:

Scalability Planning: Design for high-concurrency vector search workloads
Embedding Management: Implement efficient embedding generation and update strategies
Quality Assurance: Monitor search result quality and user satisfaction metrics
Cost Optimization: Balance embedding model costs with search performance requirements
Security Implementation: Secure vector data and search operations appropriately
Disaster Recovery: Plan for vector index backup and recovery procedures

Conclusion

MongoDB Atlas Vector Search provides enterprise-grade vector database capabilities that seamlessly integrate AI-powered search with traditional database operations. The combination of high-performance vector indexing, advanced similarity algorithms, and familiar SQL-style interfaces enables applications to deliver sophisticated semantic search, personalization, and recommendation features without additional infrastructure complexity.

Key Atlas Vector Search benefits include:

Native AI Integration: Vector database capabilities built into MongoDB Atlas with zero additional infrastructure
High-Performance Search: Optimized vector indexing and similarity algorithms for enterprise-scale workloads
Hybrid Search Capabilities: Seamless integration of semantic and traditional search methodologies
Advanced Personalization: User preference embeddings enable sophisticated recommendation systems
SQL Compatibility: Familiar vector operations accessible through SQL-style query interfaces
Comprehensive Analytics: Real-time monitoring and optimization recommendations for vector search performance

Whether you're building e-commerce recommendation engines, content discovery platforms, customer support systems, or AI-powered search applications, MongoDB Atlas Vector Search with QueryLeaf's familiar SQL interface provides the foundation for intelligent search experiences that scale efficiently while maintaining familiar development patterns.

QueryLeaf Integration: QueryLeaf automatically optimizes MongoDB Atlas Vector Search operations while providing SQL-familiar syntax for vector similarity search, hybrid search strategies, and personalized recommendations. Advanced vector indexing, embedding management, and performance analytics are seamlessly accessible through familiar SQL constructs, making sophisticated AI-powered search both powerful and approachable for SQL-oriented development teams.

The integration of MongoDB's vector search capabilities with SQL-style operations makes it an ideal platform for applications that require both advanced AI functionality and operational simplicity, ensuring your search and recommendation systems deliver intelligent user experiences while maintaining familiar development and deployment patterns.

November 21, 2025
25 min read

MongoDB Aggregation Framework for Geospatial Analytics: Advanced Location-Based Data Processing and Spatial Intelligence for Enterprise Applications

Modern applications increasingly rely on location-based services and spatial analytics to deliver personalized experiences, optimize operations, and generate business insights. Traditional databases struggle with complex geospatial queries, requiring specialized GIS systems or expensive spatial extensions that complicate development and deployment workflows. Processing location data at scale often requires complex spatial calculations, distance computations, and geographic aggregations that are inefficient in conventional relational systems.

MongoDB's Aggregation Framework provides comprehensive geospatial analytics capabilities that enable sophisticated location-based data processing directly within the database. Unlike traditional databases that require external spatial libraries or complex workarounds, MongoDB's native spatial operators integrate seamlessly with aggregation pipelines while delivering enterprise-grade performance and scalability for real-world location intelligence applications.

The Traditional Geospatial Analytics Challenge

Implementing location-based analytics with conventional database systems creates significant development and performance challenges:

-- Traditional PostgreSQL geospatial analytics - complex spatial extensions required

-- Store location tracking data with PostGIS spatial extensions
CREATE EXTENSION IF NOT EXISTS postgis;

CREATE TABLE user_locations (
    location_id SERIAL PRIMARY KEY,
    user_id INTEGER NOT NULL,
    device_id VARCHAR(100),

    -- Location coordinates using PostGIS geometry types
    coordinates GEOMETRY(POINT, 4326) NOT NULL,
    accuracy_meters DECIMAL(8,2),
    altitude_meters DECIMAL(8,2),

    -- Address information
    street_address VARCHAR(500),
    city VARCHAR(100),
    state_province VARCHAR(100),
    postal_code VARCHAR(20),
    country VARCHAR(3),

    -- Location metadata
    location_source VARCHAR(50), -- 'gps', 'network', 'passive'
    location_method VARCHAR(50), -- 'check_in', 'automatic', 'manual'

    -- Timestamps and tracking
    recorded_at TIMESTAMP NOT NULL,
    created_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,

    -- Indexes for spatial operations
    FOREIGN KEY (user_id) REFERENCES users(user_id)
);

-- Create spatial indexes (essential for performance)
CREATE INDEX idx_user_locations_gist ON user_locations USING GIST (coordinates);
CREATE INDEX idx_user_locations_user_recorded ON user_locations(user_id, recorded_at);

-- Store points of interest for proximity analysis
CREATE TABLE points_of_interest (
    poi_id SERIAL PRIMARY KEY,
    poi_name VARCHAR(200) NOT NULL,
    poi_category VARCHAR(100) NOT NULL,
    poi_subcategory VARCHAR(100),

    -- POI location
    coordinates GEOMETRY(POINT, 4326) NOT NULL,

    -- POI details
    description TEXT,
    website VARCHAR(500),
    phone VARCHAR(20),

    -- Business hours and metadata
    business_hours JSONB,
    amenities TEXT[],
    rating_average DECIMAL(3,2),
    review_count INTEGER DEFAULT 0,

    -- Address and contact
    street_address VARCHAR(500),
    city VARCHAR(100),
    state_province VARCHAR(100),
    postal_code VARCHAR(20),
    country VARCHAR(3),

    created_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP
);

CREATE INDEX idx_poi_gist ON points_of_interest USING GIST (coordinates);
CREATE INDEX idx_poi_category ON points_of_interest(poi_category, poi_subcategory);

-- Complex spatial analytics query - expensive operations
WITH user_daily_paths AS (
  -- Calculate daily movement patterns for users
  SELECT 
    user_id,
    DATE(recorded_at) as tracking_date,

    -- Aggregate location points into daily paths
    ST_MakeLine(
      ST_Transform(coordinates, 3857) 
      ORDER BY recorded_at
    ) as daily_path_meters,

    -- Calculate total distance traveled
    SUM(
      ST_Distance(
        ST_Transform(coordinates, 3857),
        ST_Transform(
          LAG(coordinates) OVER (
            PARTITION BY user_id, DATE(recorded_at) 
            ORDER BY recorded_at
          ), 3857
        )
      )
    ) as total_distance_meters,

    -- Time-based calculations
    MIN(recorded_at) as first_location_time,
    MAX(recorded_at) as last_location_time,
    COUNT(*) as location_points,

    -- Geographic bounds
    ST_XMin(ST_Extent(coordinates)) as min_longitude,
    ST_YMin(ST_Extent(coordinates)) as min_latitude,
    ST_XMax(ST_Extent(coordinates)) as max_longitude,
    ST_YMax(ST_Extent(coordinates)) as max_latitude

  FROM user_locations
  WHERE recorded_at >= CURRENT_DATE - INTERVAL '7 days'
  GROUP BY user_id, DATE(recorded_at)
  HAVING COUNT(*) >= 5 -- Only days with sufficient tracking
),

poi_proximity_analysis AS (
  -- Analyze proximity to points of interest
  SELECT 
    ul.user_id,
    ul.recorded_at,
    poi.poi_id,
    poi.poi_name,
    poi.poi_category,

    -- Distance calculations (expensive spatial operations)
    ST_Distance(
      ST_Transform(ul.coordinates, 3857),
      ST_Transform(poi.coordinates, 3857)
    ) as distance_meters,

    -- Determine if user is within proximity zones
    CASE 
      WHEN ST_DWithin(
        ST_Transform(ul.coordinates, 3857),
        ST_Transform(poi.coordinates, 3857),
        100  -- 100 meters
      ) THEN 'immediate_vicinity'
      WHEN ST_DWithin(
        ST_Transform(ul.coordinates, 3857),
        ST_Transform(poi.coordinates, 3857),
        500  -- 500 meters
      ) THEN 'nearby'
      WHEN ST_DWithin(
        ST_Transform(ul.coordinates, 3857),
        ST_Transform(poi.coordinates, 3857),
        1000 -- 1 kilometer
      ) THEN 'walking_distance'
      ELSE 'distant'
    END as proximity_category,

    -- Time spent near POI (complex window calculations)
    CASE 
      WHEN ST_DWithin(
        ST_Transform(ul.coordinates, 3857),
        ST_Transform(poi.coordinates, 3857),
        200
      ) THEN 
        EXTRACT(EPOCH FROM (
          LEAD(ul.recorded_at) OVER (
            PARTITION BY ul.user_id 
            ORDER BY ul.recorded_at
          ) - ul.recorded_at
        )) / 60.0 -- Convert to minutes
      ELSE 0
    END as time_spent_minutes

  FROM user_locations ul
  CROSS JOIN points_of_interest poi
  WHERE ul.recorded_at >= CURRENT_DATE - INTERVAL '7 days'
    AND ST_DWithin(
      ST_Transform(ul.coordinates, 3857),
      ST_Transform(poi.coordinates, 3857),
      5000 -- Only POIs within 5km
    )
),

geospatial_insights AS (
  -- Generate comprehensive geospatial insights
  SELECT 
    udp.user_id,
    udp.tracking_date,

    -- Daily movement analysis
    udp.total_distance_meters,
    ROUND(udp.total_distance_meters / 1000.0, 2) as total_distance_km,

    -- Calculate movement velocity and patterns
    CASE 
      WHEN EXTRACT(EPOCH FROM (udp.last_location_time - udp.first_location_time)) > 0
      THEN ROUND(
        udp.total_distance_meters / 
        EXTRACT(EPOCH FROM (udp.last_location_time - udp.first_location_time)) * 3.6,
        2
      )
      ELSE 0
    END as average_speed_kmh,

    -- Geographic coverage analysis
    ST_Area(
      ST_Transform(
        ST_MakeEnvelope(
          udp.min_longitude, udp.min_latitude,
          udp.max_longitude, udp.max_latitude,
          4326
        ), 3857
      )
    ) / 1000000.0 as coverage_area_km2, -- Convert to km²

    udp.location_points,
    udp.first_location_time,
    udp.last_location_time,

    -- POI interaction analysis
    (
      SELECT COUNT(DISTINCT poi_id)
      FROM poi_proximity_analysis ppa
      WHERE ppa.user_id = udp.user_id
        AND DATE(ppa.recorded_at) = udp.tracking_date
        AND ppa.proximity_category IN ('immediate_vicinity', 'nearby')
    ) as unique_pois_visited,

    (
      SELECT SUM(time_spent_minutes)
      FROM poi_proximity_analysis ppa
      WHERE ppa.user_id = udp.user_id
        AND DATE(ppa.recorded_at) = udp.tracking_date
        AND ppa.time_spent_minutes > 0
    ) as total_poi_time_minutes,

    -- Most frequent POI categories
    (
      SELECT STRING_AGG(poi_category, ', ' ORDER BY visit_count DESC)
      FROM (
        SELECT poi_category, COUNT(*) as visit_count
        FROM poi_proximity_analysis ppa
        WHERE ppa.user_id = udp.user_id
          AND DATE(ppa.recorded_at) = udp.tracking_date
          AND ppa.proximity_category = 'immediate_vicinity'
        GROUP BY poi_category
        ORDER BY visit_count DESC
        LIMIT 3
      ) top_categories
    ) as top_poi_categories,

    -- Mobility pattern classification
    CASE 
      WHEN udp.total_distance_meters < 1000 THEN 'stationary'
      WHEN udp.total_distance_meters < 5000 THEN 'local_movement'
      WHEN udp.total_distance_meters < 20000 THEN 'moderate_travel'
      ELSE 'extensive_travel'
    END as mobility_pattern

  FROM user_daily_paths udp
)

SELECT 
  gi.user_id,
  gi.tracking_date,
  gi.total_distance_km,
  gi.average_speed_kmh,
  ROUND(gi.coverage_area_km2, 3) as coverage_area_km2,
  gi.location_points,
  gi.unique_pois_visited,
  ROUND(gi.total_poi_time_minutes, 1) as total_poi_time_minutes,
  gi.top_poi_categories,
  gi.mobility_pattern,

  -- Daily insights and categorization
  CASE 
    WHEN gi.unique_pois_visited > 5 AND gi.total_distance_km > 10 THEN 'high_mobility_explorer'
    WHEN gi.unique_pois_visited > 3 AND gi.total_distance_km < 5 THEN 'local_explorer'
    WHEN gi.total_distance_km > 20 THEN 'long_distance_traveler'
    WHEN gi.total_poi_time_minutes > 60 THEN 'poi_focused_user'
    ELSE 'standard_user'
  END as user_behavior_profile,

  -- Environmental and context factors
  EXTRACT(DOW FROM gi.tracking_date) as day_of_week,
  CASE 
    WHEN EXTRACT(DOW FROM gi.tracking_date) IN (0, 6) THEN 'weekend'
    ELSE 'weekday'
  END as day_type,

  -- Performance and data quality metrics
  CASE 
    WHEN gi.location_points < 10 THEN 'insufficient_data'
    WHEN gi.location_points < 50 THEN 'moderate_tracking'
    ELSE 'comprehensive_tracking'
  END as tracking_quality

FROM geospatial_insights gi
ORDER BY gi.user_id, gi.tracking_date;

-- Problems with traditional geospatial analytics:
-- 1. Complex spatial extension dependencies (PostGIS) add deployment overhead
-- 2. Expensive coordinate transformations required for accurate distance calculations
-- 3. Multiple complex JOINs and spatial operations create performance bottlenecks
-- 4. Difficult to scale horizontally due to spatial index requirements
-- 5. Limited aggregation capabilities for complex spatial analytics
-- 6. Complex query syntax that's hard to optimize and maintain
-- 7. Poor integration with application object models
-- 8. Expensive spatial index maintenance and storage overhead
-- 9. Limited real-time processing capabilities for streaming location data
-- 10. Complex deployment and configuration of spatial extensions

MongoDB provides native geospatial analytics with powerful aggregation framework integration:

// MongoDB Geospatial Aggregation Framework - native spatial analytics
const { MongoClient } = require('mongodb');

// Advanced MongoDB Geospatial Analytics Manager
class MongoGeospatialAnalyticsManager {
  constructor() {
    this.client = null;
    this.db = null;
    this.spatialCollections = new Map();
    this.geospatialIndexes = new Map();
    this.analyticsCache = new Map();
  }

  async initialize() {
    console.log('Initializing MongoDB Geospatial Analytics Manager...');

    // Connect with geospatial optimization settings
    this.client = new MongoClient(process.env.MONGODB_URI || 'mongodb://localhost:27017', {
      // Connection optimization for geospatial operations
      maxPoolSize: 15,
      minPoolSize: 5,
      maxIdleTimeMS: 30000,

      // Read preferences optimized for analytics
      readPreference: 'secondaryPreferred',
      readConcern: { level: 'local' },

      // Write concern for location tracking
      writeConcern: { w: 1, j: false }, // Optimized for high-volume tracking

      appName: 'GeospatialAnalyticsManager'
    });

    await this.client.connect();
    this.db = this.client.db('location_intelligence');

    // Initialize geospatial collections and indexes
    await this.setupGeospatialCollections();
    await this.createSpatialIndexes();

    console.log('✅ MongoDB Geospatial Analytics Manager initialized');
  }

  async setupGeospatialCollections() {
    console.log('Setting up geospatial collections...');

    const collections = {
      // User location tracking with geospatial data
      userLocations: {
        validator: {
          $jsonSchema: {
            bsonType: 'object',
            required: ['userId', 'coordinates', 'recordedAt'],
            properties: {
              userId: { bsonType: 'objectId' },
              deviceId: { bsonType: 'string' },
              coordinates: {
                bsonType: 'object',
                required: ['type', 'coordinates'],
                properties: {
                  type: { enum: ['Point'] },
                  coordinates: {
                    bsonType: 'array',
                    minItems: 2,
                    maxItems: 2,
                    items: { bsonType: 'double' }
                  }
                }
              },
              accuracy: { bsonType: 'double', minimum: 0 },
              altitude: { bsonType: 'double' },
              address: {
                bsonType: 'object',
                properties: {
                  street: { bsonType: 'string' },
                  city: { bsonType: 'string' },
                  state: { bsonType: 'string' },
                  postalCode: { bsonType: 'string' },
                  country: { bsonType: 'string' }
                }
              },
              locationSource: { enum: ['gps', 'network', 'passive', 'manual'] },
              locationMethod: { enum: ['checkin', 'automatic', 'manual', 'passive'] },
              recordedAt: { bsonType: 'date' }
            }
          }
        }
      },

      // Points of interest for proximity analysis
      pointsOfInterest: {
        validator: {
          $jsonSchema: {
            bsonType: 'object',
            required: ['name', 'category', 'coordinates'],
            properties: {
              name: { bsonType: 'string' },
              category: { bsonType: 'string' },
              subcategory: { bsonType: 'string' },
              coordinates: {
                bsonType: 'object',
                required: ['type', 'coordinates'],
                properties: {
                  type: { enum: ['Point'] },
                  coordinates: {
                    bsonType: 'array',
                    minItems: 2,
                    maxItems: 2,
                    items: { bsonType: 'double' }
                  }
                }
              },
              address: {
                bsonType: 'object',
                properties: {
                  street: { bsonType: 'string' },
                  city: { bsonType: 'string' },
                  state: { bsonType: 'string' },
                  postalCode: { bsonType: 'string' },
                  country: { bsonType: 'string' }
                }
              },
              businessHours: { bsonType: 'object' },
              amenities: { bsonType: 'array' },
              rating: { bsonType: 'double', minimum: 0, maximum: 5 },
              reviewCount: { bsonType: 'int', minimum: 0 }
            }
          }
        }
      },

      // Geographic regions and boundaries
      geographicRegions: {
        validator: {
          $jsonSchema: {
            bsonType: 'object',
            required: ['name', 'regionType', 'geometry'],
            properties: {
              name: { bsonType: 'string' },
              regionType: { enum: ['city', 'district', 'neighborhood', 'zone', 'boundary'] },
              geometry: {
                bsonType: 'object',
                required: ['type', 'coordinates'],
                properties: {
                  type: { enum: ['Polygon', 'MultiPolygon'] },
                  coordinates: { bsonType: 'array' }
                }
              },
              properties: { bsonType: 'object' },
              population: { bsonType: 'int', minimum: 0 },
              area: { bsonType: 'double', minimum: 0 }
            }
          }
        }
      }
    };

    for (const [collectionName, options] of Object.entries(collections)) {
      await this.db.createCollection(collectionName, options);
      this.spatialCollections.set(collectionName, this.db.collection(collectionName));
    }

    console.log('✅ Geospatial collections created');
  }

  async createSpatialIndexes() {
    console.log('Creating optimized geospatial indexes...');

    const userLocations = this.spatialCollections.get('userLocations');
    const pointsOfInterest = this.spatialCollections.get('pointsOfInterest');
    const geographicRegions = this.spatialCollections.get('geographicRegions');

    // User locations indexes
    await userLocations.createIndex({ coordinates: '2dsphere' });
    await userLocations.createIndex({ userId: 1, recordedAt: -1 });
    await userLocations.createIndex({ recordedAt: -1 });
    await userLocations.createIndex({ 
      coordinates: '2dsphere', 
      userId: 1, 
      recordedAt: -1 
    });

    // Points of interest indexes  
    await pointsOfInterest.createIndex({ coordinates: '2dsphere' });
    await pointsOfInterest.createIndex({ category: 1, subcategory: 1 });
    await pointsOfInterest.createIndex({ 
      coordinates: '2dsphere', 
      category: 1 
    });

    // Geographic regions indexes
    await geographicRegions.createIndex({ geometry: '2dsphere' });
    await geographicRegions.createIndex({ regionType: 1 });

    console.log('✅ Geospatial indexes created');
  }

  async performUserMobilityAnalysis(userId, dateRange = { days: 7 }) {
    console.log(`Performing mobility analysis for user ${userId}...`);

    const startDate = new Date();
    startDate.setDate(startDate.getDate() - dateRange.days);

    const userLocations = this.spatialCollections.get('userLocations');

    const mobilityPipeline = [
      // Match user locations within date range
      {
        $match: {
          userId: userId,
          recordedAt: { $gte: startDate }
        }
      },

      // Sort by recording time for path analysis
      { $sort: { recordedAt: 1 } },

      // Group by day for daily analysis
      {
        $group: {
          _id: {
            userId: '$userId',
            day: { $dateToString: { format: '%Y-%m-%d', date: '$recordedAt' } }
          },

          // Collect all locations for the day
          locations: {
            $push: {
              coordinates: '$coordinates',
              recordedAt: '$recordedAt',
              accuracy: '$accuracy',
              locationSource: '$locationSource'
            }
          },

          // Basic aggregations
          locationCount: { $sum: 1 },
          firstLocation: { $first: '$recordedAt' },
          lastLocation: { $last: '$recordedAt' },

          // Calculate geographic bounds
          minLongitude: { $min: { $arrayElemAt: ['$coordinates.coordinates', 0] } },
          maxLongitude: { $max: { $arrayElemAt: ['$coordinates.coordinates', 0] } },
          minLatitude: { $min: { $arrayElemAt: ['$coordinates.coordinates', 1] } },
          maxLatitude: { $max: { $arrayElemAt: ['$coordinates.coordinates', 1] } }
        }
      },

      // Calculate daily movement metrics
      {
        $addFields: {
          // Calculate coverage area using geographic bounds
          coverageArea: {
            $multiply: [
              { $subtract: ['$maxLongitude', '$minLongitude'] },
              { $subtract: ['$maxLatitude', '$minLatitude'] },
              111320 // Approximate meters per degree (varies by latitude)
            ]
          },

          // Calculate time span for velocity analysis
          timeSpanMinutes: {
            $divide: [
              { $subtract: ['$lastLocation', '$firstLocation'] },
              60000 // Convert milliseconds to minutes
            ]
          },

          // Data quality assessment
          trackingQuality: {
            $switch: {
              branches: [
                { case: { $gte: ['$locationCount', 50] }, then: 'comprehensive' },
                { case: { $gte: ['$locationCount', 20] }, then: 'moderate' },
                { case: { $gte: ['$locationCount', 5] }, then: 'basic' }
              ],
              default: 'insufficient'
            }
          }
        }
      },

      // Calculate movement distances using geospatial operations
      {
        $addFields: {
          movementAnalysis: {
            $reduce: {
              input: { $range: [1, { $size: '$locations' }] },
              initialValue: { 
                totalDistance: 0, 
                segments: [] 
              },
              in: {
                totalDistance: {
                  $add: [
                    '$$value.totalDistance',
                    {
                      $let: {
                        vars: {
                          currentLoc: { $arrayElemAt: ['$locations', '$$this'] },
                          previousLoc: { $arrayElemAt: ['$locations', { $subtract: ['$$this', 1] }] }
                        },
                        in: {
                          // Use $geoNear equivalent calculation for distance
                          $multiply: [
                            {
                              $sqrt: {
                                $add: [
                                  {
                                    $pow: [
                                      {
                                        $multiply: [
                                          { $subtract: [
                                            { $arrayElemAt: ['$$currentLoc.coordinates.coordinates', 0] },
                                            { $arrayElemAt: ['$$previousLoc.coordinates.coordinates', 0] }
                                          ] },
                                          111320 // Meters per degree longitude (approximate)
                                        ]
                                      },
                                      2
                                    ]
                                  },
                                  {
                                    $pow: [
                                      {
                                        $multiply: [
                                          { $subtract: [
                                            { $arrayElemAt: ['$$currentLoc.coordinates.coordinates', 1] },
                                            { $arrayElemAt: ['$$previousLoc.coordinates.coordinates', 1] }
                                          ] },
                                          110540 // Meters per degree latitude
                                        ]
                                      },
                                      2
                                    ]
                                  }
                                ]
                              }
                            },
                            1 // Simplified distance calculation
                          ]
                        }
                      }
                    }
                  ]
                },
                segments: {
                  $concatArrays: [
                    '$$value.segments',
                    [{
                      from: { $arrayElemAt: ['$locations', { $subtract: ['$$this', 1] }] },
                      to: { $arrayElemAt: ['$locations', '$$this'] },
                      distance: '$$value.totalDistance'
                    }]
                  ]
                }
              }
            }
          }
        }
      },

      // Calculate velocity and mobility patterns
      {
        $addFields: {
          totalDistanceMeters: '$movementAnalysis.totalDistance',
          totalDistanceKm: { $divide: ['$movementAnalysis.totalDistance', 1000] },
          averageSpeedKmh: {
            $cond: {
              if: { $gt: ['$timeSpanMinutes', 0] },
              then: {
                $multiply: [
                  { $divide: ['$movementAnalysis.totalDistance', 1000] },
                  { $divide: [60, '$timeSpanMinutes'] }
                ]
              },
              else: 0
            }
          },

          // Classify mobility pattern
          mobilityPattern: {
            $switch: {
              branches: [
                { case: { $lt: ['$movementAnalysis.totalDistance', 1000] }, then: 'stationary' },
                { case: { $lt: ['$movementAnalysis.totalDistance', 5000] }, then: 'local_movement' },
                { case: { $lt: ['$movementAnalysis.totalDistance', 20000] }, then: 'moderate_travel' }
              ],
              default: 'extensive_travel'
            }
          }
        }
      },

      // Final projection with insights
      {
        $project: {
          userId: '$_id.userId',
          analysisDate: '$_id.day',
          totalDistanceKm: { $round: ['$totalDistanceKm', 2] },
          averageSpeedKmh: { $round: ['$averageSpeedKmh', 2] },
          coverageAreaKm2: { $round: [{ $divide: ['$coverageArea', 1000000] }, 4] },
          locationCount: 1,
          timeSpanHours: { $round: [{ $divide: ['$timeSpanMinutes', 60] }, 2] },
          mobilityPattern: 1,
          trackingQuality: 1,

          // Geographic bounds for mapping
          bounds: {
            northeast: { lat: '$maxLatitude', lng: '$maxLongitude' },
            southwest: { lat: '$minLatitude', lng: '$minLongitude' }
          },

          // Movement insights
          insights: {
            isHighMobility: { $gt: ['$totalDistanceKm', 10] },
            isLocalExplorer: {
              $and: [
                { $lt: ['$totalDistanceKm', 5] },
                { $gt: ['$locationCount', 20] }
              ]
            },
            hasLongPeriods: { $gt: ['$timeSpanMinutes', 480] }, // More than 8 hours
            dataQualitySufficient: { $ne: ['$trackingQuality', 'insufficient'] }
          }
        }
      },

      { $sort: { analysisDate: -1 } }
    ];

    const results = await userLocations.aggregate(mobilityPipeline).toArray();

    // Calculate summary statistics across all days
    const summaryPipeline = [
      { $match: { userId: userId, recordedAt: { $gte: startDate } } },
      {
        $group: {
          _id: '$userId',
          totalLocations: { $sum: 1 },
          uniqueDays: { $addToSet: { $dateToString: { format: '%Y-%m-%d', date: '$recordedAt' } } },
          averageAccuracy: { $avg: '$accuracy' },
          locationSources: { $addToSet: '$locationSource' },
          timeRange: {
            $push: '$recordedAt'
          }
        }
      },
      {
        $addFields: {
          uniqueDaysCount: { $size: '$uniqueDays' },
          totalTimeSpan: {
            $subtract: [{ $max: '$timeRange' }, { $min: '$timeRange' }]
          }
        }
      }
    ];

    const summary = await userLocations.aggregate(summaryPipeline).toArray();

    return {
      userId: userId,
      analysisTimestamp: new Date(),
      dateRange: { startDate, endDate: new Date() },
      dailyAnalysis: results,
      summary: summary.length > 0 ? summary[0] : null,
      recommendations: this.generateMobilityRecommendations(results, summary[0])
    };
  }

  generateMobilityRecommendations(dailyData, summary) {
    const recommendations = [];

    if (!summary || !dailyData.length) {
      recommendations.push({
        type: 'data_quality',
        priority: 'high',
        message: 'Insufficient location data for meaningful analysis'
      });
      return recommendations;
    }

    // Analyze movement patterns
    const highMobilityDays = dailyData.filter(day => day.insights.isHighMobility);
    const localExplorerDays = dailyData.filter(day => day.insights.isLocalExplorer);

    if (highMobilityDays.length > dailyData.length * 0.5) {
      recommendations.push({
        type: 'behavioral',
        priority: 'medium',
        message: 'High mobility pattern detected - consider travel optimization features',
        insight: `${highMobilityDays.length} of ${dailyData.length} days showed extensive travel`
      });
    }

    if (localExplorerDays.length > dailyData.length * 0.3) {
      recommendations.push({
        type: 'behavioral',
        priority: 'low',
        message: 'Local exploration pattern - recommend nearby POI discovery features',
        insight: `${localExplorerDays.length} days showed intensive local exploration`
      });
    }

    // Data quality recommendations
    const lowQualityDays = dailyData.filter(day => day.trackingQuality === 'insufficient' || day.trackingQuality === 'basic');
    if (lowQualityDays.length > dailyData.length * 0.3) {
      recommendations.push({
        type: 'data_quality',
        priority: 'medium',
        message: 'Consider improving location tracking frequency or accuracy',
        insight: `${lowQualityDays.length} days had insufficient tracking data`
      });
    }

    return recommendations;
  }

  async performPOIProximityAnalysis(userId, proximityRadius = 500) {
    console.log(`Performing POI proximity analysis for user ${userId} within ${proximityRadius}m...`);

    const userLocations = this.spatialCollections.get('userLocations');
    const pointsOfInterest = this.spatialCollections.get('pointsOfInterest');

    const proximityPipeline = [
      // Start with user locations
      {
        $match: {
          userId: userId,
          recordedAt: { $gte: new Date(Date.now() - 7 * 24 * 60 * 60 * 1000) } // Last 7 days
        }
      },

      // Geospatial lookup to find nearby POIs
      {
        $lookup: {
          from: 'pointsOfInterest',
          let: { userLocation: '$coordinates' },
          pipeline: [
            {
              $geoNear: {
                near: '$$userLocation',
                distanceField: 'distance',
                maxDistance: proximityRadius,
                spherical: true
              }
            },
            {
              $project: {
                name: 1,
                category: 1,
                subcategory: 1,
                coordinates: 1,
                rating: 1,
                distance: 1
              }
            }
          ],
          as: 'nearbyPOIs'
        }
      },

      // Filter locations that have nearby POIs
      {
        $match: {
          nearbyPOIs: { $ne: [] }
        }
      },

      // Unwind to analyze each POI interaction
      { $unwind: '$nearbyPOIs' },

      // Calculate visit duration and interaction metrics
      {
        $addFields: {
          proximityCategory: {
            $switch: {
              branches: [
                { case: { $lte: ['$nearbyPOIs.distance', 100] }, then: 'immediate_vicinity' },
                { case: { $lte: ['$nearbyPOIs.distance', 200] }, then: 'very_close' },
                { case: { $lte: ['$nearbyPOIs.distance', proximityRadius] }, then: 'nearby' }
              ],
              default: 'distant'
            }
          }
        }
      },

      // Group by POI to analyze visit patterns
      {
        $group: {
          _id: {
            poiId: '$nearbyPOIs._id',
            poiName: '$nearbyPOIs.name',
            poiCategory: '$nearbyPOIs.category',
            poiSubcategory: '$nearbyPOIs.subcategory'
          },

          visitCount: { $sum: 1 },
          averageDistance: { $avg: '$nearbyPOIs.distance' },
          minDistance: { $min: '$nearbyPOIs.distance' },
          maxDistance: { $max: '$nearbyPOIs.distance' },

          visits: {
            $push: {
              timestamp: '$recordedAt',
              distance: '$nearbyPOIs.distance',
              proximityCategory: '$proximityCategory',
              accuracy: '$accuracy'
            }
          },

          firstVisit: { $min: '$recordedAt' },
          lastVisit: { $max: '$recordedAt' },

          // Calculate proximity engagement
          closeVisits: {
            $sum: {
              $cond: [
                { $lte: ['$nearbyPOIs.distance', 100] },
                1,
                0
              ]
            }
          }
        }
      },

      // Calculate visit duration and patterns
      {
        $addFields: {
          visitDurationMinutes: {
            $divide: [
              { $subtract: ['$lastVisit', '$firstVisit'] },
              60000 // Convert to minutes
            ]
          },

          engagementLevel: {
            $switch: {
              branches: [
                { case: { $gte: ['$closeVisits', 5] }, then: 'high' },
                { case: { $gte: ['$closeVisits', 2] }, then: 'medium' }
              ],
              default: 'low'
            }
          },

          visitFrequency: {
            $switch: {
              branches: [
                { case: { $gte: ['$visitCount', 10] }, then: 'frequent' },
                { case: { $gte: ['$visitCount', 3] }, then: 'occasional' }
              ],
              default: 'rare'
            }
          }
        }
      },

      // Final projection and insights
      {
        $project: {
          poiId: '$_id.poiId',
          poiName: '$_id.poiName',
          category: '$_id.poiCategory',
          subcategory: '$_id.poiSubcategory',

          visitMetrics: {
            visitCount: '$visitCount',
            visitFrequency: '$visitFrequency',
            averageDistanceMeters: { $round: ['$averageDistance', 1] },
            minDistanceMeters: { $round: ['$minDistance', 1] },
            closeVisits: '$closeVisits',
            engagementLevel: '$engagementLevel'
          },

          timeMetrics: {
            firstVisit: '$firstVisit',
            lastVisit: '$lastVisit',
            visitDurationMinutes: { $round: ['$visitDurationMinutes', 1] }
          },

          insights: {
            isRegularDestination: { $gte: ['$visitCount', 5] },
            hasCloseInteraction: { $gte: ['$closeVisits', 1] },
            isLongTermRelationship: {
              $gte: [
                { $divide: [{ $subtract: ['$lastVisit', '$firstVisit'] }, 86400000] }, // Days
                7
              ]
            }
          }
        }
      },

      { $sort: { 'visitMetrics.visitCount': -1, 'visitMetrics.averageDistanceMeters': 1 } }
    ];

    const poiAnalysis = await userLocations.aggregate(proximityPipeline).toArray();

    // Generate category-level insights
    const categoryPipeline = [
      ...proximityPipeline.slice(0, -2), // Reuse pipeline up to final projection
      {
        $group: {
          _id: '$_id.poiCategory',

          totalPOIs: { $sum: 1 },
          totalVisits: { $sum: '$visitCount' },
          averageVisitsPerPOI: { $avg: '$visitCount' },

          highEngagementPOIs: {
            $sum: {
              $cond: [{ $eq: ['$engagementLevel', 'high'] }, 1, 0]
            }
          },

          categories: { $addToSet: '$_id.poiSubcategory' }
        }
      },
      {
        $project: {
          category: '$_id',
          totalPOIs: 1,
          totalVisits: 1,
          averageVisitsPerPOI: { $round: ['$averageVisitsPerPOI', 2] },
          highEngagementPOIs: 1,
          subcategories: '$categories',

          categoryInsight: {
            $switch: {
              branches: [
                { case: { $gte: ['$totalVisits', 50] }, then: 'primary_interest' },
                { case: { $gte: ['$totalVisits', 20] }, then: 'significant_interest' },
                { case: { $gte: ['$totalVisits', 5] }, then: 'moderate_interest' }
              ],
              default: 'limited_interest'
            }
          }
        }
      },
      { $sort: { totalVisits: -1 } }
    ];

    const categoryInsights = await userLocations.aggregate(categoryPipeline).toArray();

    return {
      userId: userId,
      analysisTimestamp: new Date(),
      proximityRadius: proximityRadius,
      poiAnalysis: poiAnalysis,
      categoryInsights: categoryInsights,
      summary: {
        totalPOIsInteracted: poiAnalysis.length,
        totalVisits: poiAnalysis.reduce((sum, poi) => sum + poi.visitMetrics.visitCount, 0),
        topCategories: categoryInsights.slice(0, 5),
        behaviorProfile: this.generatePOIBehaviorProfile(poiAnalysis, categoryInsights)
      }
    };
  }

  generatePOIBehaviorProfile(poiAnalysis, categoryInsights) {
    if (!poiAnalysis.length) {
      return 'insufficient_data';
    }

    const frequentPOIs = poiAnalysis.filter(poi => poi.visitMetrics.visitFrequency === 'frequent');
    const highEngagementPOIs = poiAnalysis.filter(poi => poi.visitMetrics.engagementLevel === 'high');
    const primaryCategories = categoryInsights.filter(cat => cat.categoryInsight === 'primary_interest');

    if (frequentPOIs.length >= 3 && highEngagementPOIs.length >= 2) {
      return 'habitual_visitor';
    } else if (primaryCategories.length >= 2) {
      return 'diverse_explorer';
    } else if (poiAnalysis.length > 10 && frequentPOIs.length === 0) {
      return 'casual_explorer';
    } else if (frequentPOIs.length >= 1) {
      return 'routine_focused';
    } else {
      return 'occasional_visitor';
    }
  }

  async performGeographicRegionAnalysis(regionFilters = {}) {
    console.log('Performing geographic region analysis...');

    const userLocations = this.spatialCollections.get('userLocations');
    const geographicRegions = this.spatialCollections.get('geographicRegions');

    const regionAnalysisPipeline = [
      // Start with geographic regions
      {
        $match: {
          regionType: regionFilters.regionType || { $exists: true }
        }
      },

      // Lookup users within each region
      {
        $lookup: {
          from: 'userLocations',
          let: { regionGeometry: '$geometry' },
          pipeline: [
            {
              $match: {
                recordedAt: { 
                  $gte: regionFilters.startDate || new Date(Date.now() - 30 * 24 * 60 * 60 * 1000) // Default 30 days
                }
              }
            },
            {
              $match: {
                $expr: {
                  $geoWithin: {
                    $geometry: '$$regionGeometry',
                    $centerSphere: ['$coordinates', 0.01] // Simplified spatial match
                  }
                }
              }
            }
          ],
          as: 'locationVisits'
        }
      },

      // Calculate region activity metrics
      {
        $addFields: {
          totalVisits: { $size: '$locationVisits' },
          uniqueUsers: {
            $size: {
              $setUnion: {
                $map: {
                  input: '$locationVisits',
                  as: 'visit',
                  in: '$$visit.userId'
                }
              }
            }
          },

          visitsByTimeOfDay: {
            $reduce: {
              input: '$locationVisits',
              initialValue: { morning: 0, afternoon: 0, evening: 0, night: 0 },
              in: {
                morning: {
                  $add: [
                    '$$value.morning',
                    {
                      $cond: [
                        { $and: [
                          { $gte: [{ $hour: '$$this.recordedAt' }, 6] },
                          { $lt: [{ $hour: '$$this.recordedAt' }, 12] }
                        ]},
                        1, 0
                      ]
                    }
                  ]
                },
                afternoon: {
                  $add: [
                    '$$value.afternoon',
                    {
                      $cond: [
                        { $and: [
                          { $gte: [{ $hour: '$$this.recordedAt' }, 12] },
                          { $lt: [{ $hour: '$$this.recordedAt' }, 18] }
                        ]},
                        1, 0
                      ]
                    }
                  ]
                },
                evening: {
                  $add: [
                    '$$value.evening',
                    {
                      $cond: [
                        { $and: [
                          { $gte: [{ $hour: '$$this.recordedAt' }, 18] },
                          { $lt: [{ $hour: '$$this.recordedAt' }, 22] }
                        ]},
                        1, 0
                      ]
                    }
                  ]
                },
                night: {
                  $add: [
                    '$$value.night',
                    {
                      $cond: [
                        { $or: [
                          { $gte: [{ $hour: '$$this.recordedAt' }, 22] },
                          { $lt: [{ $hour: '$$this.recordedAt' }, 6] }
                        ]},
                        1, 0
                      ]
                    }
                  ]
                }
              }
            }
          }
        }
      },

      // Calculate activity patterns and insights
      {
        $addFields: {
          averageVisitsPerUser: {
            $cond: {
              if: { $gt: ['$uniqueUsers', 0] },
              then: { $divide: ['$totalVisits', '$uniqueUsers'] },
              else: 0
            }
          },

          activityLevel: {
            $switch: {
              branches: [
                { case: { $gte: ['$totalVisits', 1000] }, then: 'very_high' },
                { case: { $gte: ['$totalVisits', 500] }, then: 'high' },
                { case: { $gte: ['$totalVisits', 100] }, then: 'medium' },
                { case: { $gte: ['$totalVisits', 10] }, then: 'low' }
              ],
              default: 'very_low'
            }
          },

          primaryTimeOfDay: {
            $let: {
              vars: { times: '$visitsByTimeOfDay' },
              in: {
                $switch: {
                  branches: [
                    { case: { $gte: ['$$times.morning', '$$times.afternoon'] }, then: 'morning' },
                    { case: { $gte: ['$$times.afternoon', '$$times.evening'] }, then: 'afternoon' },
                    { case: { $gte: ['$$times.evening', '$$times.night'] }, then: 'evening' }
                  ],
                  default: 'night'
                }
              }
            }
          }
        }
      },

      // Final projection
      {
        $project: {
          regionId: '$_id',
          regionName: '$name',
          regionType: '$regionType',

          activityMetrics: {
            totalVisits: '$totalVisits',
            uniqueUsers: '$uniqueUsers',
            averageVisitsPerUser: { $round: ['$averageVisitsPerUser', 2] },
            activityLevel: '$activityLevel',
            primaryTimeOfDay: '$primaryTimeOfDay'
          },

          timeDistribution: '$visitsByTimeOfDay',

          insights: {
            isPopularDestination: { $gte: ['$totalVisits', 500] },
            hasRegularVisitors: { $gte: ['$averageVisitsPerUser', 3] },
            isDaytimeActive: {
              $gt: [
                { $add: ['$visitsByTimeOfDay.morning', '$visitsByTimeOfDay.afternoon'] },
                { $add: ['$visitsByTimeOfDay.evening', '$visitsByTimeOfDay.night'] }
              ]
            }
          },

          regionProperties: '$properties',
          area: '$area',
          population: '$population'
        }
      },

      { $sort: { 'activityMetrics.totalVisits': -1 } }
    ];

    const regionAnalysis = await geographicRegions.aggregate(regionAnalysisPipeline).toArray();

    return {
      analysisTimestamp: new Date(),
      regionFilters: regionFilters,
      regionAnalysis: regionAnalysis,
      summary: {
        totalRegionsAnalyzed: regionAnalysis.length,
        mostActiveRegion: regionAnalysis.length > 0 ? regionAnalysis[0] : null,
        averageActivityLevel: this.calculateAverageActivityLevel(regionAnalysis)
      }
    };
  }

  calculateAverageActivityLevel(regions) {
    if (!regions.length) return 'no_data';

    const activityScores = {
      'very_high': 5,
      'high': 4,
      'medium': 3,
      'low': 2,
      'very_low': 1
    };

    const totalScore = regions.reduce((sum, region) => {
      return sum + (activityScores[region.activityMetrics.activityLevel] || 0);
    }, 0);

    const averageScore = totalScore / regions.length;

    if (averageScore >= 4.5) return 'very_high';
    if (averageScore >= 3.5) return 'high';
    if (averageScore >= 2.5) return 'medium';
    if (averageScore >= 1.5) return 'low';
    return 'very_low';
  }

  async getGeospatialAnalyticsMetrics() {
    console.log('Generating geospatial analytics metrics...');

    const collections = [
      { name: 'userLocations', collection: this.spatialCollections.get('userLocations') },
      { name: 'pointsOfInterest', collection: this.spatialCollections.get('pointsOfInterest') },
      { name: 'geographicRegions', collection: this.spatialCollections.get('geographicRegions') }
    ];

    const metrics = {
      timestamp: new Date(),
      collections: {},
      geospatialIndexes: {},
      queryPerformance: {},
      dataQuality: {}
    };

    for (const { name, collection } of collections) {
      try {
        // Basic collection statistics
        const stats = await this.db.command({ collStats: name });

        // Document count and size
        metrics.collections[name] = {
          documentCount: stats.count,
          storageSize: stats.storageSize,
          avgDocumentSize: stats.avgObjSize,
          indexCount: stats.nindexes,
          indexSize: stats.totalIndexSize
        };

        // Geospatial-specific metrics
        if (name === 'userLocations') {
          const locationMetrics = await collection.aggregate([
            {
              $group: {
                _id: null,
                totalLocations: { $sum: 1 },
                uniqueUsers: { $addToSet: '$userId' },
                dateRange: { $push: '$recordedAt' },
                locationSources: { $addToSet: '$locationSource' },
                averageAccuracy: { $avg: '$accuracy' }
              }
            },
            {
              $addFields: {
                uniqueUsersCount: { $size: '$uniqueUsers' },
                timeSpanDays: {
                  $divide: [
                    { $subtract: [{ $max: '$dateRange' }, { $min: '$dateRange' }] },
                    86400000
                  ]
                }
              }
            }
          ]).toArray();

          metrics.dataQuality.userLocations = locationMetrics[0] || {};
        }

      } catch (error) {
        metrics.collections[name] = { error: error.message };
      }
    }

    return metrics;
  }

  async shutdown() {
    console.log('Shutting down MongoDB Geospatial Analytics Manager...');

    if (this.client) {
      await this.client.close();
      console.log('✅ MongoDB connection closed');
    }

    this.spatialCollections.clear();
    this.geospatialIndexes.clear();
    this.analyticsCache.clear();
  }
}

// Export the geospatial analytics manager
module.exports = { MongoGeospatialAnalyticsManager };

// Benefits of MongoDB Geospatial Analytics:
// - Native 2dsphere indexes provide optimized spatial query performance
// - Aggregation framework enables complex geospatial calculations and analytics
// - Built-in distance calculations and proximity analysis without external libraries
// - Seamless integration of spatial operations with business logic and data transformations
// - Scalable geospatial processing that works across sharded deployments
// - Advanced spatial aggregation patterns for location intelligence applications
// - Real-time geospatial analytics with high-performance spatial indexing
// - Flexible coordinate system support and projection transformations
// - Production-ready spatial analytics with comprehensive monitoring and optimization
// - SQL-compatible geospatial operations through QueryLeaf integration

Understanding MongoDB Geospatial Aggregation Capabilities

Advanced Spatial Analytics Patterns

Implement sophisticated geospatial analytics with MongoDB's native spatial operators:

// Advanced geospatial aggregation patterns for enterprise location intelligence
class LocationIntelligenceProcessor extends MongoGeospatialAnalyticsManager {
  constructor() {
    super();
    this.spatialProcessors = new Map();
    this.heatmapGenerators = new Map();
    this.routeAnalyzers = new Map();
  }

  async generateLocationHeatmaps(parameters = {}) {
    console.log('Generating location density heatmaps...');

    const { 
      gridSize = 0.01, // Degrees for grid cells
      minDensity = 5,   // Minimum locations per cell
      dateRange = 30    // Days of data
    } = parameters;

    const userLocations = this.spatialCollections.get('userLocations');

    const heatmapPipeline = [
      // Filter recent locations
      {
        $match: {
          recordedAt: { 
            $gte: new Date(Date.now() - dateRange * 24 * 60 * 60 * 1000) 
          }
        }
      },

      // Create grid-based grouping for heatmap
      {
        $addFields: {
          gridCell: {
            lat: {
              $multiply: [
                { $floor: { $divide: [{ $arrayElemAt: ['$coordinates.coordinates', 1] }, gridSize] } },
                gridSize
              ]
            },
            lng: {
              $multiply: [
                { $floor: { $divide: [{ $arrayElemAt: ['$coordinates.coordinates', 0] }, gridSize] } },
                gridSize
              ]
            }
          }
        }
      },

      // Group by grid cells
      {
        $group: {
          _id: {
            lat: '$gridCell.lat',
            lng: '$gridCell.lng'
          },

          locationCount: { $sum: 1 },
          uniqueUsers: { $addToSet: '$userId' },
          averageAccuracy: { $avg: '$accuracy' },

          timeDistribution: {
            $push: {
              $dateToString: { format: '%H', date: '$recordedAt' }
            }
          },

          locationSources: { $addToSet: '$locationSource' }
        }
      },

      // Filter cells with minimum density
      {
        $match: {
          locationCount: { $gte: minDensity }
        }
      },

      // Calculate density metrics
      {
        $addFields: {
          density: '$locationCount',
          uniqueUsersCount: { $size: '$uniqueUsers' },

          // Calculate time-based activity patterns
          hourlyActivity: {
            $reduce: {
              input: { $range: [0, 24] },
              initialValue: {},
              in: {
                $mergeObjects: [
                  '$$value',
                  {
                    $let: {
                      vars: { hour: '$$this' },
                      in: {
                        $arrayToObject: [[{
                          k: { $toString: '$$hour' },
                          v: {
                            $size: {
                              $filter: {
                                input: '$timeDistribution',
                                cond: { $eq: ['$$this', { $toString: '$$hour' }] }
                              }
                            }
                          }
                        }]]
                      }
                    }
                  }
                ]
              }
            }
          },

          // Density classification
          densityLevel: {
            $switch: {
              branches: [
                { case: { $gte: ['$locationCount', 100] }, then: 'very_high' },
                { case: { $gte: ['$locationCount', 50] }, then: 'high' },
                { case: { $gte: ['$locationCount', 20] }, then: 'medium' }
              ],
              default: 'low'
            }
          }
        }
      },

      // Generate heatmap coordinates
      {
        $project: {
          coordinates: {
            lat: '$_id.lat',
            lng: '$_id.lng'
          },

          heatmapMetrics: {
            density: '$density',
            uniqueUsers: '$uniqueUsersCount',
            densityLevel: '$densityLevel',
            averageAccuracy: { $round: ['$averageAccuracy', 2] }
          },

          temporalPatterns: {
            hourlyActivity: '$hourlyActivity',
            peakHour: {
              $let: {
                vars: { 
                  maxHour: { $max: { $objectToArray: '$hourlyActivity' } } 
                },
                in: '$$maxHour.k'
              }
            }
          },

          insights: {
            isHighTrafficArea: { $gte: ['$locationCount', 50] },
            hasRegularActivity: { $gte: ['$uniqueUsersCount', 5] },
            isDataRich: { $gte: ['$averageAccuracy', 50] }
          }
        }
      },

      { $sort: { 'heatmapMetrics.density': -1 } }
    ];

    const heatmapData = await userLocations.aggregate(heatmapPipeline).toArray();

    return {
      generatedAt: new Date(),
      parameters: { gridSize, minDensity, dateRange },
      heatmapData: heatmapData,
      summary: {
        totalHotspots: heatmapData.length,
        highestDensity: heatmapData.length > 0 ? heatmapData[0].heatmapMetrics.density : 0,
        averageDensity: heatmapData.length > 0 ? 
          Math.round(heatmapData.reduce((sum, cell) => sum + cell.heatmapMetrics.density, 0) / heatmapData.length) : 0
      }
    };
  }

  async performRouteAnalysis(userId, analysisType = 'daily') {
    console.log(`Performing ${analysisType} route analysis for user ${userId}...`);

    const userLocations = this.spatialCollections.get('userLocations');

    const routeAnalysisPipeline = [
      // Match user locations
      { $match: { userId: userId } },
      { $sort: { recordedAt: 1 } },

      // Group by analysis period (daily, weekly, etc.)
      {
        $group: {
          _id: {
            period: {
              $dateToString: { 
                format: analysisType === 'daily' ? '%Y-%m-%d' : '%Y-%W',
                date: '$recordedAt' 
              }
            }
          },

          locations: {
            $push: {
              coordinates: '$coordinates',
              timestamp: '$recordedAt',
              accuracy: '$accuracy'
            }
          },

          startTime: { $first: '$recordedAt' },
          endTime: { $last: '$recordedAt' }
        }
      },

      // Calculate route metrics
      {
        $addFields: {
          routeMetrics: {
            $let: {
              vars: {
                locationCount: { $size: '$locations' }
              },
              in: {
                totalPoints: '$$locationCount',
                timeSpanHours: {
                  $divide: [
                    { $subtract: ['$endTime', '$startTime'] },
                    3600000 // Convert to hours
                  ]
                },

                // Simplified route distance calculation
                estimatedDistance: {
                  $reduce: {
                    input: { $range: [1, '$$locationCount'] },
                    initialValue: 0,
                    in: {
                      $add: [
                        '$$value',
                        {
                          $let: {
                            vars: {
                              current: { $arrayElemAt: ['$locations', '$$this'] },
                              previous: { $arrayElemAt: ['$locations', { $subtract: ['$$this', 1] }] }
                            },
                            in: {
                              // Simplified distance using coordinate differences
                              $sqrt: {
                                $add: [
                                  {
                                    $pow: [
                                      {
                                        $multiply: [
                                          { $subtract: [
                                            { $arrayElemAt: ['$$current.coordinates.coordinates', 0] },
                                            { $arrayElemAt: ['$$previous.coordinates.coordinates', 0] }
                                          ] },
                                          111320 // Approximate meters per degree
                                        ]
                                      },
                                      2
                                    ]
                                  },
                                  {
                                    $pow: [
                                      {
                                        $multiply: [
                                          { $subtract: [
                                            { $arrayElemAt: ['$$current.coordinates.coordinates', 1] },
                                            { $arrayElemAt: ['$$previous.coordinates.coordinates', 1] }
                                          ] },
                                          110540
                                        ]
                                      },
                                      2
                                    ]
                                  }
                                ]
                              }
                            }
                          }
                        }
                      ]
                    }
                  }
                }
              }
            }
          }
        }
      },

      // Calculate route insights
      {
        $addFields: {
          routeInsights: {
            distanceKm: { $divide: ['$routeMetrics.estimatedDistance', 1000] },
            averageSpeed: {
              $cond: {
                if: { $gt: ['$routeMetrics.timeSpanHours', 0] },
                then: {
                  $divide: [
                    { $divide: ['$routeMetrics.estimatedDistance', 1000] },
                    '$routeMetrics.timeSpanHours'
                  ]
                },
                else: 0
              }
            },

            routeComplexity: {
              $switch: {
                branches: [
                  { case: { $gte: ['$routeMetrics.totalPoints', 100] }, then: 'complex' },
                  { case: { $gte: ['$routeMetrics.totalPoints', 50] }, then: 'moderate' }
                ],
                default: 'simple'
              }
            },

            mobilityType: {
              $let: {
                vars: { avgSpeed: '$averageSpeed' },
                in: {
                  $switch: {
                    branches: [
                      { case: { $lte: ['$$avgSpeed', 5] }, then: 'walking' },
                      { case: { $lte: ['$$avgSpeed', 25] }, then: 'cycling' },
                      { case: { $lte: ['$$avgSpeed', 60] }, then: 'driving' }
                    ],
                    default: 'high_speed_transport'
                  }
                }
              }
            }
          }
        }
      },

      {
        $project: {
          period: '$_id.period',
          startTime: 1,
          endTime: 1,

          route: {
            totalPoints: '$routeMetrics.totalPoints',
            distanceKm: { $round: ['$routeInsights.distanceKm', 2] },
            timeSpanHours: { $round: ['$routeMetrics.timeSpanHours', 2] },
            averageSpeedKmh: { $round: ['$routeInsights.averageSpeed', 2] }
          },

          classification: {
            routeComplexity: '$routeInsights.routeComplexity',
            mobilityType: '$routeInsights.mobilityType'
          },

          insights: {
            isLongDistance: { $gte: ['$routeInsights.distanceKm', 10] },
            isHighSpeed: { $gte: ['$routeInsights.averageSpeed', 30] },
            hasExtendedActivity: { $gte: ['$routeMetrics.timeSpanHours', 4] }
          }
        }
      },

      { $sort: { startTime: -1 } }
    ];

    const routeAnalysis = await userLocations.aggregate(routeAnalysisPipeline).toArray();

    return {
      userId: userId,
      analysisType: analysisType,
      analysisTimestamp: new Date(),
      routes: routeAnalysis,
      summary: this.generateRouteSummary(routeAnalysis)
    };
  }

  generateRouteSummary(routes) {
    if (!routes.length) {
      return { message: 'No routes found for analysis period' };
    }

    const totalDistance = routes.reduce((sum, route) => sum + route.route.distanceKm, 0);
    const mobilityTypes = routes.map(route => route.classification.mobilityType);
    const mostCommonMobility = this.findMostCommon(mobilityTypes);

    return {
      totalRoutes: routes.length,
      totalDistanceKm: Math.round(totalDistance * 100) / 100,
      averageRouteDistance: Math.round((totalDistance / routes.length) * 100) / 100,
      primaryMobilityType: mostCommonMobility,
      longDistanceRoutes: routes.filter(r => r.insights.isLongDistance).length,
      highSpeedRoutes: routes.filter(r => r.insights.isHighSpeed).length
    };
  }

  findMostCommon(array) {
    return array.reduce((acc, val) => {
      acc[val] = (acc[val] || 0) + 1;
      return acc;
    }, {});
  }
}

// Export the enhanced location intelligence processor
module.exports = { LocationIntelligenceProcessor };

SQL-Style Geospatial Analytics with QueryLeaf

QueryLeaf provides familiar SQL syntax for MongoDB geospatial analytics and spatial operations:

-- QueryLeaf geospatial analytics with SQL-familiar patterns

-- Create location tracking collection with geospatial indexes
CREATE COLLECTION user_locations (
  user_id OBJECTID NOT NULL,
  device_id VARCHAR(100),

  -- GeoJSON point coordinates
  coordinates GEOMETRY(POINT) NOT NULL,
  accuracy_meters DECIMAL(8,2),
  altitude_meters DECIMAL(8,2),

  -- Address information
  address JSON,

  -- Location metadata
  location_source ENUM('gps', 'network', 'passive', 'manual'),
  location_method ENUM('checkin', 'automatic', 'manual', 'passive'),

  -- Timestamps
  recorded_at TIMESTAMP NOT NULL,
  created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

-- Create geospatial indexes for optimal query performance
CREATE INDEX idx_user_locations_geo ON user_locations (coordinates) 
WITH (index_type = '2dsphere');

CREATE INDEX idx_user_locations_user_time ON user_locations (user_id, recorded_at);

-- Points of interest collection with spatial data
CREATE COLLECTION points_of_interest (
  poi_name VARCHAR(200) NOT NULL,
  category VARCHAR(100) NOT NULL,
  subcategory VARCHAR(100),

  coordinates GEOMETRY(POINT) NOT NULL,

  -- POI details
  description TEXT,
  business_hours JSON,
  amenities ARRAY<STRING>,
  rating_average DECIMAL(3,2),
  review_count INTEGER DEFAULT 0,

  -- Address
  address JSON,

  created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

CREATE INDEX idx_poi_geo ON points_of_interest (coordinates) 
WITH (index_type = '2dsphere');

CREATE INDEX idx_poi_category ON points_of_interest (category, subcategory);

-- Advanced geospatial analytics with proximity analysis
WITH user_mobility_analysis AS (
  SELECT 
    user_id,
    DATE(recorded_at) as tracking_date,

    -- Calculate daily movement patterns using geospatial functions
    COUNT(*) as location_points,
    MIN(recorded_at) as first_location,
    MAX(recorded_at) as last_location,

    -- Geographic bounds calculation
    ST_X(ST_ENVELOPE(ST_COLLECT(coordinates))) as min_longitude,
    ST_Y(ST_ENVELOPE(ST_COLLECT(coordinates))) as min_latitude,
    ST_X(ST_ENVELOPE(ST_COLLECT(coordinates))) as max_longitude,
    ST_Y(ST_ENVELOPE(ST_COLLECT(coordinates))) as max_latitude,

    -- Distance calculations using geospatial aggregation
    SUM(
      ST_DISTANCE(
        coordinates,
        LAG(coordinates) OVER (
          PARTITION BY user_id, DATE(recorded_at)
          ORDER BY recorded_at
        )
      )
    ) as total_distance_meters,

    -- Create daily path geometry
    ST_MAKELINE(coordinates ORDER BY recorded_at) as daily_path,

    -- Average location accuracy
    AVG(accuracy_meters) as avg_accuracy

  FROM user_locations
  WHERE recorded_at >= CURRENT_DATE - INTERVAL '7 days'
  GROUP BY user_id, DATE(recorded_at)
  HAVING COUNT(*) >= 5  -- Only days with sufficient tracking data
),

poi_proximity_analysis AS (
  -- Analyze user proximity to points of interest using spatial operations
  SELECT 
    ul.user_id,
    ul.recorded_at,
    poi.poi_name,
    poi.category,
    poi.subcategory,

    -- Distance calculation using native geospatial functions
    ST_DISTANCE(ul.coordinates, poi.coordinates) as distance_meters,

    -- Proximity categorization using geospatial predicates
    CASE 
      WHEN ST_DWITHIN(ul.coordinates, poi.coordinates, 100) THEN 'immediate_vicinity'
      WHEN ST_DWITHIN(ul.coordinates, poi.coordinates, 500) THEN 'nearby'
      WHEN ST_DWITHIN(ul.coordinates, poi.coordinates, 1000) THEN 'walking_distance'
      ELSE 'distant'
    END as proximity_category,

    -- Calculate time spent near POI using window functions
    CASE 
      WHEN ST_DWITHIN(ul.coordinates, poi.coordinates, 200) THEN
        EXTRACT(EPOCH FROM (
          LEAD(ul.recorded_at) OVER (
            PARTITION BY ul.user_id 
            ORDER BY ul.recorded_at
          ) - ul.recorded_at
        )) / 60.0  -- Convert to minutes
      ELSE 0
    END as time_spent_minutes

  FROM user_locations ul
  -- Use spatial join for efficient proximity queries
  INNER JOIN points_of_interest poi ON ST_DWITHIN(ul.coordinates, poi.coordinates, 5000)
  WHERE ul.recorded_at >= CURRENT_DATE - INTERVAL '7 days'
),

geospatial_insights AS (
  -- Generate comprehensive location-based insights
  SELECT 
    uma.user_id,
    uma.tracking_date,

    -- Movement analysis
    uma.total_distance_meters,
    ROUND(uma.total_distance_meters / 1000.0, 2) as total_distance_km,

    -- Calculate movement velocity
    CASE 
      WHEN EXTRACT(EPOCH FROM (uma.last_location - uma.first_location)) > 0 THEN
        ROUND(
          uma.total_distance_meters / 
          EXTRACT(EPOCH FROM (uma.last_location - uma.first_location)) * 3.6,
          2
        )
      ELSE 0
    END as average_speed_kmh,

    -- Geographic coverage analysis using spatial functions
    ST_AREA(
      ST_MAKEENVELOPE(
        uma.min_longitude, uma.min_latitude,
        uma.max_longitude, uma.max_latitude
      )
    ) / 1000000.0 as coverage_area_km2,  -- Convert to square kilometers

    uma.location_points,
    uma.avg_accuracy,

    -- POI interaction aggregations
    (
      SELECT COUNT(DISTINCT poi_name)
      FROM poi_proximity_analysis ppa
      WHERE ppa.user_id = uma.user_id
        AND DATE(ppa.recorded_at) = uma.tracking_date
        AND ppa.proximity_category IN ('immediate_vicinity', 'nearby')
    ) as unique_pois_visited,

    (
      SELECT SUM(time_spent_minutes)
      FROM poi_proximity_analysis ppa
      WHERE ppa.user_id = uma.user_id
        AND DATE(ppa.recorded_at) = uma.tracking_date
        AND ppa.time_spent_minutes > 0
    ) as total_poi_time_minutes,

    -- Most frequent POI categories using spatial aggregation
    (
      SELECT JSON_ARRAYAGG(
        JSON_OBJECT(
          'category', category,
          'visit_count', visit_count
        ) ORDER BY visit_count DESC
      )
      FROM (
        SELECT category, COUNT(*) as visit_count
        FROM poi_proximity_analysis ppa
        WHERE ppa.user_id = uma.user_id
          AND DATE(ppa.recorded_at) = uma.tracking_date
          AND ppa.proximity_category = 'immediate_vicinity'
        GROUP BY category
        ORDER BY visit_count DESC
        LIMIT 3
      ) top_categories
    ) as top_poi_categories,

    -- Mobility pattern classification
    CASE 
      WHEN uma.total_distance_meters < 1000 THEN 'stationary'
      WHEN uma.total_distance_meters < 5000 THEN 'local_movement'
      WHEN uma.total_distance_meters < 20000 THEN 'moderate_travel'
      ELSE 'extensive_travel'
    END as mobility_pattern,

    -- Route complexity analysis
    CASE 
      WHEN uma.location_points > 100 THEN 'complex_route'
      WHEN uma.location_points > 50 THEN 'moderate_route'
      ELSE 'simple_route'
    END as route_complexity

  FROM user_mobility_analysis uma
)

SELECT 
  gi.user_id,
  gi.tracking_date,
  gi.total_distance_km,
  gi.average_speed_kmh,
  ROUND(gi.coverage_area_km2, 4) as coverage_area_km2,
  gi.location_points,
  gi.unique_pois_visited,
  ROUND(gi.total_poi_time_minutes, 1) as total_poi_time_minutes,
  gi.top_poi_categories,
  gi.mobility_pattern,
  gi.route_complexity,

  -- User behavior profiling using geospatial insights
  CASE 
    WHEN gi.unique_pois_visited > 5 AND gi.total_distance_km > 10 THEN 'high_mobility_explorer'
    WHEN gi.unique_pois_visited > 3 AND gi.total_distance_km < 5 THEN 'local_explorer'
    WHEN gi.total_distance_km > 20 THEN 'long_distance_traveler'
    WHEN gi.total_poi_time_minutes > 60 THEN 'poi_focused_user'
    ELSE 'standard_user'
  END as user_behavior_profile,

  -- Transportation mode inference
  CASE 
    WHEN gi.average_speed_kmh <= 6 THEN 'walking'
    WHEN gi.average_speed_kmh <= 25 THEN 'cycling'
    WHEN gi.average_speed_kmh <= 60 THEN 'driving'
    ELSE 'high_speed_transport'
  END as inferred_transport_mode,

  -- Temporal patterns
  EXTRACT(DOW FROM gi.tracking_date) as day_of_week,
  CASE 
    WHEN EXTRACT(DOW FROM gi.tracking_date) IN (0, 6) THEN 'weekend'
    ELSE 'weekday'
  END as day_type,

  -- Data quality assessment
  CASE 
    WHEN gi.location_points < 10 THEN 'insufficient_data'
    WHEN gi.location_points < 50 THEN 'moderate_tracking'
    ELSE 'comprehensive_tracking'
  END as tracking_quality,

  ROUND(gi.avg_accuracy, 1) as average_accuracy_meters

FROM geospatial_insights gi
ORDER BY gi.user_id, gi.tracking_date DESC;

-- Location density heatmap generation using spatial aggregation
WITH location_density_grid AS (
  SELECT 
    -- Create spatial grid for heatmap visualization
    FLOOR(ST_X(coordinates) / 0.01) * 0.01 as grid_lng,
    FLOOR(ST_Y(coordinates) / 0.01) * 0.01 as grid_lat,

    COUNT(*) as location_count,
    COUNT(DISTINCT user_id) as unique_users,
    AVG(accuracy_meters) as avg_accuracy,

    -- Time-based activity patterns
    COUNT(*) FILTER (
      WHERE EXTRACT(HOUR FROM recorded_at) BETWEEN 6 AND 11
    ) as morning_activity,

    COUNT(*) FILTER (
      WHERE EXTRACT(HOUR FROM recorded_at) BETWEEN 12 AND 17
    ) as afternoon_activity,

    COUNT(*) FILTER (
      WHERE EXTRACT(HOUR FROM recorded_at) BETWEEN 18 AND 21
    ) as evening_activity,

    COUNT(*) FILTER (
      WHERE EXTRACT(HOUR FROM recorded_at) BETWEEN 22 AND 5
    ) as night_activity,

    -- Location source analysis
    JSON_OBJECTAGG(
      location_source, 
      COUNT(*)
    ) as source_distribution

  FROM user_locations
  WHERE recorded_at >= CURRENT_TIMESTAMP - INTERVAL '30 days'
  GROUP BY 
    FLOOR(ST_X(coordinates) / 0.01) * 0.01,
    FLOOR(ST_Y(coordinates) / 0.01) * 0.01
  HAVING COUNT(*) >= 5  -- Minimum density threshold
),

heatmap_analysis AS (
  SELECT 
    grid_lng,
    grid_lat,

    -- Create heatmap center point
    ST_POINT(grid_lng + 0.005, grid_lat + 0.005) as cell_center,

    location_count,
    unique_users,
    ROUND(avg_accuracy, 1) as avg_accuracy,

    -- Activity distribution
    morning_activity,
    afternoon_activity, 
    evening_activity,
    night_activity,

    -- Calculate peak activity time
    CASE 
      WHEN morning_activity >= GREATEST(afternoon_activity, evening_activity, night_activity) 
        THEN 'morning'
      WHEN afternoon_activity >= GREATEST(evening_activity, night_activity) 
        THEN 'afternoon'
      WHEN evening_activity >= night_activity 
        THEN 'evening'
      ELSE 'night'
    END as peak_activity_period,

    -- Density classification for visualization
    CASE 
      WHEN location_count >= 100 THEN 'very_high'
      WHEN location_count >= 50 THEN 'high'
      WHEN location_count >= 20 THEN 'medium'
      ELSE 'low'
    END as density_level,

    source_distribution,

    -- Calculate density score for heatmap intensity
    LOG(location_count) * LOG(unique_users + 1) as heatmap_intensity

  FROM location_density_grid
  ORDER BY location_count DESC
)

SELECT 
  -- Heatmap coordinates
  ST_X(cell_center) as center_longitude,
  ST_Y(cell_center) as center_latitude,

  -- Density metrics for visualization
  location_count as total_locations,
  unique_users,
  density_level,
  ROUND(heatmap_intensity, 2) as intensity_score,

  -- Temporal activity patterns
  JSON_OBJECT(
    'morning', morning_activity,
    'afternoon', afternoon_activity,
    'evening', evening_activity,
    'night', night_activity,
    'peak_period', peak_activity_period
  ) as activity_patterns,

  -- Data quality indicators
  avg_accuracy,
  source_distribution,

  -- Insights for location intelligence
  CASE 
    WHEN unique_users > location_count * 0.8 THEN 'transient_area'
    WHEN unique_users < location_count * 0.2 THEN 'regular_destination'
    ELSE 'mixed_usage'
  END as area_usage_pattern,

  CASE 
    WHEN peak_activity_period IN ('morning', 'evening') THEN 'commuter_zone'
    WHEN peak_activity_period = 'afternoon' THEN 'business_district'
    WHEN peak_activity_period = 'night' THEN 'entertainment_area'
    ELSE 'residential_area'
  END as inferred_area_type

FROM heatmap_analysis
WHERE density_level IN ('medium', 'high', 'very_high')
ORDER BY intensity_score DESC;

-- Geographic region analysis with spatial joins
CREATE COLLECTION geographic_regions (
  region_name VARCHAR(200) NOT NULL,
  region_type ENUM('city', 'district', 'neighborhood', 'zone'),
  geometry GEOMETRY(POLYGON) NOT NULL,
  properties JSON,
  population INTEGER,
  area_km2 DECIMAL(10,4),
  created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

CREATE INDEX idx_geographic_regions_geo ON geographic_regions (geometry) 
WITH (index_type = '2dsphere');

-- Analyze user activity within geographic regions
WITH regional_activity_analysis AS (
  SELECT 
    gr.region_name,
    gr.region_type,
    gr.population,
    gr.area_km2,

    -- Count location visits within each region using spatial containment
    COUNT(ul.user_id) as total_visits,
    COUNT(DISTINCT ul.user_id) as unique_visitors,

    -- Calculate visit density
    ROUND(
      COUNT(ul.user_id)::DECIMAL / NULLIF(gr.area_km2, 0),
      2
    ) as visits_per_km2,

    -- Time-based visit patterns
    COUNT(*) FILTER (
      WHERE EXTRACT(DOW FROM ul.recorded_at) IN (0, 6)
    ) as weekend_visits,

    COUNT(*) FILTER (
      WHERE EXTRACT(DOW FROM ul.recorded_at) BETWEEN 1 AND 5
    ) as weekday_visits,

    -- Peak activity times within regions
    MODE() WITHIN GROUP (
      ORDER BY EXTRACT(HOUR FROM ul.recorded_at)
    ) as peak_hour,

    -- Average visit duration (simplified calculation)
    AVG(
      EXTRACT(EPOCH FROM (
        LEAD(ul.recorded_at) OVER (
          PARTITION BY ul.user_id, gr.region_name
          ORDER BY ul.recorded_at
        ) - ul.recorded_at
      )) / 60.0
    ) as avg_visit_duration_minutes,

    -- First and last visits
    MIN(ul.recorded_at) as first_visit,
    MAX(ul.recorded_at) as last_visit

  FROM geographic_regions gr
  -- Spatial join to find locations within regions
  LEFT JOIN user_locations ul ON ST_CONTAINS(gr.geometry, ul.coordinates)
  WHERE ul.recorded_at >= CURRENT_TIMESTAMP - INTERVAL '30 days'
  GROUP BY gr.region_name, gr.region_type, gr.population, gr.area_km2
),

region_insights AS (
  SELECT 
    region_name,
    region_type,
    population,
    area_km2,
    total_visits,
    unique_visitors,
    visits_per_km2,
    weekend_visits,
    weekday_visits,
    peak_hour,
    ROUND(avg_visit_duration_minutes, 2) as avg_visit_duration_minutes,
    first_visit,
    last_visit,

    -- Calculate regional activity metrics
    ROUND(
      total_visits::DECIMAL / NULLIF(unique_visitors, 0),
      2
    ) as avg_visits_per_user,

    ROUND(
      weekend_visits::DECIMAL / NULLIF(total_visits, 0) * 100,
      1
    ) as weekend_activity_percent,

    -- Population-adjusted activity
    CASE 
      WHEN population > 0 THEN 
        ROUND(unique_visitors::DECIMAL / population * 100, 2)
      ELSE NULL
    END as visitor_penetration_percent,

    -- Activity level classification
    CASE 
      WHEN total_visits >= 1000 THEN 'very_high'
      WHEN total_visits >= 500 THEN 'high'
      WHEN total_visits >= 100 THEN 'medium'
      WHEN total_visits >= 10 THEN 'low'
      ELSE 'very_low'
    END as activity_level,

    -- Regional characteristics inference
    CASE 
      WHEN weekend_visits > weekday_visits * 1.2 THEN 'leisure_destination'
      WHEN weekday_visits > weekend_visits * 1.5 THEN 'business_district'
      WHEN peak_hour BETWEEN 7 AND 9 OR peak_hour BETWEEN 17 AND 19 THEN 'commuter_area'
      ELSE 'mixed_use'
    END as inferred_usage_type

  FROM regional_activity_analysis
  WHERE total_visits > 0
)

SELECT 
  region_name,
  region_type,
  activity_level,
  inferred_usage_type,

  -- Core metrics
  total_visits,
  unique_visitors,
  avg_visits_per_user,
  visits_per_km2,

  -- Temporal patterns
  weekend_activity_percent,
  CONCAT(peak_hour, ':00') as peak_hour,
  avg_visit_duration_minutes,

  -- Geographic and demographic insights
  area_km2,
  population,
  visitor_penetration_percent,

  -- Activity insights
  CASE 
    WHEN visits_per_km2 > 100 THEN 'high_density_destination'
    WHEN visitor_penetration_percent > 10 THEN 'popular_local_area'
    WHEN avg_visit_duration_minutes > 60 THEN 'extended_stay_location'
    ELSE 'standard_activity_area'
  END as location_characteristic,

  -- Planning and optimization insights
  CASE 
    WHEN activity_level = 'very_high' AND inferred_usage_type = 'business_district' 
      THEN 'Consider infrastructure optimization'
    WHEN activity_level IN ('low', 'very_low') AND region_type = 'commercial'
      THEN 'Potential for development or marketing focus'
    WHEN weekend_activity_percent > 70 AND inferred_usage_type = 'leisure_destination'
      THEN 'Tourism and recreation optimization opportunity'
    ELSE 'Monitor for changes in activity patterns'
  END as planning_recommendation

FROM region_insights
ORDER BY total_visits DESC, visits_per_km2 DESC;

-- QueryLeaf provides comprehensive geospatial analytics capabilities:
-- 1. Native MongoDB 2dsphere indexing with SQL-familiar syntax
-- 2. Advanced spatial queries using ST_ geospatial functions
-- 3. Location-based proximity analysis with distance calculations
-- 4. Geographic aggregation and density heatmap generation
-- 5. Route analysis and mobility pattern detection
-- 6. Regional activity analysis with spatial joins
-- 7. Real-time location intelligence with temporal pattern analysis
-- 8. Integration with MongoDB's native geospatial capabilities
-- 9. Production-ready spatial analytics with performance optimization
-- 10. Enterprise-grade location-based insights accessible through familiar SQL patterns

Best Practices for MongoDB Geospatial Analytics Implementation

Spatial Index Optimization Strategies

Essential practices for maximizing geospatial query performance:

2dsphere Index Creation: Use 2dsphere indexes for all location-based queries and proximity analysis
Compound Spatial Indexes: Combine geospatial indexes with other frequently queried fields
Query Pattern Analysis: Design indexes to match common spatial query patterns and proximity searches
Coordinate System Consistency: Ensure consistent coordinate reference systems across all spatial data
Distance Calculation Optimization: Use appropriate distance calculations based on accuracy requirements
Spatial Data Validation: Implement comprehensive validation for coordinate data quality and accuracy

Production Deployment Considerations

Key factors for enterprise geospatial analytics deployments:

Real-time Processing: Design aggregation pipelines for real-time location intelligence and stream processing
Scalability Planning: Implement sharding strategies that work effectively with geospatial data distribution
Privacy and Security: Ensure location data privacy compliance and implement appropriate access controls
Data Quality Management: Monitor and maintain high-quality location data with accuracy tracking
Performance Monitoring: Track spatial query performance and optimize based on usage patterns
Integration Architecture: Design seamless integration with mapping services and location-based applications

Conclusion

MongoDB's Aggregation Framework provides comprehensive geospatial analytics capabilities that enable sophisticated location intelligence applications through native spatial operators, advanced proximity analysis, and high-performance spatial indexing. The combination of 2dsphere indexes, spatial aggregation operators, and flexible coordinate system support delivers enterprise-grade geospatial processing without the complexity of traditional GIS systems.

Key MongoDB geospatial analytics benefits include:

Native Spatial Processing: Built-in 2dsphere indexes and spatial operators provide optimized geospatial query performance
Advanced Location Intelligence: Sophisticated proximity analysis, route tracking, and mobility pattern detection
Real-time Spatial Analytics: High-performance aggregation pipelines for streaming location data processing
Flexible Coordinate Support: Comprehensive support for different coordinate systems and projection transformations
Scalable Spatial Operations: Distributed geospatial processing that scales across sharded MongoDB deployments
SQL Compatibility: Familiar spatial operations accessible through SQL-style syntax and functions

Whether you're building location-based services, analyzing user mobility patterns, or implementing proximity marketing applications, MongoDB's geospatial capabilities with QueryLeaf's SQL-familiar interface provide the foundation for scalable location intelligence that maintains high performance while simplifying spatial analytics development.

QueryLeaf Integration: QueryLeaf automatically optimizes MongoDB geospatial operations while providing SQL-familiar syntax for spatial queries, proximity analysis, and location-based aggregations. Advanced geospatial analytics patterns, heatmap generation, and route analysis are seamlessly accessible through familiar SQL constructs, making sophisticated location intelligence both powerful and approachable for SQL-oriented teams.

The combination of MongoDB's robust geospatial capabilities with familiar SQL-style management makes it an ideal platform for applications that require both advanced spatial analytics and operational simplicity, ensuring your location-based applications can scale efficiently while maintaining familiar development and operational patterns.

November 20, 2025
27 min read

MongoDB Compound Indexes and Multi-Field Query Optimization: Advanced Performance Tuning for Complex Query Patterns

Modern database applications require sophisticated query optimization strategies that can efficiently handle complex multi-field queries, range operations, and sorting requirements across multiple dimensions. Traditional single-field indexing approaches often fail to provide optimal performance for real-world query patterns that involve filtering, sorting, and projecting across multiple document fields simultaneously, leading to inefficient query execution plans and degraded application performance.

MongoDB provides comprehensive compound indexing capabilities that enable optimal query performance through intelligent multi-field index construction, advanced query planning optimization, and sophisticated index intersection strategies. Unlike traditional databases that require manual index tuning and complex optimization hints, MongoDB's compound indexes automatically optimize query execution paths while supporting complex query patterns with minimal configuration overhead.

The Traditional Multi-Field Query Challenge

Conventional database indexing approaches often struggle with multi-field query optimization:

-- Traditional PostgreSQL multi-field query challenges - limited compound index effectiveness

-- Basic product catalog with traditional indexing approach
CREATE TABLE products (
    product_id SERIAL PRIMARY KEY,
    sku VARCHAR(100) UNIQUE NOT NULL,
    product_name VARCHAR(200) NOT NULL,
    category VARCHAR(100) NOT NULL,
    subcategory VARCHAR(100),
    brand VARCHAR(100) NOT NULL,

    -- Price and inventory fields
    price DECIMAL(10,2) NOT NULL,
    sale_price DECIMAL(10,2),
    cost DECIMAL(10,2),
    stock_quantity INTEGER NOT NULL DEFAULT 0,
    reserved_quantity INTEGER NOT NULL DEFAULT 0,

    -- Product attributes
    weight_kg DECIMAL(8,3),
    dimensions_length_cm DECIMAL(8,2),
    dimensions_width_cm DECIMAL(8,2),  
    dimensions_height_cm DECIMAL(8,2),
    color VARCHAR(50),
    size VARCHAR(50),

    -- Status and lifecycle
    status VARCHAR(50) DEFAULT 'active',
    availability VARCHAR(50) DEFAULT 'in_stock',
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    published_at TIMESTAMP,
    discontinued_at TIMESTAMP,

    -- SEO and marketing
    seo_title VARCHAR(200),
    seo_description TEXT,
    marketing_tags TEXT[],
    featured BOOLEAN DEFAULT false,

    -- Supplier information
    supplier_id INTEGER,
    supplier_sku VARCHAR(100),
    lead_time_days INTEGER,
    minimum_order_quantity INTEGER DEFAULT 1,

    -- Performance tracking
    view_count INTEGER DEFAULT 0,
    purchase_count INTEGER DEFAULT 0,
    rating_average DECIMAL(3,2) DEFAULT 0,
    rating_count INTEGER DEFAULT 0
);

-- Traditional indexing approach with limited compound effectiveness
CREATE INDEX idx_products_category ON products(category);
CREATE INDEX idx_products_brand ON products(brand);  
CREATE INDEX idx_products_price ON products(price);
CREATE INDEX idx_products_status ON products(status);
CREATE INDEX idx_products_created_at ON products(created_at);
CREATE INDEX idx_products_stock ON products(stock_quantity);

-- Attempt at compound indexes with limited optimization
CREATE INDEX idx_products_category_brand ON products(category, brand);
CREATE INDEX idx_products_price_range ON products(price, status);
CREATE INDEX idx_products_inventory ON products(status, stock_quantity, availability);

-- Complex multi-field query that struggles with traditional indexing
WITH product_search_filters AS (
  SELECT 
    p.*,

    -- Calculate derived fields for filtering
    (p.stock_quantity - p.reserved_quantity) as available_quantity,
    CASE 
      WHEN p.sale_price IS NOT NULL AND p.sale_price < p.price THEN p.sale_price
      ELSE p.price
    END as effective_price,

    -- Performance scoring
    (p.rating_average * p.rating_count + p.view_count * 0.1 + p.purchase_count * 2) as performance_score,

    -- Age calculation
    EXTRACT(DAYS FROM CURRENT_TIMESTAMP - p.created_at) as days_since_created,

    -- Availability status
    CASE 
      WHEN p.stock_quantity <= 0 THEN 'out_of_stock'
      WHEN p.stock_quantity <= 5 THEN 'low_stock'
      ELSE 'in_stock'
    END as stock_status

  FROM products p
  WHERE 
    -- Multiple filtering conditions that don't align with index structure
    p.status = 'active'
    AND p.category IN ('Electronics', 'Computers', 'Mobile')
    AND p.brand IN ('Apple', 'Samsung', 'Sony', 'Microsoft')
    AND p.price BETWEEN 100.00 AND 2000.00
    AND p.stock_quantity > 0
    AND p.created_at >= CURRENT_DATE - INTERVAL '2 years'
    AND (p.featured = true OR p.rating_average >= 4.0)
    AND p.availability = 'in_stock'
),

price_range_analysis AS (
  -- Complex price range queries that can't utilize compound indexes effectively
  SELECT 
    psf.*,

    -- Price tier classification
    CASE 
      WHEN effective_price < 200 THEN 'budget'
      WHEN effective_price < 500 THEN 'mid_range'  
      WHEN effective_price < 1000 THEN 'premium'
      ELSE 'luxury'
    END as price_tier,

    -- Discount calculations
    CASE 
      WHEN psf.sale_price IS NOT NULL THEN
        ROUND(((psf.price - psf.sale_price) / psf.price * 100), 1)
      ELSE 0
    END as discount_percentage,

    -- Competitive analysis (requires additional queries)
    (
      SELECT AVG(price) 
      FROM products p2 
      WHERE p2.category = psf.category 
      AND p2.brand = psf.brand
      AND p2.status = 'active'
    ) as category_brand_avg_price,

    -- Inventory velocity
    CASE 
      WHEN psf.purchase_count > 0 AND psf.days_since_created > 0 THEN
        ROUND(psf.purchase_count::DECIMAL / (psf.days_since_created / 30), 2)
      ELSE 0
    END as monthly_sales_velocity

  FROM product_search_filters psf
),

sorting_and_ranking AS (
  -- Complex sorting requirements that bypass index usage
  SELECT 
    pra.*,

    -- Multi-criteria ranking
    (
      -- Price competitiveness (lower is better)
      CASE 
        WHEN pra.category_brand_avg_price > 0 THEN
          (pra.category_brand_avg_price - pra.effective_price) / pra.category_brand_avg_price * 20
        ELSE 0
      END +

      -- Rating score
      pra.rating_average * 10 +

      -- Stock availability score
      CASE 
        WHEN pra.available_quantity > 10 THEN 20
        WHEN pra.available_quantity > 5 THEN 15
        WHEN pra.available_quantity > 0 THEN 10
        ELSE 0
      END +

      -- Recency bonus
      CASE 
        WHEN pra.days_since_created <= 30 THEN 15
        WHEN pra.days_since_created <= 90 THEN 10
        WHEN pra.days_since_created <= 180 THEN 5
        ELSE 0
      END +

      -- Featured bonus
      CASE WHEN pra.featured THEN 25 ELSE 0 END

    ) as composite_ranking_score

  FROM price_range_analysis pra
)

SELECT 
  sar.product_id,
  sar.sku,
  sar.product_name,
  sar.category,
  sar.subcategory,
  sar.brand,
  sar.effective_price,
  sar.price_tier,
  sar.discount_percentage,
  sar.available_quantity,
  sar.stock_status,
  sar.rating_average,
  sar.rating_count,
  sar.performance_score,
  sar.monthly_sales_velocity,
  sar.composite_ranking_score,

  -- Additional computed fields
  CASE 
    WHEN sar.discount_percentage > 20 THEN 'great_deal'
    WHEN sar.discount_percentage > 10 THEN 'good_deal'
    ELSE 'regular_price'
  END as deal_status,

  -- Recommendation priority
  CASE 
    WHEN sar.composite_ranking_score > 80 THEN 'highly_recommended'
    WHEN sar.composite_ranking_score > 60 THEN 'recommended'  
    WHEN sar.composite_ranking_score > 40 THEN 'consider'
    ELSE 'standard'
  END as recommendation_level

FROM sorting_and_ranking sar

-- Complex ordering that can't utilize compound indexes effectively
ORDER BY 
  CASE 
    WHEN sar.featured = true THEN 0 
    ELSE 1 
  END,  -- Featured first
  sar.composite_ranking_score DESC,  -- Then by composite score
  sar.rating_average DESC,  -- Then by rating
  sar.available_quantity DESC,  -- Then by stock availability
  sar.created_at DESC  -- Finally by recency

LIMIT 50 OFFSET 0;

-- Performance analysis of the traditional approach
EXPLAIN (ANALYZE, BUFFERS) 
WITH product_search_filters AS (
  SELECT p.*
  FROM products p
  WHERE 
    p.status = 'active'
    AND p.category = 'Electronics'
    AND p.brand IN ('Apple', 'Samsung')
    AND p.price BETWEEN 200.00 AND 1000.00
    AND p.stock_quantity > 0
    AND p.rating_average >= 4.0
)
SELECT * FROM product_search_filters 
ORDER BY price, rating_average DESC, created_at DESC
LIMIT 10;

-- Problems with traditional multi-field indexing:
-- 1. Index selection conflicts - query planner struggles to choose optimal index
-- 2. Poor compound index utilization due to filter order mismatches
-- 3. Expensive sorting operations that can't use index ordering
-- 4. Multiple index scans and expensive merge operations
-- 5. Large intermediate result sets that require post-filtering
-- 6. Inability to optimize across different query patterns efficiently
-- 7. Complex explain plans with sequential scans and sorts
-- 8. High I/O overhead from multiple index lookups
-- 9. Limited flexibility for dynamic filtering combinations
-- 10. Poor performance scaling with data volume growth

-- Attempt to create better compound indexes (still limited)
CREATE INDEX idx_products_comprehensive_search ON products(
  status, category, brand, price, stock_quantity, rating_average, created_at
);

-- But this creates issues:
-- 1. Index becomes too wide and expensive to maintain
-- 2. Only effective for queries that match the exact field order
-- 3. Partial index usage for queries with different filter combinations  
-- 4. High storage overhead and slow write performance
-- 5. Still can't optimize for different sort orders efficiently

-- Complex aggregation query with traditional limitations
WITH category_brand_performance AS (
  SELECT 
    p.category,
    p.brand,

    -- Aggregated metrics that require full collection scans
    COUNT(*) as total_products,
    AVG(p.price) as avg_price,
    AVG(p.rating_average) as avg_rating,
    SUM(p.stock_quantity) as total_stock,
    SUM(p.purchase_count) as total_sales,

    -- Percentile calculations (expensive without proper indexing)
    PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY p.price) as price_p25,
    PERCENTILE_CONT(0.50) WITHIN GROUP (ORDER BY p.price) as price_median,
    PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY p.price) as price_p75,

    -- Time-based analysis
    COUNT(*) FILTER (WHERE p.created_at >= CURRENT_DATE - INTERVAL '30 days') as new_products_30d,
    COUNT(*) FILTER (WHERE p.featured = true) as featured_count,
    COUNT(*) FILTER (WHERE p.stock_quantity = 0) as out_of_stock_count

  FROM products p
  WHERE p.status = 'active'
  GROUP BY p.category, p.brand
  HAVING COUNT(*) >= 5  -- Only brands with significant presence
),

performance_ranking AS (
  SELECT 
    cbp.*,

    -- Performance scoring
    (
      (cbp.avg_rating * 20) +
      (CASE WHEN cbp.avg_price > 0 THEN (cbp.total_sales * 1000.0 / cbp.total_products) ELSE 0 END) +
      ((cbp.featured_count * 100.0 / cbp.total_products)) +
      (100 - (cbp.out_of_stock_count * 100.0 / cbp.total_products))
    ) as performance_index,

    -- Market position analysis
    RANK() OVER (
      PARTITION BY cbp.category 
      ORDER BY cbp.total_sales DESC, cbp.avg_rating DESC
    ) as category_sales_rank,

    RANK() OVER (
      PARTITION BY cbp.category 
      ORDER BY cbp.avg_price DESC
    ) as category_price_rank

  FROM category_brand_performance cbp
)

SELECT 
  pr.category,
  pr.brand,
  pr.total_products,
  ROUND(pr.avg_price, 2) as avg_price,
  ROUND(pr.avg_rating, 2) as avg_rating,
  pr.total_stock,
  pr.total_sales,
  ROUND(pr.price_median, 2) as median_price,
  pr.new_products_30d,
  pr.featured_count,
  ROUND(pr.performance_index, 1) as performance_score,
  pr.category_sales_rank,
  pr.category_price_rank,

  -- Market analysis
  CASE 
    WHEN pr.category_price_rank <= 3 THEN 'premium_positioning'
    WHEN pr.category_price_rank <= pr.total_products * 0.5 THEN 'mid_market'
    ELSE 'value_positioning'
  END as market_position,

  CASE 
    WHEN pr.performance_index > 200 THEN 'top_performer'
    WHEN pr.performance_index > 150 THEN 'strong_performer'
    WHEN pr.performance_index > 100 THEN 'average_performer'
    ELSE 'underperformer'
  END as performance_tier

FROM performance_ranking pr
ORDER BY pr.category, pr.performance_index DESC;

-- Traditional limitations in complex analytics:
-- 1. Multiple full table scans required for different aggregations
-- 2. Expensive sorting and ranking operations
-- 3. No ability to use covering indexes for complex calculations
-- 4. High CPU and memory usage for large result set processing
-- 5. Poor query plan reusability across similar analytical queries
-- 6. Limited optimization for mixed OLTP and OLAP workloads
-- 7. Complex join operations that can't utilize compound indexes effectively
-- 8. Expensive window function calculations without proper index support
-- 9. Limited ability to optimize across time-series and categorical dimensions
-- 10. Inefficient handling of sparse data and optional field queries

MongoDB provides sophisticated compound indexing capabilities with advanced optimization:

// MongoDB Advanced Compound Indexing - high-performance multi-field query optimization
const { MongoClient, ObjectId } = require('mongodb');

const client = new MongoClient('mongodb://localhost:27017');
const db = client.db('advanced_compound_indexing');

// Comprehensive MongoDB Compound Index Manager
class AdvancedCompoundIndexManager {
  constructor(db, config = {}) {
    this.db = db;
    this.collections = {
      products: db.collection('products'),
      categories: db.collection('categories'),
      brands: db.collection('brands'),
      indexAnalytics: db.collection('index_analytics'),
      queryPerformance: db.collection('query_performance'),
      indexRecommendations: db.collection('index_recommendations')
    };

    // Advanced compound indexing configuration
    this.config = {
      enableIntelligentIndexing: config.enableIntelligentIndexing !== false,
      enableQueryAnalysis: config.enableQueryAnalysis !== false,
      enablePerformanceTracking: config.enablePerformanceTracking !== false,
      enableIndexOptimization: config.enableIndexOptimization !== false,

      // Index management settings
      maxCompoundIndexFields: config.maxCompoundIndexFields || 8,
      backgroundIndexCreation: config.backgroundIndexCreation !== false,
      sparseIndexOptimization: config.sparseIndexOptimization !== false,
      partialIndexSupport: config.partialIndexSupport !== false,

      // Performance thresholds
      slowQueryThreshold: config.slowQueryThreshold || 1000, // milliseconds
      indexEfficiencyThreshold: config.indexEfficiencyThreshold || 0.8,
      cardinalityAnalysisThreshold: config.cardinalityAnalysisThreshold || 1000,

      // Optimization strategies
      enableESRRule: config.enableESRRule !== false, // Equality, Sort, Range
      enableQueryPlanCaching: config.enableQueryPlanCaching !== false,
      enableIndexIntersection: config.enableIndexIntersection !== false,
      enableCoveringIndexes: config.enableCoveringIndexes !== false,

      // Monitoring and maintenance
      indexMaintenanceInterval: config.indexMaintenanceInterval || 86400000, // 24 hours
      performanceAnalysisInterval: config.performanceAnalysisInterval || 3600000, // 1 hour
      enableAutomaticOptimization: config.enableAutomaticOptimization || false
    };

    // Performance tracking
    this.performanceMetrics = {
      queryPatterns: new Map(),
      indexUsage: new Map(),
      slowQueries: [],
      optimizationHistory: []
    };

    // Index strategy patterns
    this.indexStrategies = {
      searchOptimized: ['status', 'category', 'brand', 'price', 'rating_average'],
      analyticsOptimized: ['created_at', 'category', 'brand', 'total_sales'],
      inventoryOptimized: ['status', 'availability', 'stock_quantity', 'updated_at'],
      sortingOptimized: ['featured', 'price', 'rating_average', 'created_at']
    };

    this.initializeCompoundIndexing();
  }

  async initializeCompoundIndexing() {
    console.log('Initializing advanced compound indexing system...');

    try {
      // Create comprehensive compound indexes
      await this.createAdvancedCompoundIndexes();

      // Setup query pattern analysis
      if (this.config.enableQueryAnalysis) {
        await this.setupQueryPatternAnalysis();
      }

      // Initialize performance monitoring
      if (this.config.enablePerformanceTracking) {
        await this.setupPerformanceMonitoring();
      }

      // Enable automatic optimization if configured
      if (this.config.enableAutomaticOptimization) {
        await this.setupAutomaticOptimization();
      }

      console.log('Advanced compound indexing system initialized successfully');

    } catch (error) {
      console.error('Error initializing compound indexing:', error);
      throw error;
    }
  }

  async createAdvancedCompoundIndexes() {
    console.log('Creating advanced compound indexes with intelligent optimization...');

    try {
      const products = this.collections.products;

      // Core Search and Filtering Compound Index (ESR Pattern: Equality, Sort, Range)
      await products.createIndex({
        status: 1,              // Equality filter (most selective first)
        category: 1,            // Equality filter  
        brand: 1,               // Equality filter
        featured: -1,           // Sort field (featured items first)
        price: 1,               // Range filter
        rating_average: -1,     // Sort field
        created_at: -1          // Final sort field
      }, {
        name: 'idx_products_search_optimized',
        background: this.config.backgroundIndexCreation,
        sparse: this.config.sparseIndexOptimization
      });

      // E-commerce Product Catalog Compound Index
      await products.createIndex({
        status: 1,
        availability: 1,
        category: 1,
        subcategory: 1,
        price: 1,
        stock_quantity: -1,
        rating_average: -1
      }, {
        name: 'idx_products_catalog_comprehensive',
        background: this.config.backgroundIndexCreation,
        partialFilterExpression: {
          status: 'active',
          stock_quantity: { $gt: 0 }
        }
      });

      // Advanced Analytics and Reporting Index
      await products.createIndex({
        created_at: -1,         // Time-based queries (most recent first)
        category: 1,            // Grouping dimension
        brand: 1,               // Grouping dimension  
        purchase_count: -1,     // Performance metric
        rating_average: -1,     // Quality metric
        price: 1                // Financial metric
      }, {
        name: 'idx_products_analytics_optimized',
        background: this.config.backgroundIndexCreation
      });

      // Inventory Management Compound Index
      await products.createIndex({
        status: 1,
        availability: 1,
        stock_quantity: 1,
        reserved_quantity: 1,
        supplier_id: 1,
        updated_at: -1
      }, {
        name: 'idx_products_inventory_management',
        background: this.config.backgroundIndexCreation,
        sparse: true
      });

      // Text Search and SEO Optimization Index
      await products.createIndex({
        status: 1,
        category: 1,
        '$text': {
          product_name: 'text',
          seo_title: 'text', 
          seo_description: 'text',
          marketing_tags: 'text'
        },
        rating_average: -1,
        view_count: -1
      }, {
        name: 'idx_products_text_search_optimized',
        background: this.config.backgroundIndexCreation,
        weights: {
          product_name: 10,
          seo_title: 8,
          seo_description: 5,
          marketing_tags: 3
        }
      });

      // Price and Discount Analysis Index
      await products.createIndex({
        status: 1,
        category: 1,
        brand: 1,
        'pricing.effective_price': 1,
        'pricing.discount_percentage': -1,
        'pricing.price_tier': 1,
        featured: -1
      }, {
        name: 'idx_products_pricing_analysis',
        background: this.config.backgroundIndexCreation
      });

      // Performance and Popularity Tracking Index
      await products.createIndex({
        status: 1,
        performance_score: -1,
        view_count: -1,
        purchase_count: -1,
        rating_count: -1,
        created_at: -1
      }, {
        name: 'idx_products_performance_tracking',
        background: this.config.backgroundIndexCreation
      });

      // Supplier and Procurement Index
      await products.createIndex({
        supplier_id: 1,
        status: 1,
        lead_time_days: 1,
        minimum_order_quantity: 1,
        cost: 1,
        updated_at: -1
      }, {
        name: 'idx_products_supplier_procurement',
        background: this.config.backgroundIndexCreation,
        sparse: true
      });

      // Geographic and Dimensional Analysis Index (for products with physical attributes)
      await products.createIndex({
        status: 1,
        category: 1,
        'dimensions.weight_kg': 1,
        'dimensions.volume_cubic_cm': 1,
        'shipping.shipping_class': 1,
        availability: 1
      }, {
        name: 'idx_products_dimensional_analysis',
        background: this.config.backgroundIndexCreation,
        sparse: true,
        partialFilterExpression: {
          'dimensions.weight_kg': { $exists: true }
        }
      });

      // Customer Behavior and Recommendation Index
      await products.createIndex({
        status: 1,
        'analytics.recommendation_score': -1,
        'analytics.conversion_rate': -1,
        'analytics.customer_segment_affinity': 1,
        price: 1,
        rating_average: -1
      }, {
        name: 'idx_products_recommendation_engine',
        background: this.config.backgroundIndexCreation,
        sparse: true
      });

      // Seasonal and Temporal Analysis Index
      await products.createIndex({
        status: 1,
        'lifecycle.seasonality_pattern': 1,
        'lifecycle.peak_season_months': 1,
        created_at: -1,
        discontinued_at: 1,
        'analytics.seasonal_performance_score': -1
      }, {
        name: 'idx_products_seasonal_analysis',
        background: this.config.backgroundIndexCreation,
        sparse: true
      });

      console.log('Advanced compound indexes created successfully');

    } catch (error) {
      console.error('Error creating compound indexes:', error);
      throw error;
    }
  }

  async optimizeQueryWithCompoundIndexes(queryPattern, options = {}) {
    console.log('Optimizing query with intelligent compound index selection...');
    const startTime = Date.now();

    try {
      // Analyze query pattern and select optimal index strategy
      const indexStrategy = await this.analyzeQueryPattern(queryPattern);

      // Build optimized aggregation pipeline
      const optimizedPipeline = await this.buildOptimizedPipeline(queryPattern, indexStrategy, options);

      // Execute query with performance monitoring
      const results = await this.executeOptimizedQuery(optimizedPipeline, indexStrategy);

      // Track performance metrics
      await this.trackQueryPerformance({
        queryPattern: queryPattern,
        indexStrategy: indexStrategy,
        executionTime: Date.now() - startTime,
        results: results.length,
        pipeline: optimizedPipeline
      });

      return {
        results: results,
        performance: {
          executionTime: Date.now() - startTime,
          indexesUsed: indexStrategy.recommendedIndexes,
          optimizationStrategy: indexStrategy.strategy,
          queryPlan: indexStrategy.queryPlan
        },
        optimization: {
          pipelineOptimized: true,
          indexHintsApplied: indexStrategy.hintsApplied,
          coveringIndexUsed: indexStrategy.coveringIndexUsed,
          sortOptimized: indexStrategy.sortOptimized
        }
      };

    } catch (error) {
      console.error('Error optimizing query:', error);

      // Track failed query for analysis
      await this.trackQueryPerformance({
        queryPattern: queryPattern,
        executionTime: Date.now() - startTime,
        error: error.message,
        failed: true
      });

      throw error;
    }
  }

  async analyzeQueryPattern(queryPattern) {
    console.log('Analyzing query pattern for optimal index selection...');

    const analysis = {
      filterFields: [],
      sortFields: [],
      rangeFields: [],
      equalityFields: [],
      textSearchFields: [],
      recommendedIndexes: [],
      strategy: 'compound_optimized',
      hintsApplied: false,
      coveringIndexUsed: false,
      sortOptimized: false
    };

    // Extract filter conditions
    if (queryPattern.match) {
      Object.keys(queryPattern.match).forEach(field => {
        const condition = queryPattern.match[field];

        if (typeof condition === 'object' && condition.$gte !== undefined || condition.$lte !== undefined || condition.$lt !== undefined || condition.$gt !== undefined) {
          analysis.rangeFields.push(field);
        } else if (typeof condition === 'object' && condition.$in !== undefined) {
          analysis.equalityFields.push(field);
        } else if (typeof condition === 'object' && condition.$text !== undefined) {
          analysis.textSearchFields.push(field);
        } else {
          analysis.equalityFields.push(field);
        }

        analysis.filterFields.push(field);
      });
    }

    // Extract sort conditions
    if (queryPattern.sort) {
      Object.keys(queryPattern.sort).forEach(field => {
        analysis.sortFields.push(field);
      });
    }

    // Determine optimal index strategy based on ESR (Equality, Sort, Range) pattern
    analysis.recommendedIndexes = this.selectOptimalIndexes(analysis);

    // Determine if covering index can be used
    analysis.coveringIndexUsed = await this.canUseCoveringIndex(queryPattern, analysis);

    // Check if sort can be optimized
    analysis.sortOptimized = this.canOptimizeSort(analysis);

    return analysis;
  }

  selectOptimalIndexes(analysis) {
    const recommendedIndexes = [];

    // Check if query matches existing optimized compound indexes
    const fieldSet = new Set([...analysis.equalityFields, ...analysis.sortFields, ...analysis.rangeFields]);

    // Search-optimized pattern
    if (fieldSet.has('status') && fieldSet.has('category') && fieldSet.has('price')) {
      recommendedIndexes.push('idx_products_search_optimized');
    }

    // Analytics pattern
    if (fieldSet.has('created_at') && fieldSet.has('category') && fieldSet.has('brand')) {
      recommendedIndexes.push('idx_products_analytics_optimized');
    }

    // Inventory pattern
    if (fieldSet.has('availability') && fieldSet.has('stock_quantity')) {
      recommendedIndexes.push('idx_products_inventory_management');
    }

    // Pricing pattern
    if (fieldSet.has('pricing.effective_price') || fieldSet.has('price')) {
      recommendedIndexes.push('idx_products_pricing_analysis');
    }

    // Text search pattern
    if (analysis.textSearchFields.length > 0) {
      recommendedIndexes.push('idx_products_text_search_optimized');
    }

    // Default to comprehensive catalog index if no specific pattern matches
    if (recommendedIndexes.length === 0) {
      recommendedIndexes.push('idx_products_catalog_comprehensive');
    }

    return recommendedIndexes;
  }

  async buildOptimizedPipeline(queryPattern, indexStrategy, options = {}) {
    console.log('Building optimized aggregation pipeline...');

    const pipeline = [];

    // Match stage with index-optimized field ordering
    if (queryPattern.match) {
      const optimizedMatch = this.optimizeMatchStage(queryPattern.match, indexStrategy);
      pipeline.push({ $match: optimizedMatch });
    }

    // Add computed fields stage if needed
    if (options.addComputedFields) {
      pipeline.push({
        $addFields: {
          effective_price: {
            $cond: {
              if: { $and: [{ $ne: ['$sale_price', null] }, { $lt: ['$sale_price', '$price'] }] },
              then: '$sale_price',
              else: '$price'
            }
          },
          available_quantity: { $subtract: ['$stock_quantity', { $ifNull: ['$reserved_quantity', 0] }] },
          performance_score: {
            $add: [
              { $multiply: ['$rating_average', { $ifNull: ['$rating_count', 0] }] },
              { $multiply: ['$view_count', 0.1] },
              { $multiply: ['$purchase_count', 2] }
            ]
          },
          days_since_created: {
            $divide: [{ $subtract: [new Date(), '$created_at'] }, 1000 * 60 * 60 * 24]
          }
        }
      });
    }

    // Advanced filtering stage with computed fields
    if (options.advancedFilters) {
      pipeline.push({
        $match: {
          available_quantity: { $gt: 0 },
          effective_price: { $gte: options.minPrice || 0, $lte: options.maxPrice || Number.MAX_SAFE_INTEGER }
        }
      });
    }

    // Sorting stage optimized for index usage
    if (queryPattern.sort || options.sort) {
      const sortSpec = queryPattern.sort || options.sort;
      const optimizedSort = this.optimizeSortStage(sortSpec, indexStrategy);
      pipeline.push({ $sort: optimizedSort });
    }

    // Faceting stage for complex analytics
    if (options.enableFaceting) {
      pipeline.push({
        $facet: {
          results: [
            { $limit: options.limit || 50 },
            { $skip: options.skip || 0 }
          ],
          totalCount: [{ $count: 'count' }],
          categoryStats: [
            { $group: { _id: '$category', count: { $sum: 1 }, avgPrice: { $avg: '$effective_price' } } }
          ],
          brandStats: [
            { $group: { _id: '$brand', count: { $sum: 1 }, avgRating: { $avg: '$rating_average' } } }
          ],
          priceStats: [
            {
              $group: {
                _id: null,
                minPrice: { $min: '$effective_price' },
                maxPrice: { $max: '$effective_price' },
                avgPrice: { $avg: '$effective_price' }
              }
            }
          ]
        }
      });
    } else {
      // Standard pagination
      if (options.skip) pipeline.push({ $skip: options.skip });
      if (options.limit) pipeline.push({ $limit: options.limit });
    }

    // Projection stage for covering index optimization
    if (options.projection) {
      pipeline.push({ $project: options.projection });
    }

    return pipeline;
  }

  optimizeMatchStage(matchConditions, indexStrategy) {
    console.log('Optimizing match stage for compound index efficiency...');

    const optimizedMatch = {};

    // Reorder match conditions to align with compound index field order
    // ESR Pattern: Equality conditions first, then sort fields, then range conditions

    // Add equality conditions first (most selective)
    const equalityFields = ['status', 'category', 'brand', 'availability'];
    equalityFields.forEach(field => {
      if (matchConditions[field] !== undefined) {
        optimizedMatch[field] = matchConditions[field];
      }
    });

    // Add other non-range conditions
    Object.keys(matchConditions).forEach(field => {
      if (!equalityFields.includes(field) && !this.isRangeCondition(matchConditions[field])) {
        optimizedMatch[field] = matchConditions[field];
      }
    });

    // Add range conditions last
    Object.keys(matchConditions).forEach(field => {
      if (this.isRangeCondition(matchConditions[field])) {
        optimizedMatch[field] = matchConditions[field];
      }
    });

    return optimizedMatch;
  }

  optimizeSortStage(sortSpec, indexStrategy) {
    console.log('Optimizing sort stage for index-supported ordering...');

    // Reorder sort fields to match compound index ordering when possible
    const optimizedSort = {};

    // Priority order based on common compound index patterns
    const sortPriority = ['featured', 'status', 'category', 'brand', 'price', 'rating_average', 'created_at'];

    // Add sort fields in optimized order
    sortPriority.forEach(field => {
      if (sortSpec[field] !== undefined) {
        optimizedSort[field] = sortSpec[field];
      }
    });

    // Add any remaining sort fields
    Object.keys(sortSpec).forEach(field => {
      if (!sortPriority.includes(field)) {
        optimizedSort[field] = sortSpec[field];
      }
    });

    return optimizedSort;
  }

  isRangeCondition(condition) {
    if (typeof condition !== 'object' || condition === null) return false;
    return condition.$gte !== undefined || condition.$lte !== undefined || 
           condition.$gt !== undefined || condition.$lt !== undefined || 
           condition.$in !== undefined;
  }

  async canUseCoveringIndex(queryPattern, analysis) {
    // Determine if the query can be satisfied entirely by index fields
    // This is a simplified check - in production, analyze the actual query projection
    const projectionFields = queryPattern.projection ? Object.keys(queryPattern.projection) : [];
    const requiredFields = [...analysis.filterFields, ...analysis.sortFields, ...projectionFields];

    // Check against known covering indexes
    const coveringIndexes = [
      'idx_products_search_optimized',
      'idx_products_catalog_comprehensive'
    ];

    // Simplified check - in practice, would verify actual index field coverage
    return requiredFields.length <= 6; // Assume reasonable covering index size
  }

  canOptimizeSort(analysis) {
    // Check if sort fields align with compound index ordering
    return analysis.sortFields.length > 0 && analysis.sortFields.length <= 3;
  }

  async executeOptimizedQuery(pipeline, indexStrategy) {
    console.log('Executing optimized query with performance monitoring...');

    try {
      // Apply index hints if specified
      const aggregateOptions = {
        allowDiskUse: false, // Force memory-based operations for better performance
        maxTimeMS: 30000
      };

      if (indexStrategy.recommendedIndexes.length > 0) {
        aggregateOptions.hint = indexStrategy.recommendedIndexes[0];
      }

      const cursor = this.collections.products.aggregate(pipeline, aggregateOptions);
      const results = await cursor.toArray();

      return results;

    } catch (error) {
      console.error('Error executing optimized query:', error);
      throw error;
    }
  }

  async performCompoundIndexAnalysis() {
    console.log('Performing comprehensive compound index analysis...');

    try {
      // Analyze current index usage and effectiveness
      const indexStats = await this.analyzeIndexUsage();

      // Identify slow queries and optimization opportunities
      const slowQueryAnalysis = await this.analyzeSlowQueries();

      // Generate index recommendations
      const recommendations = await this.generateIndexRecommendations(indexStats, slowQueryAnalysis);

      // Perform index efficiency analysis
      const efficiencyAnalysis = await this.analyzeIndexEfficiency();

      return {
        indexUsage: indexStats,
        slowQueries: slowQueryAnalysis,
        recommendations: recommendations,
        efficiency: efficiencyAnalysis,

        // Summary metrics
        summary: {
          totalIndexes: indexStats.totalIndexes,
          efficientIndexes: indexStats.efficientIndexes,
          underutilizedIndexes: indexStats.underutilizedIndexes,
          recommendedOptimizations: recommendations.length,
          averageQueryTime: slowQueryAnalysis.averageExecutionTime
        }
      };

    } catch (error) {
      console.error('Error performing compound index analysis:', error);
      throw error;
    }
  }

  async analyzeIndexUsage() {
    console.log('Analyzing compound index usage patterns...');

    try {
      // Get index statistics from MongoDB
      const indexStats = await this.collections.products.aggregate([
        { $indexStats: {} }
      ]).toArray();

      const analysis = {
        totalIndexes: indexStats.length,
        efficientIndexes: 0,
        underutilizedIndexes: 0,
        indexDetails: [],
        usagePatterns: new Map()
      };

      for (const indexStat of indexStats) {
        const indexAnalysis = {
          name: indexStat.name,
          accessCount: indexStat.accesses.ops,
          lastAccessed: indexStat.accesses.since,
          keyPattern: indexStat.key,

          // Calculate efficiency metrics
          efficiency: this.calculateIndexEfficiency(indexStat),

          // Usage classification
          usageClass: this.classifyIndexUsage(indexStat)
        };

        analysis.indexDetails.push(indexAnalysis);

        if (indexAnalysis.efficiency > this.config.indexEfficiencyThreshold) {
          analysis.efficientIndexes++;
        }

        if (indexAnalysis.usageClass === 'underutilized') {
          analysis.underutilizedIndexes++;
        }
      }

      return analysis;

    } catch (error) {
      console.error('Error analyzing index usage:', error);
      return { error: error.message };
    }
  }

  calculateIndexEfficiency(indexStat) {
    // Calculate index efficiency based on access patterns and selectivity
    const accessCount = indexStat.accesses.ops || 0;
    const timeSinceCreation = Date.now() - (indexStat.accesses.since ? indexStat.accesses.since.getTime() : Date.now());
    const daysSinceCreation = Math.max(1, timeSinceCreation / (1000 * 60 * 60 * 24));

    const accessesPerDay = accessCount / daysSinceCreation;

    // Efficiency score based on access frequency and recency
    return Math.min(1.0, accessesPerDay / 100); // Normalize to 0-1 scale
  }

  classifyIndexUsage(indexStat) {
    const accessCount = indexStat.accesses.ops || 0;
    const efficiency = this.calculateIndexEfficiency(indexStat);

    if (efficiency > 0.8) return 'highly_utilized';
    if (efficiency > 0.5) return 'moderately_utilized';
    if (efficiency > 0.1) return 'lightly_utilized';
    return 'underutilized';
  }

  async analyzeSlowQueries() {
    console.log('Analyzing slow queries for optimization opportunities...');

    try {
      // This would typically analyze MongoDB profiler data
      // For demo purposes, we'll simulate slow query analysis

      const slowQueryAnalysis = {
        averageExecutionTime: 150, // milliseconds
        slowestQueries: [
          {
            pattern: 'complex_search',
            averageTime: 500,
            count: 25,
            indexMisses: ['category + brand + price range']
          },
          {
            pattern: 'analytics_aggregation', 
            averageTime: 800,
            count: 12,
            indexMisses: ['created_at + category grouping']
          }
        ],
        totalSlowQueries: 37,
        optimizationOpportunities: [
          'Add compound index for category + brand + price filtering',
          'Optimize sort operations with index-aligned ordering',
          'Consider covering indexes for frequent projections'
        ]
      };

      return slowQueryAnalysis;

    } catch (error) {
      console.error('Error analyzing slow queries:', error);
      return { error: error.message };
    }
  }

  async generateIndexRecommendations(indexStats, slowQueryAnalysis) {
    console.log('Generating intelligent index recommendations...');

    const recommendations = [];

    // Analyze missing compound indexes based on slow query patterns
    if (slowQueryAnalysis.slowestQueries) {
      for (const slowQuery of slowQueryAnalysis.slowestQueries) {
        if (slowQuery.indexMisses) {
          for (const missingIndex of slowQuery.indexMisses) {
            recommendations.push({
              type: 'create_compound_index',
              priority: 'high',
              description: `Create compound index for: ${missingIndex}`,
              estimatedImprovement: '60-80% query time reduction',
              implementation: this.generateIndexCreationCommand(missingIndex)
            });
          }
        }
      }
    }

    // Recommend removal of underutilized indexes
    if (indexStats.indexDetails) {
      for (const indexDetail of indexStats.indexDetails) {
        if (indexDetail.usageClass === 'underutilized' && !indexDetail.name.includes('_id_')) {
          recommendations.push({
            type: 'remove_index',
            priority: 'medium',
            description: `Consider removing underutilized index: ${indexDetail.name}`,
            estimatedImprovement: 'Reduced storage overhead and faster writes',
            implementation: `db.collection.dropIndex("${indexDetail.name}")`
          });
        }
      }
    }

    // Recommend index consolidation opportunities
    recommendations.push({
      type: 'consolidate_indexes',
      priority: 'medium',
      description: 'Consolidate multiple single-field indexes into compound indexes',
      estimatedImprovement: 'Better query optimization and reduced storage',
      implementation: 'Analyze query patterns and create strategic compound indexes'
    });

    return recommendations;
  }

  generateIndexCreationCommand(indexDescription) {
    // Generate MongoDB index creation command based on description
    // This is a simplified implementation
    return `db.products.createIndex({/* fields based on: ${indexDescription} */}, {background: true})`;
  }

  async analyzeIndexEfficiency() {
    console.log('Analyzing compound index efficiency and optimization potential...');

    const efficiencyAnalysis = {
      overallEfficiency: 0.75, // Simulated metric
      topPerformingIndexes: [
        'idx_products_search_optimized',
        'idx_products_catalog_comprehensive'
      ],
      improvementOpportunities: [
        {
          index: 'idx_products_analytics_optimized',
          currentEfficiency: 0.6,
          potentialImprovement: 'Reorder fields to better match query patterns',
          estimatedGain: '25% performance improvement'
        }
      ],
      resourceUtilization: {
        totalIndexSize: '450 MB',
        memoryUsage: '180 MB',
        maintenanceOverhead: 'Low'
      }
    };

    return efficiencyAnalysis;
  }

  async trackQueryPerformance(performanceData) {
    try {
      const performanceRecord = {
        ...performanceData,
        timestamp: new Date(),
        collection: 'products'
      };

      await this.collections.queryPerformance.insertOne(performanceRecord);

      // Update in-memory performance tracking
      const pattern = performanceData.queryPattern.match ? 
        Object.keys(performanceData.queryPattern.match).join('+') : 'unknown';

      if (!this.performanceMetrics.queryPatterns.has(pattern)) {
        this.performanceMetrics.queryPatterns.set(pattern, {
          count: 0,
          totalTime: 0,
          averageTime: 0
        });
      }

      const patternMetrics = this.performanceMetrics.queryPatterns.get(pattern);
      patternMetrics.count++;
      patternMetrics.totalTime += performanceData.executionTime;
      patternMetrics.averageTime = patternMetrics.totalTime / patternMetrics.count;

    } catch (error) {
      console.warn('Error tracking query performance:', error);
      // Don't throw - performance tracking shouldn't break queries
    }
  }

  async setupPerformanceMonitoring() {
    console.log('Setting up compound index performance monitoring...');

    setInterval(async () => {
      try {
        await this.performPeriodicAnalysis();
      } catch (error) {
        console.warn('Error in periodic performance analysis:', error);
      }
    }, this.config.performanceAnalysisInterval);
  }

  async performPeriodicAnalysis() {
    console.log('Performing periodic compound index performance analysis...');

    try {
      // Analyze recent query performance
      const recentPerformance = await this.analyzeRecentPerformance();

      // Generate optimization recommendations
      const recommendations = await this.generateOptimizationRecommendations(recentPerformance);

      // Log analysis results
      await this.collections.indexAnalytics.insertOne({
        timestamp: new Date(),
        analysisType: 'periodic_performance',
        performance: recentPerformance,
        recommendations: recommendations
      });

      // Apply automatic optimizations if enabled
      if (this.config.enableAutomaticOptimization && recommendations.length > 0) {
        await this.applyAutomaticOptimizations(recommendations);
      }

    } catch (error) {
      console.error('Error in periodic analysis:', error);
    }
  }

  async analyzeRecentPerformance() {
    const oneHourAgo = new Date(Date.now() - 60 * 60 * 1000);

    const recentQueries = await this.collections.queryPerformance.find({
      timestamp: { $gte: oneHourAgo }
    }).toArray();

    const analysis = {
      totalQueries: recentQueries.length,
      averageExecutionTime: 0,
      slowQueries: recentQueries.filter(q => q.executionTime > this.config.slowQueryThreshold).length,
      queryPatterns: new Map()
    };

    if (recentQueries.length > 0) {
      analysis.averageExecutionTime = recentQueries.reduce((sum, q) => sum + q.executionTime, 0) / recentQueries.length;
    }

    return analysis;
  }

  async generateOptimizationRecommendations(performanceAnalysis) {
    const recommendations = [];

    if (performanceAnalysis.averageExecutionTime > this.config.slowQueryThreshold) {
      recommendations.push({
        type: 'performance_degradation',
        priority: 'high',
        description: 'Average query time has increased significantly',
        action: 'analyze_index_usage'
      });
    }

    if (performanceAnalysis.slowQueries > performanceAnalysis.totalQueries * 0.1) {
      recommendations.push({
        type: 'slow_query_threshold',
        priority: 'medium', 
        description: 'High percentage of slow queries detected',
        action: 'review_compound_indexes'
      });
    }

    return recommendations;
  }

  async applyAutomaticOptimizations(recommendations) {
    console.log('Applying automatic compound index optimizations...');

    for (const recommendation of recommendations) {
      try {
        if (recommendation.action === 'analyze_index_usage') {
          await this.performCompoundIndexAnalysis();
        } else if (recommendation.action === 'review_compound_indexes') {
          await this.reviewIndexConfiguration();
        }
      } catch (error) {
        console.warn(`Error applying optimization ${recommendation.action}:`, error);
      }
    }
  }

  async reviewIndexConfiguration() {
    console.log('Reviewing compound index configuration for optimization opportunities...');

    // This would analyze current indexes and suggest improvements
    const review = {
      timestamp: new Date(),
      reviewType: 'automated_compound_index_review',
      findings: [
        'All critical compound indexes are present',
        'Index usage patterns are within expected ranges',
        'No immediate optimization required'
      ]
    };

    await this.collections.indexAnalytics.insertOne(review);
  }
}

// Benefits of MongoDB Advanced Compound Indexing:
// - Intelligent multi-field query optimization with ESR (Equality, Sort, Range) pattern adherence
// - Advanced compound index strategies tailored for different query patterns and use cases
// - Comprehensive performance monitoring and automatic optimization recommendations  
// - Sophisticated index selection and query plan optimization
// - Built-in covering index support for maximum query performance
// - Automatic index efficiency analysis and maintenance recommendations
// - Production-ready compound indexing with minimal configuration overhead
// - Advanced query pattern analysis for optimal index design
// - Resource-aware index management with storage and memory optimization
// - SQL-compatible compound indexing through QueryLeaf integration

module.exports = {
  AdvancedCompoundIndexManager
};

Understanding MongoDB Compound Index Architecture

Advanced Multi-Field Indexing and Query Optimization Strategies

Implement sophisticated compound indexing patterns for production MongoDB deployments:

// Production-ready MongoDB compound indexing with enterprise-grade optimization and monitoring
class ProductionCompoundIndexOptimizer extends AdvancedCompoundIndexManager {
  constructor(db, productionConfig) {
    super(db, productionConfig);

    this.productionConfig = {
      ...productionConfig,
      enableDistributedIndexing: true,
      enableShardKeyOptimization: true,
      enableCrossCollectionIndexing: true,
      enableIndexPartitioning: true,
      enableAutomaticIndexMaintenance: true,
      enableComplianceIndexing: true
    };

    this.setupProductionOptimizations();
    this.initializeDistributedIndexing();
    this.setupAdvancedAnalytics();
  }

  async implementDistributedCompoundIndexing(shardingStrategy) {
    console.log('Implementing distributed compound indexing across sharded clusters...');

    const distributedStrategy = {
      // Shard-aware compound indexing
      shardKeyAlignment: {
        enableShardKeyPrefixing: true,
        optimizeForShardDistribution: true,
        minimizeCrossShardQueries: true,
        balanceIndexEfficiency: true
      },

      // Cross-shard optimization
      crossShardOptimization: {
        enableGlobalIndexes: true,
        optimizeForLatency: true,
        minimizeNetworkTraffic: true,
        enableIntelligentRouting: true
      },

      // High availability indexing
      highAvailabilityIndexing: {
        replicationAwareIndexing: true,
        automaticFailoverIndexing: true,
        consistencyLevelOptimization: true,
        geographicDistributionSupport: true
      }
    };

    return await this.deployDistributedIndexing(distributedStrategy);
  }

  async setupAdvancedIndexOptimization() {
    console.log('Setting up advanced compound index optimization...');

    const optimizationStrategies = {
      // Query pattern learning
      queryPatternLearning: {
        enableMachineLearningOptimization: true,
        adaptiveIndexCreation: true,
        predictiveIndexManagement: true,
        workloadPatternRecognition: true
      },

      // Resource optimization
      resourceOptimization: {
        memoryAwareIndexing: true,
        storageOptimization: true,
        cpuUtilizationOptimization: true,
        networkBandwidthOptimization: true
      },

      // Index lifecycle management
      lifecycleManagement: {
        automaticIndexAging: true,
        indexArchiving: true,
        historicalDataIndexing: true,
        complianceRetention: true
      }
    };

    return await this.deployOptimizationStrategies(optimizationStrategies);
  }

  async implementAdvancedQueryOptimization() {
    console.log('Implementing advanced query optimization with compound index intelligence...');

    const queryOptimizationStrategy = {
      // Query plan optimization
      queryPlanOptimization: {
        enableCostBasedOptimization: true,
        statisticsBasedPlanning: true,
        adaptiveQueryExecution: true,
        parallelExecutionOptimization: true
      },

      // Index intersection optimization
      indexIntersection: {
        enableIntelligentIntersection: true,
        costAwareIntersection: true,
        selectivityBasedOptimization: true,
        memoryCacheOptimization: true
      },

      // Covering index strategies
      coveringIndexStrategies: {
        automaticCoveringDetection: true,
        projectionOptimization: true,
        fieldOrderOptimization: true,
        sparseIndexOptimization: true
      }
    };

    return await this.deployQueryOptimizationStrategy(queryOptimizationStrategy);
  }
}

SQL-Style Compound Indexing with QueryLeaf

QueryLeaf provides familiar SQL syntax for MongoDB compound indexing and multi-field query optimization:

-- QueryLeaf advanced compound indexing with SQL-familiar syntax for MongoDB

-- Create comprehensive compound indexes with intelligent field ordering
CREATE COMPOUND INDEX idx_products_search_optimization ON products (
  -- ESR Pattern: Equality, Sort, Range
  status ASC,              -- Equality filter (highest selectivity)
  category ASC,            -- Equality filter
  brand ASC,               -- Equality filter  
  featured DESC,           -- Sort field (featured items first)
  price ASC,               -- Range filter
  rating_average DESC,     -- Sort field
  created_at DESC          -- Final sort field
) 
WITH (
  index_type = 'compound_optimized',
  background_creation = true,
  sparse_optimization = true,

  -- Advanced indexing options
  enable_covering_index = true,
  optimize_for_sorting = true,
  memory_usage_limit = '256MB',

  -- Performance tuning
  selectivity_threshold = 0.1,
  cardinality_analysis = true,
  enable_statistics_collection = true
);

-- E-commerce catalog compound index with partial filtering
CREATE COMPOUND INDEX idx_products_catalog_comprehensive ON products (
  status ASC,
  availability ASC,
  category ASC,
  subcategory ASC,
  price ASC,
  stock_quantity DESC,
  rating_average DESC
)
WITH (
  -- Partial index for active products only
  partial_filter = 'status = "active" AND stock_quantity > 0',
  include_null_values = false,

  -- Optimization settings  
  enable_prefix_compression = true,
  block_size = '16KB',
  fill_factor = 90
);

-- Advanced analytics compound index optimized for time-series queries
CREATE COMPOUND INDEX idx_products_analytics_temporal ON products (
  created_at DESC,         -- Time dimension (most recent first)
  category ASC,            -- Grouping dimension
  brand ASC,               -- Grouping dimension
  purchase_count DESC,     -- Performance metric
  rating_average DESC,     -- Quality metric
  price ASC                -- Financial metric
)
WITH (
  index_type = 'time_series_optimized',
  background_creation = true,

  -- Time-series specific optimizations
  enable_temporal_partitioning = true,
  partition_granularity = 'month',
  retention_policy = '2 years',

  -- Analytics optimization
  enable_aggregation_pipeline_optimization = true,
  support_window_functions = true
);

-- Text search compound index with weighted fields
CREATE COMPOUND INDEX idx_products_text_search_optimized ON products (
  status ASC,
  category ASC,

  -- Full-text search fields with weights
  FULLTEXT(
    product_name WEIGHT 10,
    seo_title WEIGHT 8,
    seo_description WEIGHT 5,
    marketing_tags WEIGHT 3
  ),

  rating_average DESC,
  view_count DESC
)
WITH (
  index_type = 'compound_text_search',
  text_search_language = 'english',
  enable_stemming = true,
  enable_stop_words = true,

  -- Text search optimization
  phrase_search_optimization = true,
  fuzzy_search_support = true,
  enable_search_analytics = true
);

-- Complex multi-field query optimization using compound indexes
WITH optimized_product_search AS (
  SELECT 
    p.*,

    -- Utilize compound index for efficient filtering
    -- idx_products_search_optimization will be used for this query pattern
    ROW_NUMBER() OVER (
      PARTITION BY p.category 
      ORDER BY 
        p.featured DESC,           -- Index-supported sort
        p.rating_average DESC,     -- Index-supported sort  
        p.price ASC                -- Index-supported sort
    ) as category_rank,

    -- Calculate derived metrics that benefit from covering indexes
    CASE 
      WHEN p.sale_price IS NOT NULL AND p.sale_price < p.price THEN p.sale_price
      ELSE p.price
    END as effective_price,

    (p.stock_quantity - COALESCE(p.reserved_quantity, 0)) as available_quantity,

    -- Performance score calculation
    (
      p.rating_average * p.rating_count + 
      p.view_count * 0.1 + 
      p.purchase_count * 2.0
    ) as performance_score,

    -- Age-based scoring
    EXTRACT(DAYS FROM CURRENT_TIMESTAMP - p.created_at) as days_since_created

  FROM products p
  WHERE 
    -- Compound index will efficiently handle this filter combination
    p.status = 'active'                    -- Uses index prefix
    AND p.category IN ('Electronics', 'Computers', 'Mobile')  -- Uses index
    AND p.brand IN ('Apple', 'Samsung', 'Sony')              -- Uses index
    AND p.featured = true OR p.rating_average >= 4.0         -- Uses index
    AND p.price BETWEEN 100.00 AND 2000.00                  -- Range condition last
    AND p.stock_quantity > 0                                 -- Additional filter

  -- Index hint to ensure optimal compound index usage
  USE INDEX (idx_products_search_optimization)
),

category_analytics AS (
  -- Utilize analytics compound index for efficient aggregation
  SELECT 
    category,
    brand,

    -- Aggregation operations optimized by compound index
    COUNT(*) as product_count,
    AVG(effective_price) as avg_price,
    AVG(rating_average) as avg_rating,
    SUM(purchase_count) as total_sales,

    -- Percentile calculations using index ordering
    PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY effective_price) as price_p25,
    PERCENTILE_CONT(0.50) WITHIN GROUP (ORDER BY effective_price) as price_median,
    PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY effective_price) as price_p75,

    -- Performance metrics
    AVG(performance_score) as avg_performance_score,
    COUNT(*) FILTER (WHERE days_since_created <= 30) as new_products_30d,

    -- Stock analysis
    AVG(available_quantity) as avg_stock_level,
    COUNT(*) FILTER (WHERE available_quantity = 0) as out_of_stock_count

  FROM optimized_product_search
  GROUP BY category, brand

  -- Use analytics compound index for grouping optimization
  USE INDEX (idx_products_analytics_temporal)
),

performance_ranking AS (
  SELECT 
    ca.*,

    -- Ranking within category using index-optimized ordering
    RANK() OVER (
      PARTITION BY ca.category 
      ORDER BY ca.total_sales DESC, ca.avg_rating DESC
    ) as category_sales_rank,

    RANK() OVER (
      PARTITION BY ca.category
      ORDER BY ca.avg_price DESC
    ) as category_price_rank,

    -- Performance classification
    CASE 
      WHEN ca.avg_performance_score > 200 THEN 'top_performer'
      WHEN ca.avg_performance_score > 150 THEN 'strong_performer'
      WHEN ca.avg_performance_score > 100 THEN 'average_performer'
      ELSE 'underperformer'
    END as performance_tier,

    -- Market position analysis
    CASE 
      WHEN ca.category_price_rank <= 3 THEN 'premium_positioning'
      WHEN ca.category_price_rank <= (SELECT COUNT(DISTINCT brand) FROM optimized_product_search WHERE category = ca.category) / 2 THEN 'mid_market'
      ELSE 'value_positioning'
    END as market_position

  FROM category_analytics ca
)

SELECT 
  pr.category,
  pr.brand,
  pr.product_count,
  ROUND(pr.avg_price, 2) as average_price,
  ROUND(pr.price_median, 2) as median_price,
  ROUND(pr.avg_rating, 2) as average_rating,
  pr.total_sales,
  pr.new_products_30d as new_products_last_30_days,
  ROUND(pr.avg_stock_level, 1) as average_stock_level,
  pr.out_of_stock_count,

  -- Performance and positioning metrics
  pr.performance_tier,
  pr.market_position,
  pr.category_sales_rank,
  pr.category_price_rank,

  -- Efficiency metrics
  CASE 
    WHEN pr.product_count > 0 THEN ROUND((pr.total_sales * 1.0 / pr.product_count), 2)
    ELSE 0
  END as sales_per_product,

  ROUND(
    CASE 
      WHEN pr.out_of_stock_count = 0 THEN 100
      ELSE ((pr.product_count - pr.out_of_stock_count) * 100.0 / pr.product_count)
    END, 
    1
  ) as stock_availability_percent,

  -- Competitive analysis
  CASE 
    WHEN pr.avg_price > pr.price_p75 AND pr.avg_rating >= 4.0 THEN 'premium_quality'
    WHEN pr.avg_price < pr.price_p25 AND pr.total_sales > pr.avg_performance_score THEN 'value_leader'
    WHEN pr.avg_rating >= 4.5 THEN 'quality_leader'
    WHEN pr.total_sales > (SELECT AVG(total_sales) FROM performance_ranking WHERE category = pr.category) * 1.5 THEN 'market_leader'
    ELSE 'standard_competitor'
  END as competitive_position

FROM performance_ranking pr
WHERE pr.product_count >= 5  -- Only brands with significant presence

-- Optimize ordering using compound index
ORDER BY 
  pr.category ASC,
  pr.total_sales DESC,
  pr.avg_rating DESC,
  pr.avg_price ASC

-- Query execution will benefit from multiple compound indexes
WITH (
  -- Query optimization hints
  enable_compound_index_intersection = true,
  prefer_covering_indexes = true,
  optimize_for_sort_performance = true,

  -- Performance monitoring
  track_index_usage = true,
  collect_execution_statistics = true,
  enable_query_plan_caching = true,

  -- Resource management
  max_memory_usage = '512MB',
  enable_spill_to_disk = false,
  parallel_processing = true
);

-- Advanced compound index performance analysis and optimization
WITH compound_index_performance AS (
  SELECT 
    index_name,
    index_type,
    key_pattern,

    -- Usage statistics
    total_accesses,
    accesses_per_day,
    last_accessed,

    -- Performance metrics
    avg_execution_time_ms,
    index_hit_ratio,
    selectivity_factor,

    -- Resource utilization
    index_size_mb,
    memory_usage_mb,
    maintenance_overhead_percent,

    -- Efficiency calculations
    (total_accesses * 1.0 / NULLIF(index_size_mb, 0)) as access_efficiency,
    (index_hit_ratio * selectivity_factor) as effectiveness_score

  FROM index_statistics
  WHERE index_type = 'compound'
  AND created_date >= CURRENT_DATE - INTERVAL '30 days'
),

index_optimization_analysis AS (
  SELECT 
    cip.*,

    -- Performance classification
    CASE 
      WHEN effectiveness_score > 0.8 THEN 'highly_effective'
      WHEN effectiveness_score > 0.6 THEN 'moderately_effective'
      WHEN effectiveness_score > 0.4 THEN 'somewhat_effective'
      ELSE 'ineffective'
    END as effectiveness_classification,

    -- Resource efficiency
    CASE 
      WHEN access_efficiency > 100 THEN 'highly_efficient'
      WHEN access_efficiency > 50 THEN 'moderately_efficient'
      WHEN access_efficiency > 10 THEN 'somewhat_efficient'
      ELSE 'inefficient'
    END as efficiency_classification,

    -- Optimization recommendations
    CASE 
      WHEN accesses_per_day < 10 AND index_size_mb > 50 THEN 'consider_removal'
      WHEN avg_execution_time_ms > 1000 THEN 'needs_optimization'
      WHEN index_hit_ratio < 0.7 THEN 'review_field_order'
      WHEN maintenance_overhead_percent > 20 THEN 'reduce_complexity'
      ELSE 'performing_well'
    END as optimization_recommendation,

    -- Priority scoring for optimization
    (
      CASE effectiveness_classification
        WHEN 'ineffective' THEN 40
        WHEN 'somewhat_effective' THEN 20  
        WHEN 'moderately_effective' THEN 10
        ELSE 0
      END +
      CASE efficiency_classification
        WHEN 'inefficient' THEN 30
        WHEN 'somewhat_efficient' THEN 15
        WHEN 'moderately_efficient' THEN 5
        ELSE 0
      END +
      CASE 
        WHEN avg_execution_time_ms > 2000 THEN 25
        WHEN avg_execution_time_ms > 1000 THEN 15
        WHEN avg_execution_time_ms > 500 THEN 5
        ELSE 0
      END
    ) as optimization_priority_score

  FROM compound_index_performance cip
),

optimization_recommendations AS (
  SELECT 
    ioa.index_name,
    ioa.effectiveness_classification,
    ioa.efficiency_classification,
    ioa.optimization_recommendation,
    ioa.optimization_priority_score,

    -- Detailed recommendations based on analysis
    CASE ioa.optimization_recommendation
      WHEN 'consider_removal' THEN 
        FORMAT('Index %s is rarely used (%s accesses/day) but consumes %s MB - consider removal',
               ioa.index_name, ioa.accesses_per_day, ioa.index_size_mb)
      WHEN 'needs_optimization' THEN
        FORMAT('Index %s has slow execution time (%s ms avg) - review field order and selectivity',
               ioa.index_name, ioa.avg_execution_time_ms)
      WHEN 'review_field_order' THEN
        FORMAT('Index %s has low hit ratio (%s) - consider reordering fields for better selectivity',
               ioa.index_name, ROUND(ioa.index_hit_ratio * 100, 1))
      WHEN 'reduce_complexity' THEN
        FORMAT('Index %s has high maintenance overhead (%s%%) - consider simplifying or partitioning',
               ioa.index_name, ioa.maintenance_overhead_percent)
      ELSE
        FORMAT('Index %s is performing well - no immediate action required', ioa.index_name)
    END as detailed_recommendation,

    -- Implementation guidance
    CASE ioa.optimization_recommendation
      WHEN 'consider_removal' THEN 'DROP INDEX ' || ioa.index_name
      WHEN 'needs_optimization' THEN 'REINDEX ' || ioa.index_name || ' WITH (optimization_level = high)'
      WHEN 'review_field_order' THEN 'RECREATE INDEX ' || ioa.index_name || ' WITH (field_order_optimization = true)'
      WHEN 'reduce_complexity' THEN 'PARTITION INDEX ' || ioa.index_name || ' BY (time_range = monthly)'
      ELSE 'MAINTAIN INDEX ' || ioa.index_name || ' WITH (current_settings)'
    END as implementation_command,

    -- Expected impact
    CASE ioa.optimization_recommendation
      WHEN 'consider_removal' THEN 'Reduced storage cost and faster write operations'
      WHEN 'needs_optimization' THEN 'Improved query performance by 30-50%'
      WHEN 'review_field_order' THEN 'Better index utilization and reduced scan time'
      WHEN 'reduce_complexity' THEN 'Lower maintenance overhead and better scalability'
      ELSE 'Continued optimal performance'
    END as expected_impact

  FROM index_optimization_analysis ioa
  WHERE ioa.optimization_priority_score > 0
)

SELECT 
  or_rec.index_name,
  or_rec.effectiveness_classification,
  or_rec.efficiency_classification,
  or_rec.optimization_recommendation,
  or_rec.optimization_priority_score,
  or_rec.detailed_recommendation,
  or_rec.implementation_command,
  or_rec.expected_impact,

  -- Priority classification
  CASE 
    WHEN or_rec.optimization_priority_score > 60 THEN 'critical_priority'
    WHEN or_rec.optimization_priority_score > 40 THEN 'high_priority'
    WHEN or_rec.optimization_priority_score > 20 THEN 'medium_priority'
    ELSE 'low_priority'
  END as optimization_priority,

  -- Timeline recommendation  
  CASE 
    WHEN or_rec.optimization_priority_score > 60 THEN 'immediate_action_required'
    WHEN or_rec.optimization_priority_score > 40 THEN 'address_within_week'
    WHEN or_rec.optimization_priority_score > 20 THEN 'address_within_month'
    ELSE 'monitor_and_review_quarterly'
  END as timeline_recommendation

FROM optimization_recommendations or_rec

-- Order by priority for implementation planning
ORDER BY 
  or_rec.optimization_priority_score DESC,
  or_rec.index_name ASC;

-- Real-time compound index monitoring dashboard
CREATE VIEW compound_index_health_dashboard AS
WITH real_time_index_metrics AS (
  SELECT 
    -- Current timestamp for dashboard refresh
    CURRENT_TIMESTAMP as dashboard_time,

    -- Index overview statistics
    (SELECT COUNT(*) FROM compound_indexes WHERE status = 'active') as total_compound_indexes,
    (SELECT COUNT(*) FROM compound_indexes WHERE effectiveness_score > 0.8) as high_performing_indexes,
    (SELECT COUNT(*) FROM compound_indexes WHERE effectiveness_score < 0.4) as underperforming_indexes,

    -- Performance aggregates
    (SELECT AVG(avg_execution_time_ms) FROM index_performance_metrics 
     WHERE measurement_time >= CURRENT_TIMESTAMP - INTERVAL '1 hour') as system_avg_execution_time,

    (SELECT AVG(index_hit_ratio) FROM index_performance_metrics
     WHERE measurement_time >= CURRENT_TIMESTAMP - INTERVAL '1 hour') as system_avg_hit_ratio,

    -- Resource utilization
    (SELECT SUM(index_size_mb) FROM compound_indexes) as total_index_size_mb,
    (SELECT SUM(memory_usage_mb) FROM compound_indexes) as total_memory_usage_mb,

    -- Query optimization metrics
    (SELECT COUNT(*) FROM slow_queries 
     WHERE query_time >= CURRENT_TIMESTAMP - INTERVAL '5 minutes') as recent_slow_queries,

    (SELECT AVG(optimization_effectiveness) FROM query_optimizations
     WHERE applied_timestamp >= CURRENT_TIMESTAMP - INTERVAL '1 hour') as recent_optimization_effectiveness
),

index_health_summary AS (
  SELECT 
    index_name,
    effectiveness_score,
    efficiency_classification,
    avg_execution_time_ms,
    accesses_per_hour,
    last_optimized,

    -- Health indicators
    CASE 
      WHEN effectiveness_score > 0.8 AND avg_execution_time_ms < 100 THEN '🟢 Excellent'
      WHEN effectiveness_score > 0.6 AND avg_execution_time_ms < 500 THEN '🟡 Good' 
      WHEN effectiveness_score > 0.4 AND avg_execution_time_ms < 1000 THEN '🟠 Fair'
      ELSE '🔴 Poor'
    END as health_status,

    -- Trend indicators
    LAG(effectiveness_score) OVER (
      PARTITION BY index_name 
      ORDER BY measurement_timestamp
    ) as prev_effectiveness_score

  FROM compound_index_performance 
  WHERE measurement_timestamp >= CURRENT_TIMESTAMP - INTERVAL '24 hours'
  QUALIFY ROW_NUMBER() OVER (PARTITION BY index_name ORDER BY measurement_timestamp DESC) = 1
)

SELECT 
  dashboard_time,

  -- System overview
  total_compound_indexes,
  high_performing_indexes,
  underperforming_indexes,
  FORMAT('%s high, %s underperforming', high_performing_indexes, underperforming_indexes) as performance_summary,

  -- Performance metrics
  ROUND(system_avg_execution_time, 1) as avg_execution_time_ms,
  ROUND(system_avg_hit_ratio * 100, 1) as avg_hit_ratio_percent,
  recent_slow_queries,

  -- Resource utilization
  ROUND(total_index_size_mb, 1) as total_index_size_mb,
  ROUND(total_memory_usage_mb, 1) as memory_usage_mb,
  ROUND((total_memory_usage_mb / NULLIF(total_index_size_mb, 0)) * 100, 1) as memory_efficiency_percent,

  -- Health status distribution
  (SELECT COUNT(*) FROM index_health_summary WHERE health_status LIKE '%Excellent%') as excellent_indexes,
  (SELECT COUNT(*) FROM index_health_summary WHERE health_status LIKE '%Good%') as good_indexes,
  (SELECT COUNT(*) FROM index_health_summary WHERE health_status LIKE '%Fair%') as fair_indexes,
  (SELECT COUNT(*) FROM index_health_summary WHERE health_status LIKE '%Poor%') as poor_indexes,

  -- Overall system health
  CASE 
    WHEN recent_slow_queries > 10 OR underperforming_indexes > total_compound_indexes * 0.3 THEN 'CRITICAL'
    WHEN recent_slow_queries > 5 OR underperforming_indexes > total_compound_indexes * 0.2 THEN 'WARNING'
    WHEN high_performing_indexes >= total_compound_indexes * 0.8 THEN 'EXCELLENT'
    ELSE 'HEALTHY'
  END as system_health_status,

  -- Active alerts
  ARRAY[
    CASE WHEN recent_slow_queries > 10 THEN FORMAT('%s slow queries in last 5 minutes', recent_slow_queries) END,
    CASE WHEN underperforming_indexes > 5 THEN FORMAT('%s compound indexes need optimization', underperforming_indexes) END,
    CASE WHEN system_avg_execution_time > 500 THEN 'High average query execution time detected' END,
    CASE WHEN memory_efficiency_percent < 50 THEN 'Low memory efficiency - consider index optimization' END
  ]::TEXT[] as active_alerts,

  -- Top performing indexes
  (SELECT JSON_AGG(
    JSON_BUILD_OBJECT(
      'index_name', index_name,
      'health', health_status,
      'execution_time', avg_execution_time_ms || 'ms',
      'accesses_per_hour', accesses_per_hour
    )
  ) FROM index_health_summary ORDER BY effectiveness_score DESC LIMIT 5) as top_performing_indexes,

  -- Indexes needing attention
  (SELECT JSON_AGG(
    JSON_BUILD_OBJECT(
      'index_name', index_name,
      'health', health_status,
      'execution_time', avg_execution_time_ms || 'ms',
      'issue', 'Performance degradation detected'
    )
  ) FROM index_health_summary WHERE health_status LIKE '%Poor%') as indexes_needing_attention

FROM real_time_index_metrics;

-- QueryLeaf provides comprehensive compound indexing capabilities:
-- 1. Advanced multi-field compound indexing with ESR pattern optimization
-- 2. Intelligent index selection and query plan optimization  
-- 3. Comprehensive performance monitoring and analytics
-- 4. Automated index maintenance and optimization recommendations
-- 5. SQL-familiar compound index creation and management syntax
-- 6. Advanced covering index strategies for maximum query performance
-- 7. Time-series and analytics-optimized compound indexing patterns
-- 8. Production-ready index monitoring with real-time health dashboards
-- 9. Cross-collection and distributed indexing optimization
-- 10. Integration with MongoDB's native compound indexing optimizations

Best Practices for Production Compound Indexing

Index Design Strategy and Performance Optimization

Essential principles for effective MongoDB compound indexing deployment:

ESR Pattern Adherence: Design compound indexes following Equality, Sort, Range field ordering for optimal query performance
Selectivity Analysis: Place most selective fields first in compound indexes to minimize document scan overhead
Query Pattern Alignment: Analyze application query patterns and create compound indexes that match common filter combinations
Covering Index Strategy: Design covering indexes that include all fields needed by frequent queries to eliminate document lookups
Index Intersection Planning: Understand when MongoDB will use index intersection vs. compound indexes for query optimization
Sort Optimization: Align sort operations with compound index field ordering to avoid expensive in-memory sorting

Scalability and Production Deployment

Optimize compound indexing for enterprise-scale requirements:

Shard Key Integration: Design compound indexes that work effectively with shard key distributions in sharded deployments
Resource Management: Monitor index size, memory usage, and maintenance overhead for optimal system resource utilization
Automated Optimization: Implement automated index analysis and optimization based on changing query patterns and performance metrics
Cross-Collection Strategy: Design compound indexing strategies that optimize queries spanning multiple collections
Compliance Integration: Ensure compound indexing meets audit, security, and data governance requirements
Operational Integration: Integrate compound index monitoring with existing alerting and operational workflows

Conclusion

MongoDB compound indexes provide comprehensive multi-field query optimization capabilities that enable optimal performance for complex query patterns through intelligent field ordering, advanced selectivity analysis, and sophisticated query plan optimization. The native compound indexing support ensures that multi-field queries benefit from MongoDB's optimized index intersection, covering index strategies, and ESR pattern adherence with minimal configuration overhead.

Key MongoDB Compound Indexing benefits include:

Intelligent Query Optimization: Advanced compound indexing with ESR pattern adherence for optimal multi-field query performance
Comprehensive Performance Analysis: Built-in index usage monitoring with automated optimization recommendations
Production-Ready Scalability: Enterprise-grade compound indexing strategies that scale efficiently across distributed deployments
Resource-Aware Management: Intelligent index size and memory optimization for optimal system resource utilization
Advanced Query Planning: Sophisticated query plan optimization with covering index support and index intersection intelligence
SQL Accessibility: Familiar SQL-style compound indexing operations through QueryLeaf for accessible database optimization

Whether you're optimizing e-commerce search queries, analytics aggregations, time-series data analysis, or complex multi-dimensional filtering operations, MongoDB compound indexes with QueryLeaf's familiar SQL interface provide the foundation for efficient, scalable, and high-performance multi-field query optimization.

QueryLeaf Integration: QueryLeaf automatically optimizes MongoDB compound indexing while providing SQL-familiar syntax for index creation, performance monitoring, and optimization strategies. Advanced compound indexing patterns, ESR rule adherence, and covering index strategies are seamlessly handled through familiar SQL constructs, making sophisticated multi-field query optimization accessible to SQL-oriented development teams.

The combination of MongoDB's robust compound indexing capabilities with SQL-style index management operations makes it an ideal platform for applications requiring both complex multi-field query performance and familiar database optimization patterns, ensuring your queries can scale efficiently while maintaining optimal performance across diverse query patterns and data access requirements.

November 19, 2025
25 min read

MongoDB Document Embedding vs Referencing Patterns: Advanced Data Modeling for High-Performance Applications

Effective data modeling is fundamental to application performance, scalability, and maintainability, particularly in document-oriented databases where developers have the flexibility to structure data in multiple ways. Traditional relational databases enforce normalized structures through foreign key relationships, but this approach often requires complex joins that become performance bottlenecks as data volumes and query complexity increase.

MongoDB's document model enables sophisticated data modeling strategies through document embedding and referencing patterns that can dramatically improve query performance and simplify application logic. Unlike relational databases that require expensive joins across multiple tables, MongoDB allows developers to embed related data directly within documents or use efficient referencing patterns optimized for document retrieval and updates.

The Traditional Relational Modeling Challenge

Relational database modeling often requires complex normalization and expensive join operations:

-- Traditional PostgreSQL normalized schema - complex joins required

-- Order management system with multiple related tables
CREATE TABLE customers (
    customer_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    company_name VARCHAR(500) NOT NULL,
    email VARCHAR(320) NOT NULL UNIQUE,
    phone VARCHAR(20),

    -- Address information (could be separate table)
    billing_address_line1 VARCHAR(200) NOT NULL,
    billing_address_line2 VARCHAR(200),
    billing_city VARCHAR(100) NOT NULL,
    billing_state VARCHAR(50) NOT NULL,
    billing_postal_code VARCHAR(20) NOT NULL,
    billing_country VARCHAR(3) NOT NULL DEFAULT 'USA',

    created_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP
);

CREATE TABLE products (
    product_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    sku VARCHAR(100) NOT NULL UNIQUE,
    name VARCHAR(500) NOT NULL,
    description TEXT,
    base_price DECIMAL(10,2) NOT NULL,
    category_id UUID NOT NULL,

    created_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,

    FOREIGN KEY (category_id) REFERENCES product_categories(category_id)
);

CREATE TABLE product_categories (
    category_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    category_name VARCHAR(200) NOT NULL,
    parent_category_id UUID,
    category_path TEXT, -- Materialized path for hierarchy

    FOREIGN KEY (parent_category_id) REFERENCES product_categories(category_id)
);

CREATE TABLE orders (
    order_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    customer_id UUID NOT NULL,
    order_number VARCHAR(50) NOT NULL UNIQUE,
    order_status VARCHAR(20) NOT NULL DEFAULT 'pending',

    -- Order totals
    subtotal DECIMAL(12,2) NOT NULL DEFAULT 0.00,
    tax_amount DECIMAL(12,2) NOT NULL DEFAULT 0.00,
    shipping_amount DECIMAL(12,2) NOT NULL DEFAULT 0.00,
    discount_amount DECIMAL(12,2) NOT NULL DEFAULT 0.00,
    total_amount DECIMAL(12,2) NOT NULL DEFAULT 0.00,

    -- Shipping information
    shipping_address_line1 VARCHAR(200),
    shipping_address_line2 VARCHAR(200),
    shipping_city VARCHAR(100),
    shipping_state VARCHAR(50),
    shipping_postal_code VARCHAR(20),
    shipping_country VARCHAR(3),

    -- Timestamps
    order_date TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
    shipped_date TIMESTAMP,
    delivered_date TIMESTAMP,

    created_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,

    FOREIGN KEY (customer_id) REFERENCES customers(customer_id),

    CONSTRAINT chk_order_status 
        CHECK (order_status IN ('pending', 'confirmed', 'processing', 'shipped', 'delivered', 'cancelled'))
);

CREATE TABLE order_items (
    order_item_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    order_id UUID NOT NULL,
    product_id UUID NOT NULL,
    quantity INTEGER NOT NULL,
    unit_price DECIMAL(10,2) NOT NULL,
    line_total DECIMAL(12,2) NOT NULL,

    -- Product snapshot data (denormalized for historical accuracy)
    product_sku VARCHAR(100) NOT NULL,
    product_name VARCHAR(500) NOT NULL,
    product_description TEXT,

    created_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,

    FOREIGN KEY (order_id) REFERENCES orders(order_id) ON DELETE CASCADE,
    FOREIGN KEY (product_id) REFERENCES products(product_id),

    CONSTRAINT chk_quantity_positive CHECK (quantity > 0),
    CONSTRAINT chk_unit_price_positive CHECK (unit_price >= 0),
    CONSTRAINT chk_line_total_calculation CHECK (line_total = quantity * unit_price)
);

CREATE TABLE order_status_history (
    status_history_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    order_id UUID NOT NULL,
    previous_status VARCHAR(20),
    new_status VARCHAR(20) NOT NULL,
    status_changed_by UUID, -- User who changed the status
    status_change_reason TEXT,
    status_changed_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,

    FOREIGN KEY (order_id) REFERENCES orders(order_id) ON DELETE CASCADE
);

-- Complex query to get complete order information - expensive joins
SELECT 
    -- Order information
    o.order_id,
    o.order_number,
    o.order_status,
    o.total_amount,
    o.order_date,

    -- Customer information (requires join)
    c.customer_id,
    c.company_name,
    c.email as customer_email,
    c.phone as customer_phone,

    -- Customer billing address
    c.billing_address_line1,
    c.billing_address_line2,
    c.billing_city,
    c.billing_state,
    c.billing_postal_code,
    c.billing_country,

    -- Order shipping address
    o.shipping_address_line1,
    o.shipping_address_line2,
    o.shipping_city,
    o.shipping_state,
    o.shipping_postal_code,
    o.shipping_country,

    -- Order items aggregation (requires complex subquery)
    (
        SELECT JSON_AGG(
            JSON_BUILD_OBJECT(
                'product_id', oi.product_id,
                'product_sku', oi.product_sku,
                'product_name', oi.product_name,
                'quantity', oi.quantity,
                'unit_price', oi.unit_price,
                'line_total', oi.line_total,
                'category', pc.category_name,
                'category_path', pc.category_path
            ) ORDER BY oi.created_at
        )
        FROM order_items oi
        JOIN products p ON oi.product_id = p.product_id
        JOIN product_categories pc ON p.category_id = pc.category_id
        WHERE oi.order_id = o.order_id
    ) as order_items,

    -- Order status history (requires another subquery)
    (
        SELECT JSON_AGG(
            JSON_BUILD_OBJECT(
                'previous_status', osh.previous_status,
                'new_status', osh.new_status,
                'changed_at', osh.status_changed_at,
                'reason', osh.status_change_reason
            ) ORDER BY osh.status_changed_at DESC
        )
        FROM order_status_history osh
        WHERE osh.order_id = o.order_id
    ) as status_history

FROM orders o
JOIN customers c ON o.customer_id = c.customer_id
WHERE o.order_id = $1;

-- Performance problems with traditional approach:
-- 1. Multiple JOIN operations create expensive query execution plans
-- 2. N+1 query problems when loading order lists with items
-- 3. Complex aggregations require subqueries and temporary result sets
-- 4. Schema changes require coordinated migrations across multiple tables
-- 5. Maintaining referential integrity across tables adds overhead
-- 6. Distributed transactions become complex with multiple related tables
-- 7. Caching strategies are complicated by normalized data spread across tables
-- 8. Application code becomes complex managing relationships between entities
-- 9. Query optimization requires deep understanding of join algorithms and indexing
-- 10. Scaling reads requires careful replication and read-replica routing strategies

MongoDB provides flexible document modeling with embedding and referencing patterns:

// MongoDB Document Modeling - flexible embedding and referencing patterns
const { MongoClient, ObjectId } = require('mongodb');

// Advanced MongoDB Document Modeling Manager
class MongoDocumentModelingManager {
  constructor() {
    this.client = null;
    this.db = null;
    this.modelingStrategies = new Map();
    this.performanceMetrics = new Map();
  }

  async initialize() {
    console.log('Initializing MongoDB Document Modeling Manager...');

    // Connect with optimized settings for document operations
    this.client = new MongoClient(process.env.MONGODB_URI || 'mongodb://localhost:27017', {
      // Optimized for document operations
      maxPoolSize: 20,
      minPoolSize: 5,
      maxIdleTimeMS: 30000,

      // Read preferences for document modeling
      readPreference: 'primaryPreferred',
      readConcern: { level: 'local' },

      // Write concern for consistency
      writeConcern: { w: 1, j: true },

      // Compression for large documents
      compressors: ['zlib'],

      appName: 'DocumentModelingManager'
    });

    await this.client.connect();
    this.db = this.client.db('ecommerce');

    // Initialize document modeling strategies
    await this.setupModelingStrategies();

    console.log('✅ MongoDB Document Modeling Manager initialized');
  }

  async setupModelingStrategies() {
    console.log('Setting up document modeling strategies...');

    const strategies = {
      // Embedding strategy for one-to-few relationships
      'embed_small_related': {
        name: 'Embed Small Related Data',
        description: 'Embed small, frequently accessed related documents',
        useCases: ['order_items', 'user_preferences', 'product_variants'],
        benefits: ['Single query retrieval', 'Atomic updates', 'Better performance'],
        limitations: ['Document size limits', 'Update complexity for arrays'],
        maxDocumentSize: 16777216, // 16MB MongoDB limit
        maxArrayElements: 1000     // Practical limit for embedded arrays
      },

      // Referencing strategy for one-to-many relationships
      'reference_large_collections': {
        name: 'Reference Large Collections',
        description: 'Use references for large or independently managed collections',
        useCases: ['user_orders', 'product_reviews', 'transaction_history'],
        benefits: ['Flexible querying', 'Independent updates', 'Smaller documents'],
        limitations: ['Multiple queries needed', 'No atomic cross-document updates'],
        maxReferencedDocuments: 1000000 // Practical query limit
      },

      // Hybrid strategy for complex relationships
      'hybrid_denormalization': {
        name: 'Hybrid Denormalization',
        description: 'Combine embedding and referencing with selective denormalization',
        useCases: ['order_with_customer_summary', 'product_with_category_details'],
        benefits: ['Optimized for read patterns', 'Reduced query complexity', 'Good performance'],
        limitations: ['Data duplication', 'Update coordination needed'],
        denormalizationFields: ['frequently_accessed', 'rarely_changed', 'small_size']
      }
    };

    for (const [strategyKey, strategy] of Object.entries(strategies)) {
      this.modelingStrategies.set(strategyKey, strategy);
    }

    console.log('✅ Document modeling strategies initialized');
  }

  // Embedding Pattern Implementation
  async createEmbeddedOrderDocument(orderData) {
    console.log('Creating embedded order document...');

    try {
      const embeddedOrder = {
        _id: new ObjectId(),
        orderNumber: orderData.orderNumber,
        orderDate: new Date(),
        status: 'pending',

        // Embedded customer information (frequently accessed, rarely changes)
        customer: {
          customerId: orderData.customer.customerId,
          companyName: orderData.customer.companyName,
          email: orderData.customer.email,
          phone: orderData.customer.phone,

          // Embedded billing address
          billingAddress: {
            line1: orderData.customer.billingAddress.line1,
            line2: orderData.customer.billingAddress.line2,
            city: orderData.customer.billingAddress.city,
            state: orderData.customer.billingAddress.state,
            postalCode: orderData.customer.billingAddress.postalCode,
            country: orderData.customer.billingAddress.country
          }
        },

        // Embedded shipping address
        shippingAddress: {
          line1: orderData.shippingAddress.line1,
          line2: orderData.shippingAddress.line2,
          city: orderData.shippingAddress.city,
          state: orderData.shippingAddress.state,
          postalCode: orderData.shippingAddress.postalCode,
          country: orderData.shippingAddress.country
        },

        // Embedded order items (one-to-few relationship)
        items: orderData.items.map(item => ({
          itemId: new ObjectId(),
          productId: item.productId,

          // Denormalized product information (snapshot for historical accuracy)
          product: {
            sku: item.product.sku,
            name: item.product.name,
            description: item.product.description,
            category: item.product.category,
            categoryPath: item.product.categoryPath
          },

          quantity: item.quantity,
          unitPrice: item.unitPrice,
          lineTotal: item.quantity * item.unitPrice,

          // Item-specific metadata
          addedAt: new Date(),
          notes: item.notes || null
        })),

        // Order totals (calculated from items)
        totals: {
          subtotal: orderData.items.reduce((sum, item) => sum + (item.quantity * item.unitPrice), 0),
          taxAmount: orderData.taxAmount || 0,
          shippingAmount: orderData.shippingAmount || 0,
          discountAmount: orderData.discountAmount || 0,
          total: 0 // Will be calculated
        },

        // Embedded status history
        statusHistory: [{
          previousStatus: null,
          newStatus: 'pending',
          changedAt: new Date(),
          changedBy: orderData.createdBy,
          reason: 'Order created'
        }],

        // Payment information (embedded for atomic updates)
        payment: {
          paymentMethod: orderData.payment.method,
          paymentStatus: 'pending',
          transactions: []
        },

        // Metadata
        metadata: {
          createdAt: new Date(),
          updatedAt: new Date(),
          createdBy: orderData.createdBy,
          version: 1,
          dataModelingStrategy: 'embedded'
        }
      };

      // Calculate total
      embeddedOrder.totals.total = 
        embeddedOrder.totals.subtotal + 
        embeddedOrder.totals.taxAmount + 
        embeddedOrder.totals.shippingAmount - 
        embeddedOrder.totals.discountAmount;

      // Insert the complete embedded document
      const result = await this.db.collection('orders_embedded').insertOne(embeddedOrder);

      console.log('✅ Embedded order document created:', result.insertedId);

      return {
        success: true,
        orderId: result.insertedId,
        strategy: 'embedded',
        documentSize: JSON.stringify(embeddedOrder).length,
        embeddedCollections: ['customer', 'items', 'statusHistory', 'payment']
      };

    } catch (error) {
      console.error('Error creating embedded order document:', error);
      return { success: false, error: error.message };
    }
  }

  // Referencing Pattern Implementation
  async createReferencedOrderDocument(orderData) {
    console.log('Creating referenced order document...');

    const session = this.client.startSession();

    try {
      return await session.withTransaction(async () => {
        // Create main order document with references
        const referencedOrder = {
          _id: new ObjectId(),
          orderNumber: orderData.orderNumber,
          orderDate: new Date(),
          status: 'pending',

          // Reference to customer document
          customerId: new ObjectId(orderData.customer.customerId),

          // Embedded shipping address (order-specific, not shared)
          shippingAddress: {
            line1: orderData.shippingAddress.line1,
            line2: orderData.shippingAddress.line2,
            city: orderData.shippingAddress.city,
            state: orderData.shippingAddress.state,
            postalCode: orderData.shippingAddress.postalCode,
            country: orderData.shippingAddress.country
          },

          // Order totals
          totals: {
            subtotal: 0, // Will be calculated from items
            taxAmount: orderData.taxAmount || 0,
            shippingAmount: orderData.shippingAmount || 0,
            discountAmount: orderData.discountAmount || 0,
            total: 0
          },

          // Reference to payment document
          paymentId: null, // Will be set after payment creation

          // Metadata
          metadata: {
            createdAt: new Date(),
            updatedAt: new Date(),
            createdBy: orderData.createdBy,
            version: 1,
            dataModelingStrategy: 'referenced'
          }
        };

        // Insert main order document
        const orderResult = await this.db.collection('orders_referenced')
          .insertOne(referencedOrder, { session });

        // Create separate order items documents
        const orderItems = orderData.items.map(item => ({
          _id: new ObjectId(),
          orderId: orderResult.insertedId,
          productId: new ObjectId(item.productId),

          quantity: item.quantity,
          unitPrice: item.unitPrice,
          lineTotal: item.quantity * item.unitPrice,

          // Metadata
          addedAt: new Date(),
          notes: item.notes || null
        }));

        const itemsResult = await this.db.collection('order_items_referenced')
          .insertMany(orderItems, { session });

        // Calculate and update order totals
        const subtotal = orderItems.reduce((sum, item) => sum + item.lineTotal, 0);
        const total = subtotal + referencedOrder.totals.taxAmount + 
                     referencedOrder.totals.shippingAmount - 
                     referencedOrder.totals.discountAmount;

        await this.db.collection('orders_referenced').updateOne(
          { _id: orderResult.insertedId },
          { 
            $set: { 
              'totals.subtotal': subtotal,
              'totals.total': total,
              'metadata.updatedAt': new Date()
            }
          },
          { session }
        );

        // Create initial status history document
        await this.db.collection('order_status_history').insertOne({
          _id: new ObjectId(),
          orderId: orderResult.insertedId,
          previousStatus: null,
          newStatus: 'pending',
          changedAt: new Date(),
          changedBy: orderData.createdBy,
          reason: 'Order created'
        }, { session });

        console.log('✅ Referenced order documents created');

        return {
          success: true,
          orderId: orderResult.insertedId,
          strategy: 'referenced',
          relatedCollections: {
            orderItems: Object.values(itemsResult.insertedIds).length,
            statusHistory: 1
          }
        };
      });

    } catch (error) {
      console.error('Error creating referenced order document:', error);
      return { success: false, error: error.message };
    } finally {
      await session.endSession();
    }
  }

  // Hybrid Pattern Implementation
  async createHybridOrderDocument(orderData) {
    console.log('Creating hybrid order document with selective denormalization...');

    try {
      const hybridOrder = {
        _id: new ObjectId(),
        orderNumber: orderData.orderNumber,
        orderDate: new Date(),
        status: 'pending',

        // Hybrid customer approach: embed summary, reference full document
        customer: {
          // Denormalized frequently accessed fields
          customerId: new ObjectId(orderData.customer.customerId),
          companyName: orderData.customer.companyName,
          email: orderData.customer.email,

          // Reference for complete customer information
          customerRef: {
            collection: 'customers',
            id: new ObjectId(orderData.customer.customerId)
          }
        },

        // Embedded shipping address
        shippingAddress: {
          line1: orderData.shippingAddress.line1,
          line2: orderData.shippingAddress.line2,
          city: orderData.shippingAddress.city,
          state: orderData.shippingAddress.state,
          postalCode: orderData.shippingAddress.postalCode,
          country: orderData.shippingAddress.country
        },

        // Hybrid items approach: embed summary, reference details for large catalogs
        itemsSummary: {
          totalItems: orderData.items.length,
          totalQuantity: orderData.items.reduce((sum, item) => sum + item.quantity, 0),
          uniqueProducts: [...new Set(orderData.items.map(item => item.productId))].length,

          // Embed small item summaries for quick display
          quickView: orderData.items.slice(0, 5).map(item => ({
            productId: new ObjectId(item.productId),
            productName: item.product.name,
            quantity: item.quantity,
            unitPrice: item.unitPrice,
            lineTotal: item.quantity * item.unitPrice
          }))
        },

        // Reference to complete items collection for large orders
        itemsCollection: orderData.items.length > 10 ? {
          collection: 'order_items_detailed',
          orderId: null // Will be set after document creation
        } : null,

        // Embed all items for small orders (< 10 items)
        items: orderData.items.length <= 10 ? orderData.items.map(item => ({
          itemId: new ObjectId(),
          productId: new ObjectId(item.productId),

          // Denormalized product summary
          product: {
            sku: item.product.sku,
            name: item.product.name,
            category: item.product.category
          },

          quantity: item.quantity,
          unitPrice: item.unitPrice,
          lineTotal: item.quantity * item.unitPrice
        })) : [],

        // Order totals with embedded calculations
        totals: {
          subtotal: orderData.items.reduce((sum, item) => sum + (item.quantity * item.unitPrice), 0),
          taxAmount: orderData.taxAmount || 0,
          shippingAmount: orderData.shippingAmount || 0,
          discountAmount: orderData.discountAmount || 0,
          total: 0 // Will be calculated
        },

        // Recent status embedded, full history referenced
        currentStatus: {
          status: 'pending',
          updatedAt: new Date(),
          updatedBy: orderData.createdBy
        },

        statusHistoryRef: {
          collection: 'order_status_history',
          orderId: null // Will be set after document creation
        },

        // Metadata with modeling strategy information
        metadata: {
          createdAt: new Date(),
          updatedAt: new Date(),
          createdBy: orderData.createdBy,
          version: 1,
          dataModelingStrategy: 'hybrid',
          embeddingDecisions: {
            customer: 'partial_denormalization',
            items: orderData.items.length <= 10 ? 'embedded' : 'referenced_with_summary',
            statusHistory: 'current_embedded_history_referenced'
          }
        }
      };

      // Calculate total
      hybridOrder.totals.total = 
        hybridOrder.totals.subtotal + 
        hybridOrder.totals.taxAmount + 
        hybridOrder.totals.shippingAmount - 
        hybridOrder.totals.discountAmount;

      // Insert the hybrid order document
      const result = await this.db.collection('orders_hybrid').insertOne(hybridOrder);

      // If large order, create separate detailed items collection
      if (orderData.items.length > 10) {
        const detailedItems = orderData.items.map(item => ({
          _id: new ObjectId(),
          orderId: result.insertedId,
          productId: new ObjectId(item.productId),

          // Full product information for detailed operations
          product: {
            sku: item.product.sku,
            name: item.product.name,
            description: item.product.description,
            category: item.product.category,
            categoryPath: item.product.categoryPath,
            specifications: item.product.specifications
          },

          quantity: item.quantity,
          unitPrice: item.unitPrice,
          lineTotal: item.quantity * item.unitPrice,

          // Additional item metadata
          addedAt: new Date(),
          notes: item.notes || null,
          customizations: item.customizations || {}
        }));

        await this.db.collection('order_items_detailed').insertMany(detailedItems);

        // Update reference in main document
        await this.db.collection('orders_hybrid').updateOne(
          { _id: result.insertedId },
          { $set: { 'itemsCollection.orderId': result.insertedId } }
        );
      }

      // Create initial status history
      await this.db.collection('order_status_history').insertOne({
        _id: new ObjectId(),
        orderId: result.insertedId,
        previousStatus: null,
        newStatus: 'pending',
        changedAt: new Date(),
        changedBy: orderData.createdBy,
        reason: 'Order created'
      });

      // Update status history reference
      await this.db.collection('orders_hybrid').updateOne(
        { _id: result.insertedId },
        { $set: { 'statusHistoryRef.orderId': result.insertedId } }
      );

      console.log('✅ Hybrid order document created:', result.insertedId);

      return {
        success: true,
        orderId: result.insertedId,
        strategy: 'hybrid',
        embeddingDecisions: hybridOrder.metadata.embeddingDecisions,
        documentSize: JSON.stringify(hybridOrder).length,
        separateCollections: orderData.items.length > 10 ? ['order_items_detailed'] : []
      };

    } catch (error) {
      console.error('Error creating hybrid order document:', error);
      return { success: false, error: error.message };
    }
  }

  async performModelingStrategyComparison(orderData) {
    console.log('Performing modeling strategy comparison...');

    const strategies = ['embedded', 'referenced', 'hybrid'];
    const results = {};

    for (const strategy of strategies) {
      const startTime = Date.now();
      let result;

      try {
        switch (strategy) {
          case 'embedded':
            result = await this.createEmbeddedOrderDocument(orderData);
            break;
          case 'referenced':
            result = await this.createReferencedOrderDocument(orderData);
            break;
          case 'hybrid':
            result = await this.createHybridOrderDocument(orderData);
            break;
        }

        const executionTime = Date.now() - startTime;

        // Perform read performance test
        const readStartTime = Date.now();
        await this.retrieveOrderByStrategy(result.orderId, strategy);
        const readTime = Date.now() - readStartTime;

        results[strategy] = {
          success: result.success,
          orderId: result.orderId,
          creationTime: executionTime,
          readTime: readTime,
          documentSize: result.documentSize || null,
          collections: this.getCollectionCountForStrategy(strategy),
          advantages: this.getStrategyAdvantages(strategy),
          limitations: this.getStrategyLimitations(strategy)
        };

      } catch (error) {
        results[strategy] = {
          success: false,
          error: error.message,
          creationTime: Date.now() - startTime
        };
      }
    }

    // Generate comparison analysis
    const analysis = this.analyzeStrategyComparison(results, orderData);

    return {
      timestamp: new Date(),
      orderData: {
        itemCount: orderData.items.length,
        customerType: orderData.customer.type,
        orderValue: orderData.items.reduce((sum, item) => sum + (item.quantity * item.unitPrice), 0)
      },
      results: results,
      analysis: analysis,
      recommendation: this.generateStrategyRecommendation(results, orderData)
    };
  }

  async retrieveOrderByStrategy(orderId, strategy) {
    switch (strategy) {
      case 'embedded':
        return await this.db.collection('orders_embedded').findOne({ _id: orderId });

      case 'referenced':
        const order = await this.db.collection('orders_referenced').findOne({ _id: orderId });
        if (order) {
          // Simulate additional queries needed for referenced data
          const items = await this.db.collection('order_items_referenced')
            .find({ orderId: orderId }).toArray();
          const customer = await this.db.collection('customers')
            .findOne({ _id: order.customerId });
          const statusHistory = await this.db.collection('order_status_history')
            .find({ orderId: orderId }).toArray();

          return { ...order, items, customer, statusHistory };
        }
        return order;

      case 'hybrid':
        const hybridOrder = await this.db.collection('orders_hybrid').findOne({ _id: orderId });
        if (hybridOrder && hybridOrder.itemsCollection) {
          // Load detailed items if referenced
          const detailedItems = await this.db.collection('order_items_detailed')
            .find({ orderId: orderId }).toArray();
          hybridOrder.detailedItems = detailedItems;
        }
        return hybridOrder;

      default:
        throw new Error(`Unknown strategy: ${strategy}`);
    }
  }

  getCollectionCountForStrategy(strategy) {
    const collectionCounts = {
      embedded: 1,    // Only orders_embedded
      referenced: 3,  // orders_referenced, order_items_referenced, order_status_history
      hybrid: 2       // orders_hybrid, order_status_history (+ conditional order_items_detailed)
    };
    return collectionCounts[strategy] || 0;
  }

  getStrategyAdvantages(strategy) {
    const advantages = {
      embedded: [
        'Single query retrieval for complete order',
        'Atomic updates across related data',
        'Better read performance for order details',
        'Simplified application logic',
        'Natural data locality'
      ],
      referenced: [
        'Flexible independent querying of entities',
        'Smaller individual document sizes',
        'Easy to update individual components',
        'Better for large item collections',
        'Familiar relational-style patterns'
      ],
      hybrid: [
        'Optimized for specific access patterns',
        'Best read performance for common operations',
        'Balanced document sizes',
        'Flexibility for both embedded and referenced data',
        'Adaptive to data volume changes'
      ]
    };
    return advantages[strategy] || [];
  }

  getStrategyLimitations(strategy) {
    const limitations = {
      embedded: [
        'Document size limits (16MB)',
        'Complex updates for large arrays',
        'Potential for data duplication',
        'Less flexible for independent querying',
        'Growth limitations for embedded collections'
      ],
      referenced: [
        'Multiple queries needed for complete data',
        'No atomic cross-document transactions',
        'More complex application logic',
        'Potential N+1 query problems',
        'Reduced read performance'
      ],
      hybrid: [
        'Increased complexity in data modeling decisions',
        'Potential for data synchronization issues',
        'More maintenance overhead',
        'Complexity in query optimization',
        'Requires careful planning for access patterns'
      ]
    };
    return limitations[strategy] || [];
  }

  analyzeStrategyComparison(results, orderData) {
    const analysis = {
      performance: {},
      scalability: {},
      complexity: {},
      dataIntegrity: {}
    };

    // Performance analysis
    const fastest = Object.entries(results).reduce((fastest, [strategy, result]) => {
      if (result.success && (!fastest || result.readTime < fastest.readTime)) {
        return { strategy, readTime: result.readTime };
      }
      return fastest;
    }, null);

    analysis.performance.fastestRead = fastest;
    analysis.performance.readTimeComparison = Object.fromEntries(
      Object.entries(results).map(([strategy, result]) => [
        strategy, 
        result.success ? result.readTime : 'failed'
      ])
    );

    // Scalability analysis based on order characteristics
    const itemCount = orderData.items.length;
    if (itemCount <= 5) {
      analysis.scalability.recommendation = 'embedded';
      analysis.scalability.reason = 'Small item count ideal for embedding';
    } else if (itemCount <= 20) {
      analysis.scalability.recommendation = 'hybrid';
      analysis.scalability.reason = 'Medium item count benefits from hybrid approach';
    } else {
      analysis.scalability.recommendation = 'referenced';
      analysis.scalability.reason = 'Large item count requires referencing to avoid document size limits';
    }

    // Complexity analysis
    analysis.complexity.applicationLogic = {
      embedded: 'Low - single document operations',
      referenced: 'High - multiple collection coordination',
      hybrid: 'Medium - selective complexity based on data size'
    };

    // Data integrity analysis
    analysis.dataIntegrity = {
      embedded: 'High - atomic document updates',
      referenced: 'Medium - requires transaction coordination',
      hybrid: 'Medium - mixed atomic and coordinated updates'
    };

    return analysis;
  }

  generateStrategyRecommendation(results, orderData) {
    const itemCount = orderData.items.length;
    const orderValue = orderData.items.reduce((sum, item) => sum + (item.quantity * item.unitPrice), 0);

    // Decision matrix based on order characteristics
    if (itemCount <= 5 && orderValue < 1000) {
      return {
        recommendedStrategy: 'embedded',
        confidence: 'high',
        reasoning: 'Small order with few items ideal for embedded document pattern',
        benefits: [
          'Fastest read performance',
          'Simplest application logic',
          'Atomic updates',
          'Best for order display and processing'
        ],
        considerations: [
          'Monitor document growth over time',
          'Consider hybrid if order complexity increases'
        ]
      };
    } else if (itemCount <= 20) {
      return {
        recommendedStrategy: 'hybrid',
        confidence: 'high',
        reasoning: 'Medium-sized order benefits from selective embedding and referencing',
        benefits: [
          'Optimized for common access patterns',
          'Balanced performance and flexibility',
          'Handles growth well',
          'Good read performance with manageable complexity'
        ],
        considerations: [
          'Requires careful design of embedded vs referenced data',
          'Monitor access patterns to optimize embedding decisions'
        ]
      };
    } else {
      return {
        recommendedStrategy: 'referenced',
        confidence: 'medium',
        reasoning: 'Large order requires referencing to manage document size and complexity',
        benefits: [
          'Avoids document size limits',
          'Flexible querying of individual components',
          'Better for large-scale data management',
          'Easier to update individual items'
        ],
        considerations: [
          'Requires multiple queries for complete order data',
          'Consider caching strategies for performance',
          'Use transactions for data consistency'
        ]
      };
    }
  }

  async getModelingMetrics() {
    const collections = [
      'orders_embedded',
      'orders_referenced', 
      'order_items_referenced',
      'orders_hybrid',
      'order_items_detailed',
      'order_status_history'
    ];

    const metrics = {
      timestamp: new Date(),
      collections: {},
      summary: {
        totalOrders: 0,
        embeddedOrders: 0,
        referencedOrders: 0,
        hybridOrders: 0,
        averageDocumentSize: 0
      }
    };

    for (const collectionName of collections) {
      try {
        const collection = this.db.collection(collectionName);
        const stats = await this.db.command({ collStats: collectionName });

        metrics.collections[collectionName] = {
          documentCount: stats.count,
          storageSize: stats.storageSize,
          averageDocumentSize: stats.avgObjSize,
          indexCount: stats.nindexes,
          indexSize: stats.totalIndexSize
        };

        // Count orders by strategy
        if (collectionName.includes('orders')) {
          if (collectionName.includes('embedded')) {
            metrics.summary.embeddedOrders = stats.count;
          } else if (collectionName.includes('referenced')) {
            metrics.summary.referencedOrders = stats.count;
          } else if (collectionName.includes('hybrid')) {
            metrics.summary.hybridOrders = stats.count;
          }
        }

      } catch (error) {
        metrics.collections[collectionName] = { error: error.message };
      }
    }

    metrics.summary.totalOrders = 
      metrics.summary.embeddedOrders + 
      metrics.summary.referencedOrders + 
      metrics.summary.hybridOrders;

    return metrics;
  }

  async shutdown() {
    console.log('Shutting down MongoDB Document Modeling Manager...');

    if (this.client) {
      await this.client.close();
      console.log('✅ MongoDB connection closed');
    }

    this.modelingStrategies.clear();
    this.performanceMetrics.clear();
  }
}

// Export the document modeling manager
module.exports = { MongoDocumentModelingManager };

// Benefits of MongoDB Document Modeling:
// - Flexible embedding and referencing patterns eliminate complex join operations
// - Atomic document operations provide strong consistency for related data
// - Optimized read performance through denormalization and data locality
// - Adaptive modeling strategies that scale with data volume and access patterns
// - Simplified application logic through single-document operations
// - Natural data relationships that map to application object models
// - Hybrid approaches that balance performance, flexibility, and maintainability
// - Reduced query complexity through strategic data embedding
// - Better cache utilization through document-based data access
// - SQL-compatible document modeling patterns through QueryLeaf integration

Understanding MongoDB Document Relationships

Embedding vs Referencing Decision Framework

Choose the optimal modeling strategy based on data characteristics and access patterns:

// Advanced decision framework for embedding vs referencing
class DocumentModelingDecisionEngine {
  constructor() {
    this.decisionRules = new Map();
    this.performanceProfiles = new Map();
    this.scalabilityThresholds = new Map();
  }

  async analyzeRelationshipPattern(relationshipData) {
    console.log('Analyzing relationship pattern for optimal modeling strategy...');

    const analysis = {
      relationship: relationshipData,
      characteristics: this.analyzeDataCharacteristics(relationshipData),
      accessPatterns: this.analyzeAccessPatterns(relationshipData),
      scalabilityFactors: this.analyzeScalabilityFactors(relationshipData),
      recommendation: null
    };

    // Apply decision framework
    analysis.recommendation = this.generateModelingRecommendation(analysis);

    return analysis;
  }

  analyzeDataCharacteristics(relationshipData) {
    return {
      cardinality: this.determineCardinality(relationshipData),
      dataVolume: this.assessDataVolume(relationshipData),
      dataStability: this.assessDataStability(relationshipData),
      documentComplexity: this.assessDocumentComplexity(relationshipData)
    };
  }

  determineCardinality(relationshipData) {
    const { parentCollection, childCollection, relationshipType } = relationshipData;

    if (relationshipType === 'one-to-one') {
      return {
        type: 'one-to-one',
        recommendation: 'embed',
        confidence: 'high',
        reasoning: 'One-to-one relationships benefit from embedding for atomic operations'
      };
    } else if (relationshipType === 'one-to-few') {
      return {
        type: 'one-to-few',
        recommendation: 'embed',
        confidence: 'high',
        reasoning: 'Small collections (< 100 items) should be embedded for performance'
      };
    } else if (relationshipType === 'one-to-many') {
      return {
        type: 'one-to-many',
        recommendation: 'reference',
        confidence: 'medium',
        reasoning: 'Large collections may exceed document size limits if embedded'
      };
    } else if (relationshipType === 'many-to-many') {
      return {
        type: 'many-to-many',
        recommendation: 'reference',
        confidence: 'high',
        reasoning: 'Many-to-many relationships require referencing to avoid duplication'
      };
    }

    return { type: 'unknown', recommendation: 'analyze_further' };
  }

  assessDataVolume(relationshipData) {
    const { estimatedChildDocuments, averageChildSize, maxChildSize } = relationshipData;

    const estimatedEmbeddedSize = estimatedChildDocuments * averageChildSize;
    const maxEmbeddedSize = estimatedChildDocuments * maxChildSize;

    // MongoDB 16MB document limit
    const documentSizeLimit = 16 * 1024 * 1024; // 16MB
    const practicalLimit = documentSizeLimit * 0.8; // 80% of limit for safety

    return {
      estimatedSize: estimatedEmbeddedSize,
      maxPotentialSize: maxEmbeddedSize,
      exceedsLimit: maxEmbeddedSize > practicalLimit,
      volumeRecommendation: maxEmbeddedSize > practicalLimit ? 'reference' : 'embed',
      sizingDetails: {
        documentLimit: documentSizeLimit,
        practicalLimit: practicalLimit,
        utilizationPercent: (estimatedEmbeddedSize / practicalLimit) * 100
      }
    };
  }

  assessDataStability(relationshipData) {
    const { updateFrequency, childDocumentMutability, parentDocumentMutability } = relationshipData;

    if (updateFrequency === 'high' && childDocumentMutability === 'high') {
      return {
        stability: 'low',
        recommendation: 'reference',
        reasoning: 'High update frequency on embedded arrays can cause performance issues'
      };
    } else if (updateFrequency === 'low' && childDocumentMutability === 'low') {
      return {
        stability: 'high',
        recommendation: 'embed',
        reasoning: 'Stable data benefits from embedding for read performance'
      };
    } else {
      return {
        stability: 'medium',
        recommendation: 'hybrid',
        reasoning: 'Mixed stability patterns may benefit from selective embedding'
      };
    }
  }

  assessDocumentComplexity(relationshipData) {
    const { childDocumentStructure, nestingLevels, arrayComplexity } = relationshipData;

    const complexityScore = 
      (nestingLevels * 2) + 
      (arrayComplexity === 'high' ? 3 : arrayComplexity === 'medium' ? 2 : 1) +
      (Object.keys(childDocumentStructure).length * 0.1);

    return {
      complexityScore: complexityScore,
      level: complexityScore > 10 ? 'high' : complexityScore > 5 ? 'medium' : 'low',
      recommendation: complexityScore > 10 ? 'reference' : 'embed',
      reasoning: complexityScore > 10 
        ? 'High complexity documents should be referenced to maintain manageable parent documents'
        : 'Low to medium complexity allows for efficient embedding'
    };
  }

  analyzeAccessPatterns(relationshipData) {
    const { 
      parentReadFrequency, 
      childReadFrequency, 
      jointReadFrequency,
      independentChildQueries,
      bulkChildOperations 
    } = relationshipData.accessPatterns;

    return {
      primaryPattern: jointReadFrequency > (parentReadFrequency + childReadFrequency) / 2 
        ? 'joint_access' : 'independent_access',
      jointAccessRatio: jointReadFrequency / (parentReadFrequency + childReadFrequency),
      independentQueriesFrequent: independentChildQueries === 'high',
      bulkOperationsFrequent: bulkChildOperations === 'high',

      accessRecommendation: this.generateAccessBasedRecommendation(relationshipData.accessPatterns)
    };
  }

  generateAccessBasedRecommendation(accessPatterns) {
    const { jointReadFrequency, independentChildQueries, bulkChildOperations } = accessPatterns;

    if (jointReadFrequency === 'high' && independentChildQueries === 'low') {
      return {
        strategy: 'embed',
        confidence: 'high',
        reasoning: 'High joint read frequency with low independent queries favors embedding'
      };
    } else if (independentChildQueries === 'high' || bulkChildOperations === 'high') {
      return {
        strategy: 'reference',
        confidence: 'high',
        reasoning: 'High independent child operations favor referencing for query flexibility'
      };
    } else {
      return {
        strategy: 'hybrid',
        confidence: 'medium',
        reasoning: 'Mixed access patterns may benefit from hybrid approach with summary embedding'
      };
    }
  }

  analyzeScalabilityFactors(relationshipData) {
    const { growthProjections, performanceRequirements, maintenanceComplexity } = relationshipData;

    return {
      projectedGrowth: growthProjections,
      performanceTargets: performanceRequirements,
      maintenanceBurden: maintenanceComplexity,

      scalabilityRecommendation: this.generateScalabilityRecommendation(relationshipData)
    };
  }

  generateScalabilityRecommendation(relationshipData) {
    const { growthProjections, performanceRequirements } = relationshipData;

    if (growthProjections.childDocuments === 'exponential') {
      return {
        strategy: 'reference',
        confidence: 'high',
        reasoning: 'Exponential growth requires referencing to prevent document size issues',
        scalingConsiderations: [
          'Implement pagination for large result sets',
          'Consider sharding strategies for high-volume collections',
          'Monitor document size growth patterns'
        ]
      };
    } else if (performanceRequirements.readLatency === 'critical') {
      return {
        strategy: 'embed',
        confidence: 'high',
        reasoning: 'Critical read performance requirements favor embedding for single-query retrieval',
        scalingConsiderations: [
          'Monitor embedded array growth',
          'Implement size-based migration to referencing',
          'Consider read replicas for scaling reads'
        ]
      };
    } else {
      return {
        strategy: 'hybrid',
        confidence: 'medium',
        reasoning: 'Balanced growth and performance requirements suit hybrid approach',
        scalingConsiderations: [
          'Start with embedding, migrate to referencing as data grows',
          'Implement adaptive strategies based on document size',
          'Monitor access patterns for optimization opportunities'
        ]
      };
    }
  }

  generateModelingRecommendation(analysis) {
    const recommendations = {
      embedding: 0,
      referencing: 0,
      hybrid: 0
    };

    // Weight different factors
    const factors = [
      { factor: analysis.characteristics.cardinality, weight: 3 },
      { factor: analysis.characteristics.dataVolume, weight: 4 },
      { factor: analysis.characteristics.dataStability, weight: 2 },
      { factor: analysis.accessPatterns.accessRecommendation, weight: 3 },
      { factor: analysis.scalabilityFactors.scalabilityRecommendation, weight: 2 }
    ];

    // Score each strategy based on factor recommendations
    factors.forEach(({ factor, weight }) => {
      const strategy = factor.strategy || factor.recommendation;
      if (strategy && recommendations.hasOwnProperty(strategy.replace('embed', 'embedding').replace('reference', 'referencing'))) {
        const strategyKey = strategy.replace('embed', 'embedding').replace('reference', 'referencing');
        recommendations[strategyKey] += weight;
      }
    });

    // Find highest scoring strategy
    const recommendedStrategy = Object.entries(recommendations)
      .reduce((best, [strategy, score]) => score > best.score ? { strategy, score } : best, 
              { strategy: 'hybrid', score: 0 });

    return {
      strategy: recommendedStrategy.strategy,
      confidence: this.calculateConfidence(recommendations, recommendedStrategy.score),
      scores: recommendations,
      reasoning: this.generateDetailedReasoning(analysis, recommendedStrategy.strategy),
      implementation: this.generateImplementationGuidance(recommendedStrategy.strategy, analysis),
      monitoring: this.generateMonitoringRecommendations(recommendedStrategy.strategy, analysis)
    };
  }

  calculateConfidence(scores, topScore) {
    const totalScore = Object.values(scores).reduce((sum, score) => sum + score, 0);
    const confidence = topScore / totalScore;

    if (confidence > 0.7) return 'high';
    if (confidence > 0.5) return 'medium';
    return 'low';
  }

  generateDetailedReasoning(analysis, strategy) {
    const reasons = [];

    // Add reasoning based on analysis factors
    if (analysis.characteristics.cardinality.recommendation === strategy.replace('embedding', 'embed').replace('referencing', 'reference')) {
      reasons.push(`Cardinality pattern (${analysis.characteristics.cardinality.type}) supports ${strategy}`);
    }

    if (analysis.characteristics.dataVolume.volumeRecommendation === strategy.replace('embedding', 'embed').replace('referencing', 'reference')) {
      reasons.push(`Data volume analysis supports ${strategy} (${analysis.characteristics.dataVolume.utilizationPercent.toFixed(1)}% of document limit)`);
    }

    if (analysis.accessPatterns.accessRecommendation.strategy === strategy.replace('embedding', 'embed').replace('referencing', 'reference')) {
      reasons.push(`Access patterns favor ${strategy} (${analysis.accessPatterns.primaryPattern})`);
    }

    return reasons;
  }

  generateImplementationGuidance(strategy, analysis) {
    const baseGuidance = {
      embedding: {
        steps: [
          'Design parent document to include embedded child array/object',
          'Implement atomic update operations for parent and children',
          'Create indexes on embedded fields for query performance',
          'Monitor document size growth'
        ],
        patterns: ['embed_small_collections', 'denormalize_frequently_accessed', 'atomic_updates']
      },
      referencing: {
        steps: [
          'Create separate collections for parent and child entities',
          'Establish reference fields (foreign keys)',
          'Implement application-level joins or aggregation pipelines',
          'Use transactions for cross-collection consistency'
        ],
        patterns: ['reference_large_collections', 'independent_querying', 'flexible_relationships']
      },
      hybrid: {
        steps: [
          'Identify frequently accessed child data for embedding',
          'Embed summaries, reference full details',
          'Implement dual-path queries for different use cases',
          'Monitor access patterns and optimize embedding decisions'
        ],
        patterns: ['selective_denormalization', 'summary_embedding', 'adaptive_modeling']
      }
    };

    return baseGuidance[strategy] || baseGuidance.hybrid;
  }

  generateMonitoringRecommendations(strategy, analysis) {
    const monitoring = {
      metrics: [],
      alerts: [],
      optimizations: []
    };

    if (strategy === 'embedding') {
      monitoring.metrics.push(
        'Document size growth rate',
        'Embedded array length distribution',
        'Update operation performance on embedded data'
      );
      monitoring.alerts.push(
        'Document size approaching 80% of 16MB limit',
        'Embedded array length exceeding 1000 elements',
        'Update performance degradation on large embedded arrays'
      );
      monitoring.optimizations.push(
        'Consider referencing if documents exceed size thresholds',
        'Implement array size limits and archiving strategies',
        'Monitor for embedded array update bottlenecks'
      );
    } else if (strategy === 'referencing') {
      monitoring.metrics.push(
        'Query performance for multi-collection operations',
        'Reference integrity maintenance overhead',
        'Aggregation pipeline performance'
      );
      monitoring.alerts.push(
        'High latency on multi-collection queries',
        'Reference consistency violations',
        'Excessive aggregation pipeline complexity'
      );
      monitoring.optimizations.push(
        'Implement caching for frequently accessed references',
        'Consider selective denormalization for hot data paths',
        'Optimize aggregation pipelines and indexing strategies'
      );
    }

    return monitoring;
  }
}

// Export the decision engine
module.exports = { DocumentModelingDecisionEngine };

SQL-Style Document Modeling with QueryLeaf

QueryLeaf provides familiar SQL syntax for MongoDB document modeling operations:

-- QueryLeaf document modeling with SQL-familiar patterns

-- Create embedded document structures
CREATE COLLECTION orders_embedded AS
SELECT 
  order_id,
  order_number,
  order_date,
  status,

  -- Embed customer information
  JSON_OBJECT(
    'customer_id', customer_id,
    'company_name', company_name,
    'email', email,
    'phone', phone,
    'billing_address', JSON_OBJECT(
      'line1', billing_address_line1,
      'line2', billing_address_line2,
      'city', billing_city,
      'state', billing_state,
      'postal_code', billing_postal_code,
      'country', billing_country
    )
  ) as customer,

  -- Embed order items array
  (
    SELECT JSON_ARRAYAGG(
      JSON_OBJECT(
        'item_id', item_id,
        'product_id', product_id,
        'product', JSON_OBJECT(
          'sku', product_sku,
          'name', product_name,
          'category', product_category
        ),
        'quantity', quantity,
        'unit_price', unit_price,
        'line_total', line_total
      )
    )
    FROM order_items oi
    WHERE oi.order_id = o.order_id
  ) as items,

  -- Embed totals object
  JSON_OBJECT(
    'subtotal', subtotal,
    'tax_amount', tax_amount,
    'shipping_amount', shipping_amount,
    'total_amount', total_amount
  ) as totals,

  -- Metadata
  JSON_OBJECT(
    'created_at', created_at,
    'updated_at', updated_at,
    'modeling_strategy', 'embedded'
  ) as metadata

FROM orders o
JOIN customers c ON o.customer_id = c.customer_id
WHERE o.created_at >= CURRENT_DATE - INTERVAL '30 days';

-- Query embedded documents with SQL syntax
SELECT 
  order_number,
  status,
  customer->>'company_name' as customer_name,
  customer->'billing_address'->>'city' as billing_city,
  totals->>'total_amount'::DECIMAL as order_total,

  -- Query embedded array elements
  JSON_ARRAY_LENGTH(items) as item_count,

  -- Extract specific item information
  (
    SELECT SUM((item->>'quantity')::INTEGER)
    FROM JSON_ARRAY_ELEMENTS(items) as item
  ) as total_quantity,

  -- Get product categories from embedded items
  (
    SELECT ARRAY_AGG(DISTINCT item->'product'->>'category')
    FROM JSON_ARRAY_ELEMENTS(items) as item
  ) as product_categories

FROM orders_embedded
WHERE status = 'pending'
  AND customer->>'email' LIKE '%@company.com'
ORDER BY totals->>'total_amount'::DECIMAL DESC;

-- Create referenced document structures
CREATE COLLECTION orders_referenced AS
SELECT 
  order_id,
  order_number,
  order_date,
  status,
  customer_id, -- Reference to customers collection

  -- Order totals (calculated fields)
  subtotal,
  tax_amount,
  shipping_amount,
  total_amount,

  JSON_OBJECT(
    'created_at', created_at,
    'updated_at', updated_at,
    'modeling_strategy', 'referenced'
  ) as metadata

FROM orders
WHERE created_at >= CURRENT_DATE - INTERVAL '30 days';

-- Query referenced documents with joins (QueryLeaf aggregation syntax)
SELECT 
  o.order_number,
  o.status,
  o.total_amount,

  -- Join with customer collection
  c.company_name,
  c.email,

  -- Aggregate order items
  (
    SELECT JSON_ARRAYAGG(
      JSON_OBJECT(
        'product_sku', oi.product_sku,
        'product_name', oi.product_name,
        'quantity', oi.quantity,
        'unit_price', oi.unit_price,
        'line_total', oi.line_total
      )
    )
    FROM order_items_referenced oi
    WHERE oi.order_id = o.order_id
  ) as items,

  -- Get status history
  (
    SELECT JSON_ARRAYAGG(
      JSON_OBJECT(
        'status', new_status,
        'changed_at', status_changed_at,
        'reason', status_change_reason
      ) ORDER BY status_changed_at DESC
    )
    FROM order_status_history osh
    WHERE osh.order_id = o.order_id
  ) as status_history

FROM orders_referenced o
LEFT JOIN customers c ON o.customer_id = c.customer_id
WHERE o.status IN ('pending', 'processing')
ORDER BY o.order_date DESC;

-- Hybrid document modeling approach
CREATE COLLECTION orders_hybrid AS
SELECT 
  order_id,
  order_number,
  order_date,
  status,

  -- Hybrid customer: embed summary, reference full details
  JSON_OBJECT(
    'customer_id', customer_id,
    'company_name', company_name,
    'email', email,
    'customer_ref', JSON_OBJECT(
      'collection', 'customers',
      'id', customer_id
    )
  ) as customer,

  -- Hybrid items: embed summary for quick access
  JSON_OBJECT(
    'total_items', (SELECT COUNT(*) FROM order_items WHERE order_id = o.order_id),
    'unique_products', (SELECT COUNT(DISTINCT product_id) FROM order_items WHERE order_id = o.order_id),
    'total_quantity', (SELECT SUM(quantity) FROM order_items WHERE order_id = o.order_id),

    -- Embed small item previews
    'quick_view', (
      SELECT JSON_ARRAYAGG(
        JSON_OBJECT(
          'product_sku', product_sku,
          'product_name', product_name,
          'quantity', quantity,
          'line_total', line_total
        )
      )
      FROM (
        SELECT * FROM order_items 
        WHERE order_id = o.order_id 
        ORDER BY line_total DESC 
        LIMIT 3
      ) top_items
    ),

    -- Reference for complete items if needed
    'items_collection', CASE 
      WHEN (SELECT COUNT(*) FROM order_items WHERE order_id = o.order_id) > 10 
      THEN JSON_OBJECT('collection', 'order_items_detailed', 'order_id', order_id)
      ELSE NULL
    END
  ) as items_summary,

  -- Embed current status, reference full history
  JSON_OBJECT(
    'current_status', status,
    'status_updated_at', updated_at,
    'status_history_ref', JSON_OBJECT(
      'collection', 'order_status_history',
      'order_id', order_id
    )
  ) as status_info,

  -- Totals
  JSON_OBJECT(
    'subtotal', subtotal,
    'tax_amount', tax_amount,
    'shipping_amount', shipping_amount,
    'total_amount', total_amount
  ) as totals,

  -- Metadata with modeling decisions
  JSON_OBJECT(
    'created_at', created_at,
    'updated_at', updated_at,
    'modeling_strategy', 'hybrid',
    'embedding_decisions', JSON_OBJECT(
      'customer', 'partial_denormalization',
      'items', CASE 
        WHEN (SELECT COUNT(*) FROM order_items WHERE order_id = o.order_id) <= 10 
        THEN 'embedded' 
        ELSE 'referenced_with_summary' 
      END,
      'status', 'current_embedded_history_referenced'
    )
  ) as metadata

FROM orders o
JOIN customers c ON o.customer_id = c.customer_id;

-- Performance analysis for document modeling strategies
WITH modeling_performance AS (
  SELECT 
    'embedded' as strategy,
    COUNT(*) as document_count,
    AVG(LENGTH(JSON_SERIALIZE(items))) as avg_items_size,
    AVG(JSON_ARRAY_LENGTH(items)) as avg_item_count,
    MAX(JSON_ARRAY_LENGTH(items)) as max_item_count,

    -- Query performance metrics
    AVG(query_execution_time_ms) as avg_query_time,
    PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY query_execution_time_ms) as p95_query_time

  FROM orders_embedded_performance_log
  WHERE query_date >= CURRENT_DATE - INTERVAL '7 days'

  UNION ALL

  SELECT 
    'referenced' as strategy,
    COUNT(*) as document_count,
    NULL as avg_items_size,
    AVG((
      SELECT COUNT(*) 
      FROM order_items_referenced oir 
      WHERE oir.order_id = orp.order_id
    )) as avg_item_count,
    MAX((
      SELECT COUNT(*) 
      FROM order_items_referenced oir 
      WHERE oir.order_id = orp.order_id
    )) as max_item_count,

    AVG(query_execution_time_ms) as avg_query_time,
    PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY query_execution_time_ms) as p95_query_time

  FROM orders_referenced_performance_log orp
  WHERE query_date >= CURRENT_DATE - INTERVAL '7 days'

  UNION ALL

  SELECT 
    'hybrid' as strategy,
    COUNT(*) as document_count,
    AVG(LENGTH(JSON_SERIALIZE(items_summary))) as avg_items_size,
    AVG(items_summary->>'total_items'::INTEGER) as avg_item_count,
    MAX(items_summary->>'total_items'::INTEGER) as max_item_count,

    AVG(query_execution_time_ms) as avg_query_time,
    PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY query_execution_time_ms) as p95_query_time

  FROM orders_hybrid_performance_log
  WHERE query_date >= CURRENT_DATE - INTERVAL '7 days'
)

SELECT 
  strategy,
  document_count,
  avg_item_count,
  max_item_count,

  -- Performance comparison
  ROUND(avg_query_time::NUMERIC, 2) as avg_query_time_ms,
  ROUND(p95_query_time::NUMERIC, 2) as p95_query_time_ms,

  -- Performance rating
  CASE 
    WHEN avg_query_time <= 50 THEN 'excellent'
    WHEN avg_query_time <= 200 THEN 'good'
    WHEN avg_query_time <= 500 THEN 'acceptable'
    ELSE 'needs_optimization'
  END as performance_rating,

  -- Strategy recommendations
  CASE strategy
    WHEN 'embedded' THEN
      CASE 
        WHEN max_item_count > 100 THEN 'Consider hybrid approach for large orders'
        WHEN avg_query_time > 200 THEN 'Monitor document size and query complexity'
        ELSE 'Strategy performing well for current data patterns'
      END
    WHEN 'referenced' THEN
      CASE 
        WHEN avg_query_time > 500 THEN 'Consider hybrid with summary embedding for performance'
        WHEN p95_query_time > 1000 THEN 'Optimize indexes and aggregation pipelines'
        ELSE 'Strategy suitable for large, complex data relationships'
      END
    WHEN 'hybrid' THEN
      CASE 
        WHEN avg_query_time > avg_query_time * 1.5 THEN 'Review embedding decisions and access patterns'
        ELSE 'Balanced approach providing good performance and flexibility'
      END
  END as strategy_recommendation

FROM modeling_performance
ORDER BY avg_query_time;

-- Document modeling decision support system
CREATE VIEW document_modeling_recommendations AS
WITH relationship_analysis AS (
  SELECT 
    table_name as parent_collection,
    related_table as child_collection,
    relationship_type,

    -- Data volume analysis
    (SELECT COUNT(*) FROM information_schema.tables WHERE table_name = related_table) as child_document_count,
    AVG(pg_column_size(row_to_json(r.*))) as avg_child_size,
    MAX(pg_column_size(row_to_json(r.*))) as max_child_size,

    -- Relationship cardinality
    CASE 
      WHEN relationship_type = 'one_to_one' THEN 1
      WHEN relationship_type = 'one_to_few' THEN 10
      WHEN relationship_type = 'one_to_many' THEN 1000
      ELSE 10000
    END as estimated_child_count,

    -- Update patterns
    (
      SELECT COUNT(*) 
      FROM update_frequency_log ufl 
      WHERE ufl.table_name = related_table 
        AND ufl.log_date >= CURRENT_DATE - INTERVAL '7 days'
    ) as weekly_updates,

    -- Access patterns
    (
      SELECT COUNT(*) 
      FROM query_log ql 
      WHERE ql.query_text LIKE '%JOIN%' 
        AND ql.query_text LIKE '%' || table_name || '%'
        AND ql.query_text LIKE '%' || related_table || '%'
        AND ql.query_date >= CURRENT_DATE - INTERVAL '7 days'
    ) as joint_queries_weekly,

    (
      SELECT COUNT(*) 
      FROM query_log ql 
      WHERE ql.query_text LIKE '%' || related_table || '%'
        AND ql.query_text NOT LIKE '%JOIN%'
        AND ql.query_date >= CURRENT_DATE - INTERVAL '7 days'
    ) as independent_queries_weekly

  FROM relationship_metadata rm
  WHERE rm.target_system = 'mongodb'
)

SELECT 
  parent_collection,
  child_collection,
  relationship_type,

  -- Volume-based recommendation
  CASE 
    WHEN estimated_child_count * avg_child_size > 10485760 THEN 'reference' -- 10MB threshold
    WHEN estimated_child_count <= 100 AND avg_child_size <= 10240 THEN 'embed' -- Small documents
    ELSE 'hybrid'
  END as volume_recommendation,

  -- Access pattern recommendation
  CASE 
    WHEN joint_queries_weekly > independent_queries_weekly * 2 THEN 'embed'
    WHEN independent_queries_weekly > joint_queries_weekly * 2 THEN 'reference'
    ELSE 'hybrid'
  END as access_pattern_recommendation,

  -- Update frequency recommendation
  CASE 
    WHEN weekly_updates > 1000 THEN 'reference'
    WHEN weekly_updates < 100 THEN 'embed'
    ELSE 'hybrid'
  END as update_frequency_recommendation,

  -- Combined recommendation with confidence
  CASE 
    WHEN 
      (CASE WHEN estimated_child_count * avg_child_size > 10485760 THEN 1 ELSE 0 END) +
      (CASE WHEN independent_queries_weekly > joint_queries_weekly * 2 THEN 1 ELSE 0 END) +
      (CASE WHEN weekly_updates > 1000 THEN 1 ELSE 0 END) >= 2
    THEN 'reference'

    WHEN 
      (CASE WHEN estimated_child_count <= 100 AND avg_child_size <= 10240 THEN 1 ELSE 0 END) +
      (CASE WHEN joint_queries_weekly > independent_queries_weekly * 2 THEN 1 ELSE 0 END) +
      (CASE WHEN weekly_updates < 100 THEN 1 ELSE 0 END) >= 2
    THEN 'embed'

    ELSE 'hybrid'
  END as final_recommendation,

  -- Confidence calculation
  CASE 
    WHEN ABS((joint_queries_weekly::DECIMAL / NULLIF(independent_queries_weekly, 0)) - 1) > 2 THEN 'high'
    WHEN ABS((joint_queries_weekly::DECIMAL / NULLIF(independent_queries_weekly, 0)) - 1) > 0.5 THEN 'medium'
    ELSE 'low'
  END as recommendation_confidence,

  -- Implementation guidance
  CASE 
    WHEN relationship_type = 'one_to_one' AND avg_child_size < 5120 
    THEN 'Embed child document directly in parent'

    WHEN relationship_type = 'one_to_few' AND estimated_child_count <= 50
    THEN 'Embed as array in parent document'

    WHEN relationship_type = 'one_to_many' AND joint_queries_weekly > 500
    THEN 'Consider hybrid: embed summary, reference details'

    WHEN relationship_type = 'many_to_many'
    THEN 'Use references with junction collection or arrays of references'

    ELSE 'Analyze specific access patterns and data growth projections'
  END as implementation_guidance,

  -- Monitoring recommendations
  ARRAY[
    CASE WHEN estimated_child_count * avg_child_size > 5242880 THEN 'Monitor document size growth' END,
    CASE WHEN weekly_updates > 500 THEN 'Monitor update performance on embedded arrays' END,
    CASE WHEN independent_queries_weekly > 1000 THEN 'Consider indexing strategies for referenced collections' END,
    'Track query performance after modeling implementation',
    'Monitor data growth patterns and access frequency changes'
  ] as monitoring_checklist

FROM relationship_analysis
ORDER BY parent_collection, child_collection;

-- QueryLeaf provides comprehensive document modeling capabilities:
-- 1. SQL-familiar syntax for creating embedded and referenced document structures
-- 2. Flexible querying of complex nested documents with JSON operators
-- 3. Performance analysis comparing different modeling strategies
-- 4. Automated recommendations based on data characteristics and access patterns
-- 5. Hybrid modeling support with selective embedding and referencing
-- 6. Decision support systems for optimal modeling strategy selection
-- 7. Integration with MongoDB's native document operations and indexing
-- 8. Production-ready patterns for scalable document-based applications
-- 9. Monitoring and optimization guidance for document modeling decisions
-- 10. Enterprise-grade document modeling accessible through familiar SQL constructs

Best Practices for MongoDB Document Modeling

Embedding vs Referencing Decision Guidelines

Essential practices for choosing optimal document modeling strategies:

One-to-Few Relationships: Embed small, related collections (< 100 documents) that are frequently accessed together
One-to-Many Relationships: Use references for large collections that may grow beyond document size limits
Data Update Patterns: Embed stable data, reference frequently updated data to avoid complex array updates
Query Optimization: Embed data that is commonly queried together to minimize database round trips
Document Size Management: Monitor embedded collections to prevent approaching the 16MB document limit
Access Pattern Analysis: Choose modeling strategy based on whether data is primarily accessed jointly or independently

Performance Optimization Strategies

Optimize document models for maximum application performance:

Strategic Denormalization: Embed frequently accessed data even if it creates some duplication
Index Optimization: Create appropriate indexes on embedded fields and reference keys
Hybrid Approaches: Combine embedding and referencing based on data access patterns and volume
Document Structure: Design document schemas that align with application query patterns
Array Management: Limit embedded array sizes and implement archiving for historical data
Caching Strategies: Implement application-level caching for frequently accessed referenced data

Conclusion

MongoDB document modeling provides flexible strategies for structuring data that can dramatically improve application performance and simplify development complexity. The choice between embedding, referencing, and hybrid approaches depends on specific data characteristics, access patterns, and scalability requirements.

Key MongoDB document modeling benefits include:

Flexible Data Structures: Native support for complex nested documents and arrays eliminates rigid relational constraints
Optimized Read Performance: Strategic embedding enables single-query retrieval of complete entity data
Atomic Operations: Document-level atomic updates provide consistency for related data without complex transactions
Scalable Patterns: Hybrid approaches that adapt to data volume and access pattern changes over time
Simplified Application Logic: Natural object mapping reduces impedance mismatch between database and application models
SQL Compatibility: Familiar document modeling patterns accessible through SQL-style operations

Whether you're building e-commerce platforms, content management systems, or IoT data collection applications, MongoDB's document modeling flexibility with QueryLeaf's SQL-familiar interface provides the foundation for scalable data architecture that maintains high performance while adapting to evolving business requirements.

QueryLeaf Integration: QueryLeaf automatically optimizes MongoDB document modeling while providing SQL-familiar syntax for creating embedded and referenced document structures. Advanced modeling decision support, performance analysis, and hybrid pattern implementation are seamlessly accessible through familiar SQL constructs, making sophisticated document modeling both powerful and approachable for SQL-oriented teams.

The combination of MongoDB's flexible document modeling with familiar SQL-style management makes it an ideal platform for applications that require both sophisticated data relationships and operational simplicity, ensuring your data architecture scales efficiently while maintaining familiar development and operational patterns.

November 18, 2025
20 min read

MongoDB Bulk Write Operations for High-Performance Data Processing: Enterprise-Scale Data Ingestion and Batch Processing with SQL-Compatible Patterns

Enterprise applications frequently need to process large volumes of data efficiently, whether importing CSV files, synchronizing with external systems, or performing batch transformations. Traditional row-by-row database operations create significant performance bottlenecks and resource overhead when processing thousands or millions of records, leading to extended processing times and poor user experiences.

MongoDB Bulk Write Operations provide sophisticated batch processing capabilities that dramatically improve throughput by combining multiple write operations into optimized batch requests. Unlike traditional databases that require complex stored procedures or external ETL tools, MongoDB's native bulk operations integrate seamlessly with application code while delivering enterprise-grade performance and reliability.

The Traditional Batch Processing Challenge

Processing large datasets with conventional database approaches creates significant performance and operational challenges:

-- Traditional PostgreSQL batch processing - inefficient row-by-row operations

-- Product catalog import with individual INSERT statements
-- This approach creates massive performance problems at scale

DO $$
DECLARE
    product_record RECORD;
    import_cursor CURSOR FOR 
        SELECT * FROM product_import_staging;
    total_processed INTEGER := 0;
    batch_size INTEGER := 1000;
    start_time TIMESTAMP;
    current_batch_time TIMESTAMP;

BEGIN
    start_time := CURRENT_TIMESTAMP;

    -- Process each record individually - extremely inefficient
    FOR product_record IN import_cursor LOOP
        -- Individual validation and processing
        BEGIN
            -- Product existence check (N+1 query problem)
            IF EXISTS (
                SELECT 1 FROM products 
                WHERE sku = product_record.sku
            ) THEN
                -- Update existing product
                UPDATE products 
                SET 
                    name = product_record.name,
                    description = product_record.description,
                    price = product_record.price,
                    category_id = (
                        SELECT category_id 
                        FROM categories 
                        WHERE category_name = product_record.category_name
                    ),
                    stock_quantity = product_record.stock_quantity,
                    weight_kg = product_record.weight_kg,
                    dimensions_json = product_record.dimensions_json::JSONB,
                    supplier_id = (
                        SELECT supplier_id 
                        FROM suppliers 
                        WHERE supplier_code = product_record.supplier_code
                    ),

                    -- Pricing and inventory details
                    cost_price = product_record.cost_price,
                    margin_percent = product_record.margin_percent,
                    tax_category = product_record.tax_category,
                    minimum_order_quantity = product_record.minimum_order_quantity,
                    lead_time_days = product_record.lead_time_days,

                    -- Status and lifecycle
                    status = product_record.status,
                    is_active = product_record.is_active,
                    availability_date = product_record.availability_date::DATE,
                    discontinuation_date = product_record.discontinuation_date::DATE,

                    -- SEO and marketing
                    seo_title = product_record.seo_title,
                    seo_description = product_record.seo_description,
                    keywords_array = string_to_array(product_record.keywords, ','),

                    -- Audit fields
                    updated_at = CURRENT_TIMESTAMP,
                    updated_by = 'bulk_import_system'

                WHERE sku = product_record.sku;

            ELSE
                -- Insert new product with complex validation
                INSERT INTO products (
                    sku, name, description, price, category_id,
                    stock_quantity, weight_kg, dimensions_json,
                    supplier_id, cost_price, margin_percent,
                    tax_category, minimum_order_quantity, lead_time_days,
                    status, is_active, availability_date, discontinuation_date,
                    seo_title, seo_description, keywords_array,
                    created_at, updated_at, created_by
                ) VALUES (
                    product_record.sku,
                    product_record.name,
                    product_record.description,
                    product_record.price,
                    (SELECT category_id FROM categories WHERE category_name = product_record.category_name),
                    product_record.stock_quantity,
                    product_record.weight_kg,
                    product_record.dimensions_json::JSONB,
                    (SELECT supplier_id FROM suppliers WHERE supplier_code = product_record.supplier_code),
                    product_record.cost_price,
                    product_record.margin_percent,
                    product_record.tax_category,
                    product_record.minimum_order_quantity,
                    product_record.lead_time_days,
                    product_record.status,
                    product_record.is_active,
                    product_record.availability_date::DATE,
                    product_record.discontinuation_date::DATE,
                    product_record.seo_title,
                    product_record.seo_description,
                    string_to_array(product_record.keywords, ','),
                    CURRENT_TIMESTAMP,
                    CURRENT_TIMESTAMP,
                    'bulk_import_system'
                );

            END IF;

            -- Process product variants (additional N+1 queries)
            IF product_record.variants_json IS NOT NULL THEN
                INSERT INTO product_variants (
                    product_sku,
                    variant_sku,
                    variant_attributes,
                    price_adjustment,
                    stock_quantity,
                    created_at
                )
                SELECT 
                    product_record.sku,
                    variant->>'sku',
                    variant->'attributes',
                    (variant->>'price_adjustment')::DECIMAL,
                    (variant->>'stock_quantity')::INTEGER,
                    CURRENT_TIMESTAMP
                FROM jsonb_array_elements(product_record.variants_json::JSONB) AS variant
                ON CONFLICT (variant_sku) DO UPDATE SET
                    variant_attributes = EXCLUDED.variant_attributes,
                    price_adjustment = EXCLUDED.price_adjustment,
                    stock_quantity = EXCLUDED.stock_quantity,
                    updated_at = CURRENT_TIMESTAMP;
            END IF;

            -- Update inventory tracking
            INSERT INTO inventory_transactions (
                product_sku,
                transaction_type,
                quantity_change,
                new_quantity,
                reason,
                created_at
            ) VALUES (
                product_record.sku,
                'bulk_import',
                product_record.stock_quantity,
                product_record.stock_quantity,
                'Product catalog import',
                CURRENT_TIMESTAMP
            );

            total_processed := total_processed + 1;

            -- Periodic progress reporting (every 1000 records)
            IF total_processed % batch_size = 0 THEN
                current_batch_time := CURRENT_TIMESTAMP;
                RAISE NOTICE 'Processed % records. Current batch time: % seconds', 
                    total_processed, 
                    EXTRACT(EPOCH FROM (current_batch_time - start_time));

                -- Commit intermediate results (but lose atomicity)
                COMMIT;
            END IF;

        EXCEPTION WHEN OTHERS THEN
            -- Log error and continue (poor error handling)
            INSERT INTO import_errors (
                record_data,
                error_message,
                created_at
            ) VALUES (
                row_to_json(product_record)::TEXT,
                SQLERRM,
                CURRENT_TIMESTAMP
            );

            RAISE NOTICE 'Error processing record %: %', product_record.sku, SQLERRM;
            CONTINUE;
        END;

    END LOOP;

    RAISE NOTICE 'Import completed. Total processed: % records in % seconds', 
        total_processed, 
        EXTRACT(EPOCH FROM (CURRENT_TIMESTAMP - start_time));

END $$;

-- Problems with traditional row-by-row processing:
-- 1. Each operation requires a separate database round-trip
-- 2. N+1 query problems for lookups and validations
-- 3. No atomic batch operations - partial failures leave data inconsistent
-- 4. Poor performance - processing 100K records can take hours
-- 5. Resource intensive - high CPU and memory usage per operation
-- 6. Limited error handling - difficult to handle partial batch failures
-- 7. Complex transaction management across large datasets
-- 8. Manual progress tracking and monitoring implementation
-- 9. No built-in retry logic for transient failures
-- 10. Difficult to optimize - requires stored procedures or external tools

MongoDB provides native bulk operations with intelligent batching and error handling:

// MongoDB Bulk Write Operations - high-performance batch processing
const { MongoClient } = require('mongodb');

// Advanced MongoDB Bulk Operations Manager
class MongoBulkOperationsManager {
  constructor() {
    this.client = null;
    this.db = null;
    this.operationMetrics = new Map();
    this.errorHandlers = new Map();
    this.batchConfigurations = new Map();
  }

  async initialize() {
    console.log('Initializing MongoDB Bulk Operations Manager...');

    // Connect with optimized settings for bulk operations
    this.client = new MongoClient(process.env.MONGODB_URI || 'mongodb://localhost:27017', {
      // Connection pool optimized for bulk operations
      minPoolSize: 5,
      maxPoolSize: 20,
      maxIdleTimeMS: 30000,

      // Write concern optimized for throughput
      writeConcern: { w: 1, j: false }, // Faster for bulk imports
      readPreference: 'primary',

      // Batch operation optimizations
      maxBsonObjectSize: 16777216, // 16MB BSON limit
      compression: ['zlib'], // Reduce network overhead

      appName: 'BulkOperationsManager'
    });

    await this.client.connect();
    this.db = this.client.db('production');

    // Initialize batch configurations for different operation types
    await this.setupBatchConfigurations();

    console.log('✅ MongoDB Bulk Operations Manager initialized');
  }

  async setupBatchConfigurations() {
    console.log('Setting up batch operation configurations...');

    // Define optimized batch configurations for different scenarios
    const configurations = {
      // High-throughput product catalog imports
      'product_import': {
        batchSize: 1000,          // Optimal balance of memory and performance
        maxBatchSizeBytes: 10485760, // 10MB max batch size
        ordered: false,           // Allow parallel processing
        timeout: 300000,          // 5 minute timeout
        retryAttempts: 3,
        retryDelay: 1000,

        // Validation settings
        bypassDocumentValidation: false,
        validateDocuments: true,

        // Performance monitoring
        trackMetrics: true,
        logProgress: true,
        progressInterval: 5000
      },

      // Real-time order processing updates
      'order_updates': {
        batchSize: 500,
        maxBatchSizeBytes: 5242880, // 5MB max
        ordered: true,            // Maintain order for business logic
        timeout: 60000,           // 1 minute timeout
        retryAttempts: 5,
        retryDelay: 500,

        // Strict validation for financial data
        bypassDocumentValidation: false,
        validateDocuments: true,

        trackMetrics: true,
        logProgress: true,
        progressInterval: 1000
      },

      // Analytics data aggregation
      'analytics_batch': {
        batchSize: 2000,          // Larger batches for analytics
        maxBatchSizeBytes: 15728640, // 15MB max
        ordered: false,
        timeout: 600000,          // 10 minute timeout
        retryAttempts: 2,
        retryDelay: 2000,

        // Relaxed validation for analytical data
        bypassDocumentValidation: true,
        validateDocuments: false,

        trackMetrics: true,
        logProgress: true,
        progressInterval: 10000
      },

      // Log data ingestion (high volume, low latency)
      'log_ingestion': {
        batchSize: 5000,          // Very large batches
        maxBatchSizeBytes: 20971520, // 20MB max
        ordered: false,
        timeout: 120000,          // 2 minute timeout
        retryAttempts: 1,         // Minimal retries for logs
        retryDelay: 100,

        // Minimal validation for high throughput
        bypassDocumentValidation: true,
        validateDocuments: false,

        trackMetrics: false,      // Reduce overhead
        logProgress: false,
        progressInterval: 50000
      }
    };

    for (const [configName, config] of Object.entries(configurations)) {
      this.batchConfigurations.set(configName, config);
    }

    console.log('✅ Batch configurations initialized');
  }

  async performBulkProductImport(productData, options = {}) {
    console.log(`Starting bulk product import for ${productData.length} products...`);

    const config = this.batchConfigurations.get('product_import');
    const collection = this.db.collection('products');

    // Prepare bulk operations with sophisticated error handling
    const bulkOps = [];
    const processingMetrics = {
      startTime: Date.now(),
      totalRecords: productData.length,
      processedRecords: 0,
      successfulOperations: 0,
      failedOperations: 0,
      errors: [],
      batchTimes: []
    };

    try {
      // Prepare bulk write operations
      for (const product of productData) {
        const operation = await this.createProductOperation(product);
        if (operation) {
          bulkOps.push(operation);
        } else {
          processingMetrics.failedOperations++;
        }
      }

      // Execute bulk operations in optimized batches
      const results = await this.executeBulkOperations(
        collection, 
        bulkOps, 
        config, 
        processingMetrics
      );

      const totalTime = Date.now() - processingMetrics.startTime;

      console.log('✅ Bulk product import completed:', {
        totalRecords: processingMetrics.totalRecords,
        processedRecords: processingMetrics.processedRecords,
        successfulOperations: processingMetrics.successfulOperations,
        failedOperations: processingMetrics.failedOperations,
        totalTimeMs: totalTime,
        throughputPerSecond: Math.round((processingMetrics.successfulOperations / totalTime) * 1000),
        errorRate: ((processingMetrics.failedOperations / processingMetrics.totalRecords) * 100).toFixed(2) + '%'
      });

      return {
        success: true,
        results: results,
        metrics: processingMetrics,
        recommendations: this.generatePerformanceRecommendations(processingMetrics)
      };

    } catch (error) {
      console.error('Bulk product import failed:', error);
      return {
        success: false,
        error: error.message,
        metrics: processingMetrics,
        partialResults: processingMetrics.successfulOperations > 0
      };
    }
  }

  async createProductOperation(productData) {
    try {
      // Comprehensive data transformation and validation
      const transformedProduct = {
        // Core product information
        sku: productData.sku,
        name: productData.name,
        description: productData.description,

        // Pricing and financial data
        pricing: {
          basePrice: parseFloat(productData.price) || 0,
          costPrice: parseFloat(productData.cost_price) || 0,
          marginPercent: parseFloat(productData.margin_percent) || 0,
          currency: productData.currency || 'USD',
          taxCategory: productData.tax_category || 'standard'
        },

        // Inventory management
        inventory: {
          stockQuantity: parseInt(productData.stock_quantity) || 0,
          minimumOrderQuantity: parseInt(productData.minimum_order_quantity) || 1,
          leadTimeDays: parseInt(productData.lead_time_days) || 0,
          trackInventory: productData.track_inventory !== 'false'
        },

        // Product specifications
        specifications: {
          weight: {
            value: parseFloat(productData.weight_kg) || 0,
            unit: 'kg'
          },
          dimensions: productData.dimensions_json ? 
            JSON.parse(productData.dimensions_json) : null,
          attributes: productData.attributes_json ? 
            JSON.parse(productData.attributes_json) : {}
        },

        // Categorization and classification
        classification: {
          categoryId: productData.category_id,
          categoryPath: productData.category_path ? 
            productData.category_path.split('/') : [],
          tags: productData.tags ? 
            productData.tags.split(',').map(tag => tag.trim()) : [],
          brand: productData.brand || null
        },

        // Supplier and sourcing information
        supplier: {
          supplierId: productData.supplier_id,
          supplierCode: productData.supplier_code,
          supplierProductCode: productData.supplier_product_code,
          leadTimeFromSupplier: parseInt(productData.supplier_lead_time) || 0
        },

        // Product lifecycle and status
        lifecycle: {
          status: productData.status || 'active',
          isActive: productData.is_active !== 'false',
          availabilityDate: productData.availability_date ? 
            new Date(productData.availability_date) : new Date(),
          discontinuationDate: productData.discontinuation_date ? 
            new Date(productData.discontinuation_date) : null,
          seasonality: productData.seasonality || null
        },

        // SEO and marketing data
        marketing: {
          seoTitle: productData.seo_title || productData.name,
          seoDescription: productData.seo_description || productData.description,
          keywords: productData.keywords ? 
            productData.keywords.split(',').map(kw => kw.trim()) : [],
          promotionalText: productData.promotional_text || null,
          featuredProduct: productData.featured_product === 'true'
        },

        // Product variants and options
        variants: productData.variants_json ? 
          JSON.parse(productData.variants_json).map(variant => ({
            variantId: variant.variant_id || this.generateVariantId(),
            variantSku: variant.sku,
            attributes: variant.attributes || {},
            priceAdjustment: parseFloat(variant.price_adjustment) || 0,
            stockQuantity: parseInt(variant.stock_quantity) || 0,
            isActive: variant.is_active !== 'false'
          })) : [],

        // Media and assets
        media: {
          images: productData.images_json ? 
            JSON.parse(productData.images_json) : [],
          documents: productData.documents_json ? 
            JSON.parse(productData.documents_json) : [],
          videos: productData.videos_json ? 
            JSON.parse(productData.videos_json) : []
        },

        // Compliance and regulatory
        compliance: {
          regulatoryInfo: productData.regulatory_info_json ? 
            JSON.parse(productData.regulatory_info_json) : {},
          certifications: productData.certifications ? 
            productData.certifications.split(',').map(cert => cert.trim()) : [],
          restrictions: productData.restrictions_json ? 
            JSON.parse(productData.restrictions_json) : {}
        },

        // Audit and tracking
        audit: {
          createdAt: new Date(),
          updatedAt: new Date(),
          createdBy: 'bulk_import_system',
          importBatch: options.batchId || this.generateBatchId(),
          dataSource: options.dataSource || 'csv_import',
          version: 1
        }
      };

      // Determine operation type (insert vs update)
      const existingProduct = await this.db.collection('products')
        .findOne({ sku: transformedProduct.sku }, { projection: { _id: 1, audit: 1 } });

      if (existingProduct) {
        // Update existing product
        transformedProduct.audit.updatedAt = new Date();
        transformedProduct.audit.version = (existingProduct.audit?.version || 1) + 1;

        return {
          updateOne: {
            filter: { sku: transformedProduct.sku },
            update: { $set: transformedProduct },
            upsert: false
          }
        };
      } else {
        // Insert new product
        return {
          insertOne: {
            document: transformedProduct
          }
        };
      }

    } catch (error) {
      console.error(`Error creating operation for product ${productData.sku}:`, error);
      return null;
    }
  }

  async executeBulkOperations(collection, operations, config, metrics) {
    const results = [];
    const totalBatches = Math.ceil(operations.length / config.batchSize);

    console.log(`Executing ${operations.length} operations in ${totalBatches} batches...`);

    for (let i = 0; i < operations.length; i += config.batchSize) {
      const batchStart = Date.now();
      const batch = operations.slice(i, i + config.batchSize);
      const batchNumber = Math.floor(i / config.batchSize) + 1;

      try {
        // Execute bulk write with optimized options
        const result = await collection.bulkWrite(batch, {
          ordered: config.ordered,
          bypassDocumentValidation: config.bypassDocumentValidation,
          writeConcern: { w: 1, j: false }, // Optimized for throughput
        });

        // Track successful operations
        metrics.successfulOperations += result.insertedCount + result.modifiedCount + result.upsertedCount;
        metrics.processedRecords += batch.length;

        const batchTime = Date.now() - batchStart;
        metrics.batchTimes.push(batchTime);

        results.push({
          batchNumber: batchNumber,
          batchSize: batch.length,
          result: result,
          processingTimeMs: batchTime,
          throughputPerSecond: Math.round((batch.length / batchTime) * 1000)
        });

        // Progress logging
        if (config.logProgress && batchNumber % Math.ceil(totalBatches / 10) === 0) {
          const progressPercent = ((i + batch.length) / operations.length * 100).toFixed(1);
          console.log(`Batch ${batchNumber}/${totalBatches} completed (${progressPercent}%) - ${Math.round((batch.length / batchTime) * 1000)} ops/sec`);
        }

      } catch (error) {
        console.error(`Batch ${batchNumber} failed:`, error);

        metrics.failedOperations += batch.length;
        metrics.errors.push({
          batchNumber: batchNumber,
          error: error.message,
          batchSize: batch.length,
          timestamp: new Date()
        });

        // Implement retry logic for transient errors
        if (config.retryAttempts > 0 && this.isRetryableError(error)) {
          console.log(`Retrying batch ${batchNumber} in ${config.retryDelay}ms...`);
          await new Promise(resolve => setTimeout(resolve, config.retryDelay));

          try {
            const retryResult = await collection.bulkWrite(batch, {
              ordered: config.ordered,
              bypassDocumentValidation: config.bypassDocumentValidation,
              writeConcern: { w: 1, j: false }
            });

            metrics.successfulOperations += retryResult.insertedCount + retryResult.modifiedCount + retryResult.upsertedCount;
            metrics.processedRecords += batch.length;

            results.push({
              batchNumber: batchNumber,
              batchSize: batch.length,
              result: retryResult,
              processingTimeMs: Date.now() - batchStart,
              retryAttempt: true
            });

            console.log(`✅ Retry successful for batch ${batchNumber}`);

          } catch (retryError) {
            console.error(`Retry failed for batch ${batchNumber}:`, retryError);
            metrics.errors.push({
              batchNumber: batchNumber,
              error: `Retry failed: ${retryError.message}`,
              batchSize: batch.length,
              timestamp: new Date()
            });
          }
        }
      }
    }

    return results;
  }

  isRetryableError(error) {
    // Define retryable error conditions
    const retryableErrors = [
      'network timeout',
      'connection pool timeout',
      'temporary failure',
      'server selection timeout',
      'connection interrupted'
    ];

    return retryableErrors.some(retryableError => 
      error.message.toLowerCase().includes(retryableError)
    );
  }

  generateVariantId() {
    return 'var_' + Date.now() + '_' + Math.random().toString(36).substr(2, 9);
  }

  generateBatchId() {
    return 'batch_' + new Date().toISOString().replace(/[:.]/g, '') + '_' + Math.random().toString(36).substr(2, 6);
  }

  generatePerformanceRecommendations(metrics) {
    const recommendations = [];
    const totalTime = Date.now() - metrics.startTime;
    const overallThroughput = (metrics.successfulOperations / totalTime) * 1000;

    // Throughput analysis
    if (overallThroughput < 100) {
      recommendations.push({
        type: 'performance',
        priority: 'high',
        message: 'Low throughput detected. Consider increasing batch size or optimizing document structure.',
        currentValue: Math.round(overallThroughput),
        targetValue: 500
      });
    }

    // Error rate analysis
    const errorRate = (metrics.failedOperations / metrics.totalRecords) * 100;
    if (errorRate > 5) {
      recommendations.push({
        type: 'reliability',
        priority: 'high',
        message: 'High error rate. Review data quality and validation rules.',
        currentValue: errorRate.toFixed(2) + '%',
        targetValue: '< 1%'
      });
    }

    // Batch timing analysis
    if (metrics.batchTimes.length > 0) {
      const avgBatchTime = metrics.batchTimes.reduce((a, b) => a + b, 0) / metrics.batchTimes.length;
      const maxBatchTime = Math.max(...metrics.batchTimes);

      if (maxBatchTime > avgBatchTime * 3) {
        recommendations.push({
          type: 'optimization',
          priority: 'medium',
          message: 'Inconsistent batch processing times. Consider connection pool tuning.',
          currentValue: `${Math.round(maxBatchTime)}ms max, ${Math.round(avgBatchTime)}ms avg`
        });
      }
    }

    return recommendations.length > 0 ? recommendations : [
      { type: 'status', priority: 'info', message: 'Bulk operations performing optimally.' }
    ];
  }

  async performBulkOrderUpdates(orderUpdates) {
    console.log(`Processing ${orderUpdates.length} order updates...`);

    const config = this.batchConfigurations.get('order_updates');
    const collection = this.db.collection('orders');
    const bulkOps = [];

    // Create update operations for orders
    for (const update of orderUpdates) {
      const operation = {
        updateOne: {
          filter: { orderId: update.orderId },
          update: {
            $set: {
              status: update.status,
              updatedAt: new Date(),
              updatedBy: update.updatedBy || 'system',

              // Track status history
              $push: {
                statusHistory: {
                  status: update.status,
                  timestamp: new Date(),
                  updatedBy: update.updatedBy || 'system',
                  reason: update.reason || 'bulk_update'
                }
              }
            }
          },
          upsert: false
        }
      };

      // Add conditional updates based on status
      if (update.status === 'shipped') {
        operation.updateOne.update.$set.shippedAt = new Date();
        operation.updateOne.update.$set.trackingNumber = update.trackingNumber;
        operation.updateOne.update.$set.carrier = update.carrier;
      } else if (update.status === 'delivered') {
        operation.updateOne.update.$set.deliveredAt = new Date();
        operation.updateOne.update.$set.deliveryConfirmation = update.deliveryConfirmation;
      }

      bulkOps.push(operation);
    }

    return await this.executeBulkOperations(collection, bulkOps, config, {
      startTime: Date.now(),
      totalRecords: orderUpdates.length,
      processedRecords: 0,
      successfulOperations: 0,
      failedOperations: 0,
      errors: [],
      batchTimes: []
    });
  }

  async performAnalyticsBatchProcessing(analyticsData) {
    console.log(`Processing ${analyticsData.length} analytics records...`);

    const config = this.batchConfigurations.get('analytics_batch');
    const collection = this.db.collection('analytics_events');

    // Transform analytics data for bulk insert
    const bulkOps = analyticsData.map(event => ({
      insertOne: {
        document: {
          eventType: event.type,
          userId: event.userId,
          sessionId: event.sessionId,
          timestamp: new Date(event.timestamp),

          // Event properties
          properties: event.properties || {},

          // User context
          userAgent: event.userAgent,
          ipAddress: event.ipAddress,

          // Page/app context
          page: {
            url: event.pageUrl,
            title: event.pageTitle,
            referrer: event.referrer
          },

          // Device and browser info
          device: event.device || {},
          browser: event.browser || {},

          // Geographic data
          geo: event.geo || {},

          // Processing metadata
          processedAt: new Date(),
          batchId: this.generateBatchId()
        }
      }
    }));

    return await this.executeBulkOperations(collection, bulkOps, config, {
      startTime: Date.now(),
      totalRecords: analyticsData.length,
      processedRecords: 0,
      successfulOperations: 0,
      failedOperations: 0,
      errors: [],
      batchTimes: []
    });
  }

  async getBulkOperationStats() {
    const collections = ['products', 'orders', 'analytics_events'];
    const stats = {};

    for (const collectionName of collections) {
      const collection = this.db.collection(collectionName);

      // Get collection statistics
      const collStats = await this.db.command({ collStats: collectionName });

      // Get recent bulk operation metrics
      const recentBulkOps = await collection.aggregate([
        {
          $match: {
            'audit.importBatch': { $exists: true },
            'audit.createdAt': { 
              $gte: new Date(Date.now() - 24 * 60 * 60 * 1000) // Last 24 hours
            }
          }
        },
        {
          $group: {
            _id: '$audit.importBatch',
            count: { $sum: 1 },
            earliestRecord: { $min: '$audit.createdAt' },
            latestRecord: { $max: '$audit.createdAt' },
            dataSource: { $first: '$audit.dataSource' }
          }
        },
        { $sort: { latestRecord: -1 } },
        { $limit: 10 }
      ]).toArray();

      stats[collectionName] = {
        documentCount: collStats.count,
        storageSize: collStats.storageSize,
        indexSize: collStats.totalIndexSize,
        avgDocumentSize: collStats.avgObjSize,
        recentBulkOperations: recentBulkOps
      };
    }

    return {
      timestamp: new Date(),
      collections: stats,
      summary: this.generateStatsummary(stats)
    };
  }

  generateStatsummary(stats) {
    let totalDocuments = 0;
    let totalStorageSize = 0;
    let recentBulkOperations = 0;

    for (const [collectionName, collectionStats] of Object.entries(stats)) {
      totalDocuments += collectionStats.documentCount;
      totalStorageSize += collectionStats.storageSize;
      recentBulkOperations += collectionStats.recentBulkOperations.length;
    }

    return {
      totalDocuments,
      totalStorageSize,
      recentBulkOperations,
      averageDocumentSize: totalDocuments > 0 ? Math.round(totalStorageSize / totalDocuments) : 0,
      performanceIndicators: {
        storageEfficiency: totalStorageSize < (totalDocuments * 1000) ? 'excellent' : 'review',
        bulkOperationActivity: recentBulkOperations > 5 ? 'high' : 'normal'
      }
    };
  }

  async shutdown() {
    console.log('Shutting down MongoDB Bulk Operations Manager...');

    if (this.client) {
      await this.client.close();
      console.log('✅ MongoDB connection closed');
    }

    this.operationMetrics.clear();
    this.errorHandlers.clear();
    this.batchConfigurations.clear();
  }
}

// Export the bulk operations manager
module.exports = { MongoBulkOperationsManager };

// Benefits of MongoDB Bulk Operations:
// - Native batch processing eliminates individual operation overhead
// - Intelligent batching with configurable size and ordering options
// - Built-in error handling with detailed failure reporting and retry logic
// - High-performance throughput optimization for large-scale data processing
// - Atomic batch operations with comprehensive transaction support
// - Advanced monitoring and metrics tracking for performance analysis
// - Flexible operation types supporting mixed insert/update/delete operations
// - Production-ready error recovery and partial failure handling
// - Comprehensive performance recommendations and optimization guidance
// - SQL-compatible batch processing patterns through QueryLeaf integration

Understanding MongoDB Bulk Operations Architecture

Advanced Bulk Processing Patterns

Implement sophisticated bulk operation strategies for different data processing scenarios:

// Advanced bulk operations patterns for enterprise data processing
class EnterpriseBulkProcessor {
  constructor() {
    this.processingQueues = new Map();
    this.operationStrategies = new Map();
    this.performanceProfiler = new Map();
    this.errorRecoveryHandlers = new Map();
  }

  async initializeProcessingStrategies() {
    console.log('Initializing enterprise bulk processing strategies...');

    // Define processing strategies for different data types
    const strategies = {
      // High-frequency transactional updates
      'financial_transactions': {
        processingMode: 'ordered_atomic',
        batchSize: 100,
        maxConcurrency: 1,
        errorTolerance: 'zero',
        consistencyLevel: 'strong',

        validation: {
          strict: true,
          schemaValidation: true,
          businessRules: true,
          auditLogging: true
        },

        performance: {
          priorityLevel: 'critical',
          maxLatencyMs: 1000,
          throughputTarget: 500
        }
      },

      // Bulk inventory synchronization
      'inventory_sync': {
        processingMode: 'unordered_parallel',
        batchSize: 1000,
        maxConcurrency: 5,
        errorTolerance: 'partial',
        consistencyLevel: 'eventual',

        validation: {
          strict: false,
          schemaValidation: true,
          businessRules: false,
          auditLogging: false
        },

        performance: {
          priorityLevel: 'high',
          maxLatencyMs: 5000,
          throughputTarget: 2000
        }
      },

      // Analytics data ingestion
      'analytics_ingestion': {
        processingMode: 'unordered_parallel',
        batchSize: 5000,
        maxConcurrency: 10,
        errorTolerance: 'high',
        consistencyLevel: 'relaxed',

        validation: {
          strict: false,
          schemaValidation: false,
          businessRules: false,
          auditLogging: false
        },

        performance: {
          priorityLevel: 'normal',
          maxLatencyMs: 30000,
          throughputTarget: 10000
        }
      },

      // Customer data migration
      'customer_migration': {
        processingMode: 'ordered_atomic',
        batchSize: 200,
        maxConcurrency: 2,
        errorTolerance: 'low',
        consistencyLevel: 'strong',

        validation: {
          strict: true,
          schemaValidation: true,
          businessRules: true,
          auditLogging: true
        },

        performance: {
          priorityLevel: 'high',
          maxLatencyMs: 10000,
          throughputTarget: 100
        }
      }
    };

    for (const [strategyName, strategy] of Object.entries(strategies)) {
      this.operationStrategies.set(strategyName, strategy);
    }

    console.log('✅ Processing strategies initialized');
  }

  async processDataWithStrategy(strategyName, data, options = {}) {
    const strategy = this.operationStrategies.get(strategyName);
    if (!strategy) {
      throw new Error(`Unknown processing strategy: ${strategyName}`);
    }

    console.log(`Processing ${data.length} records with ${strategyName} strategy...`);

    const processingContext = {
      strategyName,
      strategy,
      startTime: Date.now(),
      totalRecords: data.length,
      processedRecords: 0,
      successfulRecords: 0,
      failedRecords: 0,
      errors: [],
      performanceMetrics: {
        batchTimes: [],
        throughputMeasurements: [],
        memoryUsage: []
      }
    };

    try {
      // Apply strategy-specific processing
      switch (strategy.processingMode) {
        case 'ordered_atomic':
          return await this.processOrderedAtomic(data, strategy, processingContext);
        case 'unordered_parallel':
          return await this.processUnorderedParallel(data, strategy, processingContext);
        default:
          throw new Error(`Unknown processing mode: ${strategy.processingMode}`);
      }

    } catch (error) {
      console.error(`Processing failed for strategy ${strategyName}:`, error);
      return {
        success: false,
        error: error.message,
        context: processingContext
      };
    }
  }

  async processOrderedAtomic(data, strategy, context) {
    console.log('Processing with ordered atomic mode...');

    const collection = this.db.collection(context.collectionName || 'bulk_operations');
    const results = [];

    // Process in sequential batches to maintain order
    for (let i = 0; i < data.length; i += strategy.batchSize) {
      const batchStart = Date.now();
      const batch = data.slice(i, i + strategy.batchSize);

      try {
        // Start transaction for atomic processing
        const session = this.client.startSession();

        await session.withTransaction(async () => {
          const bulkOps = batch.map(record => this.createBulkOperation(record, strategy));

          const result = await collection.bulkWrite(bulkOps, {
            ordered: true,
            session: session,
            writeConcern: { w: 'majority', j: true } // Strong consistency
          });

          context.successfulRecords += result.insertedCount + result.modifiedCount;
          context.processedRecords += batch.length;

          results.push({
            batchIndex: Math.floor(i / strategy.batchSize),
            result: result,
            processingTimeMs: Date.now() - batchStart
          });
        });

        await session.endSession();

        // Performance monitoring for ordered processing
        const batchTime = Date.now() - batchStart;
        context.performanceMetrics.batchTimes.push(batchTime);

        if (batchTime > strategy.performance.maxLatencyMs) {
          console.warn(`Batch latency ${batchTime}ms exceeds target ${strategy.performance.maxLatencyMs}ms`);
        }

      } catch (error) {
        console.error(`Ordered batch failed at index ${i}:`, error);
        context.failedRecords += batch.length;
        context.errors.push({
          batchIndex: Math.floor(i / strategy.batchSize),
          error: error.message,
          recordCount: batch.length
        });

        // For zero error tolerance, stop processing
        if (strategy.errorTolerance === 'zero') {
          throw new Error(`Processing stopped due to zero error tolerance: ${error.message}`);
        }
      }
    }

    return {
      success: true,
      results: results,
      context: context
    };
  }

  async processUnorderedParallel(data, strategy, context) {
    console.log(`Processing with unordered parallel mode (${strategy.maxConcurrency} concurrent batches)...`);

    const collection = this.db.collection(context.collectionName || 'bulk_operations');
    const results = [];
    const concurrentPromises = [];

    // Create batches for parallel processing
    const batches = [];
    for (let i = 0; i < data.length; i += strategy.batchSize) {
      batches.push({
        index: Math.floor(i / strategy.batchSize),
        data: data.slice(i, i + strategy.batchSize)
      });
    }

    // Process batches with controlled concurrency
    for (let i = 0; i < batches.length; i += strategy.maxConcurrency) {
      const concurrentBatches = batches.slice(i, i + strategy.maxConcurrency);

      const batchPromises = concurrentBatches.map(async (batch) => {
        const batchStart = Date.now();

        try {
          const bulkOps = batch.data.map(record => this.createBulkOperation(record, strategy));

          const result = await collection.bulkWrite(bulkOps, {
            ordered: false, // Allow parallel processing within batch
            writeConcern: { w: 1, j: false } // Optimized for throughput
          });

          context.successfulRecords += result.insertedCount + result.modifiedCount;
          context.processedRecords += batch.data.length;

          return {
            batchIndex: batch.index,
            result: result,
            processingTimeMs: Date.now() - batchStart,
            throughputPerSecond: Math.round((batch.data.length / (Date.now() - batchStart)) * 1000)
          };

        } catch (error) {
          context.failedRecords += batch.data.length;
          context.errors.push({
            batchIndex: batch.index,
            error: error.message,
            recordCount: batch.data.length
          });

          return {
            batchIndex: batch.index,
            error: error.message,
            recordCount: batch.data.length
          };
        }
      });

      // Wait for current batch of concurrent operations
      const batchResults = await Promise.all(batchPromises);
      results.push(...batchResults);

      // Track performance metrics
      const successfulBatches = batchResults.filter(r => !r.error);
      if (successfulBatches.length > 0) {
        const avgThroughput = successfulBatches.reduce((sum, r) => sum + r.throughputPerSecond, 0) / successfulBatches.length;
        context.performanceMetrics.throughputMeasurements.push(avgThroughput);
      }
    }

    return {
      success: true,
      results: results,
      context: context
    };
  }

  createBulkOperation(record, strategy) {
    // Apply validation based on strategy
    if (strategy.validation.strict) {
      this.validateRecord(record, strategy);
    }

    // Transform record based on operation type
    const transformedRecord = this.transformRecord(record, strategy);

    // Return appropriate bulk operation
    if (record.operationType === 'update') {
      return {
        updateOne: {
          filter: { _id: record._id },
          update: { $set: transformedRecord },
          upsert: record.upsert || false
        }
      };
    } else {
      return {
        insertOne: {
          document: transformedRecord
        }
      };
    }
  }

  validateRecord(record, strategy) {
    // Implement validation logic based on strategy
    if (strategy.validation.schemaValidation) {
      // Schema validation logic
    }

    if (strategy.validation.businessRules) {
      // Business rules validation
    }

    return true;
  }

  transformRecord(record, strategy) {
    // Apply transformations based on strategy
    const transformed = { ...record };

    // Add audit fields based on strategy requirements
    if (strategy.validation.auditLogging) {
      transformed.audit = {
        processedAt: new Date(),
        strategy: strategy,
        version: 1
      };
    }

    return transformed;
  }

  async getProcessingMetrics() {
    const metrics = {
      timestamp: new Date(),
      strategies: {}
    };

    for (const [strategyName, strategy] of this.operationStrategies) {
      const performanceProfile = this.performanceProfiler.get(strategyName);

      metrics.strategies[strategyName] = {
        configuration: strategy,
        performance: performanceProfile || {
          avgThroughput: 0,
          avgLatency: 0,
          errorRate: 0,
          lastUsed: null
        }
      };
    }

    return metrics;
  }
}

// Export the enterprise bulk processor
module.exports = { EnterpriseBulkProcessor };

SQL-Style Bulk Operations with QueryLeaf

QueryLeaf provides familiar SQL syntax for MongoDB bulk operations and batch processing:

-- QueryLeaf bulk operations with SQL-familiar syntax

-- Bulk insert operations with performance optimization
BULK INSERT INTO products 
SELECT 
  sku,
  name,
  description,

  -- Pricing structure
  price as "pricing.basePrice",
  cost_price as "pricing.costPrice",
  margin_percent as "pricing.marginPercent",
  currency as "pricing.currency",

  -- Inventory management
  stock_quantity as "inventory.stockQuantity",
  minimum_order_quantity as "inventory.minimumOrderQuantity",
  track_inventory as "inventory.trackInventory",

  -- Product specifications
  weight_kg as "specifications.weight.value",
  'kg' as "specifications.weight.unit",
  JSON_PARSE(dimensions_json) as "specifications.dimensions",

  -- Classification
  category_id as "classification.categoryId",
  STRING_SPLIT(category_path, '/') as "classification.categoryPath",
  STRING_SPLIT(tags, ',') as "classification.tags",

  -- Audit information
  CURRENT_TIMESTAMP as "audit.createdAt",
  'bulk_import_system' as "audit.createdBy",
  CONCAT('batch_', DATE_FORMAT(NOW(), '%Y%m%d_%H%i%s')) as "audit.importBatch"

FROM product_import_staging
WHERE validation_status = 'passed'
WITH BULK_OPTIONS (
  batch_size = 1000,
  ordered = false,
  timeout_seconds = 300,
  retry_attempts = 3
);

-- Bulk update operations with conditional logic
BULK UPDATE orders 
SET 
  status = staging.new_status,
  updated_at = CURRENT_TIMESTAMP,
  updated_by = 'bulk_update_system',

  -- Conditional updates based on status
  shipped_at = CASE 
    WHEN staging.new_status = 'shipped' THEN CURRENT_TIMESTAMP 
    ELSE shipped_at 
  END,

  tracking_number = CASE 
    WHEN staging.new_status = 'shipped' THEN staging.tracking_number 
    ELSE tracking_number 
  END,

  delivered_at = CASE 
    WHEN staging.new_status = 'delivered' THEN CURRENT_TIMESTAMP 
    ELSE delivered_at 
  END,

  -- Array operations for status history
  $PUSH = {
    "status_history": {
      "status": staging.new_status,
      "timestamp": CURRENT_TIMESTAMP,
      "updated_by": 'bulk_update_system',
      "reason": COALESCE(staging.reason, 'bulk_update')
    }
  }

FROM order_status_updates staging
WHERE orders.order_id = staging.order_id
  AND orders.status != staging.new_status
WITH BULK_OPTIONS (
  batch_size = 500,
  ordered = true,
  upsert = false
);

-- Bulk upsert operations (insert or update)
BULK UPSERT INTO customer_profiles
SELECT 
  customer_id,
  email,
  first_name,
  last_name,

  -- Contact information
  phone,
  JSON_OBJECT(
    'street', street_address,
    'city', city,
    'state', state,
    'postal_code', postal_code,
    'country', country
  ) as address,

  -- Preferences and segmentation
  marketing_preferences,
  customer_segment,
  lifetime_value,

  -- Behavioral data
  last_purchase_date,
  total_orders,
  average_order_value,

  -- Timestamps
  COALESCE(existing_created_at, CURRENT_TIMESTAMP) as created_at,
  CURRENT_TIMESTAMP as updated_at

FROM customer_data_import cdi
LEFT JOIN customer_profiles existing 
  ON existing.customer_id = cdi.customer_id
WITH BULK_OPTIONS (
  batch_size = 750,
  ordered = false,
  match_fields = ['customer_id'],
  upsert = true
);

-- Bulk delete operations with conditions
BULK DELETE FROM expired_sessions
WHERE last_activity < DATE_SUB(CURRENT_TIMESTAMP, INTERVAL 30 DAY)
  AND session_status = 'inactive'
WITH BULK_OPTIONS (
  batch_size = 2000,
  ordered = false,
  confirm_delete = true
);

-- Advanced bulk operations with aggregation
WITH aggregated_metrics AS (
  SELECT 
    user_id,
    DATE_TRUNC('day', event_timestamp) as event_date,

    -- Event aggregations
    COUNT(*) as total_events,
    COUNT(DISTINCT session_id) as unique_sessions,
    SUM(CASE WHEN event_type = 'purchase' THEN 1 ELSE 0 END) as purchase_events,
    SUM(CASE WHEN event_type = 'page_view' THEN 1 ELSE 0 END) as page_view_events,

    -- Value calculations
    SUM(COALESCE(event_value, 0)) as total_event_value,
    AVG(COALESCE(session_duration_seconds, 0)) as avg_session_duration,

    -- Behavioral insights
    MAX(event_timestamp) as last_activity,
    MIN(event_timestamp) as first_activity,

    -- Engagement scoring
    CASE 
      WHEN COUNT(*) > 100 THEN 'high_engagement'
      WHEN COUNT(*) > 20 THEN 'medium_engagement'
      ELSE 'low_engagement'
    END as engagement_level

  FROM user_events
  WHERE event_timestamp >= DATE_SUB(CURRENT_DATE, INTERVAL 7 DAY)
  GROUP BY user_id, DATE_TRUNC('day', event_timestamp)
)

BULK INSERT INTO daily_user_metrics
SELECT 
  user_id,
  event_date,
  total_events,
  unique_sessions,
  purchase_events,
  page_view_events,
  total_event_value,
  avg_session_duration,
  last_activity,
  first_activity,
  engagement_level,

  -- Computed metrics
  ROUND(total_event_value / NULLIF(total_events, 0), 2) as avg_event_value,
  ROUND(page_view_events::DECIMAL / NULLIF(unique_sessions, 0), 2) as pages_per_session,

  -- Processing metadata
  CURRENT_TIMESTAMP as computed_at,
  'daily_aggregation' as metric_source

FROM aggregated_metrics
WITH BULK_OPTIONS (
  batch_size = 1500,
  ordered = false,
  on_conflict = 'replace'
);

-- Bulk operations performance monitoring
SELECT 
  operation_type,
  collection_name,
  batch_size,

  -- Performance metrics
  COUNT(*) as total_operations,
  AVG(records_processed) as avg_records_per_operation,
  AVG(processing_time_ms) as avg_processing_time_ms,

  -- Throughput calculations
  SUM(records_processed) as total_records_processed,
  SUM(processing_time_ms) as total_processing_time_ms,
  ROUND(
    SUM(records_processed)::DECIMAL / (SUM(processing_time_ms) / 1000),
    2
  ) as overall_throughput_per_second,

  -- Success rates
  AVG(success_rate) as avg_success_rate,
  COUNT(*) FILTER (WHERE success_rate = 100) as fully_successful_operations,
  COUNT(*) FILTER (WHERE success_rate < 95) as problematic_operations,

  -- Error analysis
  AVG(error_count) as avg_errors_per_operation,
  SUM(error_count) as total_errors,

  -- Resource utilization
  AVG(memory_usage_mb) as avg_memory_usage_mb,
  MAX(memory_usage_mb) as peak_memory_usage_mb,

  -- Time-based analysis
  MIN(operation_start_time) as earliest_operation,
  MAX(operation_end_time) as latest_operation,

  -- Performance assessment
  CASE 
    WHEN AVG(processing_time_ms) < 1000 AND AVG(success_rate) >= 99 THEN 'excellent'
    WHEN AVG(processing_time_ms) < 5000 AND AVG(success_rate) >= 95 THEN 'good'
    WHEN AVG(processing_time_ms) < 10000 AND AVG(success_rate) >= 90 THEN 'acceptable'
    ELSE 'needs_optimization'
  END as performance_rating,

  -- Optimization recommendations
  CASE 
    WHEN AVG(processing_time_ms) > 10000 THEN 'Consider reducing batch size or optimizing queries'
    WHEN AVG(success_rate) < 95 THEN 'Review error handling and data validation'
    WHEN AVG(memory_usage_mb) > 500 THEN 'Optimize document size or batch processing'
    WHEN overall_throughput_per_second < 100 THEN 'Consider increasing batch size or parallelization'
    ELSE 'Performance within acceptable parameters'
  END as optimization_recommendation

FROM bulk_operation_logs
WHERE operation_start_time >= CURRENT_TIMESTAMP - INTERVAL '24 hours'
GROUP BY operation_type, collection_name, batch_size
ORDER BY total_records_processed DESC;

-- Bulk operation capacity planning
CREATE VIEW bulk_operations_capacity_planning AS
WITH hourly_bulk_metrics AS (
  SELECT 
    DATE_TRUNC('hour', operation_start_time) as hour_bucket,
    operation_type,

    -- Volume metrics
    COUNT(*) as operations_per_hour,
    SUM(records_processed) as records_per_hour,
    AVG(records_processed) as avg_records_per_operation,

    -- Performance metrics
    AVG(processing_time_ms) as avg_processing_time_ms,
    PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY processing_time_ms) as p95_processing_time_ms,

    -- Throughput calculations
    ROUND(
      SUM(records_processed)::DECIMAL / (SUM(processing_time_ms) / 1000),
      2
    ) as throughput_per_second,

    -- Resource usage
    AVG(memory_usage_mb) as avg_memory_usage_mb,
    MAX(memory_usage_mb) as peak_memory_usage_mb,

    -- Success metrics
    AVG(success_rate) as avg_success_rate,
    SUM(error_count) as errors_per_hour

  FROM bulk_operation_logs
  WHERE operation_start_time >= CURRENT_TIMESTAMP - INTERVAL '7 days'
  GROUP BY DATE_TRUNC('hour', operation_start_time), operation_type
),

capacity_trends AS (
  SELECT 
    *,
    -- Trend analysis
    LAG(throughput_per_second) OVER (PARTITION BY operation_type ORDER BY hour_bucket) as prev_throughput,
    LAG(records_per_hour) OVER (PARTITION BY operation_type ORDER BY hour_bucket) as prev_records_per_hour,

    -- Growth calculations
    CASE 
      WHEN LAG(throughput_per_second) OVER (PARTITION BY operation_type ORDER BY hour_bucket) IS NOT NULL
      THEN ROUND(
        (throughput_per_second - LAG(throughput_per_second) OVER (PARTITION BY operation_type ORDER BY hour_bucket)) / 
        LAG(throughput_per_second) OVER (PARTITION BY operation_type ORDER BY hour_bucket) * 100,
        2
      )
      ELSE 0
    END as throughput_change_percent

  FROM hourly_bulk_metrics
)

SELECT 
  operation_type,
  TO_CHAR(hour_bucket, 'YYYY-MM-DD HH24:00') as analysis_hour,

  -- Volume analysis
  operations_per_hour,
  records_per_hour,
  ROUND(avg_records_per_operation::NUMERIC, 0) as avg_records_per_operation,

  -- Performance analysis
  ROUND(avg_processing_time_ms::NUMERIC, 0) as avg_processing_time_ms,
  ROUND(p95_processing_time_ms::NUMERIC, 0) as p95_processing_time_ms,
  throughput_per_second,

  -- Resource analysis
  ROUND(avg_memory_usage_mb::NUMERIC, 1) as avg_memory_usage_mb,
  ROUND(peak_memory_usage_mb::NUMERIC, 1) as peak_memory_usage_mb,

  -- Quality metrics
  ROUND(avg_success_rate::NUMERIC, 2) as avg_success_rate_pct,
  errors_per_hour,

  -- Trend analysis
  throughput_change_percent,

  -- Capacity assessment
  CASE 
    WHEN throughput_per_second > 1000 AND avg_success_rate >= 99 THEN 'optimal_capacity'
    WHEN throughput_per_second > 500 AND avg_success_rate >= 95 THEN 'good_capacity'
    WHEN throughput_per_second > 100 AND avg_success_rate >= 90 THEN 'adequate_capacity'
    ELSE 'capacity_constraints'
  END as capacity_status,

  -- Scaling recommendations
  CASE 
    WHEN throughput_change_percent > 50 THEN 'Monitor for sustained growth - consider scaling'
    WHEN throughput_change_percent < -25 THEN 'Throughput declining - investigate performance'
    WHEN peak_memory_usage_mb > 1000 THEN 'High memory usage - optimize batch sizes'
    WHEN errors_per_hour > 100 THEN 'High error rate - review data quality'
    ELSE 'Capacity planning within normal parameters'
  END as scaling_recommendation

FROM capacity_trends
WHERE hour_bucket >= CURRENT_TIMESTAMP - INTERVAL '24 hours'
ORDER BY operation_type, hour_bucket DESC;

-- QueryLeaf provides comprehensive bulk operations capabilities:
-- 1. SQL-familiar syntax for MongoDB bulk insert, update, upsert, and delete operations
-- 2. Advanced batch processing with configurable performance options
-- 3. Real-time performance monitoring with throughput and success rate tracking
-- 4. Intelligent error handling with retry logic and partial failure recovery
-- 5. Capacity planning with trend analysis and scaling recommendations
-- 6. Resource utilization monitoring with memory and processing time analysis
-- 7. Flexible operation strategies for different data processing scenarios
-- 8. Production-ready bulk processing with comprehensive audit and logging
-- 9. Integration with MongoDB's native bulk write capabilities
-- 10. Enterprise-grade batch operations accessible through familiar SQL constructs

Best Practices for MongoDB Bulk Operations Implementation

Bulk Operation Optimization Strategies

Essential practices for maximizing bulk operation performance:

Batch Size Optimization: Configure optimal batch sizes based on document size and available memory
Operation Ordering: Choose ordered vs unordered operations based on consistency requirements
Write Concern Tuning: Balance durability requirements with throughput needs
Error Handling Strategy: Implement comprehensive error handling with retry logic
Memory Management: Monitor memory usage and optimize document structures
Performance Monitoring: Track throughput, latency, and error rates continuously

Production Deployment Considerations

Key factors for enterprise bulk operation deployments:

Concurrency Control: Implement proper concurrency limits to prevent resource exhaustion
Progress Tracking: Provide comprehensive progress reporting for long-running operations
Atomic Transactions: Use transactions for operations requiring strong consistency
Failover Handling: Design operations to handle replica set failovers gracefully
Resource Scaling: Plan for dynamic resource scaling based on processing demands
Data Validation: Implement multi-layer validation to ensure data quality

Conclusion

MongoDB Bulk Write Operations provide enterprise-grade batch processing capabilities that dramatically improve throughput and efficiency for large-scale data operations. The combination of intelligent batching, comprehensive error handling, and performance optimization enables applications to process millions of records efficiently while maintaining data consistency and reliability.

Key MongoDB bulk operations benefits include:

High-Performance Processing: Native bulk operations eliminate individual operation overhead for massive throughput improvements
Intelligent Batching: Configurable batch sizes and ordering options optimized for different processing scenarios
Comprehensive Error Handling: Advanced error recovery with detailed failure reporting and retry logic
Atomic Operations: Transaction support for operations requiring strong consistency guarantees
Performance Monitoring: Real-time metrics and recommendations for continuous optimization
SQL Compatibility: Familiar bulk processing patterns accessible through SQL-style operations

Whether you're importing large datasets, synchronizing with external systems, or performing batch transformations, MongoDB bulk operations with QueryLeaf's SQL-familiar interface provide the foundation for scalable data processing that maintains high performance while simplifying operational complexity.

QueryLeaf Integration: QueryLeaf automatically optimizes MongoDB bulk operations while providing SQL-familiar syntax for batch insert, update, upsert, and delete operations. Advanced error handling, performance monitoring, and capacity planning are seamlessly accessible through familiar SQL constructs, making sophisticated bulk data processing both powerful and approachable for SQL-oriented teams.

The combination of MongoDB's high-performance bulk operations with familiar SQL-style management makes it an ideal platform for applications that require both massive data processing capabilities and operational simplicity, ensuring your batch processing infrastructure scales efficiently while maintaining familiar development and operational patterns.