2025

November 17, 2025
21 min read

MongoDB Connection Pooling and Performance Optimization: Production-Scale Database Connection Management and High-Performance Application Design

High-performance applications require efficient database connection management to handle concurrent user requests, maintain low latency, and scale effectively under varying load conditions. Poor connection management leads to resource exhaustion, application timeouts, and degraded user experience, particularly in microservices architectures where multiple services compete for database connections.

MongoDB connection pooling provides sophisticated connection lifecycle management with intelligent pool sizing, connection health monitoring, and automatic failover capabilities. Unlike traditional databases that require complex external connection pooling solutions, MongoDB drivers include built-in connection pool management that integrates seamlessly with MongoDB's replica sets and sharded clusters.

The Traditional Connection Management Challenge

Managing database connections efficiently in traditional environments requires complex infrastructure and careful tuning:

-- Traditional PostgreSQL connection management - complex and resource-intensive

-- Connection monitoring and management requires external tools
-- PostgreSQL connection stats (requires pg_stat_activity monitoring)
SELECT 
    pid,
    usename as username,
    application_name,
    client_addr,
    state,
    state_change,

    -- Connection timing
    backend_start,
    query_start,
    EXTRACT(EPOCH FROM (NOW() - backend_start)) as connection_age_seconds,
    EXTRACT(EPOCH FROM (NOW() - query_start)) as query_duration_seconds,

    -- Current activity
    query,
    wait_event_type,
    wait_event,

    -- Resource usage estimation
    CASE 
        WHEN state = 'active' THEN 'high'
        WHEN state = 'idle in transaction' THEN 'medium'
        WHEN state = 'idle' THEN 'low'
        ELSE 'unknown'
    END as resource_impact

FROM pg_stat_activity
WHERE pid != pg_backend_pid() -- Exclude current monitoring connection
ORDER BY backend_start DESC;

-- PostgreSQL connection limits and configuration
-- Must be configured at database server level in postgresql.conf:
-- max_connections = 200          -- Hard limit on concurrent connections
-- shared_buffers = 256MB         -- Memory allocation affects connection overhead
-- effective_cache_size = 4GB     -- Query planning memory estimates
-- work_mem = 4MB                 -- Per-operation memory limit
-- maintenance_work_mem = 256MB   -- Maintenance operation memory

-- Application-level connection pooling setup (using PgBouncer configuration)
-- /etc/pgbouncer/pgbouncer.ini configuration file:
-- [databases]
-- production_db = host=localhost port=5432 dbname=production user=app_user
-- 
-- [pgbouncer]  
-- pool_mode = transaction        -- Connection reuse strategy
-- max_client_conn = 1000         -- Maximum client connections
-- default_pool_size = 20         -- Connections per database/user
-- min_pool_size = 5              -- Minimum maintained connections
-- reserve_pool_size = 5          -- Emergency connection reserve
-- max_db_connections = 50        -- Maximum per database
-- 
-- # Connection lifecycle settings
-- server_lifetime = 3600         -- Server connection max age (seconds)
-- server_idle_timeout = 600      -- Idle server connection timeout
-- client_idle_timeout = 0        -- Client idle timeout (0 = disabled)
-- 
-- # Performance tuning
-- listen_backlog = 128           -- TCP listen queue size
-- server_connect_timeout = 15    -- Server connection timeout
-- server_login_retry = 15        -- Login retry interval

-- Complex connection pool monitoring query
WITH connection_stats AS (
    SELECT 
        datname as database_name,
        usename as username,
        application_name,
        client_addr,
        state,

        -- Connection lifecycle analysis
        CASE 
            WHEN backend_start > NOW() - INTERVAL '1 minute' THEN 'new'
            WHEN backend_start > NOW() - INTERVAL '5 minutes' THEN 'recent'
            WHEN backend_start > NOW() - INTERVAL '30 minutes' THEN 'established'
            ELSE 'long_running'
        END as connection_age_category,

        -- Query activity analysis
        CASE 
            WHEN query_start IS NULL THEN 'no_query_executed'
            WHEN query_start > NOW() - INTERVAL '1 second' THEN 'active_query'
            WHEN query_start > NOW() - INTERVAL '30 seconds' THEN 'recent_query'
            ELSE 'old_query'
        END as query_activity_category,

        -- Resource usage indicators
        CASE 
            WHEN wait_event_type = 'Lock' THEN 'blocking'
            WHEN wait_event_type = 'IO' THEN 'io_intensive'
            WHEN state = 'idle in transaction' THEN 'transaction_holding'
            WHEN state = 'active' THEN 'cpu_intensive'
            ELSE 'idle'
        END as resource_usage_pattern,

        EXTRACT(EPOCH FROM (NOW() - backend_start)) as connection_duration_seconds,
        EXTRACT(EPOCH FROM (NOW() - query_start)) as current_query_seconds

    FROM pg_stat_activity
    WHERE pid != pg_backend_pid()
),

connection_summary AS (
    SELECT 
        database_name,
        state,
        connection_age_category,
        query_activity_category,
        resource_usage_pattern,

        -- Connection counts by category
        COUNT(*) as connection_count,
        AVG(connection_duration_seconds) as avg_connection_age,
        MAX(connection_duration_seconds) as max_connection_age,
        AVG(current_query_seconds) as avg_query_duration,
        MAX(current_query_seconds) as max_query_duration,

        -- Resource impact assessment
        COUNT(*) FILTER (WHERE resource_usage_pattern = 'blocking') as blocking_connections,
        COUNT(*) FILTER (WHERE resource_usage_pattern = 'io_intensive') as io_intensive_connections,
        COUNT(*) FILTER (WHERE resource_usage_pattern = 'transaction_holding') as transaction_holding_connections,
        COUNT(*) FILTER (WHERE state = 'idle') as idle_connections,
        COUNT(*) FILTER (WHERE state = 'active') as active_connections

    FROM connection_stats
    GROUP BY database_name, state, connection_age_category, query_activity_category, resource_usage_pattern
),

database_connection_health AS (
    SELECT 
        database_name,
        SUM(connection_count) as total_connections,
        SUM(active_connections) as total_active,
        SUM(idle_connections) as total_idle,
        SUM(blocking_connections) as total_blocking,

        -- Health indicators
        ROUND(AVG(avg_connection_age)::numeric, 2) as avg_connection_age_seconds,
        ROUND(MAX(max_connection_age)::numeric, 2) as oldest_connection_seconds,
        ROUND(AVG(avg_query_duration)::numeric, 2) as avg_query_duration_seconds,

        -- Connection efficiency ratio
        ROUND(
            (SUM(active_connections)::decimal / NULLIF(SUM(connection_count), 0) * 100)::numeric, 
            2
        ) as connection_utilization_percent,

        -- Performance assessment
        CASE 
            WHEN SUM(blocking_connections) > 5 THEN 'critical_blocking'
            WHEN SUM(connection_count) > 150 THEN 'high_connection_count'
            WHEN AVG(avg_query_duration) > 30 THEN 'slow_queries'
            WHEN SUM(idle_connections)::decimal / SUM(connection_count) > 0.7 THEN 'excessive_idle'
            ELSE 'healthy'
        END as connection_health_status

    FROM connection_summary
    GROUP BY database_name
)

SELECT 
    database_name,
    total_connections,
    total_active,
    total_idle,
    connection_utilization_percent,

    -- Performance indicators
    avg_connection_age_seconds,
    avg_query_duration_seconds,
    connection_health_status,

    -- Recommendations
    CASE connection_health_status
        WHEN 'critical_blocking' THEN 'Investigate blocking queries and deadlocks'
        WHEN 'high_connection_count' THEN 'Consider connection pooling or scaling'
        WHEN 'slow_queries' THEN 'Optimize query performance and indexing'
        WHEN 'excessive_idle' THEN 'Tune connection pool idle timeout settings'
        ELSE 'Connection pool operating within normal parameters'
    END as optimization_recommendation,

    -- Connection pool sizing recommendations
    CASE 
        WHEN connection_utilization_percent > 90 THEN 'Increase pool size'
        WHEN connection_utilization_percent < 30 THEN 'Reduce pool size'
        ELSE 'Pool size appropriate'
    END as pool_sizing_recommendation

FROM database_connection_health
ORDER BY total_connections DESC;

-- Problems with traditional connection management:
-- 1. External connection pooler configuration and maintenance complexity
-- 2. Limited visibility into connection pool health and performance metrics
-- 3. Manual tuning of pool sizes and timeout configurations
-- 4. Complex failover and high availability connection management
-- 5. Difficulty coordinating connection pools across multiple application instances
-- 6. Limited integration with database cluster topology changes
-- 7. Resource overhead from maintaining separate connection pooling infrastructure
-- 8. Complex monitoring and alerting setup for connection pool health
-- 9. Manual configuration management across different environments
-- 10. Poor integration with modern microservices and containerized deployments

MongoDB provides native connection pooling with intelligent management and monitoring:

// MongoDB Connection Pooling - native high-performance connection management
const { MongoClient } = require('mongodb');

// Advanced MongoDB Connection Pool Manager
class MongoConnectionPoolManager {
  constructor() {
    this.clients = new Map();
    this.poolConfigurations = new Map();
    this.connectionMetrics = new Map();
    this.healthCheckIntervals = new Map();
  }

  async createOptimizedConnectionPools() {
    console.log('Creating optimized MongoDB connection pools...');

    // Production connection pool configuration
    const productionPoolConfig = {
      // Core pool sizing
      minPoolSize: 5,           // Minimum connections maintained
      maxPoolSize: 50,          // Maximum concurrent connections
      maxIdleTimeMS: 30000,     // 30 seconds idle timeout

      // Connection lifecycle management
      maxConnecting: 5,         // Maximum concurrent connection attempts
      serverSelectionTimeoutMS: 30000,  // Server selection timeout
      socketTimeoutMS: 45000,   // Socket operation timeout
      connectTimeoutMS: 10000,  // Initial connection timeout

      // Health monitoring
      heartbeatFrequencyMS: 10000,  // Heartbeat interval
      serverMonitoringMode: 'auto', // Automatic server monitoring

      // Performance optimization
      compressors: ['zlib'],    // Enable compression
      zlibCompressionLevel: 6,  // Compression efficiency

      // Error handling and retries
      retryWrites: true,        // Enable retryable writes
      retryReads: true,         // Enable retryable reads
      maxWriteRetries: 3,       // Maximum write retry attempts

      // Connection security
      tls: true,               // Enable TLS
      tlsInsecure: false,      // Require certificate validation

      // Application metadata
      appName: 'ProductionApp', // Application identifier
      driverInfo: {
        name: 'MongoDB Connection Pool Manager',
        version: '1.0.0'
      }
    };

    // Read-only replica connection pool (for analytics)
    const analyticsPoolConfig = {
      ...productionPoolConfig,
      minPoolSize: 2,
      maxPoolSize: 20,
      readPreference: 'secondary', // Prefer secondary for reads
      readConcern: { level: 'available' }, // Relaxed read concern
      appName: 'AnalyticsApp'
    };

    // High-throughput batch processing pool
    const batchProcessingConfig = {
      ...productionPoolConfig,
      minPoolSize: 10,
      maxPoolSize: 100,         // Higher concurrency for batch jobs
      maxIdleTimeMS: 60000,     // Longer idle timeout
      maxConnecting: 10,        // More concurrent connections
      writeConcern: { w: 1, j: false }, // Optimized write concern
      appName: 'BatchProcessor'
    };

    try {
      // Create primary production client
      const productionClient = new MongoClient(
        process.env.MONGODB_PRODUCTION_URI || 'mongodb://localhost:27017/production',
        productionPoolConfig
      );
      await productionClient.connect();
      this.clients.set('production', productionClient);
      this.poolConfigurations.set('production', productionPoolConfig);

      // Create analytics replica client
      const analyticsClient = new MongoClient(
        process.env.MONGODB_REPLICA_URI || 'mongodb://localhost:27017/production',
        analyticsPoolConfig
      );
      await analyticsClient.connect();
      this.clients.set('analytics', analyticsClient);
      this.poolConfigurations.set('analytics', analyticsPoolConfig);

      // Create batch processing client
      const batchClient = new MongoClient(
        process.env.MONGODB_BATCH_URI || 'mongodb://localhost:27017/production',
        batchProcessingConfig
      );
      await batchClient.connect();
      this.clients.set('batch', batchClient);
      this.poolConfigurations.set('batch', batchProcessingConfig);

      // Initialize connection pool monitoring
      await this.initializePoolMonitoring();

      console.log('✅ MongoDB connection pools created successfully');
      return {
        success: true,
        pools: ['production', 'analytics', 'batch'],
        configurations: Object.fromEntries(this.poolConfigurations)
      };

    } catch (error) {
      console.error('Error creating connection pools:', error);
      return { success: false, error: error.message };
    }
  }

  async initializePoolMonitoring() {
    console.log('Initializing connection pool monitoring...');

    for (const [poolName, client] of this.clients) {
      // Initialize metrics tracking
      this.connectionMetrics.set(poolName, {
        connectionEvents: [],
        performanceMetrics: {
          totalConnections: 0,
          activeConnections: 0,
          availableConnections: 0,
          connectionsCreated: 0,
          connectionsDestroyed: 0,
          operationTime: [],
          errorRate: 0
        },
        healthStatus: 'healthy'
      });

      // Set up connection pool event monitoring
      client.on('connectionPoolCreated', (event) => {
        console.log(`Pool created: ${poolName}`, event);
        this.recordPoolEvent(poolName, 'pool_created', event);
      });

      client.on('connectionCreated', (event) => {
        console.log(`Connection created: ${poolName}`, event.connectionId);
        this.recordPoolEvent(poolName, 'connection_created', event);
        this.updatePoolMetrics(poolName, 'connection_created');
      });

      client.on('connectionReady', (event) => {
        this.recordPoolEvent(poolName, 'connection_ready', event);
      });

      client.on('connectionClosed', (event) => {
        console.log(`Connection closed: ${poolName}`, event.connectionId, event.reason);
        this.recordPoolEvent(poolName, 'connection_closed', event);
        this.updatePoolMetrics(poolName, 'connection_closed');
      });

      client.on('connectionCheckOutStarted', (event) => {
        this.recordPoolEvent(poolName, 'checkout_started', event);
      });

      client.on('connectionCheckedOut', (event) => {
        this.recordPoolEvent(poolName, 'checkout_completed', event);
        this.updatePoolMetrics(poolName, 'checkout_completed');
      });

      client.on('connectionCheckedIn', (event) => {
        this.recordPoolEvent(poolName, 'checkin_completed', event);
        this.updatePoolMetrics(poolName, 'checkin_completed');
      });

      client.on('connectionCheckOutFailed', (event) => {
        console.warn(`Connection checkout failed: ${poolName}`, event.reason);
        this.recordPoolEvent(poolName, 'checkout_failed', event);
        this.updatePoolMetrics(poolName, 'checkout_failed');
      });

      // Server monitoring events
      client.on('serverOpening', (event) => {
        console.log(`Server opening: ${poolName}`, event.address);
        this.recordPoolEvent(poolName, 'server_opening', event);
      });

      client.on('serverClosed', (event) => {
        console.log(`Server closed: ${poolName}`, event.address);
        this.recordPoolEvent(poolName, 'server_closed', event);
      });

      client.on('serverDescriptionChanged', (event) => {
        this.recordPoolEvent(poolName, 'server_description_changed', event);
        this.assessPoolHealth(poolName, event);
      });

      // Set up periodic health checks
      const healthCheckInterval = setInterval(() => {
        this.performPoolHealthCheck(poolName, client);
      }, 30000); // Every 30 seconds

      this.healthCheckIntervals.set(poolName, healthCheckInterval);
    }

    console.log('✅ Connection pool monitoring initialized');
  }

  recordPoolEvent(poolName, eventType, eventData) {
    const metrics = this.connectionMetrics.get(poolName);
    if (metrics) {
      metrics.connectionEvents.push({
        timestamp: new Date(),
        type: eventType,
        data: eventData
      });

      // Keep only last 1000 events to prevent memory leaks
      if (metrics.connectionEvents.length > 1000) {
        metrics.connectionEvents = metrics.connectionEvents.slice(-1000);
      }
    }
  }

  updatePoolMetrics(poolName, eventType) {
    const metrics = this.connectionMetrics.get(poolName);
    if (!metrics) return;

    const performance = metrics.performanceMetrics;

    switch (eventType) {
      case 'connection_created':
        performance.connectionsCreated++;
        performance.totalConnections++;
        break;
      case 'connection_closed':
        performance.connectionsDestroyed++;
        performance.totalConnections = Math.max(0, performance.totalConnections - 1);
        break;
      case 'checkout_completed':
        performance.activeConnections++;
        break;
      case 'checkin_completed':
        performance.activeConnections = Math.max(0, performance.activeConnections - 1);
        break;
      case 'checkout_failed':
        performance.errorRate = (performance.errorRate * 0.95) + 0.05; // Exponential moving average
        break;
    }

    // Calculate available connections
    const poolConfig = this.poolConfigurations.get(poolName);
    if (poolConfig) {
      performance.availableConnections = Math.max(0, 
        Math.min(poolConfig.maxPoolSize, performance.totalConnections) - performance.activeConnections
      );
    }
  }

  async performPoolHealthCheck(poolName, client) {
    try {
      const startTime = Date.now();

      // Perform a lightweight operation to test connectivity
      const db = client.db('admin');
      const result = await db.command({ ping: 1 });

      const operationTime = Date.now() - startTime;

      // Record operation time for performance tracking
      const metrics = this.connectionMetrics.get(poolName);
      if (metrics) {
        metrics.performanceMetrics.operationTime.push({
          timestamp: new Date(),
          duration: operationTime
        });

        // Keep only last 100 operation times
        if (metrics.performanceMetrics.operationTime.length > 100) {
          metrics.performanceMetrics.operationTime = metrics.performanceMetrics.operationTime.slice(-100);
        }

        // Update health status based on performance
        if (operationTime > 5000) {
          metrics.healthStatus = 'degraded';
        } else if (operationTime > 1000) {
          metrics.healthStatus = 'warning';
        } else {
          metrics.healthStatus = 'healthy';
        }
      }

    } catch (error) {
      console.error(`Health check failed for pool ${poolName}:`, error);
      const metrics = this.connectionMetrics.get(poolName);
      if (metrics) {
        metrics.healthStatus = 'unhealthy';
        metrics.performanceMetrics.errorRate = Math.min(1.0, metrics.performanceMetrics.errorRate + 0.1);
      }
    }
  }

  assessPoolHealth(poolName, serverEvent) {
    const metrics = this.connectionMetrics.get(poolName);
    if (!metrics) return;

    const { newDescription } = serverEvent;

    // Assess server health based on description
    if (newDescription.type === 'Unknown' || newDescription.error) {
      metrics.healthStatus = 'unhealthy';
    } else if (newDescription.type === 'RSSecondary' || newDescription.type === 'RSPrimary') {
      metrics.healthStatus = 'healthy';
    }
  }

  async getPoolMetrics(poolName) {
    const metrics = this.connectionMetrics.get(poolName);
    const config = this.poolConfigurations.get(poolName);

    if (!metrics || !config) {
      return { error: `Pool ${poolName} not found` };
    }

    const recentEvents = metrics.connectionEvents.slice(-10); // Last 10 events
    const recentOperationTimes = metrics.performanceMetrics.operationTime.slice(-20); // Last 20 operations

    return {
      poolName: poolName,
      healthStatus: metrics.healthStatus,

      // Connection metrics
      connections: {
        total: metrics.performanceMetrics.totalConnections,
        active: metrics.performanceMetrics.activeConnections,
        available: metrics.performanceMetrics.availableConnections,
        created: metrics.performanceMetrics.connectionsCreated,
        destroyed: metrics.performanceMetrics.connectionsDestroyed,

        // Pool configuration
        minPoolSize: config.minPoolSize,
        maxPoolSize: config.maxPoolSize,
        utilization: metrics.performanceMetrics.totalConnections / config.maxPoolSize
      },

      // Performance metrics
      performance: {
        averageOperationTime: recentOperationTimes.length > 0 ?
          recentOperationTimes.reduce((sum, op) => sum + op.duration, 0) / recentOperationTimes.length : 0,
        errorRate: metrics.performanceMetrics.errorRate,
        recentOperationTimes: recentOperationTimes
      },

      // Recent events
      recentActivity: recentEvents,

      // Health recommendations
      recommendations: this.generatePoolRecommendations(metrics, config)
    };
  }

  generatePoolRecommendations(metrics, config) {
    const recommendations = [];
    const performance = metrics.performanceMetrics;

    // Pool size recommendations
    const utilization = performance.totalConnections / config.maxPoolSize;
    if (utilization > 0.9) {
      recommendations.push({
        type: 'pool_sizing',
        priority: 'high',
        message: 'Pool utilization > 90%. Consider increasing maxPoolSize.',
        suggestedValue: Math.ceil(config.maxPoolSize * 1.5)
      });
    } else if (utilization < 0.3) {
      recommendations.push({
        type: 'pool_sizing',
        priority: 'low',
        message: 'Pool utilization < 30%. Consider reducing maxPoolSize for resource efficiency.',
        suggestedValue: Math.ceil(config.maxPoolSize * 0.7)
      });
    }

    // Performance recommendations
    const avgOpTime = metrics.performanceMetrics.operationTime.length > 0 ?
      metrics.performanceMetrics.operationTime.reduce((sum, op) => sum + op.duration, 0) / 
      metrics.performanceMetrics.operationTime.length : 0;

    if (avgOpTime > 2000) {
      recommendations.push({
        type: 'performance',
        priority: 'high',
        message: 'Average operation time > 2 seconds. Check network latency and server performance.',
        currentValue: Math.round(avgOpTime)
      });
    }

    // Error rate recommendations
    if (performance.errorRate > 0.1) {
      recommendations.push({
        type: 'reliability',
        priority: 'high',
        message: 'High error rate detected. Check server connectivity and timeouts.',
        currentValue: Math.round(performance.errorRate * 100)
      });
    }

    // Health status recommendations
    if (metrics.healthStatus === 'unhealthy') {
      recommendations.push({
        type: 'health',
        priority: 'critical',
        message: 'Pool health is unhealthy. Immediate investigation required.'
      });
    }

    return recommendations.length > 0 ? recommendations : [
      { type: 'status', priority: 'info', message: 'Pool operating within normal parameters.' }
    ];
  }

  async performConnectionLoadTest(poolName, options = {}) {
    console.log(`Performing connection load test for pool: ${poolName}`);

    const {
      concurrentOperations = 20,
      operationDuration = 60000, // 1 minute
      operationType = 'ping'
    } = options;

    const client = this.clients.get(poolName);
    if (!client) {
      return { error: `Pool ${poolName} not found` };
    }

    const testStartTime = Date.now();
    const operationResults = [];
    const activeOperations = [];

    // Create concurrent operations
    for (let i = 0; i < concurrentOperations; i++) {
      const operation = this.performSingleOperation(client, operationType, i)
        .then(result => {
          operationResults.push(result);
        })
        .catch(error => {
          operationResults.push({ 
            operationId: i, 
            error: error.message, 
            success: false,
            timestamp: new Date()
          });
        });

      activeOperations.push(operation);
    }

    // Wait for all operations to complete or timeout
    try {
      await Promise.all(activeOperations);
    } catch (error) {
      console.warn('Some operations failed during load test:', error);
    }

    const testDuration = Date.now() - testStartTime;

    // Analyze results
    const successfulOperations = operationResults.filter(r => r.success);
    const failedOperations = operationResults.filter(r => !r.success);

    const loadTestResults = {
      poolName: poolName,
      testConfiguration: {
        concurrentOperations,
        operationDuration,
        operationType
      },
      results: {
        totalOperations: operationResults.length,
        successfulOperations: successfulOperations.length,
        failedOperations: failedOperations.length,
        successRate: (successfulOperations.length / operationResults.length) * 100,

        // Performance metrics
        averageResponseTime: successfulOperations.length > 0 ?
          successfulOperations.reduce((sum, op) => sum + op.responseTime, 0) / successfulOperations.length : 0,
        minResponseTime: successfulOperations.length > 0 ?
          Math.min(...successfulOperations.map(op => op.responseTime)) : 0,
        maxResponseTime: successfulOperations.length > 0 ?
          Math.max(...successfulOperations.map(op => op.responseTime)) : 0,

        // Test duration
        totalTestDuration: testDuration,
        operationsPerSecond: (operationResults.length / testDuration) * 1000
      },

      // Pool state during test
      poolMetrics: await this.getPoolMetrics(poolName),

      // Recommendations based on load test
      recommendations: this.generateLoadTestRecommendations(operationResults, concurrentOperations)
    };

    console.log(`Load test completed for ${poolName}:`, loadTestResults.results);
    return loadTestResults;
  }

  async performSingleOperation(client, operationType, operationId) {
    const startTime = Date.now();

    try {
      const db = client.db('admin');

      switch (operationType) {
        case 'ping':
          await db.command({ ping: 1 });
          break;
        case 'serverStatus':
          await db.command({ serverStatus: 1 });
          break;
        case 'listCollections':
          await db.listCollections().toArray();
          break;
        default:
          await db.command({ ping: 1 });
      }

      const responseTime = Date.now() - startTime;

      return {
        operationId,
        success: true,
        responseTime,
        timestamp: new Date()
      };

    } catch (error) {
      return {
        operationId,
        success: false,
        error: error.message,
        responseTime: Date.now() - startTime,
        timestamp: new Date()
      };
    }
  }

  generateLoadTestRecommendations(operationResults, concurrency) {
    const recommendations = [];
    const successRate = (operationResults.filter(r => r.success).length / operationResults.length) * 100;
    const avgResponseTime = operationResults
      .filter(r => r.success)
      .reduce((sum, op) => sum + op.responseTime, 0) / Math.max(1, operationResults.filter(r => r.success).length);

    if (successRate < 95) {
      recommendations.push({
        type: 'reliability',
        priority: 'high',
        message: `Success rate ${successRate.toFixed(1)}% is below target. Investigate connection failures.`
      });
    }

    if (avgResponseTime > 1000) {
      recommendations.push({
        type: 'performance', 
        priority: 'medium',
        message: `Average response time ${avgResponseTime.toFixed(0)}ms is high. Check network and server performance.`
      });
    }

    if (successRate > 99 && avgResponseTime < 100) {
      recommendations.push({
        type: 'scaling',
        priority: 'info',
        message: `Pool handles ${concurrency} concurrent operations well. Consider testing higher concurrency.`
      });
    }

    return recommendations;
  }

  async getAllPoolsStatus() {
    const poolsStatus = {};

    for (const poolName of this.clients.keys()) {
      try {
        poolsStatus[poolName] = await this.getPoolMetrics(poolName);
      } catch (error) {
        poolsStatus[poolName] = { error: error.message };
      }
    }

    return {
      timestamp: new Date(),
      pools: poolsStatus,
      summary: this.generateSystemSummary(poolsStatus)
    };
  }

  generateSystemSummary(poolsStatus) {
    const activePools = Object.keys(poolsStatus).length;
    let totalConnections = 0;
    let healthyPools = 0;
    let warnings = [];

    for (const [poolName, status] of Object.entries(poolsStatus)) {
      if (status.error) continue;

      totalConnections += status.connections?.total || 0;

      if (status.healthStatus === 'healthy') {
        healthyPools++;
      } else if (status.healthStatus !== 'healthy') {
        warnings.push(`Pool ${poolName} status: ${status.healthStatus}`);
      }

      // Check for high utilization
      if (status.connections?.utilization > 0.8) {
        warnings.push(`Pool ${poolName} utilization high: ${(status.connections.utilization * 100).toFixed(1)}%`);
      }
    }

    return {
      totalPools: activePools,
      healthyPools,
      totalConnections,
      systemHealth: healthyPools === activePools ? 'healthy' : 'degraded',
      warnings: warnings.length > 0 ? warnings : ['All systems operating normally']
    };
  }

  async shutdown() {
    console.log('Shutting down connection pool manager...');

    // Clear health check intervals
    for (const interval of this.healthCheckIntervals.values()) {
      clearInterval(interval);
    }
    this.healthCheckIntervals.clear();

    // Close all client connections
    for (const [poolName, client] of this.clients) {
      try {
        await client.close();
        console.log(`✅ Closed connection pool: ${poolName}`);
      } catch (error) {
        console.error(`Error closing pool ${poolName}:`, error);
      }
    }

    this.clients.clear();
    this.connectionMetrics.clear();
    console.log('Connection pool manager shutdown completed');
  }
}

// Export the connection pool manager
module.exports = { MongoConnectionPoolManager };

// Benefits of MongoDB Connection Pooling:
// - Native integration with MongoDB drivers eliminates external pooler complexity
// - Automatic connection lifecycle management with intelligent pool sizing
// - Built-in monitoring and health checking with comprehensive event tracking
// - Seamless integration with MongoDB replica sets and sharded clusters
// - Advanced performance optimization with compression and retry logic
// - Real-time connection pool metrics and health assessment
// - Production-ready failover and error handling capabilities
// - Integrated load testing and performance validation tools
// - Comprehensive configuration management across different environments
// - SQL-compatible connection management patterns through QueryLeaf integration

Understanding MongoDB Connection Pool Architecture

Advanced Connection Management Patterns

Implement sophisticated connection pool strategies for production deployments:

// Advanced connection pool patterns for production systems
class ProductionConnectionManager {
  constructor() {
    this.environmentPools = new Map();
    this.serviceRoutingMap = new Map();
    this.performanceProfiles = new Map();
    this.alertingSystem = null;
  }

  async initializeMultiEnvironmentPools() {
    console.log('Initializing multi-environment connection pools...');

    const environments = {
      // Production environment - high availability, strict consistency
      production: {
        primary: {
          uri: process.env.MONGODB_PRODUCTION_PRIMARY,
          options: {
            minPoolSize: 10,
            maxPoolSize: 100,
            maxIdleTimeMS: 30000,
            serverSelectionTimeoutMS: 5000,
            readPreference: 'primary',
            readConcern: { level: 'majority' },
            writeConcern: { w: 'majority', j: true },
            retryWrites: true,
            compressors: ['zlib'],
            appName: 'ProductionPrimary'
          }
        },
        secondary: {
          uri: process.env.MONGODB_PRODUCTION_SECONDARY,
          options: {
            minPoolSize: 5,
            maxPoolSize: 50,
            maxIdleTimeMS: 60000,
            readPreference: 'secondary',
            readConcern: { level: 'available' },
            writeConcern: { w: 1, j: false },
            compressors: ['zlib'],
            appName: 'ProductionSecondary'
          }
        }
      },

      // Staging environment - production-like with relaxed constraints
      staging: {
        primary: {
          uri: process.env.MONGODB_STAGING_URI,
          options: {
            minPoolSize: 3,
            maxPoolSize: 30,
            maxIdleTimeMS: 45000,
            serverSelectionTimeoutMS: 10000,
            readPreference: 'primaryPreferred',
            readConcern: { level: 'local' },
            writeConcern: { w: 'majority', j: false },
            appName: 'StagingEnvironment'
          }
        }
      },

      // Development environment - minimal resources, fast iteration
      development: {
        primary: {
          uri: process.env.MONGODB_DEV_URI || 'mongodb://localhost:27017/development',
          options: {
            minPoolSize: 1,
            maxPoolSize: 10,
            maxIdleTimeMS: 120000,
            serverSelectionTimeoutMS: 15000,
            readPreference: 'primaryPreferred',
            readConcern: { level: 'local' },
            writeConcern: { w: 1, j: false },
            appName: 'DevelopmentEnvironment'
          }
        }
      }
    };

    for (const [envName, envConfig] of Object.entries(environments)) {
      const envPools = new Map();

      for (const [poolType, poolConfig] of Object.entries(envConfig)) {
        try {
          const client = new MongoClient(poolConfig.uri, poolConfig.options);
          await client.connect();

          envPools.set(poolType, {
            client: client,
            config: poolConfig.options,
            healthStatus: 'initializing',
            createdAt: new Date()
          });

          console.log(`✅ Connected to ${envName}.${poolType}`);

        } catch (error) {
          console.error(`Failed to connect to ${envName}.${poolType}:`, error);
          envPools.set(poolType, {
            client: null,
            config: poolConfig.options,
            healthStatus: 'failed',
            error: error.message,
            createdAt: new Date()
          });
        }
      }

      this.environmentPools.set(envName, envPools);
    }

    // Initialize service-specific routing
    await this.setupServiceRouting();

    console.log('Multi-environment connection pools initialized');
    return this.getEnvironmentSummary();
  }

  async setupServiceRouting() {
    console.log('Setting up service-specific connection routing...');

    // Define service routing patterns
    const serviceRoutingConfig = {
      // Write-heavy services use primary connections
      'user-service': { 
        environment: 'production', 
        pool: 'primary',
        profile: 'write-heavy'
      },
      'order-service': { 
        environment: 'production', 
        pool: 'primary',
        profile: 'transactional'
      },

      // Read-heavy services can use secondary connections
      'analytics-service': { 
        environment: 'production', 
        pool: 'secondary',
        profile: 'read-heavy'
      },
      'reporting-service': { 
        environment: 'production', 
        pool: 'secondary',
        profile: 'batch-read'
      },

      // Development services
      'test-service': { 
        environment: 'development', 
        pool: 'primary',
        profile: 'development'
      }
    };

    for (const [serviceName, routing] of Object.entries(serviceRoutingConfig)) {
      this.serviceRoutingMap.set(serviceName, routing);
    }

    // Define performance profiles for different service types
    this.performanceProfiles.set('write-heavy', {
      expectedOperationsPerSecond: 1000,
      maxAcceptableLatency: 50,
      errorThreshold: 0.01,
      poolUtilizationTarget: 0.7
    });

    this.performanceProfiles.set('read-heavy', {
      expectedOperationsPerSecond: 5000,
      maxAcceptableLatency: 20,
      errorThreshold: 0.005,
      poolUtilizationTarget: 0.8
    });

    this.performanceProfiles.set('transactional', {
      expectedOperationsPerSecond: 500,
      maxAcceptableLatency: 100,
      errorThreshold: 0.001,
      poolUtilizationTarget: 0.6
    });

    this.performanceProfiles.set('batch-read', {
      expectedOperationsPerSecond: 100,
      maxAcceptableLatency: 1000,
      errorThreshold: 0.05,
      poolUtilizationTarget: 0.9
    });

    console.log('Service routing configuration completed');
  }

  getConnectionForService(serviceName) {
    const routing = this.serviceRoutingMap.get(serviceName);
    if (!routing) {
      throw new Error(`No routing configuration found for service: ${serviceName}`);
    }

    const envPools = this.environmentPools.get(routing.environment);
    if (!envPools) {
      throw new Error(`Environment ${routing.environment} not found`);
    }

    const poolInfo = envPools.get(routing.pool);
    if (!poolInfo || !poolInfo.client) {
      throw new Error(`Pool ${routing.pool} not available in ${routing.environment}`);
    }

    if (poolInfo.healthStatus !== 'healthy' && poolInfo.healthStatus !== 'initializing') {
      console.warn(`Using potentially unhealthy connection for ${serviceName}: ${poolInfo.healthStatus}`);
    }

    return {
      client: poolInfo.client,
      routing: routing,
      profile: this.performanceProfiles.get(routing.profile),
      poolInfo: poolInfo
    };
  }

  async performComprehensiveHealthCheck() {
    console.log('Performing comprehensive health check across all connection pools...');

    const healthReport = {
      timestamp: new Date(),
      environments: {},
      overallHealth: 'healthy',
      criticalIssues: [],
      warnings: [],
      recommendations: []
    };

    for (const [envName, envPools] of this.environmentPools) {
      const envHealth = {
        pools: {},
        healthStatus: 'healthy',
        totalConnections: 0,
        activeConnections: 0
      };

      for (const [poolType, poolInfo] of envPools) {
        if (!poolInfo.client) {
          envHealth.pools[poolType] = {
            status: 'failed',
            error: poolInfo.error || 'Client not initialized',
            lastChecked: new Date()
          };
          envHealth.healthStatus = 'degraded';
          continue;
        }

        try {
          const startTime = Date.now();
          const db = poolInfo.client.db('admin');

          // Perform health check operations
          await db.command({ ping: 1 });
          const serverStatus = await db.command({ serverStatus: 1 });
          const responseTime = Date.now() - startTime;

          // Extract connection metrics from server status
          const connections = serverStatus.connections || {};
          const opcounters = serverStatus.opcounters || {};

          envHealth.pools[poolType] = {
            status: responseTime < 1000 ? 'healthy' : 'slow',
            responseTime: responseTime,
            connections: {
              current: connections.current || 0,
              available: connections.available || 0,
              totalCreated: connections.totalCreated || 0
            },
            operations: {
              insert: opcounters.insert || 0,
              query: opcounters.query || 0,
              update: opcounters.update || 0,
              delete: opcounters.delete || 0
            },
            serverInfo: {
              version: serverStatus.version,
              uptime: serverStatus.uptime,
              host: serverStatus.host
            },
            lastChecked: new Date()
          };

          envHealth.totalConnections += connections.current || 0;

          if (responseTime > 2000) {
            healthReport.warnings.push(`Slow response time in ${envName}.${poolType}: ${responseTime}ms`);
          }

          if ((connections.available || 0) < 10) {
            healthReport.criticalIssues.push(`Low available connections in ${envName}.${poolType}: ${connections.available}`);
            envHealth.healthStatus = 'critical';
          }

        } catch (error) {
          envHealth.pools[poolType] = {
            status: 'unhealthy',
            error: error.message,
            lastChecked: new Date()
          };
          envHealth.healthStatus = 'degraded';
          healthReport.criticalIssues.push(`Health check failed for ${envName}.${poolType}: ${error.message}`);
        }
      }

      healthReport.environments[envName] = envHealth;
    }

    // Determine overall system health
    const hasCriticalIssues = healthReport.criticalIssues.length > 0;
    const hasWarnings = healthReport.warnings.length > 0;

    if (hasCriticalIssues) {
      healthReport.overallHealth = 'critical';
    } else if (hasWarnings) {
      healthReport.overallHealth = 'warning';
    } else {
      healthReport.overallHealth = 'healthy';
    }

    // Generate recommendations
    healthReport.recommendations = this.generateSystemRecommendations(healthReport);

    console.log(`Health check completed. Overall status: ${healthReport.overallHealth}`);
    return healthReport;
  }

  generateSystemRecommendations(healthReport) {
    const recommendations = [];

    // Check for connection pool sizing issues
    for (const [envName, envHealth] of Object.entries(healthReport.environments)) {
      for (const [poolType, poolHealth] of Object.entries(envHealth.pools)) {
        if (poolHealth.status === 'healthy' && poolHealth.connections) {
          const utilization = poolHealth.connections.current / 
            (poolHealth.connections.current + poolHealth.connections.available);

          if (utilization > 0.9) {
            recommendations.push({
              type: 'scaling',
              priority: 'high',
              environment: envName,
              pool: poolType,
              message: `Pool utilization ${(utilization * 100).toFixed(1)}% is very high. Consider increasing pool size.`
            });
          } else if (utilization < 0.2) {
            recommendations.push({
              type: 'optimization',
              priority: 'low',
              environment: envName,
              pool: poolType,
              message: `Pool utilization ${(utilization * 100).toFixed(1)}% is low. Consider reducing pool size for efficiency.`
            });
          }
        }
      }
    }

    // Performance-based recommendations
    if (healthReport.overallHealth === 'warning') {
      recommendations.push({
        type: 'monitoring',
        priority: 'medium',
        message: 'System has performance warnings. Increase monitoring frequency and consider scaling.'
      });
    }

    if (healthReport.criticalIssues.length > 0) {
      recommendations.push({
        type: 'immediate_action',
        priority: 'critical',
        message: 'Critical issues detected. Immediate investigation and resolution required.'
      });
    }

    return recommendations.length > 0 ? recommendations : [
      { type: 'status', priority: 'info', message: 'All connection pools operating optimally.' }
    ];
  }

  getEnvironmentSummary() {
    const summary = {
      environments: Array.from(this.environmentPools.keys()),
      totalPools: 0,
      healthyPools: 0,
      services: Array.from(this.serviceRoutingMap.keys()),
      profiles: Array.from(this.performanceProfiles.keys())
    };

    for (const envPools of this.environmentPools.values()) {
      for (const poolInfo of envPools.values()) {
        summary.totalPools++;
        if (poolInfo.healthStatus === 'healthy' || poolInfo.healthStatus === 'initializing') {
          summary.healthyPools++;
        }
      }
    }

    return summary;
  }
}

// Export the production connection manager
module.exports = { ProductionConnectionManager };

SQL-Style Connection Management with QueryLeaf

QueryLeaf provides familiar SQL syntax for MongoDB connection pool configuration and monitoring:

-- QueryLeaf connection pool management with SQL-familiar syntax

-- Configure connection pool settings
SET connection_pool_min_size = 5;
SET connection_pool_max_size = 50;
SET connection_pool_idle_timeout = 30000; -- milliseconds
SET connection_pool_server_selection_timeout = 10000;

-- Connection pool status monitoring
SELECT 
  pool_name,
  environment,

  -- Connection metrics
  total_connections,
  active_connections,
  available_connections,

  -- Pool configuration
  min_pool_size,
  max_pool_size,
  idle_timeout_ms,

  -- Performance metrics
  ROUND(connection_utilization::NUMERIC * 100, 2) as utilization_percent,
  ROUND(avg_operation_time::NUMERIC, 2) as avg_operation_time_ms,

  -- Health status
  health_status,
  last_health_check,

  -- Connection lifecycle
  connections_created,
  connections_destroyed,

  -- Error metrics
  ROUND(error_rate::NUMERIC * 100, 4) as error_rate_percent,
  checkout_failures,

  -- Status classification
  CASE 
    WHEN health_status = 'healthy' AND connection_utilization < 0.8 THEN 'optimal'
    WHEN health_status = 'healthy' AND connection_utilization >= 0.8 THEN 'high_load'
    WHEN health_status = 'warning' THEN 'needs_attention'  
    WHEN health_status = 'unhealthy' THEN 'critical'
    ELSE 'unknown'
  END as pool_status_category

FROM mongodb_connection_pools
ORDER BY environment, pool_name;

-- Connection pool performance analysis over time
WITH pool_metrics_hourly AS (
  SELECT 
    pool_name,
    DATE_TRUNC('hour', metric_timestamp) as hour_bucket,

    -- Aggregated metrics
    AVG(active_connections) as avg_active_connections,
    MAX(active_connections) as peak_active_connections,
    AVG(connection_utilization) as avg_utilization,
    MAX(connection_utilization) as peak_utilization,

    -- Performance indicators
    AVG(avg_operation_time) as avg_operation_time,
    PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY avg_operation_time) as p95_operation_time,

    -- Error rates
    AVG(error_rate) as avg_error_rate,
    SUM(checkout_failures) as total_checkout_failures,

    -- Connection lifecycle
    SUM(connections_created) as connections_created_hourly,
    SUM(connections_destroyed) as connections_destroyed_hourly

  FROM connection_pool_metrics
  WHERE metric_timestamp >= CURRENT_TIMESTAMP - INTERVAL '24 hours'
  GROUP BY pool_name, DATE_TRUNC('hour', metric_timestamp)
),

performance_trends AS (
  SELECT 
    *,
    -- Calculate hourly trends
    LAG(avg_utilization) OVER (PARTITION BY pool_name ORDER BY hour_bucket) as prev_utilization,
    LAG(avg_operation_time) OVER (PARTITION BY pool_name ORDER BY hour_bucket) as prev_operation_time,

    -- Performance change indicators
    CASE 
      WHEN avg_utilization - LAG(avg_utilization) OVER (PARTITION BY pool_name ORDER BY hour_bucket) > 0.1 
      THEN 'utilization_spike'
      WHEN avg_operation_time - LAG(avg_operation_time) OVER (PARTITION BY pool_name ORDER BY hour_bucket) > 100
      THEN 'latency_spike'
      ELSE 'stable'
    END as performance_change

  FROM pool_metrics_hourly
)

SELECT 
  pool_name,
  TO_CHAR(hour_bucket, 'YYYY-MM-DD HH24:00') as analysis_hour,

  -- Connection utilization
  ROUND(avg_utilization::NUMERIC * 100, 1) as avg_utilization_pct,
  ROUND(peak_utilization::NUMERIC * 100, 1) as peak_utilization_pct,

  -- Connection activity
  ROUND(avg_active_connections::NUMERIC, 1) as avg_active_connections,
  peak_active_connections,

  -- Performance metrics  
  ROUND(avg_operation_time::NUMERIC, 2) as avg_operation_time_ms,
  ROUND(p95_operation_time::NUMERIC, 2) as p95_operation_time_ms,

  -- Reliability metrics
  ROUND(avg_error_rate::NUMERIC * 100, 4) as avg_error_rate_pct,
  total_checkout_failures,

  -- Connection churn
  connections_created_hourly,
  connections_destroyed_hourly,
  connections_created_hourly - connections_destroyed_hourly as net_connection_change,

  -- Performance analysis
  performance_change,

  -- Health assessment
  CASE 
    WHEN avg_error_rate > 0.01 THEN 'high_error_rate'
    WHEN peak_utilization > 0.9 THEN 'utilization_critical'
    WHEN avg_operation_time > 500 THEN 'high_latency'
    WHEN total_checkout_failures > 10 THEN 'checkout_issues'
    ELSE 'healthy'
  END as health_indicator,

  -- Optimization recommendations
  CASE 
    WHEN peak_utilization > 0.9 AND performance_change = 'utilization_spike' 
    THEN 'Increase pool size immediately'
    WHEN avg_operation_time > 1000
    THEN 'Investigate database performance'
    WHEN total_checkout_failures > 50
    THEN 'Review timeout configuration'
    WHEN avg_utilization < 0.2 AND connections_created_hourly < 5
    THEN 'Consider reducing pool size'
    ELSE 'Performance within acceptable ranges'
  END as optimization_recommendation

FROM performance_trends
WHERE hour_bucket >= CURRENT_TIMESTAMP - INTERVAL '24 hours'
ORDER BY pool_name, hour_bucket DESC;

-- Connection pool load testing and capacity planning
CREATE VIEW connection_pool_load_test AS
WITH load_test_scenarios AS (
  SELECT 
    scenario_name,
    concurrent_connections,
    operation_type,
    test_duration_minutes,
    target_operations_per_second
  FROM (VALUES 
    ('normal_load', 20, 'mixed', 10, 100),
    ('peak_load', 50, 'mixed', 10, 500),
    ('stress_test', 100, 'write_heavy', 5, 1000),
    ('sustained_read', 30, 'read_only', 30, 200)
  ) AS scenarios(scenario_name, concurrent_connections, operation_type, test_duration_minutes, target_operations_per_second)
),

load_test_results AS (
  SELECT 
    ltr.pool_name,
    ltr.scenario_name,
    lts.concurrent_connections,
    lts.operation_type,
    lts.test_duration_minutes,

    -- Performance results
    ltr.actual_operations_per_second,
    ltr.success_rate,
    ltr.avg_response_time_ms,
    ltr.p95_response_time_ms,
    ltr.max_response_time_ms,

    -- Resource utilization during test
    ltr.peak_connections_used,
    ltr.avg_connections_used,
    ltr.connection_utilization_peak,

    -- Error analysis
    ltr.total_errors,
    ltr.timeout_errors,
    ltr.connection_errors,

    -- Performance vs target
    ROUND((ltr.actual_operations_per_second / lts.target_operations_per_second::DECIMAL * 100)::NUMERIC, 1) 
      as performance_vs_target_pct,

    -- Load test assessment
    CASE 
      WHEN ltr.success_rate >= 99.5 AND ltr.avg_response_time_ms <= 100 THEN 'excellent'
      WHEN ltr.success_rate >= 95 AND ltr.avg_response_time_ms <= 500 THEN 'good'
      WHEN ltr.success_rate >= 90 AND ltr.avg_response_time_ms <= 1000 THEN 'acceptable'
      ELSE 'poor'
    END as performance_rating

  FROM load_test_results ltr
  JOIN load_test_scenarios lts ON ltr.scenario_name = lts.scenario_name
)

SELECT 
  pool_name,
  scenario_name,
  concurrent_connections,
  operation_type,

  -- Performance metrics
  actual_operations_per_second,
  performance_vs_target_pct,
  ROUND(success_rate::NUMERIC, 2) as success_rate_pct,

  -- Response time analysis
  avg_response_time_ms,
  p95_response_time_ms,
  max_response_time_ms,

  -- Resource utilization
  peak_connections_used,
  ROUND(connection_utilization_peak::NUMERIC * 100, 1) as peak_utilization_pct,

  -- Error analysis
  total_errors,
  timeout_errors,
  connection_errors,

  -- Performance assessment
  performance_rating,

  -- Capacity recommendations
  CASE performance_rating
    WHEN 'excellent' THEN 
      CONCAT('Pool can handle ', concurrent_connections + 20, ' concurrent connections')
    WHEN 'good' THEN 
      'Current capacity appropriate for this load pattern'
    WHEN 'acceptable' THEN
      'Monitor closely under sustained load'
    ELSE 
      CONCAT('Increase pool size or optimize operations for ', concurrent_connections, ' concurrent users')
  END as capacity_recommendation,

  -- Scaling suggestions
  CASE 
    WHEN connection_utilization_peak > 0.9 AND performance_rating IN ('good', 'excellent')
    THEN 'Pool size optimally configured'
    WHEN connection_utilization_peak > 0.9 AND performance_rating = 'poor'
    THEN 'Increase pool size significantly'
    WHEN connection_utilization_peak < 0.5
    THEN 'Pool may be oversized for this workload'
    ELSE 'Pool sizing appears appropriate'
  END as sizing_recommendation

FROM load_test_results
ORDER BY pool_name, concurrent_connections;

-- Real-time connection pool alerting
CREATE VIEW connection_pool_alerts AS
WITH current_pool_status AS (
  SELECT 
    pool_name,
    environment,
    health_status,
    connection_utilization,
    avg_operation_time,
    error_rate,
    available_connections,
    checkout_failures,
    last_health_check,

    -- Time since last health check
    EXTRACT(EPOCH FROM (CURRENT_TIMESTAMP - last_health_check)) as seconds_since_health_check

  FROM mongodb_connection_pools
  WHERE enabled = true
),

alert_conditions AS (
  SELECT 
    *,
    -- Define alert conditions
    CASE 
      WHEN health_status = 'unhealthy' THEN 'CRITICAL'
      WHEN connection_utilization > 0.95 THEN 'CRITICAL'
      WHEN available_connections < 2 THEN 'CRITICAL'
      WHEN seconds_since_health_check > 300 THEN 'CRITICAL' -- 5 minutes
      WHEN error_rate > 0.05 THEN 'HIGH'
      WHEN connection_utilization > 0.85 THEN 'HIGH'
      WHEN avg_operation_time > 2000 THEN 'HIGH'
      WHEN checkout_failures > 10 THEN 'MEDIUM'
      WHEN connection_utilization > 0.75 THEN 'MEDIUM'
      WHEN avg_operation_time > 1000 THEN 'MEDIUM'
      ELSE 'LOW'
    END as alert_severity,

    -- Generate alert messages
    ARRAY_REMOVE(ARRAY[
      CASE WHEN health_status = 'unhealthy' THEN 'Pool health check failing' END,
      CASE WHEN connection_utilization > 0.95 THEN 'Connection utilization critical' END,
      CASE WHEN available_connections < 2 THEN 'Available connections critically low' END,
      CASE WHEN error_rate > 0.05 THEN 'High error rate detected' END,
      CASE WHEN avg_operation_time > 2000 THEN 'High operation latency' END,
      CASE WHEN checkout_failures > 10 THEN 'Connection checkout failures' END,
      CASE WHEN seconds_since_health_check > 300 THEN 'Health check timeout' END
    ], NULL) as alert_reasons

  FROM current_pool_status
)

SELECT 
  pool_name,
  environment,
  alert_severity,
  alert_reasons,

  -- Current metrics for context
  ROUND(connection_utilization::NUMERIC * 100, 1) as utilization_pct,
  available_connections,
  ROUND(avg_operation_time::NUMERIC, 0) as avg_operation_time_ms,
  ROUND(error_rate::NUMERIC * 100, 2) as error_rate_pct,
  checkout_failures,

  -- Time context
  TO_CHAR(last_health_check, 'YYYY-MM-DD HH24:MI:SS') as last_health_check,
  ROUND(seconds_since_health_check::NUMERIC, 0) as seconds_since_health_check,

  -- Alert priority for incident management
  CASE alert_severity
    WHEN 'CRITICAL' THEN 1
    WHEN 'HIGH' THEN 2  
    WHEN 'MEDIUM' THEN 3
    ELSE 4
  END as alert_priority,

  -- Immediate action recommendations
  CASE alert_severity
    WHEN 'CRITICAL' THEN 'Immediate investigation required - potential service impact'
    WHEN 'HIGH' THEN 'Investigation needed within 15 minutes'
    WHEN 'MEDIUM' THEN 'Review within 1 hour'
    ELSE 'Monitor - no immediate action required'
  END as action_required

FROM alert_conditions
WHERE alert_severity != 'LOW'
ORDER BY alert_priority, pool_name;

-- QueryLeaf provides comprehensive connection pool management:
-- 1. SQL-familiar configuration syntax for pool sizing and timeouts
-- 2. Real-time monitoring with performance metrics and health indicators
-- 3. Historical analysis with trend detection and capacity planning
-- 4. Load testing capabilities with automated performance assessment
-- 5. Intelligent alerting with severity classification and action recommendations
-- 6. Multi-environment pool management with service routing optimization
-- 7. Production-ready monitoring with comprehensive error tracking
-- 8. Automated recommendations for scaling and optimization decisions
-- 9. Integration with MongoDB's native connection pool capabilities
-- 10. Enterprise-grade connection management with familiar SQL patterns

Best Practices for MongoDB Connection Pool Implementation

Connection Pool Sizing and Configuration

Essential practices for production connection pool deployments:

Right-Size Pool Limits: Configure minPoolSize and maxPoolSize based on actual concurrent load patterns
Timeout Management: Set appropriate timeouts for connection creation, idle time, and server selection
Environment-Specific Tuning: Use different pool configurations for production, staging, and development environments
Monitoring Integration: Implement comprehensive monitoring with health checks and performance metrics
Failover Planning: Configure connection pools to handle replica set failovers gracefully
Resource Optimization: Balance connection pool sizes with available system resources and database capacity

Performance Optimization Strategies

Optimize connection pools for maximum application performance:

Connection Reuse: Design application patterns that maximize connection reuse and minimize churn
Read Preference Strategy: Use appropriate read preferences to distribute load across replica set members
Write Concern Optimization: Configure write concerns that balance durability requirements with performance
Compression Settings: Enable compression for high-latency networks to improve throughput
Application-Level Pooling: Implement service-specific connection routing for optimal resource utilization
Load Testing: Regularly validate pool performance under realistic load conditions

Conclusion

MongoDB connection pooling provides comprehensive database connection management that eliminates the complexity and overhead of external connection pooling solutions. The integration of intelligent pool sizing, automatic health monitoring, and seamless replica set integration enables high-performance applications that scale efficiently with growing user demands.

Key MongoDB connection pooling benefits include:

Native Integration: Built-in connection pool management in MongoDB drivers eliminates external infrastructure
Intelligent Sizing: Automatic pool sizing based on application load with configurable limits and behaviors
Health Monitoring: Real-time connection health tracking with automatic failover and recovery capabilities
Performance Optimization: Advanced features like compression, retry logic, and read preference routing
Production Ready: Enterprise-grade monitoring, alerting, and capacity planning capabilities
SQL Compatibility: Familiar connection management patterns accessible through SQL-style operations

Whether you're building microservices architectures, high-throughput web applications, or data-intensive analytics platforms, MongoDB connection pooling with QueryLeaf's SQL-familiar interface provides the foundation for scalable database connectivity that maintains high performance while simplifying operational complexity.

QueryLeaf Integration: QueryLeaf automatically manages MongoDB connection pools while providing SQL-familiar syntax for pool configuration, monitoring, and optimization. Advanced connection management patterns, load testing capabilities, and production-ready alerting are seamlessly accessible through familiar SQL constructs, making sophisticated database connection management both powerful and approachable for SQL-oriented teams.

The combination of MongoDB's intelligent connection pooling with familiar SQL-style management makes it an ideal platform for applications that require both high-performance database connectivity and operational simplicity, ensuring your database infrastructure scales efficiently while maintaining familiar development and operational patterns.

November 16, 2025
28 min read

MongoDB Schema Validation and Data Quality Management: Enterprise Data Integrity and Governance

Enterprise applications demand rigorous data quality standards to ensure compliance with regulatory requirements, maintain data integrity across distributed systems, and support reliable business intelligence and analytics. Traditional relational databases enforce data quality through rigid schema constraints, foreign key relationships, and check constraints, but these approaches often lack the flexibility required for modern applications dealing with evolving data structures and diverse data sources.

MongoDB Schema Validation provides comprehensive data quality management capabilities that combine flexible document validation rules with sophisticated data governance patterns. Unlike traditional database systems that require extensive schema migrations and rigid constraints, MongoDB's validation framework enables adaptive data quality enforcement that evolves with changing business requirements while maintaining enterprise-grade compliance and governance standards.

The Traditional Data Quality Challenge

Relational database data quality management often involves complex constraint management and limited flexibility:

-- Traditional PostgreSQL data quality management - rigid and maintenance-heavy

-- Customer data with extensive validation rules
CREATE TABLE customers (
    customer_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    company_name VARCHAR(500) NOT NULL,
    legal_entity_type VARCHAR(50) NOT NULL,

    -- Contact information with validation
    primary_email VARCHAR(320) NOT NULL,
    secondary_email VARCHAR(320),
    phone_primary VARCHAR(20) NOT NULL,
    phone_secondary VARCHAR(20),

    -- Address validation
    billing_address_line1 VARCHAR(200) NOT NULL,
    billing_address_line2 VARCHAR(200),
    billing_city VARCHAR(100) NOT NULL,
    billing_state VARCHAR(50) NOT NULL,
    billing_postal_code VARCHAR(20) NOT NULL,
    billing_country VARCHAR(3) NOT NULL DEFAULT 'USA',

    shipping_address_line1 VARCHAR(200),
    shipping_address_line2 VARCHAR(200),
    shipping_city VARCHAR(100),
    shipping_state VARCHAR(50),
    shipping_postal_code VARCHAR(20),
    shipping_country VARCHAR(3),

    -- Business information
    tax_id VARCHAR(50),
    business_registration_number VARCHAR(100),
    industry_code VARCHAR(10),
    annual_revenue DECIMAL(15,2),
    employee_count INTEGER,

    -- Account status and compliance
    account_status VARCHAR(20) NOT NULL DEFAULT 'active',
    credit_limit DECIMAL(12,2) DEFAULT 0.00,
    payment_terms INTEGER DEFAULT 30,

    -- Regulatory compliance fields
    gdpr_consent BOOLEAN DEFAULT false,
    gdpr_consent_date TIMESTAMP,
    ccpa_opt_out BOOLEAN DEFAULT false,
    data_retention_category VARCHAR(50) DEFAULT 'standard',

    -- Audit fields
    created_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
    created_by UUID NOT NULL,
    updated_by UUID NOT NULL,

    -- Complex constraint validation
    CONSTRAINT chk_email_format 
        CHECK (primary_email ~* '^[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}$'),
    CONSTRAINT chk_secondary_email_format 
        CHECK (secondary_email IS NULL OR secondary_email ~* '^[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}$'),
    CONSTRAINT chk_phone_format 
        CHECK (phone_primary ~ '^\+?[1-9]\d{1,14}$'),
    CONSTRAINT chk_postal_code_format 
        CHECK (
            (billing_country = 'USA' AND billing_postal_code ~ '^\d{5}(-\d{4})?$') OR
            (billing_country = 'CAN' AND billing_postal_code ~ '^[A-Z]\d[A-Z] ?\d[A-Z]\d$') OR
            (billing_country != 'USA' AND billing_country != 'CAN')
        ),
    CONSTRAINT chk_account_status 
        CHECK (account_status IN ('active', 'suspended', 'closed', 'pending_approval')),
    CONSTRAINT chk_legal_entity_type 
        CHECK (legal_entity_type IN ('corporation', 'llc', 'partnership', 'sole_proprietorship', 'non_profit')),
    CONSTRAINT chk_revenue_positive 
        CHECK (annual_revenue IS NULL OR annual_revenue >= 0),
    CONSTRAINT chk_employee_count_positive 
        CHECK (employee_count IS NULL OR employee_count >= 0),
    CONSTRAINT chk_credit_limit_positive 
        CHECK (credit_limit >= 0),
    CONSTRAINT chk_payment_terms_valid 
        CHECK (payment_terms IN (15, 30, 45, 60, 90)),
    CONSTRAINT chk_gdpr_consent_date 
        CHECK (gdpr_consent = false OR gdpr_consent_date IS NOT NULL),

    -- Foreign key constraints
    CONSTRAINT fk_created_by FOREIGN KEY (created_by) REFERENCES users(user_id),
    CONSTRAINT fk_updated_by FOREIGN KEY (updated_by) REFERENCES users(user_id)
);

-- Additional validation through triggers for complex business rules
CREATE OR REPLACE FUNCTION validate_customer_data()
RETURNS TRIGGER AS $$
BEGIN
    -- Validate business registration requirements
    IF NEW.annual_revenue > 1000000 AND NEW.business_registration_number IS NULL THEN
        RAISE EXCEPTION 'Business registration number required for companies with revenue > $1M';
    END IF;

    -- Validate tax ID requirements
    IF NEW.legal_entity_type IN ('corporation', 'llc') AND NEW.tax_id IS NULL THEN
        RAISE EXCEPTION 'Tax ID required for corporations and LLCs';
    END IF;

    -- Validate shipping address consistency
    IF NEW.shipping_address_line1 IS NOT NULL THEN
        IF NEW.shipping_city IS NULL OR NEW.shipping_state IS NULL OR NEW.shipping_postal_code IS NULL THEN
            RAISE EXCEPTION 'Complete shipping address required when shipping address is provided';
        END IF;
    END IF;

    -- Industry-specific validation
    IF NEW.industry_code IS NOT NULL AND NOT EXISTS (
        SELECT 1 FROM industry_codes WHERE code = NEW.industry_code AND active = true
    ) THEN
        RAISE EXCEPTION 'Invalid or inactive industry code: %', NEW.industry_code;
    END IF;

    -- Credit limit validation based on business tier
    IF NEW.annual_revenue IS NOT NULL THEN
        CASE 
            WHEN NEW.annual_revenue < 100000 AND NEW.credit_limit > 10000 THEN
                RAISE EXCEPTION 'Credit limit too high for small business tier';
            WHEN NEW.annual_revenue < 1000000 AND NEW.credit_limit > 50000 THEN
                RAISE EXCEPTION 'Credit limit too high for medium business tier';
            WHEN NEW.annual_revenue >= 1000000 AND NEW.credit_limit > 500000 THEN
                RAISE EXCEPTION 'Credit limit exceeds maximum allowed';
        END CASE;
    END IF;

    -- Data retention policy validation
    IF NEW.data_retention_category NOT IN ('standard', 'extended', 'permanent', 'gdpr_restricted') THEN
        RAISE EXCEPTION 'Invalid data retention category';
    END IF;

    -- Update audit fields
    NEW.updated_at = CURRENT_TIMESTAMP;

    RETURN NEW;
END;
$$ LANGUAGE plpgsql;

CREATE TRIGGER customer_validation_trigger
    BEFORE INSERT OR UPDATE ON customers
    FOR EACH ROW
    EXECUTE FUNCTION validate_customer_data();

-- Comprehensive data quality monitoring
CREATE VIEW customer_data_quality_report AS
WITH validation_checks AS (
    SELECT 
        customer_id,
        company_name,

        -- Email validation
        CASE WHEN primary_email ~* '^[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}$' 
             THEN 'Valid' ELSE 'Invalid' END as primary_email_quality,
        CASE WHEN secondary_email IS NULL OR secondary_email ~* '^[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}$' 
             THEN 'Valid' ELSE 'Invalid' END as secondary_email_quality,

        -- Phone validation
        CASE WHEN phone_primary ~ '^\+?[1-9]\d{1,14}$' 
             THEN 'Valid' ELSE 'Invalid' END as phone_quality,

        -- Address completeness
        CASE WHEN billing_address_line1 IS NOT NULL AND billing_city IS NOT NULL AND 
                  billing_state IS NOT NULL AND billing_postal_code IS NOT NULL 
             THEN 'Complete' ELSE 'Incomplete' END as billing_address_quality,

        -- Business data completeness
        CASE WHEN (legal_entity_type IN ('corporation', 'llc') AND tax_id IS NOT NULL) OR
                  (legal_entity_type NOT IN ('corporation', 'llc')) 
             THEN 'Valid' ELSE 'Missing Tax ID' END as tax_compliance_quality,

        -- GDPR compliance
        CASE WHEN gdpr_consent = true AND gdpr_consent_date IS NOT NULL 
             THEN 'Compliant' ELSE 'Non-Compliant' END as gdpr_compliance_quality,

        -- Data freshness
        CASE WHEN updated_at >= CURRENT_TIMESTAMP - INTERVAL '90 days' 
             THEN 'Fresh' ELSE 'Stale' END as data_freshness_quality
    FROM customers
),
quality_scores AS (
    SELECT *,
        -- Calculate overall quality score (0-100)
        (
            (CASE WHEN primary_email_quality = 'Valid' THEN 15 ELSE 0 END) +
            (CASE WHEN secondary_email_quality = 'Valid' THEN 5 ELSE 0 END) +
            (CASE WHEN phone_quality = 'Valid' THEN 10 ELSE 0 END) +
            (CASE WHEN billing_address_quality = 'Complete' THEN 20 ELSE 0 END) +
            (CASE WHEN tax_compliance_quality = 'Valid' THEN 25 ELSE 0 END) +
            (CASE WHEN gdpr_compliance_quality = 'Compliant' THEN 15 ELSE 0 END) +
            (CASE WHEN data_freshness_quality = 'Fresh' THEN 10 ELSE 0 END)
        ) as overall_quality_score
    FROM validation_checks
)
SELECT 
    customer_id,
    company_name,
    overall_quality_score,

    -- Quality classification
    CASE 
        WHEN overall_quality_score >= 90 THEN 'Excellent'
        WHEN overall_quality_score >= 75 THEN 'Good'
        WHEN overall_quality_score >= 60 THEN 'Fair'
        ELSE 'Poor'
    END as quality_rating,

    -- Specific quality issues
    CASE WHEN primary_email_quality = 'Invalid' THEN 'Fix primary email format' END as primary_issue,
    CASE WHEN billing_address_quality = 'Incomplete' THEN 'Complete billing address' END as address_issue,
    CASE WHEN tax_compliance_quality = 'Missing Tax ID' THEN 'Add required tax ID' END as compliance_issue,
    CASE WHEN gdpr_compliance_quality = 'Non-Compliant' THEN 'Update GDPR consent' END as gdpr_issue,
    CASE WHEN data_freshness_quality = 'Stale' THEN 'Data needs refresh' END as freshness_issue

FROM quality_scores
ORDER BY overall_quality_score ASC;

-- Problems with traditional data quality management:
-- 1. Rigid schema constraints that are difficult to modify as requirements evolve
-- 2. Complex trigger-based validation that is hard to maintain and debug
-- 3. Limited support for nested data structures and dynamic field validation
-- 4. Extensive migration requirements when adding new validation rules
-- 5. Performance overhead from complex constraint checking during writes
-- 6. Difficulty handling semi-structured data with varying field requirements
-- 7. Limited flexibility for different validation rules across data sources
-- 8. Complex reporting and monitoring of data quality across multiple tables
-- 9. Difficulty implementing conditional validation based on document context
-- 10. Expensive maintenance of validation logic across application and database layers

MongoDB provides flexible and comprehensive data quality management:

// MongoDB Schema Validation - flexible and comprehensive data quality management
const { MongoClient } = require('mongodb');

const client = new MongoClient('mongodb://localhost:27017');
const db = client.db('enterprise_data_platform');

// Advanced Schema Validation and Data Quality Management
class MongoSchemaValidator {
  constructor(db) {
    this.db = db;
    this.validationRules = new Map();
    this.qualityMetrics = new Map();
    this.complianceReports = new Map();
  }

  async createComprehensiveCustomerValidation() {
    console.log('Creating comprehensive customer data validation schema...');

    const customersCollection = db.collection('customers');

    // Define comprehensive validation schema
    const customerValidationSchema = {
      $jsonSchema: {
        bsonType: "object",
        required: [
          "companyName", 
          "legalEntityType", 
          "primaryContact", 
          "billingAddress", 
          "accountStatus",
          "audit"
        ],
        properties: {
          _id: {
            bsonType: "objectId"
          },

          // Company identification
          companyName: {
            bsonType: "string",
            minLength: 2,
            maxLength: 500,
            pattern: "^[A-Za-z0-9\\s\\-.,&'()]+$",
            description: "Company name must be 2-500 characters, alphanumeric with common punctuation"
          },

          legalEntityType: {
            enum: ["corporation", "llc", "partnership", "sole_proprietorship", "non_profit", "government"],
            description: "Must be a valid legal entity type"
          },

          businessRegistrationNumber: {
            bsonType: "string",
            pattern: "^[A-Z0-9\\-]{5,20}$",
            description: "Business registration number format validation"
          },

          taxId: {
            bsonType: "string",
            pattern: "^\\d{2}-\\d{7}$|^\\d{9}$",
            description: "Tax ID must be EIN format (XX-XXXXXXX) or SSN format (XXXXXXXXX)"
          },

          // Contact information with nested validation
          primaryContact: {
            bsonType: "object",
            required: ["firstName", "lastName", "email", "phone"],
            properties: {
              title: {
                bsonType: "string",
                enum: ["Mr", "Mrs", "Ms", "Dr", "Prof"]
              },
              firstName: {
                bsonType: "string",
                minLength: 1,
                maxLength: 50,
                pattern: "^[A-Za-z\\s\\-']+$"
              },
              lastName: {
                bsonType: "string", 
                minLength: 1,
                maxLength: 50,
                pattern: "^[A-Za-z\\s\\-']+$"
              },
              email: {
                bsonType: "string",
                pattern: "^[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\\.[A-Za-z]{2,}$",
                description: "Must be a valid email format"
              },
              phone: {
                bsonType: "string",
                pattern: "^\\+?[1-9]\\d{1,14}$",
                description: "Must be a valid international phone format"
              },
              mobile: {
                bsonType: "string",
                pattern: "^\\+?[1-9]\\d{1,14}$"
              },
              jobTitle: {
                bsonType: "string",
                maxLength: 100
              },
              department: {
                bsonType: "string",
                maxLength: 50
              }
            },
            additionalProperties: false
          },

          // Additional contacts array validation
          additionalContacts: {
            bsonType: "array",
            maxItems: 10,
            items: {
              bsonType: "object",
              required: ["firstName", "lastName", "email", "role"],
              properties: {
                firstName: { bsonType: "string", minLength: 1, maxLength: 50 },
                lastName: { bsonType: "string", minLength: 1, maxLength: 50 },
                email: { bsonType: "string", pattern: "^[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\\.[A-Za-z]{2,}$" },
                phone: { bsonType: "string", pattern: "^\\+?[1-9]\\d{1,14}$" },
                role: {
                  enum: ["billing", "technical", "executive", "procurement", "legal"]
                }
              }
            }
          },

          // Address validation with conditional requirements
          billingAddress: {
            bsonType: "object",
            required: ["street1", "city", "state", "postalCode", "country"],
            properties: {
              street1: {
                bsonType: "string",
                minLength: 5,
                maxLength: 200
              },
              street2: {
                bsonType: "string",
                maxLength: 200
              },
              city: {
                bsonType: "string",
                minLength: 2,
                maxLength: 100,
                pattern: "^[A-Za-z\\s\\-']+$"
              },
              state: {
                bsonType: "string",
                minLength: 2,
                maxLength: 50
              },
              postalCode: {
                bsonType: "string",
                minLength: 3,
                maxLength: 20
              },
              country: {
                bsonType: "string",
                enum: ["USA", "CAN", "MEX", "GBR", "FRA", "DEU", "AUS", "JPN", "IND"],
                description: "Must be a supported country code"
              },
              coordinates: {
                bsonType: "object",
                properties: {
                  latitude: {
                    bsonType: "double",
                    minimum: -90,
                    maximum: 90
                  },
                  longitude: {
                    bsonType: "double", 
                    minimum: -180,
                    maximum: 180
                  }
                }
              }
            },
            additionalProperties: false
          },

          // Optional shipping address with same validation
          shippingAddress: {
            bsonType: "object",
            properties: {
              street1: { bsonType: "string", minLength: 5, maxLength: 200 },
              street2: { bsonType: "string", maxLength: 200 },
              city: { bsonType: "string", minLength: 2, maxLength: 100 },
              state: { bsonType: "string", minLength: 2, maxLength: 50 },
              postalCode: { bsonType: "string", minLength: 3, maxLength: 20 },
              country: { bsonType: "string", enum: ["USA", "CAN", "MEX", "GBR", "FRA", "DEU", "AUS", "JPN", "IND"] }
            }
          },

          // Business metrics with conditional validation
          businessMetrics: {
            bsonType: "object",
            properties: {
              annualRevenue: {
                bsonType: "double",
                minimum: 0,
                maximum: 999999999999.99
              },
              employeeCount: {
                bsonType: "int",
                minimum: 1,
                maximum: 1000000
              },
              industryCode: {
                bsonType: "string",
                pattern: "^[0-9]{4,6}$",
                description: "NAICS industry code format"
              },
              establishedYear: {
                bsonType: "int",
                minimum: 1800,
                maximum: 2025
              },
              publiclyTraded: {
                bsonType: "bool"
              },
              stockSymbol: {
                bsonType: "string",
                pattern: "^[A-Z]{1,5}$"
              }
            }
          },

          // Account management
          accountStatus: {
            enum: ["active", "suspended", "closed", "pending_approval", "under_review"],
            description: "Must be a valid account status"
          },

          creditProfile: {
            bsonType: "object",
            properties: {
              creditLimit: {
                bsonType: "double",
                minimum: 0,
                maximum: 10000000
              },
              paymentTerms: {
                bsonType: "int",
                enum: [15, 30, 45, 60, 90]
              },
              creditRating: {
                bsonType: "string",
                enum: ["AAA", "AA", "A", "BBB", "BB", "B", "CCC", "CC", "C", "D"]
              },
              lastCreditReview: {
                bsonType: "date"
              }
            }
          },

          // Compliance and regulatory requirements
          compliance: {
            bsonType: "object",
            properties: {
              gdprConsent: {
                bsonType: "object",
                required: ["hasConsent", "consentDate"],
                properties: {
                  hasConsent: { bsonType: "bool" },
                  consentDate: { bsonType: "date" },
                  consentVersion: { bsonType: "string" },
                  consentMethod: { 
                    enum: ["website", "email", "phone", "written", "implied"] 
                  },
                  dataProcessingPurposes: {
                    bsonType: "array",
                    items: {
                      enum: ["marketing", "analytics", "service_delivery", "legal_compliance", "research"]
                    }
                  }
                }
              },
              ccpaOptOut: {
                bsonType: "bool"
              },
              dataRetentionCategory: {
                enum: ["standard", "extended", "permanent", "gdpr_restricted", "legal_hold"],
                description: "Data retention policy classification"
              },
              piiClassification: {
                enum: ["none", "low", "medium", "high", "restricted"],
                description: "PII sensitivity classification"
              },
              regulatoryJurisdictions: {
                bsonType: "array",
                items: {
                  enum: ["US", "EU", "UK", "CA", "AU", "JP", "IN"]
                }
              }
            }
          },

          // Data quality and audit tracking
          audit: {
            bsonType: "object",
            required: ["createdAt", "updatedAt", "createdBy"],
            properties: {
              createdAt: { bsonType: "date" },
              updatedAt: { bsonType: "date" },
              createdBy: { bsonType: "objectId" },
              updatedBy: { bsonType: "objectId" },
              version: { bsonType: "int", minimum: 1 },
              lastValidated: { bsonType: "date" },
              dataSource: {
                enum: ["manual_entry", "import_csv", "api_integration", "web_form", "migration"]
              },
              validationStatus: {
                enum: ["pending", "validated", "needs_review", "rejected"]
              },
              changeHistory: {
                bsonType: "array",
                items: {
                  bsonType: "object",
                  properties: {
                    field: { bsonType: "string" },
                    oldValue: {},
                    newValue: {},
                    changedAt: { bsonType: "date" },
                    changedBy: { bsonType: "objectId" },
                    reason: { bsonType: "string" }
                  }
                }
              }
            }
          },

          // Integration and system metadata
          systemMetadata: {
            bsonType: "object",
            properties: {
              externalIds: {
                bsonType: "object",
                properties: {
                  crmId: { bsonType: "string" },
                  erpId: { bsonType: "string" }, 
                  accountingId: { bsonType: "string" },
                  legacyId: { bsonType: "string" }
                }
              },
              tags: {
                bsonType: "array",
                maxItems: 20,
                items: {
                  bsonType: "string",
                  pattern: "^[A-Za-z0-9\\-_]+$",
                  maxLength: 50
                }
              },
              customFields: {
                bsonType: "object",
                additionalProperties: true
              }
            }
          }
        },
        additionalProperties: false
      }
    };

    // Apply validation to collection
    await customersCollection.createCollection({
      validator: customerValidationSchema,
      validationLevel: "strict",
      validationAction: "error"
    });

    // Store validation schema for reference
    this.validationRules.set('customers', customerValidationSchema);

    console.log('✅ Comprehensive customer validation schema created');
    return customerValidationSchema;
  }

  async implementConditionalValidation() {
    console.log('Implementing advanced conditional validation rules...');

    // Create validation for different document types with conditional requirements
    const conditionalValidationRules = [
      {
        collectionName: 'customers',
        ruleName: 'corporation_tax_id_requirement',
        condition: {
          $expr: {
            $and: [
              { $in: ["$legalEntityType", ["corporation", "llc"]] },
              { $eq: [{ $type: "$taxId" }, "missing"] }
            ]
          }
        },
        errorMessage: "Tax ID is required for corporations and LLCs"
      },

      {
        collectionName: 'customers', 
        ruleName: 'high_revenue_business_registration',
        condition: {
          $expr: {
            $and: [
              { $gt: ["$businessMetrics.annualRevenue", 1000000] },
              { $eq: [{ $type: "$businessRegistrationNumber" }, "missing"] }
            ]
          }
        },
        errorMessage: "Business registration number required for companies with revenue > $1M"
      },

      {
        collectionName: 'customers',
        ruleName: 'public_company_stock_symbol',
        condition: {
          $expr: {
            $and: [
              { $eq: ["$businessMetrics.publiclyTraded", true] },
              { $eq: [{ $type: "$businessMetrics.stockSymbol" }, "missing"] }
            ]
          }
        },
        errorMessage: "Stock symbol required for publicly traded companies"
      },

      {
        collectionName: 'customers',
        ruleName: 'gdpr_consent_date_requirement',
        condition: {
          $expr: {
            $and: [
              { $in: ["EU", "$compliance.regulatoryJurisdictions"] },
              { $eq: ["$compliance.gdprConsent.hasConsent", true] },
              { $eq: [{ $type: "$compliance.gdprConsent.consentDate" }, "missing"] }
            ]
          }
        },
        errorMessage: "GDPR consent date required for EU jurisdiction customers"
      },

      {
        collectionName: 'customers',
        ruleName: 'high_credit_limit_validation',
        condition: {
          $expr: {
            $or: [
              {
                $and: [
                  { $lt: ["$businessMetrics.annualRevenue", 100000] },
                  { $gt: ["$creditProfile.creditLimit", 10000] }
                ]
              },
              {
                $and: [
                  { $lt: ["$businessMetrics.annualRevenue", 1000000] },
                  { $gt: ["$creditProfile.creditLimit", 50000] }
                ]
              },
              { $gt: ["$creditProfile.creditLimit", 500000] }
            ]
          }
        },
        errorMessage: "Credit limit exceeds allowed amount for business tier"
      }
    ];

    // Implement conditional validation using MongoDB's advanced features
    for (const rule of conditionalValidationRules) {
      try {
        const collection = this.db.collection(rule.collectionName);

        // Create a compound validator that includes the conditional rule
        const existingValidator = await collection.options();
        const currentSchema = existingValidator.validator || {};

        // Add conditional validation using $expr
        const enhancedSchema = {
          $and: [
            currentSchema,
            {
              $expr: {
                $not: rule.condition.$expr
              }
            }
          ]
        };

        await collection.updateOptions({
          validator: enhancedSchema,
          validationLevel: "strict",
          validationAction: "error"
        });

        console.log(`✅ Applied conditional rule: ${rule.ruleName}`);

      } catch (error) {
        console.error(`❌ Failed to apply rule ${rule.ruleName}:`, error.message);
      }
    }

    return conditionalValidationRules;
  }

  async validateDocumentQuality(collection, document) {
    console.log('Performing comprehensive document quality validation...');

    try {
      const qualityChecks = {
        documentId: document._id,
        timestamp: new Date(),
        overallScore: 0,
        checks: {},
        issues: [],
        recommendations: []
      };

      // 1. Schema compliance check
      try {
        const testResult = await collection.insertOne(document, { 
          bypassDocumentValidation: false,
          dryRun: true // MongoDB 5.0+ feature for validation testing
        });
        qualityChecks.checks.schemaCompliance = {
          status: 'PASS',
          score: 25,
          message: 'Document passes schema validation'
        };
        qualityChecks.overallScore += 25;
      } catch (validationError) {
        qualityChecks.checks.schemaCompliance = {
          status: 'FAIL',
          score: 0,
          message: validationError.message,
          details: validationError.errInfo
        };
        qualityChecks.issues.push('Schema validation failed: ' + validationError.message);
      }

      // 2. Data completeness analysis
      const completenessScore = this.analyzeDataCompleteness(document);
      qualityChecks.checks.completeness = completenessScore;
      qualityChecks.overallScore += completenessScore.score;

      // 3. Data consistency validation
      const consistencyScore = this.validateDataConsistency(document);
      qualityChecks.checks.consistency = consistencyScore;
      qualityChecks.overallScore += consistencyScore.score;

      // 4. Business rule validation
      const businessRuleScore = await this.validateBusinessRules(document);
      qualityChecks.checks.businessRules = businessRuleScore;
      qualityChecks.overallScore += businessRuleScore.score;

      // 5. Data freshness analysis
      const freshnessScore = this.analyzeFreshness(document);
      qualityChecks.checks.freshness = freshnessScore;
      qualityChecks.overallScore += freshnessScore.score;

      // Generate quality rating and recommendations
      qualityChecks.qualityRating = this.calculateQualityRating(qualityChecks.overallScore);
      qualityChecks.recommendations = this.generateQualityRecommendations(qualityChecks);

      // Store quality metrics
      await this.recordQualityMetrics(qualityChecks);

      return qualityChecks;

    } catch (error) {
      console.error('Error during quality validation:', error);
      throw error;
    }
  }

  analyzeDataCompleteness(document) {
    const analysis = {
      status: 'PASS',
      score: 0,
      details: {},
      recommendations: []
    };

    // Define critical fields and their weights
    const criticalFields = {
      'companyName': 5,
      'legalEntityType': 3,
      'primaryContact.email': 5,
      'primaryContact.phone': 3,
      'billingAddress.street1': 4,
      'billingAddress.city': 3,
      'billingAddress.state': 3,
      'billingAddress.postalCode': 3,
      'billingAddress.country': 2,
      'accountStatus': 2
    };

    let totalWeight = 0;
    let presentWeight = 0;

    Object.entries(criticalFields).forEach(([fieldPath, weight]) => {
      totalWeight += weight;

      const fieldValue = this.getNestedValue(document, fieldPath);
      if (fieldValue !== undefined && fieldValue !== null && fieldValue !== '') {
        presentWeight += weight;
        analysis.details[fieldPath] = { present: true, weight };
      } else {
        analysis.details[fieldPath] = { present: false, weight };
        analysis.recommendations.push(`Complete missing field: ${fieldPath}`);
      }
    });

    analysis.score = Math.round((presentWeight / totalWeight) * 20); // Max 20 points
    analysis.completenessPercentage = Math.round((presentWeight / totalWeight) * 100);

    if (analysis.completenessPercentage < 80) {
      analysis.status = 'NEEDS_IMPROVEMENT';
    }

    return analysis;
  }

  validateDataConsistency(document) {
    const analysis = {
      status: 'PASS',
      score: 15, // Start with full points, deduct for issues
      issues: [],
      recommendations: []
    };

    // Consistency checks
    const checks = [
      // Email format consistency
      {
        check: () => {
          const emails = [
            document.primaryContact?.email,
            ...(document.additionalContacts || []).map(c => c.email)
          ].filter(email => email);

          const emailPattern = /^[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}$/;
          return emails.every(email => emailPattern.test(email));
        },
        name: 'Email Format Consistency',
        penalty: 3,
        recommendation: 'Fix invalid email formats'
      },

      // Phone format consistency  
      {
        check: () => {
          const phones = [
            document.primaryContact?.phone,
            document.primaryContact?.mobile,
            ...(document.additionalContacts || []).map(c => c.phone)
          ].filter(phone => phone);

          const phonePattern = /^\+?[1-9]\d{1,14}$/;
          return phones.every(phone => phonePattern.test(phone));
        },
        name: 'Phone Format Consistency',
        penalty: 2,
        recommendation: 'Standardize phone number formats'
      },

      // Address consistency
      {
        check: () => {
          if (!document.shippingAddress) return true;

          const billing = document.billingAddress;
          const shipping = document.shippingAddress;

          return billing.country === shipping.country;
        },
        name: 'Address Country Consistency',
        penalty: 2,
        recommendation: 'Verify address country consistency'
      },

      // Legal entity and tax ID consistency
      {
        check: () => {
          const entityType = document.legalEntityType;
          const hasTaxId = !!document.taxId;

          if (['corporation', 'llc'].includes(entityType)) {
            return hasTaxId;
          }
          return true;
        },
        name: 'Tax ID Requirement Consistency',
        penalty: 5,
        recommendation: 'Add required tax ID for legal entity type'
      }
    ];

    checks.forEach(check => {
      try {
        if (!check.check()) {
          analysis.score -= check.penalty;
          analysis.issues.push(check.name);
          analysis.recommendations.push(check.recommendation);
        }
      } catch (error) {
        console.warn(`Consistency check failed: ${check.name}`, error);
      }
    });

    if (analysis.issues.length > 0) {
      analysis.status = 'NEEDS_IMPROVEMENT';
    }

    analysis.score = Math.max(0, analysis.score);
    return analysis;
  }

  async validateBusinessRules(document) {
    const analysis = {
      status: 'PASS',
      score: 20, // Start with full points
      violations: [],
      recommendations: []
    };

    // Business rule validations
    const businessRules = [
      {
        name: 'High Revenue Registration Requirement',
        validate: (doc) => {
          const revenue = doc.businessMetrics?.annualRevenue;
          return !revenue || revenue <= 1000000 || !!doc.businessRegistrationNumber;
        },
        penalty: 8,
        message: 'Companies with >$1M revenue require business registration number'
      },

      {
        name: 'Public Company Stock Symbol',
        validate: (doc) => {
          const isPublic = doc.businessMetrics?.publiclyTraded;
          return !isPublic || !!doc.businessMetrics?.stockSymbol;
        },
        penalty: 3,
        message: 'Publicly traded companies must have stock symbol'
      },

      {
        name: 'Credit Limit Business Tier Validation',
        validate: (doc) => {
          const revenue = doc.businessMetrics?.annualRevenue;
          const creditLimit = doc.creditProfile?.creditLimit;

          if (!revenue || !creditLimit) return true;

          if (revenue < 100000 && creditLimit > 10000) return false;
          if (revenue < 1000000 && creditLimit > 50000) return false;
          if (creditLimit > 500000) return false;

          return true;
        },
        penalty: 5,
        message: 'Credit limit exceeds allowed amount for business tier'
      },

      {
        name: 'GDPR Compliance for EU Customers',
        validate: (doc) => {
          const jurisdictions = doc.compliance?.regulatoryJurisdictions || [];
          const hasGdprConsent = doc.compliance?.gdprConsent?.hasConsent;
          const hasConsentDate = !!doc.compliance?.gdprConsent?.consentDate;

          if (jurisdictions.includes('EU')) {
            return hasGdprConsent && hasConsentDate;
          }
          return true;
        },
        penalty: 7,
        message: 'EU customers require GDPR consent with date'
      }
    ];

    for (const rule of businessRules) {
      try {
        if (!rule.validate(document)) {
          analysis.score -= rule.penalty;
          analysis.violations.push(rule.name);
          analysis.recommendations.push(rule.message);
        }
      } catch (error) {
        console.warn(`Business rule validation failed: ${rule.name}`, error);
      }
    }

    if (analysis.violations.length > 0) {
      analysis.status = 'NEEDS_REVIEW';
    }

    analysis.score = Math.max(0, analysis.score);
    return analysis;
  }

  analyzeFreshness(document) {
    const analysis = {
      status: 'PASS',
      score: 0,
      ageInDays: 0,
      recommendations: []
    };

    const updatedAt = new Date(document.audit?.updatedAt || document.audit?.createdAt);
    const now = new Date();
    const daysDifference = Math.floor((now - updatedAt) / (1000 * 60 * 60 * 24));

    analysis.ageInDays = daysDifference;

    // Freshness scoring
    if (daysDifference <= 30) {
      analysis.score = 20; // Fresh data
      analysis.status = 'FRESH';
    } else if (daysDifference <= 90) {
      analysis.score = 15; // Recent data
      analysis.status = 'RECENT';
    } else if (daysDifference <= 180) {
      analysis.score = 10; // Aging data
      analysis.status = 'AGING';
      analysis.recommendations.push('Consider updating customer information');
    } else if (daysDifference <= 365) {
      analysis.score = 5; // Stale data
      analysis.status = 'STALE';
      analysis.recommendations.push('Customer data needs refresh - over 6 months old');
    } else {
      analysis.score = 0; // Very stale
      analysis.status = 'VERY_STALE';
      analysis.recommendations.push('Critical: Customer data is over 1 year old');
    }

    return analysis;
  }

  calculateQualityRating(overallScore) {
    if (overallScore >= 90) return 'EXCELLENT';
    if (overallScore >= 75) return 'GOOD';
    if (overallScore >= 60) return 'FAIR';
    if (overallScore >= 40) return 'POOR';
    return 'CRITICAL';
  }

  generateQualityRecommendations(qualityChecks) {
    const recommendations = [];

    // Collect recommendations from all checks
    Object.values(qualityChecks.checks).forEach(check => {
      if (check.recommendations) {
        recommendations.push(...check.recommendations);
      }
    });

    // Add overall recommendations based on score
    if (qualityChecks.overallScore < 40) {
      recommendations.unshift('CRITICAL: Immediate data quality improvement required');
    } else if (qualityChecks.overallScore < 60) {
      recommendations.unshift('Multiple data quality issues need addressing');
    } else if (qualityChecks.overallScore < 75) {
      recommendations.unshift('Minor improvements needed for optimal data quality');
    }

    return [...new Set(recommendations)]; // Remove duplicates
  }

  async recordQualityMetrics(qualityChecks) {
    try {
      await this.db.collection('data_quality_metrics').insertOne({
        ...qualityChecks,
        recordedAt: new Date()
      });

      // Update in-memory metrics for reporting
      const key = `${qualityChecks.documentId}_${Date.now()}`;
      this.qualityMetrics.set(key, qualityChecks);

    } catch (error) {
      console.warn('Failed to record quality metrics:', error);
    }
  }

  async generateComplianceReport() {
    console.log('Generating comprehensive compliance and data quality report...');

    try {
      const customersCollection = this.db.collection('customers');

      // Comprehensive compliance analysis pipeline
      const complianceAnalysis = await customersCollection.aggregate([
        // Stage 1: Add computed compliance fields
        {
          $addFields: {
            // GDPR compliance status
            gdprCompliant: {
              $cond: {
                if: { $in: ["EU", "$compliance.regulatoryJurisdictions"] },
                then: {
                  $and: [
                    { $eq: ["$compliance.gdprConsent.hasConsent", true] },
                    { $ne: ["$compliance.gdprConsent.consentDate", null] }
                  ]
                },
                else: true
              }
            },

            // Tax compliance status
            taxCompliant: {
              $cond: {
                if: { $in: ["$legalEntityType", ["corporation", "llc"]] },
                then: { $ne: ["$taxId", null] },
                else: true
              }
            },

            // Data completeness score
            completenessScore: {
              $let: {
                vars: {
                  requiredFields: [
                    { $ne: ["$companyName", null] },
                    { $ne: ["$primaryContact.email", null] },
                    { $ne: ["$primaryContact.phone", null] },
                    { $ne: ["$billingAddress.street1", null] },
                    { $ne: ["$billingAddress.city", null] },
                    { $ne: ["$accountStatus", null] }
                  ]
                },
                in: {
                  $multiply: [
                    { $divide: [
                      { $size: { $filter: {
                        input: "$$requiredFields",
                        cond: { $eq: ["$$this", true] }
                      }}},
                      { $size: "$$requiredFields" }
                    ]},
                    100
                  ]
                }
              }
            },

            // Data freshness
            dataAge: {
              $divide: [
                { $subtract: [new Date(), "$audit.updatedAt"] },
                86400000 // Convert to days
              ]
            }
          }
        },

        // Stage 2: Quality classification
        {
          $addFields: {
            qualityRating: {
              $switch: {
                branches: [
                  { 
                    case: { 
                      $and: [
                        { $gte: ["$completenessScore", 95] },
                        "$gdprCompliant",
                        "$taxCompliant",
                        { $lte: ["$dataAge", 90] }
                      ]
                    }, 
                    then: "EXCELLENT" 
                  },
                  { 
                    case: { 
                      $and: [
                        { $gte: ["$completenessScore", 80] },
                        "$gdprCompliant",
                        "$taxCompliant",
                        { $lte: ["$dataAge", 180] }
                      ]
                    }, 
                    then: "GOOD" 
                  },
                  { 
                    case: { 
                      $and: [
                        { $gte: ["$completenessScore", 60] },
                        { $lte: ["$dataAge", 365] }
                      ]
                    }, 
                    then: "FAIR" 
                  }
                ],
                default: "POOR"
              }
            },

            complianceIssues: {
              $concatArrays: [
                { $cond: [{ $not: "$gdprCompliant" }, ["GDPR_NON_COMPLIANT"], []] },
                { $cond: [{ $not: "$taxCompliant" }, ["MISSING_TAX_ID"], []] },
                { $cond: [{ $lt: ["$completenessScore", 80] }, ["INCOMPLETE_DATA"], []] },
                { $cond: [{ $gt: ["$dataAge", 365] }, ["STALE_DATA"], []] }
              ]
            }
          }
        },

        // Stage 3: Aggregate compliance statistics
        {
          $group: {
            _id: null,

            // Total counts
            totalCustomers: { $sum: 1 },

            // Compliance counts
            gdprCompliantCount: { $sum: { $cond: ["$gdprCompliant", 1, 0] } },
            taxCompliantCount: { $sum: { $cond: ["$taxCompliant", 1, 0] } },

            // Quality distribution
            excellentQuality: { $sum: { $cond: [{ $eq: ["$qualityRating", "EXCELLENT"] }, 1, 0] } },
            goodQuality: { $sum: { $cond: [{ $eq: ["$qualityRating", "GOOD"] }, 1, 0] } },
            fairQuality: { $sum: { $cond: [{ $eq: ["$qualityRating", "FAIR"] }, 1, 0] } },
            poorQuality: { $sum: { $cond: [{ $eq: ["$qualityRating", "POOR"] }, 1, 0] } },

            // Completeness metrics
            avgCompletenessScore: { $avg: "$completenessScore" },
            minCompletenessScore: { $min: "$completenessScore" },

            // Freshness metrics
            avgDataAge: { $avg: "$dataAge" },
            staleDataCount: { $sum: { $cond: [{ $gt: ["$dataAge", 365] }, 1, 0] } },

            // Issue tracking
            allIssues: { $push: "$complianceIssues" },

            // Sample records for detailed analysis
            qualityExamples: {
              $push: {
                $cond: [
                  { $lte: [{ $rand: {} }, 0.1] }, // Sample 10%
                  {
                    customerId: "$_id",
                    companyName: "$companyName",
                    qualityRating: "$qualityRating",
                    completenessScore: "$completenessScore",
                    dataAge: "$dataAge",
                    issues: "$complianceIssues"
                  },
                  null
                ]
              }
            }
          }
        },

        // Stage 4: Calculate percentages and final metrics
        {
          $addFields: {
            // Compliance percentages
            gdprComplianceRate: { $multiply: [{ $divide: ["$gdprCompliantCount", "$totalCustomers"] }, 100] },
            taxComplianceRate: { $multiply: [{ $divide: ["$taxCompliantCount", "$totalCustomers"] }, 100] },

            // Quality distribution percentages
            excellentQualityPct: { $multiply: [{ $divide: ["$excellentQuality", "$totalCustomers"] }, 100] },
            goodQualityPct: { $multiply: [{ $divide: ["$goodQuality", "$totalCustomers"] }, 100] },
            fairQualityPct: { $multiply: [{ $divide: ["$fairQuality", "$totalCustomers"] }, 100] },
            poorQualityPct: { $multiply: [{ $divide: ["$poorQuality", "$totalCustomers"] }, 100] },

            // Data freshness metrics
            staleDataRate: { $multiply: [{ $divide: ["$staleDataCount", "$totalCustomers"] }, 100] },

            // Issue analysis
            issueFrequency: {
              $reduce: {
                input: "$allIssues",
                initialValue: {},
                in: {
                  $mergeObjects: [
                    "$$value",
                    {
                      $arrayToObject: {
                        $map: {
                          input: "$$this",
                          as: "issue",
                          in: {
                            k: "$$issue",
                            v: { $add: [{ $ifNull: [{ $getField: { field: "$$issue", input: "$$value" } }, 0] }, 1] }
                          }
                        }
                      }
                    }
                  ]
                }
              }
            },

            // Filter null examples
            qualityExamples: {
              $filter: {
                input: "$qualityExamples",
                cond: { $ne: ["$$this", null] }
              }
            }
          }
        },

        // Stage 5: Final report structure
        {
          $project: {
            _id: 0,
            reportGenerated: new Date(),
            summary: {
              totalCustomers: "$totalCustomers",
              overallComplianceScore: {
                $round: [
                  { $avg: ["$gdprComplianceRate", "$taxComplianceRate"] }, 
                  1
                ]
              },
              avgDataQuality: {
                $round: ["$avgCompletenessScore", 1]
              },
              avgDataAgedays: {
                $round: ["$avgDataAge", 0]
              }
            },

            compliance: {
              gdpr: {
                compliantCount: "$gdprCompliantCount",
                complianceRate: { $round: ["$gdprComplianceRate", 1] }
              },
              tax: {
                compliantCount: "$taxCompliantCount", 
                complianceRate: { $round: ["$taxComplianceRate", 1] }
              }
            },

            dataQuality: {
              distribution: {
                excellent: { count: "$excellentQuality", percentage: { $round: ["$excellentQualityPct", 1] } },
                good: { count: "$goodQuality", percentage: { $round: ["$goodQualityPct", 1] } },
                fair: { count: "$fairQuality", percentage: { $round: ["$fairQualityPct", 1] } },
                poor: { count: "$poorQuality", percentage: { $round: ["$poorQualityPct", 1] } }
              },
              completeness: {
                average: { $round: ["$avgCompletenessScore", 1] },
                minimum: { $round: ["$minCompletenessScore", 1] }
              }
            },

            dataFreshness: {
              averageAge: { $round: ["$avgDataAge", 0] },
              staleRecords: { count: "$staleDataCount", percentage: { $round: ["$staleDataRate", 1] } }
            },

            topIssues: "$issueFrequency",
            sampleRecords: { $slice: ["$qualityExamples", 10] }
          }
        }
      ]).toArray();

      const report = complianceAnalysis[0] || {};

      // Generate recommendations based on findings
      report.recommendations = this.generateComplianceRecommendations(report);

      // Store report for historical tracking
      await this.db.collection('compliance_reports').insertOne({
        ...report,
        reportType: 'comprehensive_compliance_audit',
        generatedBy: 'schema_validator_system'
      });

      this.complianceReports.set('latest', report);

      console.log('\n📋 Compliance and Data Quality Report Summary:');
      console.log(`Total Customers Analyzed: ${report.summary?.totalCustomers || 0}`);
      console.log(`Overall Compliance Score: ${report.summary?.overallComplianceScore || 0}%`);
      console.log(`Average Data Quality: ${report.summary?.avgDataQuality || 0}%`);
      console.log(`GDPR Compliance Rate: ${report.compliance?.gdpr?.complianceRate || 0}%`);
      console.log(`Tax Compliance Rate: ${report.compliance?.tax?.complianceRate || 0}%`);

      if (report.recommendations?.length > 0) {
        console.log('\n💡 Key Recommendations:');
        report.recommendations.slice(0, 5).forEach(rec => {
          console.log(`  • ${rec}`);
        });
      }

      return report;

    } catch (error) {
      console.error('Error generating compliance report:', error);
      throw error;
    }
  }

  generateComplianceRecommendations(report) {
    const recommendations = [];

    // GDPR compliance recommendations
    if (report.compliance?.gdpr?.complianceRate < 95) {
      recommendations.push('Improve GDPR compliance by ensuring all EU customers have documented consent');
    }

    // Tax compliance recommendations
    if (report.compliance?.tax?.complianceRate < 95) {
      recommendations.push('Add missing tax IDs for corporations and LLCs');
    }

    // Data quality recommendations
    const qualityDist = report.dataQuality?.distribution;
    if (qualityDist?.poor?.percentage > 10) {
      recommendations.push('Critical: Over 10% of customer records have poor data quality');
    }

    if (qualityDist?.excellent?.percentage < 50) {
      recommendations.push('Implement data quality improvement program - less than 50% excellent quality');
    }

    // Data freshness recommendations
    if (report.dataFreshness?.staleRecords?.percentage > 15) {
      recommendations.push('Establish customer data refresh program for stale records');
    }

    // Issue-specific recommendations
    const topIssues = report.topIssues || {};
    if (topIssues.INCOMPLETE_DATA > topIssues.totalCustomers * 0.2) {
      recommendations.push('Implement required field completion workflows');
    }

    return recommendations;
  }

  // Utility methods
  getNestedValue(obj, path) {
    return path.split('.').reduce((current, key) => {
      return current && current[key] !== undefined ? current[key] : undefined;
    }, obj);
  }
}

// Export the schema validator
module.exports = { MongoSchemaValidator };

// Benefits of MongoDB Schema Validation:
// - Flexible document validation with evolving schema requirements
// - Comprehensive data quality management and automated quality scoring
// - Advanced conditional validation based on document context
// - Enterprise-grade compliance tracking and regulatory reporting
// - Automated data quality monitoring and issue identification
// - Integration with business rules and custom validation logic
// - Real-time validation feedback and quality metrics
// - Support for complex nested document validation
// - Automated compliance reporting and audit trails
// - SQL-compatible data governance patterns through QueryLeaf integration

Understanding MongoDB Schema Validation Architecture

Advanced Validation Patterns

MongoDB's validation system supports sophisticated data governance strategies for enterprise applications:

// Advanced validation patterns and data governance implementation
class EnterpriseDataGovernance {
  constructor(db) {
    this.db = db;
    this.governanceRules = new Map();
    this.qualityDashboards = new Map();
    this.complianceAudits = new Map();
  }

  async implementDataLineageTracking() {
    console.log('Implementing comprehensive data lineage and governance tracking...');

    // Create data lineage collection with validation
    const lineageSchema = {
      $jsonSchema: {
        bsonType: "object",
        required: ["sourceSystem", "targetCollection", "transformationRules", "timestamp", "dataClassification"],
        properties: {
          sourceSystem: {
            bsonType: "string",
            enum: ["crm", "erp", "web_form", "api", "batch_import", "manual_entry"]
          },
          targetCollection: { bsonType: "string" },
          documentId: { bsonType: "objectId" },

          transformationRules: {
            bsonType: "array",
            items: {
              bsonType: "object",
              required: ["field", "operation", "appliedAt"],
              properties: {
                field: { bsonType: "string" },
                operation: {
                  enum: ["validation", "enrichment", "standardization", "encryption", "anonymization"]
                },
                appliedAt: { bsonType: "date" },
                appliedBy: { bsonType: "string" },
                previousValue: {},
                newValue: {},
                validationResult: {
                  bsonType: "object",
                  properties: {
                    passed: { bsonType: "bool" },
                    score: { bsonType: "double", minimum: 0, maximum: 100 },
                    issues: { bsonType: "array", items: { bsonType: "string" } }
                  }
                }
              }
            }
          },

          dataClassification: {
            bsonType: "object",
            required: ["piiLevel", "retentionClass", "accessLevel"],
            properties: {
              piiLevel: {
                enum: ["none", "low", "medium", "high", "restricted"]
              },
              retentionClass: {
                enum: ["standard", "extended", "permanent", "legal_hold", "gdpr_restricted"]
              },
              accessLevel: {
                enum: ["public", "internal", "confidential", "restricted", "top_secret"]
              },
              encryptionRequired: { bsonType: "bool" },
              auditRequired: { bsonType: "bool" }
            }
          },

          qualityMetrics: {
            bsonType: "object",
            properties: {
              completenessScore: { bsonType: "double", minimum: 0, maximum: 100 },
              accuracyScore: { bsonType: "double", minimum: 0, maximum: 100 },
              consistencyScore: { bsonType: "double", minimum: 0, maximum: 100 },
              timelinessScore: { bsonType: "double", minimum: 0, maximum: 100 },
              overallQualityScore: { bsonType: "double", minimum: 0, maximum: 100 }
            }
          },

          complianceChecks: {
            bsonType: "object",
            properties: {
              gdprCompliant: { bsonType: "bool" },
              ccpaCompliant: { bsonType: "bool" },
              hipaaCompliant: { bsonType: "bool" },
              sox404Compliant: { bsonType: "bool" },
              complianceScore: { bsonType: "double", minimum: 0, maximum: 100 },
              lastAuditDate: { bsonType: "date" },
              nextAuditDue: { bsonType: "date" }
            }
          },

          timestamp: { bsonType: "date" },
          processingLatency: { bsonType: "double" },

          audit: {
            bsonType: "object",
            required: ["createdBy", "createdAt"],
            properties: {
              createdBy: { bsonType: "string" },
              createdAt: { bsonType: "date" },
              version: { bsonType: "string" },
              correlationId: { bsonType: "string" }
            }
          }
        }
      }
    };

    await this.db.createCollection('data_lineage', {
      validator: lineageSchema,
      validationLevel: "strict",
      validationAction: "error"
    });

    console.log('✅ Data lineage tracking implemented');
    return lineageSchema;
  }

  async createDataQualityDashboard() {
    console.log('Creating real-time data quality monitoring dashboard...');

    const dashboard = await this.db.collection('customers').aggregate([
      // Stage 1: Real-time quality analysis
      {
        $addFields: {
          qualityChecks: {
            emailValid: {
              $regexMatch: {
                input: "$primaryContact.email",
                regex: "^[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\\.[A-Za-z]{2,}$"
              }
            },
            phoneValid: {
              $regexMatch: {
                input: "$primaryContact.phone", 
                regex: "^\\+?[1-9]\\d{1,14}$"
              }
            },
            addressComplete: {
              $and: [
                { $ne: ["$billingAddress.street1", null] },
                { $ne: ["$billingAddress.city", null] },
                { $ne: ["$billingAddress.state", null] },
                { $ne: ["$billingAddress.postalCode", null] }
              ]
            },
            taxIdPresent: {
              $cond: {
                if: { $in: ["$legalEntityType", ["corporation", "llc"]] },
                then: { $ne: ["$taxId", null] },
                else: true
              }
            },
            dataFresh: {
              $lt: [
                { $subtract: [new Date(), "$audit.updatedAt"] },
                7776000000 // 90 days in milliseconds
              ]
            }
          }
        }
      },

      // Stage 2: Calculate individual record scores
      {
        $addFields: {
          individualQualityScore: {
            $multiply: [
              {
                $divide: [
                  {
                    $add: [
                      { $cond: ["$qualityChecks.emailValid", 20, 0] },
                      { $cond: ["$qualityChecks.phoneValid", 15, 0] },
                      { $cond: ["$qualityChecks.addressComplete", 25, 0] },
                      { $cond: ["$qualityChecks.taxIdPresent", 25, 0] },
                      { $cond: ["$qualityChecks.dataFresh", 15, 0] }
                    ]
                  },
                  100
                ]
              },
              100
            ]
          }
        }
      },

      // Stage 3: Aggregate dashboard metrics
      {
        $group: {
          _id: null,

          // Volume metrics
          totalRecords: { $sum: 1 },
          recordsProcessedToday: {
            $sum: {
              $cond: [
                { $gte: ["$audit.createdAt", new Date(Date.now() - 86400000)] },
                1, 0
              ]
            }
          },

          // Quality distribution
          excellentQuality: {
            $sum: { $cond: [{ $gte: ["$individualQualityScore", 90] }, 1, 0] }
          },
          goodQuality: {
            $sum: { $cond: [
              { $and: [{ $gte: ["$individualQualityScore", 70] }, { $lt: ["$individualQualityScore", 90] }] },
              1, 0
            ]}
          },
          fairQuality: {
            $sum: { $cond: [
              { $and: [{ $gte: ["$individualQualityScore", 50] }, { $lt: ["$individualQualityScore", 70] }] },
              1, 0
            ]}
          },
          poorQuality: {
            $sum: { $cond: [{ $lt: ["$individualQualityScore", 50] }, 1, 0] }
          },

          // Field-specific quality metrics
          validEmails: { $sum: { $cond: ["$qualityChecks.emailValid", 1, 0] } },
          validPhones: { $sum: { $cond: ["$qualityChecks.phoneValid", 1, 0] } },
          completeAddresses: { $sum: { $cond: ["$qualityChecks.addressComplete", 1, 0] } },
          compliantTaxIds: { $sum: { $cond: ["$qualityChecks.taxIdPresent", 1, 0] } },
          freshData: { $sum: { $cond: ["$qualityChecks.dataFresh", 1, 0] } },

          // Quality score statistics
          avgQualityScore: { $avg: "$individualQualityScore" },
          minQualityScore: { $min: "$individualQualityScore" },
          maxQualityScore: { $max: "$individualQualityScore" },

          // Compliance tracking
          gdprComplianceCount: {
            $sum: {
              $cond: [
                {
                  $and: [
                    { $in: ["EU", { $ifNull: ["$compliance.regulatoryJurisdictions", []] }] },
                    { $eq: ["$compliance.gdprConsent.hasConsent", true] },
                    { $ne: ["$compliance.gdprConsent.consentDate", null] }
                  ]
                },
                1, 0
              ]
            }
          },

          // Data freshness metrics
          staleRecordsCount: {
            $sum: { $cond: [{ $not: "$qualityChecks.dataFresh" }, 1, 0] }
          }
        }
      },

      // Stage 4: Calculate percentages and dashboard KPIs
      {
        $addFields: {
          timestamp: new Date(),

          qualityDistribution: {
            excellent: {
              count: "$excellentQuality",
              percentage: { $round: [{ $multiply: [{ $divide: ["$excellentQuality", "$totalRecords"] }, 100] }, 1] }
            },
            good: {
              count: "$goodQuality", 
              percentage: { $round: [{ $multiply: [{ $divide: ["$goodQuality", "$totalRecords"] }, 100] }, 1] }
            },
            fair: {
              count: "$fairQuality",
              percentage: { $round: [{ $multiply: [{ $divide: ["$fairQuality", "$totalRecords"] }, 100] }, 1] }
            },
            poor: {
              count: "$poorQuality",
              percentage: { $round: [{ $multiply: [{ $divide: ["$poorQuality", "$totalRecords"] }, 100] }, 1] }
            }
          },

          fieldQualityRates: {
            emailValidityRate: { $round: [{ $multiply: [{ $divide: ["$validEmails", "$totalRecords"] }, 100] }, 1] },
            phoneValidityRate: { $round: [{ $multiply: [{ $divide: ["$validPhones", "$totalRecords"] }, 100] }, 1] },
            addressCompletenessRate: { $round: [{ $multiply: [{ $divide: ["$completeAddresses", "$totalRecords"] }, 100] }, 1] },
            taxComplianceRate: { $round: [{ $multiply: [{ $divide: ["$compliantTaxIds", "$totalRecords"] }, 100] }, 1] },
            dataFreshnessRate: { $round: [{ $multiply: [{ $divide: ["$freshData", "$totalRecords"] }, 100] }, 1] }
          },

          overallHealthScore: {
            $round: [
              {
                $avg: [
                  { $multiply: [{ $divide: ["$validEmails", "$totalRecords"] }, 100] },
                  { $multiply: [{ $divide: ["$validPhones", "$totalRecords"] }, 100] },
                  { $multiply: [{ $divide: ["$completeAddresses", "$totalRecords"] }, 100] },
                  { $multiply: [{ $divide: ["$compliantTaxIds", "$totalRecords"] }, 100] },
                  { $multiply: [{ $divide: ["$freshData", "$totalRecords"] }, 100] }
                ]
              },
              1
            ]
          },

          alerts: {
            criticalIssues: { $cond: [{ $gt: ["$poorQuality", { $multiply: ["$totalRecords", 0.1] }] }, "High poor quality rate", null] },
            warningIssues: {
              $switch: {
                branches: [
                  { case: { $lt: [{ $multiply: [{ $divide: ["$validEmails", "$totalRecords"] }, 100] }, 90] }, then: "Email validity below 90%" },
                  { case: { $lt: [{ $multiply: [{ $divide: ["$completeAddresses", "$totalRecords"] }, 100] }, 85] }, then: "Address completeness below 85%" },
                  { case: { $gt: ["$staleRecordsCount", { $multiply: ["$totalRecords", 0.2] }] }, then: "Over 20% stale data" }
                ],
                default: null
              }
            }
          }
        }
      }
    ]).toArray();

    const dashboardData = dashboard[0];
    if (dashboardData) {
      // Store dashboard for historical tracking
      await this.db.collection('quality_dashboards').insertOne(dashboardData);
      this.qualityDashboards.set('current', dashboardData);

      // Display dashboard summary
      console.log('\n📊 Real-Time Data Quality Dashboard:');
      console.log(`Overall Health Score: ${dashboardData.overallHealthScore}%`);
      console.log(`Total Records: ${dashboardData.totalRecords?.toLocaleString()}`);
      console.log(`Records Processed Today: ${dashboardData.recordsProcessedToday?.toLocaleString()}`);
      console.log('\nQuality Distribution:');
      console.log(`  Excellent: ${dashboardData.qualityDistribution?.excellent?.count} (${dashboardData.qualityDistribution?.excellent?.percentage}%)`);
      console.log(`  Good: ${dashboardData.qualityDistribution?.good?.count} (${dashboardData.qualityDistribution?.good?.percentage}%)`);
      console.log(`  Fair: ${dashboardData.qualityDistribution?.fair?.count} (${dashboardData.qualityDistribution?.fair?.percentage}%)`);
      console.log(`  Poor: ${dashboardData.qualityDistribution?.poor?.count} (${dashboardData.qualityDistribution?.poor?.percentage}%)`);

      if (dashboardData.alerts?.criticalIssues) {
        console.log(`\n🚨 Critical Alert: ${dashboardData.alerts.criticalIssues}`);
      }
      if (dashboardData.alerts?.warningIssues) {
        console.log(`\n⚠️ Warning: ${dashboardData.alerts.warningIssues}`);
      }
    }

    return dashboardData;
  }

  async automateDataQualityRemediation() {
    console.log('Implementing automated data quality remediation workflows...');

    const remediationRules = [
      {
        name: 'email_standardization',
        condition: { $not: { $regexMatch: { input: "$primaryContact.email", regex: "^[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\\.[A-Za-z]{2,}$" } } },
        action: 'flag_for_review',
        priority: 'high'
      },

      {
        name: 'phone_formatting',
        condition: { $not: { $regexMatch: { input: "$primaryContact.phone", regex: "^\\+?[1-9]\\d{1,14}$" } } },
        action: 'auto_format',
        priority: 'medium'
      },

      {
        name: 'missing_tax_id',
        condition: {
          $and: [
            { $in: ["$legalEntityType", ["corporation", "llc"]] },
            { $eq: ["$taxId", null] }
          ]
        },
        action: 'request_completion',
        priority: 'high'
      },

      {
        name: 'stale_data_refresh',
        condition: { $gt: [{ $subtract: [new Date(), "$audit.updatedAt"] }, 15552000000] }, // 180 days
        action: 'schedule_refresh',
        priority: 'low'
      }
    ];

    // Execute remediation workflows
    const remediationResults = [];

    for (const rule of remediationRules) {
      try {
        const affectedDocuments = await this.db.collection('customers').find({
          $expr: rule.condition
        }).limit(1000).toArray();

        if (affectedDocuments.length > 0) {
          const remediation = {
            ruleName: rule.name,
            affectedCount: affectedDocuments.length,
            action: rule.action,
            priority: rule.priority,
            processedAt: new Date(),
            results: []
          };

          // Process based on action type
          for (const doc of affectedDocuments) {
            switch (rule.action) {
              case 'flag_for_review':
                await this.flagForReview(doc._id, rule.name);
                remediation.results.push({ documentId: doc._id, status: 'flagged' });
                break;

              case 'auto_format':
                const formatted = await this.autoFormatData(doc, rule.name);
                if (formatted) {
                  remediation.results.push({ documentId: doc._id, status: 'formatted' });
                }
                break;

              case 'request_completion':
                await this.requestDataCompletion(doc._id, rule.name);
                remediation.results.push({ documentId: doc._id, status: 'completion_requested' });
                break;

              case 'schedule_refresh':
                await this.scheduleDataRefresh(doc._id);
                remediation.results.push({ documentId: doc._id, status: 'refresh_scheduled' });
                break;
            }
          }

          remediationResults.push(remediation);
          console.log(`✅ Processed ${rule.name}: ${remediation.results.length} documents`);
        }

      } catch (error) {
        console.error(`❌ Failed to process rule ${rule.name}:`, error.message);
      }
    }

    // Store remediation audit trail
    if (remediationResults.length > 0) {
      await this.db.collection('remediation_audit').insertOne({
        executionTimestamp: new Date(),
        totalRulesExecuted: remediationRules.length,
        rulesWithMatches: remediationResults.length,
        results: remediationResults,
        executedBy: 'automated_quality_system'
      });
    }

    console.log(`Automated remediation completed: ${remediationResults.length} rules processed`);
    return remediationResults;
  }

  // Helper methods for remediation actions
  async flagForReview(documentId, reason) {
    return await this.db.collection('quality_review_queue').insertOne({
      documentId: documentId,
      reason: reason,
      priority: 'high',
      status: 'pending_review',
      flaggedAt: new Date(),
      assignedTo: null
    });
  }

  async autoFormatData(document, ruleName) {
    // Example: Auto-format phone numbers
    if (ruleName === 'phone_formatting' && document.primaryContact?.phone) {
      const phone = document.primaryContact.phone.replace(/\D/g, '');
      if (phone.length === 10) {
        const formatted = `+1${phone}`;

        await this.db.collection('customers').updateOne(
          { _id: document._id },
          { 
            $set: { 
              "primaryContact.phone": formatted,
              "audit.updatedAt": new Date(),
              "audit.lastAutoFormatted": new Date()
            }
          }
        );

        return true;
      }
    }
    return false;
  }

  async requestDataCompletion(documentId, reason) {
    return await this.db.collection('data_completion_requests').insertOne({
      documentId: documentId,
      reason: reason,
      requestedAt: new Date(),
      status: 'pending',
      priority: 'high',
      assignedTo: null,
      dueDate: new Date(Date.now() + 7 * 24 * 60 * 60 * 1000) // 7 days
    });
  }

  async scheduleDataRefresh(documentId) {
    return await this.db.collection('data_refresh_schedule').insertOne({
      documentId: documentId,
      scheduledFor: new Date(Date.now() + 24 * 60 * 60 * 1000), // Next day
      priority: 'low',
      status: 'scheduled',
      refreshType: 'stale_data_update'
    });
  }
}

// Export the enterprise governance class
module.exports = { EnterpriseDataGovernance };

SQL-Style Schema Validation with QueryLeaf

QueryLeaf provides familiar SQL syntax for MongoDB schema validation and data quality management:

-- QueryLeaf schema validation with SQL-familiar syntax

-- Create collection with comprehensive validation rules
CREATE TABLE customers (
  _id OBJECTID PRIMARY KEY,
  company_name VARCHAR(500) NOT NULL,
  legal_entity_type VARCHAR(50) NOT NULL,

  -- Contact information with validation
  primary_contact JSON NOT NULL CHECK (
    JSON_VALID(primary_contact) AND
    JSON_EXTRACT(primary_contact, '$.email') REGEXP '^[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\\.[A-Za-z]{2,}$' AND
    JSON_EXTRACT(primary_contact, '$.phone') REGEXP '^\\+?[1-9]\\d{1,14}$'
  ),

  -- Address validation
  billing_address JSON NOT NULL CHECK (
    JSON_VALID(billing_address) AND
    JSON_LENGTH(JSON_EXTRACT(billing_address, '$.street1')) >= 5 AND
    JSON_LENGTH(JSON_EXTRACT(billing_address, '$.city')) >= 2 AND
    JSON_EXTRACT(billing_address, '$.country') IN ('USA', 'CAN', 'MEX', 'GBR', 'FRA', 'DEU')
  ),

  -- Business metrics with constraints
  business_metrics JSON CHECK (
    business_metrics IS NULL OR (
      JSON_VALID(business_metrics) AND
      COALESCE(JSON_EXTRACT(business_metrics, '$.annual_revenue'), 0) >= 0 AND
      COALESCE(JSON_EXTRACT(business_metrics, '$.employee_count'), 1) >= 1
    )
  ),

  account_status VARCHAR(20) NOT NULL DEFAULT 'active',

  -- Compliance fields
  compliance JSON CHECK (
    compliance IS NULL OR (
      JSON_VALID(compliance) AND
      JSON_TYPE(JSON_EXTRACT(compliance, '$.gdpr_consent.has_consent')) = 'BOOLEAN'
    )
  ),

  -- Audit fields
  audit JSON NOT NULL CHECK (
    JSON_VALID(audit) AND
    JSON_EXTRACT(audit, '$.created_at') IS NOT NULL AND
    JSON_EXTRACT(audit, '$.updated_at') IS NOT NULL
  ),

  -- Conditional constraints
  CONSTRAINT chk_legal_entity_tax_id CHECK (
    legal_entity_type NOT IN ('corporation', 'llc') OR 
    JSON_EXTRACT(compliance, '$.tax_id') IS NOT NULL
  ),

  CONSTRAINT chk_public_company_stock_symbol CHECK (
    JSON_EXTRACT(business_metrics, '$.publicly_traded') != TRUE OR
    JSON_EXTRACT(business_metrics, '$.stock_symbol') IS NOT NULL
  ),

  CONSTRAINT chk_gdpr_consent_date CHECK (
    'EU' NOT IN (SELECT value FROM JSON_TABLE(
      COALESCE(JSON_EXTRACT(compliance, '$.regulatory_jurisdictions'), '[]'),
      '$[*]' COLUMNS (value VARCHAR(10) PATH '$')
    ) AS jt) OR (
      JSON_EXTRACT(compliance, '$.gdpr_consent.has_consent') = TRUE AND
      JSON_EXTRACT(compliance, '$.gdpr_consent.consent_date') IS NOT NULL
    )
  )
) WITH (
  collection_options = JSON_OBJECT(
    'validation_level', 'strict',
    'validation_action', 'error'
  )
);

-- Data quality analysis with SQL aggregations
WITH data_quality_metrics AS (
  SELECT 
    _id,
    company_name,

    -- Email validation
    CASE 
      WHEN JSON_EXTRACT(primary_contact, '$.email') REGEXP '^[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\\.[A-Za-z]{2,}$'
      THEN 1 ELSE 0 
    END as email_valid,

    -- Phone validation  
    CASE
      WHEN JSON_EXTRACT(primary_contact, '$.phone') REGEXP '^\\+?[1-9]\\d{1,14}$'
      THEN 1 ELSE 0
    END as phone_valid,

    -- Address completeness
    CASE
      WHEN JSON_EXTRACT(billing_address, '$.street1') IS NOT NULL AND
           JSON_EXTRACT(billing_address, '$.city') IS NOT NULL AND
           JSON_EXTRACT(billing_address, '$.state') IS NOT NULL AND
           JSON_EXTRACT(billing_address, '$.postal_code') IS NOT NULL
      THEN 1 ELSE 0
    END as address_complete,

    -- Tax compliance
    CASE
      WHEN legal_entity_type NOT IN ('corporation', 'llc') OR
           JSON_EXTRACT(compliance, '$.tax_id') IS NOT NULL
      THEN 1 ELSE 0
    END as tax_compliant,

    -- Data freshness
    CASE
      WHEN TIMESTAMPDIFF(DAY, 
           STR_TO_DATE(JSON_UNQUOTE(JSON_EXTRACT(audit, '$.updated_at')), '%Y-%m-%dT%H:%i:%s.%fZ'),
           NOW()) <= 90
      THEN 1 ELSE 0
    END as data_fresh,

    -- GDPR compliance for EU customers
    CASE
      WHEN 'EU' NOT IN (
        SELECT value FROM JSON_TABLE(
          COALESCE(JSON_EXTRACT(compliance, '$.regulatory_jurisdictions'), '[]'),
          '$[*]' COLUMNS (value VARCHAR(10) PATH '$')
        ) AS jt
      ) OR (
        JSON_EXTRACT(compliance, '$.gdpr_consent.has_consent') = TRUE AND
        JSON_EXTRACT(compliance, '$.gdpr_consent.consent_date') IS NOT NULL
      )
      THEN 1 ELSE 0
    END as gdpr_compliant

  FROM customers
),
quality_scores AS (
  SELECT *,
    -- Calculate overall quality score (0-100)
    (email_valid * 20 + phone_valid * 15 + address_complete * 25 + 
     tax_compliant * 25 + data_fresh * 15) as overall_quality_score,

    -- Quality rating classification
    CASE 
      WHEN (email_valid * 20 + phone_valid * 15 + address_complete * 25 + 
            tax_compliant * 25 + data_fresh * 15) >= 90 THEN 'EXCELLENT'
      WHEN (email_valid * 20 + phone_valid * 15 + address_complete * 25 + 
            tax_compliant * 25 + data_fresh * 15) >= 75 THEN 'GOOD'  
      WHEN (email_valid * 20 + phone_valid * 15 + address_complete * 25 + 
            tax_compliant * 25 + data_fresh * 15) >= 60 THEN 'FAIR'
      ELSE 'POOR'
    END as quality_rating

  FROM data_quality_metrics
)

SELECT 
  -- Summary statistics
  COUNT(*) as total_customers,
  AVG(overall_quality_score) as avg_quality_score,

  -- Quality distribution
  COUNT(*) FILTER (WHERE quality_rating = 'EXCELLENT') as excellent_count,
  COUNT(*) FILTER (WHERE quality_rating = 'GOOD') as good_count,
  COUNT(*) FILTER (WHERE quality_rating = 'FAIR') as fair_count, 
  COUNT(*) FILTER (WHERE quality_rating = 'POOR') as poor_count,

  -- Quality percentages
  ROUND(COUNT(*) FILTER (WHERE quality_rating = 'EXCELLENT') * 100.0 / COUNT(*), 2) as excellent_pct,
  ROUND(COUNT(*) FILTER (WHERE quality_rating = 'GOOD') * 100.0 / COUNT(*), 2) as good_pct,
  ROUND(COUNT(*) FILTER (WHERE quality_rating = 'FAIR') * 100.0 / COUNT(*), 2) as fair_pct,
  ROUND(COUNT(*) FILTER (WHERE quality_rating = 'POOR') * 100.0 / COUNT(*), 2) as poor_pct,

  -- Field-specific quality metrics
  ROUND(AVG(email_valid) * 100, 2) as email_validity_rate,
  ROUND(AVG(phone_valid) * 100, 2) as phone_validity_rate,
  ROUND(AVG(address_complete) * 100, 2) as address_completeness_rate,
  ROUND(AVG(tax_compliant) * 100, 2) as tax_compliance_rate,
  ROUND(AVG(data_fresh) * 100, 2) as data_freshness_rate,
  ROUND(AVG(gdpr_compliant) * 100, 2) as gdpr_compliance_rate,

  -- Data quality health score
  ROUND((AVG(email_valid) + AVG(phone_valid) + AVG(address_complete) + 
         AVG(tax_compliant) + AVG(data_fresh) + AVG(gdpr_compliant)) / 6 * 100, 2) as overall_health_score

FROM quality_scores;

-- Automated data quality monitoring view
CREATE VIEW data_quality_dashboard AS 
WITH real_time_quality AS (
  SELECT 
    DATE_FORMAT(STR_TO_DATE(JSON_UNQUOTE(JSON_EXTRACT(audit, '$.created_at')), 
                '%Y-%m-%dT%H:%i:%s.%fZ'), '%Y-%m-%d %H:00:00') as hour_bucket,

    -- Quality metrics by hour
    COUNT(*) as records_processed,

    AVG(CASE WHEN JSON_EXTRACT(primary_contact, '$.email') REGEXP '^[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\\.[A-Za-z]{2,}$' 
             THEN 1 ELSE 0 END) as email_validity_rate,

    AVG(CASE WHEN JSON_EXTRACT(primary_contact, '$.phone') REGEXP '^\\+?[1-9]\\d{1,14}$'
             THEN 1 ELSE 0 END) as phone_validity_rate,

    AVG(CASE WHEN JSON_EXTRACT(billing_address, '$.street1') IS NOT NULL AND
                   JSON_EXTRACT(billing_address, '$.city') IS NOT NULL
             THEN 1 ELSE 0 END) as address_completeness_rate,

    -- Compliance rates
    AVG(CASE WHEN legal_entity_type NOT IN ('corporation', 'llc') OR 
                   JSON_EXTRACT(compliance, '$.tax_id') IS NOT NULL
             THEN 1 ELSE 0 END) as tax_compliance_rate,

    -- Alert conditions
    COUNT(*) FILTER (WHERE 
      JSON_EXTRACT(primary_contact, '$.email') NOT REGEXP '^[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\\.[A-Za-z]{2,}$'
    ) as invalid_email_count,

    COUNT(*) FILTER (WHERE
      legal_entity_type IN ('corporation', 'llc') AND
      JSON_EXTRACT(compliance, '$.tax_id') IS NULL
    ) as missing_tax_id_count

  FROM customers
  WHERE STR_TO_DATE(JSON_UNQUOTE(JSON_EXTRACT(audit, '$.created_at')), 
                    '%Y-%m-%dT%H:%i:%s.%fZ') >= DATE_SUB(NOW(), INTERVAL 24 HOUR)
  GROUP BY DATE_FORMAT(STR_TO_DATE(JSON_UNQUOTE(JSON_EXTRACT(audit, '$.created_at')), 
                       '%Y-%m-%dT%H:%i:%s.%fZ'), '%Y-%m-%d %H:00:00')
)

SELECT 
  hour_bucket as monitoring_hour,
  records_processed,
  ROUND(email_validity_rate * 100, 2) as email_validity_pct,
  ROUND(phone_validity_rate * 100, 2) as phone_validity_pct,
  ROUND(address_completeness_rate * 100, 2) as address_completeness_pct,
  ROUND(tax_compliance_rate * 100, 2) as tax_compliance_pct,

  -- Overall quality score for the hour
  ROUND((email_validity_rate + phone_validity_rate + address_completeness_rate + tax_compliance_rate) / 4 * 100, 2) as hourly_quality_score,

  -- Issue counts
  invalid_email_count,
  missing_tax_id_count,

  -- Alert status
  CASE 
    WHEN invalid_email_count > records_processed * 0.1 THEN '🔴 High Invalid Email Rate'
    WHEN missing_tax_id_count > 0 THEN '🟠 Missing Tax IDs'
    WHEN (email_validity_rate + phone_validity_rate + address_completeness_rate + tax_compliance_rate) / 4 < 0.8 THEN '🟡 Below Quality Threshold'
    ELSE '🟢 Quality Within Target'
  END as quality_status,

  -- Recommendations
  CASE
    WHEN invalid_email_count > records_processed * 0.05 THEN 'Implement email validation at point of entry'
    WHEN missing_tax_id_count > 5 THEN 'Review tax ID collection process'  
    WHEN address_completeness_rate < 0.9 THEN 'Improve address validation workflow'
    ELSE 'Monitor quality trends'
  END as recommendation

FROM real_time_quality  
ORDER BY hour_bucket DESC;

-- Data quality remediation workflow
WITH quality_issues AS (
  SELECT 
    _id,
    company_name,

    -- Identify specific issues
    CASE WHEN JSON_EXTRACT(primary_contact, '$.email') NOT REGEXP '^[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\\.[A-Za-z]{2,}$'
         THEN 'INVALID_EMAIL' END as email_issue,

    CASE WHEN JSON_EXTRACT(primary_contact, '$.phone') NOT REGEXP '^\\+?[1-9]\\d{1,14}$'
         THEN 'INVALID_PHONE' END as phone_issue,

    CASE WHEN JSON_EXTRACT(billing_address, '$.street1') IS NULL OR 
              JSON_EXTRACT(billing_address, '$.city') IS NULL
         THEN 'INCOMPLETE_ADDRESS' END as address_issue,

    CASE WHEN legal_entity_type IN ('corporation', 'llc') AND
              JSON_EXTRACT(compliance, '$.tax_id') IS NULL
         THEN 'MISSING_TAX_ID' END as tax_issue,

    -- Priority calculation
    CASE 
      WHEN legal_entity_type IN ('corporation', 'llc') AND 
           JSON_EXTRACT(compliance, '$.tax_id') IS NULL THEN 'HIGH'
      WHEN JSON_EXTRACT(primary_contact, '$.email') NOT REGEXP '^[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\\.[A-Za-z]{2,}$' THEN 'HIGH'
      WHEN JSON_EXTRACT(billing_address, '$.street1') IS NULL THEN 'MEDIUM'
      ELSE 'LOW'
    END as issue_priority

  FROM customers
  WHERE 
    JSON_EXTRACT(primary_contact, '$.email') NOT REGEXP '^[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\\.[A-Za-z]{2,}$' OR
    JSON_EXTRACT(primary_contact, '$.phone') NOT REGEXP '^\\+?[1-9]\\d{1,14}$' OR
    JSON_EXTRACT(billing_address, '$.street1') IS NULL OR
    JSON_EXTRACT(billing_address, '$.city') IS NULL OR
    (legal_entity_type IN ('corporation', 'llc') AND JSON_EXTRACT(compliance, '$.tax_id') IS NULL)
)

SELECT 
  _id as customer_id,
  company_name,

  -- Consolidate issues
  CONCAT_WS(', ', 
    email_issue,
    phone_issue, 
    address_issue,
    tax_issue
  ) as identified_issues,

  issue_priority,

  -- Recommended actions
  CASE issue_priority
    WHEN 'HIGH' THEN 'Immediate manual review and correction required'
    WHEN 'MEDIUM' THEN 'Schedule for data completion workflow'
    WHEN 'LOW' THEN 'Include in next batch quality improvement'
  END as recommended_action,

  -- Auto-remediation possibility
  CASE 
    WHEN phone_issue = 'INVALID_PHONE' AND 
         JSON_EXTRACT(primary_contact, '$.phone') REGEXP '^[0-9]{10}$' THEN 'AUTO_FORMAT_PHONE'
    WHEN address_issue = 'INCOMPLETE_ADDRESS' AND
         JSON_EXTRACT(billing_address, '$.street1') IS NOT NULL THEN 'REQUEST_COMPLETION'
    ELSE 'MANUAL_REVIEW'
  END as remediation_type,

  NOW() as identified_at

FROM quality_issues
ORDER BY 
  CASE issue_priority 
    WHEN 'HIGH' THEN 1 
    WHEN 'MEDIUM' THEN 2 
    ELSE 3 
  END,
  company_name;

-- QueryLeaf provides comprehensive schema validation capabilities:
-- 1. SQL-familiar constraint syntax for MongoDB document validation
-- 2. Advanced JSON validation with nested field constraints
-- 3. Conditional validation rules based on document context
-- 4. Real-time data quality monitoring with SQL aggregations
-- 5. Automated quality scoring and rating classification
-- 6. Data quality dashboard views with trend analysis
-- 7. Compliance reporting with regulatory requirement tracking
-- 8. Quality issue identification and remediation workflows
-- 9. Integration with MongoDB's native validation features
-- 10. Familiar SQL patterns for complex data governance requirements

Best Practices for Schema Validation Implementation

Validation Strategy Design

Essential practices for effective MongoDB schema validation:

Progressive Validation: Start with warning-level validation and gradually enforce strict rules
Conditional Logic: Use document context to apply appropriate validation rules
Business Rule Integration: Align validation rules with actual business requirements
Performance Consideration: Balance validation thoroughness with write performance
Error Messaging: Provide clear, actionable error messages for validation failures
Version Management: Plan for schema evolution and backward compatibility

Data Quality Management

Implement comprehensive data quality monitoring for production environments:

Continuous Monitoring: Track data quality metrics in real-time with automated dashboards
Quality Scoring: Develop standardized quality scores across different document types
Remediation Workflows: Implement automated and manual remediation processes
Compliance Tracking: Monitor regulatory compliance requirements continuously
Historical Analysis: Track data quality trends over time for improvement insights
Integration Patterns: Coordinate validation across multiple data sources and systems

Conclusion

MongoDB Schema Validation provides comprehensive data quality management capabilities that eliminate the complexity and rigidity of traditional database constraint systems. The combination of flexible document validation, sophisticated business rule enforcement, and automated quality monitoring enables enterprise-grade data governance that adapts to evolving requirements while maintaining strict compliance standards.

Key Schema Validation benefits include:

Flexible Validation: Document-based validation that adapts to varying data structures and requirements
Business Logic Integration: Advanced conditional validation based on document context and business rules
Automated Quality Management: Real-time quality monitoring with automated remediation workflows
Compliance Reporting: Comprehensive regulatory compliance tracking and audit capabilities
Performance Optimization: Efficient validation that scales with data volume and complexity
Developer Productivity: SQL-familiar validation patterns that reduce implementation complexity

Whether you're building financial services applications, healthcare systems, e-commerce platforms, or any enterprise application requiring strict data quality standards, MongoDB Schema Validation with QueryLeaf's SQL-familiar interface provides the foundation for robust data governance. This combination enables sophisticated validation strategies while preserving familiar database interaction patterns.

QueryLeaf Integration: QueryLeaf automatically translates SQL constraint definitions into MongoDB validation schemas, providing familiar CREATE TABLE syntax with CHECK constraints, conditional validation rules, and data quality monitoring queries. Advanced validation patterns, compliance reporting, and automated remediation workflows are seamlessly accessible through SQL constructs, making enterprise data governance both powerful and approachable for SQL-oriented teams.

The integration of flexible validation capabilities with SQL-style operations makes MongoDB an ideal platform for applications requiring both strict data quality enforcement and adaptive schema evolution, ensuring your data governance solutions remain both effective and maintainable as requirements evolve and data volumes scale.

November 15, 2025
25 min read

MongoDB Time-Series Collections for IoT Analytics: High-Performance Data Processing and Real-Time Analytics with SQL-Compatible Operations

Modern IoT applications generate massive volumes of time-stamped sensor data, requiring specialized database architectures that can efficiently ingest, store, and analyze temporal data at scale. Traditional relational databases struggle with the unique characteristics of time-series workloads: high write throughput, time-based queries, and analytical operations across large temporal ranges.

MongoDB Time-Series Collections provide native support for temporal data patterns with optimized storage engines, intelligent compression algorithms, and high-performance analytical capabilities specifically designed for IoT, monitoring, and analytics use cases. Unlike generic document collections or traditional time-series databases that require complex sharding strategies, Time-Series Collections automatically optimize storage layout, indexing, and query execution for temporal data patterns.

The Traditional Time-Series Data Challenge

Managing time-series data with conventional database approaches creates significant performance and operational challenges:

-- Traditional PostgreSQL time-series implementation - complex partitioning and maintenance

-- Sensor readings table with time-based partitioning
CREATE TABLE sensor_readings (
    reading_id BIGSERIAL,
    device_id VARCHAR(50) NOT NULL,
    sensor_type VARCHAR(50) NOT NULL,
    timestamp TIMESTAMPTZ NOT NULL,
    value NUMERIC(10,4) NOT NULL,
    quality_score SMALLINT DEFAULT 100,

    -- Location and device metadata
    device_location VARCHAR(100),
    facility_id VARCHAR(50),
    building_id VARCHAR(50),
    floor_id VARCHAR(50),

    -- Environmental context
    ambient_temperature NUMERIC(5,2),
    humidity_percent NUMERIC(5,2),
    atmospheric_pressure NUMERIC(7,2),

    -- System metadata
    ingestion_timestamp TIMESTAMPTZ DEFAULT CURRENT_TIMESTAMP,
    data_source VARCHAR(50),
    processing_pipeline_version VARCHAR(20),

    -- Constraint for partitioning
    PRIMARY KEY (reading_id, timestamp)
) PARTITION BY RANGE (timestamp);

-- Create monthly partitions (requires ongoing maintenance)
CREATE TABLE sensor_readings_2025_01 PARTITION OF sensor_readings
    FOR VALUES FROM ('2025-01-01') TO ('2025-02-01');
CREATE TABLE sensor_readings_2025_02 PARTITION OF sensor_readings
    FOR VALUES FROM ('2025-02-01') TO ('2025-03-01');
CREATE TABLE sensor_readings_2025_03 PARTITION OF sensor_readings
    FOR VALUES FROM ('2025-03-01') TO ('2025-04-01');
-- ... (requires creating new partitions monthly)

-- Indexes for time-series query patterns
CREATE INDEX idx_readings_device_time ON sensor_readings (device_id, timestamp DESC);
CREATE INDEX idx_readings_type_time ON sensor_readings (sensor_type, timestamp DESC);
CREATE INDEX idx_readings_facility_time ON sensor_readings (facility_id, timestamp DESC);
CREATE INDEX idx_readings_timestamp ON sensor_readings (timestamp DESC);

-- Sensor metadata table for device information
CREATE TABLE sensor_devices (
    device_id VARCHAR(50) PRIMARY KEY,
    device_name VARCHAR(200) NOT NULL,
    device_type VARCHAR(100) NOT NULL,
    manufacturer VARCHAR(100),
    model VARCHAR(100),
    firmware_version VARCHAR(50),

    -- Installation details
    installation_date DATE NOT NULL,
    location_description TEXT,
    coordinates POINT,

    -- Configuration
    sampling_interval_seconds INTEGER DEFAULT 300,
    measurement_units JSONB,
    calibration_data JSONB,
    alert_thresholds JSONB,

    -- Status tracking
    is_active BOOLEAN DEFAULT true,
    last_communication TIMESTAMPTZ,
    battery_level_percent SMALLINT,
    signal_strength_dbm INTEGER,

    created_at TIMESTAMPTZ DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMPTZ DEFAULT CURRENT_TIMESTAMP
);

-- Complex time-series aggregation query with window functions
WITH hourly_aggregations AS (
    SELECT 
        device_id,
        sensor_type,
        facility_id,
        DATE_TRUNC('hour', timestamp) as hour_bucket,

        -- Statistical aggregations
        COUNT(*) as reading_count,
        AVG(value) as avg_value,
        MIN(value) as min_value,
        MAX(value) as max_value,
        STDDEV(value) as stddev_value,
        PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY value) as median_value,
        PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY value) as p95_value,

        -- Quality metrics
        AVG(quality_score) as avg_quality,
        COUNT(*) FILTER (WHERE quality_score < 80) as poor_quality_count,

        -- Environmental correlations
        AVG(ambient_temperature) as avg_ambient_temp,
        AVG(humidity_percent) as avg_humidity,

        -- Time-based metrics
        MAX(timestamp) as latest_reading,
        MIN(timestamp) as earliest_reading

    FROM sensor_readings
    WHERE timestamp >= CURRENT_TIMESTAMP - INTERVAL '24 hours'
        AND timestamp < CURRENT_TIMESTAMP
        AND quality_score >= 50  -- Filter out very poor quality readings
    GROUP BY device_id, sensor_type, facility_id, DATE_TRUNC('hour', timestamp)
),

device_performance_metrics AS (
    SELECT 
        ha.*,
        sd.device_name,
        sd.device_type,
        sd.manufacturer,
        sd.sampling_interval_seconds,
        sd.location_description,

        -- Performance calculations
        CASE 
            WHEN ha.reading_count < (3600 / sd.sampling_interval_seconds) * 0.8 THEN 'Poor'
            WHEN ha.reading_count < (3600 / sd.sampling_interval_seconds) * 0.95 THEN 'Fair'  
            ELSE 'Good'
        END as data_completeness,

        -- Anomaly detection using z-score
        ABS(ha.avg_value - LAG(ha.avg_value, 1) OVER (
            PARTITION BY ha.device_id, ha.sensor_type 
            ORDER BY ha.hour_bucket
        )) / NULLIF(ha.stddev_value, 0) as hour_over_hour_zscore,

        -- Rate of change analysis
        (ha.avg_value - LAG(ha.avg_value, 1) OVER (
            PARTITION BY ha.device_id, ha.sensor_type 
            ORDER BY ha.hour_bucket
        )) as hour_over_hour_change,

        -- Moving averages for trend analysis
        AVG(ha.avg_value) OVER (
            PARTITION BY ha.device_id, ha.sensor_type 
            ORDER BY ha.hour_bucket 
            ROWS BETWEEN 23 PRECEDING AND CURRENT ROW
        ) as moving_avg_24h,

        -- Deviation from baseline
        ABS(ha.avg_value - AVG(ha.avg_value) OVER (
            PARTITION BY ha.device_id, ha.sensor_type, EXTRACT(hour FROM ha.hour_bucket)
        )) as deviation_from_baseline

    FROM hourly_aggregations ha
    JOIN sensor_devices sd ON ha.device_id = sd.device_id
),

alert_analysis AS (
    SELECT 
        dpm.*,

        -- Alert conditions
        CASE 
            WHEN dpm.data_completeness = 'Poor' THEN 'Data Availability Alert'
            WHEN dpm.hour_over_hour_zscore > 3 THEN 'Anomaly Alert'
            WHEN dpm.avg_quality < 70 THEN 'Data Quality Alert'
            WHEN dpm.deviation_from_baseline > dpm.stddev_value * 2 THEN 'Baseline Deviation Alert'
            ELSE NULL
        END as alert_type,

        -- Alert priority
        CASE 
            WHEN dpm.data_completeness = 'Poor' AND dpm.avg_quality < 60 THEN 'Critical'
            WHEN dpm.hour_over_hour_zscore > 4 THEN 'High'
            WHEN dpm.deviation_from_baseline > dpm.stddev_value * 3 THEN 'High'
            WHEN dpm.data_completeness = 'Fair' THEN 'Medium'
            ELSE 'Low'
        END as alert_priority

    FROM device_performance_metrics dpm
)

SELECT 
    device_id,
    device_name,
    sensor_type,
    facility_id,
    TO_CHAR(hour_bucket, 'YYYY-MM-DD HH24:00') as analysis_hour,

    -- Core metrics
    reading_count,
    ROUND(avg_value::numeric, 3) as average_value,
    ROUND(min_value::numeric, 3) as minimum_value,
    ROUND(max_value::numeric, 3) as maximum_value,
    ROUND(stddev_value::numeric, 3) as std_deviation,
    ROUND(median_value::numeric, 3) as median_value,

    -- Performance indicators
    data_completeness,
    ROUND(avg_quality::numeric, 1) as average_quality_score,
    poor_quality_count,

    -- Analytical insights
    ROUND(hour_over_hour_change::numeric, 4) as hourly_change,
    ROUND(hour_over_hour_zscore::numeric, 2) as change_zscore,
    ROUND(moving_avg_24h::numeric, 3) as daily_moving_average,
    ROUND(deviation_from_baseline::numeric, 3) as baseline_deviation,

    -- Environmental factors
    ROUND(avg_ambient_temp::numeric, 1) as ambient_temperature,
    ROUND(avg_humidity::numeric, 1) as humidity_percent,

    -- Alerts and notifications
    alert_type,
    alert_priority,

    -- Data freshness
    EXTRACT(EPOCH FROM (CURRENT_TIMESTAMP - latest_reading)) / 60 as minutes_since_last_reading

FROM alert_analysis
WHERE hour_bucket >= CURRENT_TIMESTAMP - INTERVAL '24 hours'
ORDER BY 
    facility_id, 
    device_id, 
    sensor_type,
    hour_bucket DESC;

-- Problems with traditional time-series approaches:
-- 1. Complex manual partitioning requiring ongoing maintenance and planning
-- 2. Limited compression and storage optimization for temporal data patterns
-- 3. Expensive analytical queries across large time ranges and multiple partitions
-- 4. Manual index management for various time-based query patterns
-- 5. Difficult schema evolution as IoT requirements change
-- 6. Limited support for hierarchical time-based aggregations
-- 7. Complex data lifecycle management and archival strategies
-- 8. Poor performance for high-frequency data ingestion and concurrent analytics
-- 9. Expensive infrastructure scaling for time-series workloads
-- 10. Limited real-time aggregation capabilities for streaming analytics

MongoDB Time-Series Collections provide optimized temporal data management:

// MongoDB Time-Series Collections - native high-performance temporal data management
const { MongoClient } = require('mongodb');

const client = new MongoClient('mongodb://localhost:27017');
const db = client.db('iot_analytics_platform');

// Advanced Time-Series Collection Management
class MongoTimeSeriesManager {
  constructor(db) {
    this.db = db;
    this.collections = new Map();
    this.aggregationPipelines = new Map();
    this.realtimeStreams = new Map();
  }

  async createOptimizedTimeSeriesCollections() {
    console.log('Creating optimized time-series collections for IoT analytics...');

    // Primary sensor readings collection
    const sensorReadingsSpec = {
      timeseries: {
        timeField: "timestamp",
        metaField: "device",
        granularity: "minutes"  // Optimizes for minute-level bucketing
      },
      expireAfterSeconds: 60 * 60 * 24 * 365 * 2  // 2 years retention
    };

    await this.db.createCollection('sensor_readings', sensorReadingsSpec);

    // Device heartbeat collection (higher frequency data)
    const heartbeatSpec = {
      timeseries: {
        timeField: "timestamp", 
        metaField: "device",
        granularity: "seconds"  // Optimizes for second-level data
      },
      expireAfterSeconds: 60 * 60 * 24 * 30  // 30 days retention
    };

    await this.db.createCollection('device_heartbeat', heartbeatSpec);

    // Aggregated analytics collection (lower frequency, longer retention)
    const analyticsSpec = {
      timeseries: {
        timeField: "window_start",
        metaField: "aggregation_metadata", 
        granularity: "hours"  // Optimizes for hourly aggregations
      },
      expireAfterSeconds: 60 * 60 * 24 * 365 * 5  // 5 years retention
    };

    await this.db.createCollection('analytics_aggregations', analyticsSpec);

    // Create collections references
    this.collections.set('readings', this.db.collection('sensor_readings'));
    this.collections.set('heartbeat', this.db.collection('device_heartbeat'));
    this.collections.set('analytics', this.db.collection('analytics_aggregations'));

    // Create supporting indexes for efficient queries
    await this.createTimeSeriesIndexes();

    console.log('✅ Time-series collections created with optimal configuration');
    return this.collections;
  }

  async createTimeSeriesIndexes() {
    console.log('Creating optimized indexes for time-series query patterns...');

    const readingsCollection = this.collections.get('readings');

    // Compound indexes for common query patterns
    await readingsCollection.createIndexes([
      {
        key: { "device.device_id": 1, "timestamp": -1 },
        name: "idx_device_time_desc",
        background: true
      },
      {
        key: { "device.sensor_type": 1, "timestamp": -1 }, 
        name: "idx_sensor_type_time",
        background: true
      },
      {
        key: { "device.facility_id": 1, "device.sensor_type": 1, "timestamp": -1 },
        name: "idx_facility_sensor_time",
        background: true
      },
      {
        key: { "measurements.value": 1, "timestamp": -1 },
        name: "idx_value_time_range",
        background: true
      }
    ]);

    console.log('✅ Time-series indexes created');
  }

  async ingestSensorData(sensorReadings) {
    console.log(`Ingesting ${sensorReadings.length} sensor readings...`);

    const readingsCollection = this.collections.get('readings');
    const batchSize = 10000;
    let totalIngested = 0;

    // Process readings in optimized batches
    for (let i = 0; i < sensorReadings.length; i += batchSize) {
      const batch = sensorReadings.slice(i, i + batchSize);

      try {
        // Transform readings to time-series document format
        const timeSeriesDocuments = batch.map(reading => ({
          timestamp: new Date(reading.timestamp),

          // Device metadata (metaField)
          device: {
            device_id: reading.device_id,
            sensor_type: reading.sensor_type,
            facility_id: reading.facility_id,
            building_id: reading.building_id,
            floor_id: reading.floor_id,
            location: reading.location,

            // Device specifications
            manufacturer: reading.manufacturer,
            model: reading.model,
            firmware_version: reading.firmware_version,

            // Operational context
            sampling_interval: reading.sampling_interval,
            calibration_date: reading.calibration_date,
            maintenance_schedule: reading.maintenance_schedule
          },

          // Measurement data (time-varying fields)
          measurements: {
            value: reading.value,
            unit: reading.unit,
            quality_score: reading.quality_score || 100,

            // Multiple sensor values (for multi-sensor devices)
            ...(reading.secondary_values && {
              secondary_measurements: reading.secondary_values
            })
          },

          // Environmental context
          environment: {
            ambient_temperature: reading.ambient_temperature,
            humidity: reading.humidity,
            atmospheric_pressure: reading.atmospheric_pressure,
            light_level: reading.light_level,
            noise_level: reading.noise_level
          },

          // System metadata
          system: {
            ingestion_timestamp: new Date(),
            data_source: reading.data_source || 'iot-gateway',
            processing_pipeline: reading.processing_pipeline || 'v1.0',
            batch_id: reading.batch_id,

            // Quality indicators
            transmission_latency_ms: reading.transmission_latency_ms,
            signal_strength: reading.signal_strength,
            battery_level: reading.battery_level
          },

          // Derived analytics (computed during ingestion)
          analytics: {
            is_anomaly: this.detectSimpleAnomaly(reading),
            trend_direction: this.calculateTrendDirection(reading),
            data_completeness_score: this.calculateCompletenessScore(reading)
          }
        }));

        // Bulk insert with ordered: false for better performance
        const result = await readingsCollection.insertMany(timeSeriesDocuments, {
          ordered: false,
          writeConcern: { w: 1, j: false }  // Optimized for throughput
        });

        totalIngested += result.insertedCount;

        if (i % (batchSize * 10) === 0) {
          console.log(`Ingested ${totalIngested}/${sensorReadings.length} readings...`);
        }

      } catch (error) {
        console.error(`Error ingesting batch starting at index ${i}:`, error);
        continue;
      }
    }

    console.log(`✅ Ingestion completed: ${totalIngested}/${sensorReadings.length} readings`);
    return { totalIngested, totalReceived: sensorReadings.length };
  }

  async performRealTimeAnalytics(deviceId, timeRange = '1h', options = {}) {
    console.log(`Performing real-time analytics for device ${deviceId}...`);

    const {
      aggregationLevel = 'minute',
      includeAnomalyDetection = true,
      calculateTrends = true,
      environmentalCorrelation = true
    } = options;

    const readingsCollection = this.collections.get('readings');
    const endTime = new Date();
    const startTime = new Date(endTime.getTime() - this.parseTimeRange(timeRange));

    try {
      const analyticalPipeline = [
        // Stage 1: Filter by device and time range
        {
          $match: {
            "device.device_id": deviceId,
            "timestamp": {
              $gte: startTime,
              $lte: endTime
            },
            "measurements.quality_score": { $gte: 70 }  // Filter poor quality data
          }
        },

        // Stage 2: Time-based bucketing
        {
          $group: {
            _id: {
              time_bucket: {
                $dateTrunc: {
                  date: "$timestamp",
                  unit: aggregationLevel,
                  ...(aggregationLevel === 'minute' && { binSize: 5 })  // 5-minute buckets
                }
              },
              sensor_type: "$device.sensor_type",
              facility_id: "$device.facility_id"
            },

            // Statistical aggregations
            reading_count: { $sum: 1 },
            avg_value: { $avg: "$measurements.value" },
            min_value: { $min: "$measurements.value" },
            max_value: { $max: "$measurements.value" },
            sum_value: { $sum: "$measurements.value" },

            // Advanced statistical measures
            values: { $push: "$measurements.value" },  // For percentile calculations

            // Quality metrics
            avg_quality: { $avg: "$measurements.quality_score" },
            poor_quality_count: {
              $sum: {
                $cond: [{ $lt: ["$measurements.quality_score", 80] }, 1, 0]
              }
            },

            // Environmental correlations
            avg_ambient_temp: { $avg: "$environment.ambient_temperature" },
            avg_humidity: { $avg: "$environment.humidity" },
            avg_pressure: { $avg: "$environment.atmospheric_pressure" },

            // System health indicators
            avg_signal_strength: { $avg: "$system.signal_strength" },
            avg_battery_level: { $avg: "$system.battery_level" },
            avg_transmission_latency: { $avg: "$system.transmission_latency_ms" },

            // Time boundaries
            first_timestamp: { $min: "$timestamp" },
            last_timestamp: { $max: "$timestamp" },

            // Device metadata (take first occurrence)
            device_metadata: { $first: "$device" }
          }
        },

        // Stage 3: Calculate advanced statistics
        {
          $addFields: {
            // Statistical measures
            value_range: { $subtract: ["$max_value", "$min_value"] },
            data_completeness: {
              $divide: [
                "$reading_count",
                { $divide: [
                  { $subtract: ["$last_timestamp", "$first_timestamp"] },
                  1000 * 60 * (aggregationLevel === 'minute' ? 5 : 1)  // Expected interval
                ]}
              ]
            },

            // Percentile calculations (approximated)
            median_value: {
              $arrayElemAt: [
                { $sortArray: { input: "$values", sortBy: 1 } },
                { $floor: { $multiply: [{ $size: "$values" }, 0.5] } }
              ]
            },
            p95_value: {
              $arrayElemAt: [
                { $sortArray: { input: "$values", sortBy: 1 } },
                { $floor: { $multiply: [{ $size: "$values" }, 0.95] } }
              ]
            },

            // Quality scoring
            quality_score: {
              $multiply: [
                { $divide: ["$avg_quality", 100] },
                { $min: ["$data_completeness", 1] }
              ]
            }
          }
        },

        // Stage 4: Add time-based analytical features
        {
          $setWindowFields: {
            partitionBy: { 
              sensor_type: "$_id.sensor_type",
              facility_id: "$_id.facility_id"
            },
            sortBy: { "_id.time_bucket": 1 },
            output: {
              // Moving averages
              moving_avg_3: {
                $avg: "$avg_value",
                window: { range: [-2, 0], unit: "position" }
              },
              moving_avg_6: {
                $avg: "$avg_value", 
                window: { range: [-5, 0], unit: "position" }
              },

              // Rate of change
              previous_avg: {
                $shift: { 
                  output: "$avg_value", 
                  by: -1 
                }
              },

              // Trend analysis
              trend_slope: {
                $linearFill: "$avg_value"
              }
            }
          }
        },

        // Stage 5: Calculate derived analytics
        {
          $addFields: {
            // Rate of change calculations
            rate_of_change: {
              $cond: {
                if: { $ne: ["$previous_avg", null] },
                then: { $subtract: ["$avg_value", "$previous_avg"] },
                else: 0
              }
            },

            // Anomaly detection (simple z-score based)
            is_potential_anomaly: {
              $gt: [
                { $abs: { $subtract: ["$avg_value", "$moving_avg_6"] } },
                { $multiply: [{ $sqrt: "$value_range" }, 2] }  // Simple threshold
              ]
            },

            // Trend classification
            trend_direction: {
              $switch: {
                branches: [
                  { 
                    case: { $gt: ["$rate_of_change", { $multiply: ["$value_range", 0.05] }] },
                    then: "increasing"
                  },
                  { 
                    case: { $lt: ["$rate_of_change", { $multiply: ["$value_range", -0.05] }] },
                    then: "decreasing" 
                  }
                ],
                default: "stable"
              }
            },

            // Performance classification
            performance_status: {
              $switch: {
                branches: [
                  {
                    case: { 
                      $and: [
                        { $gte: ["$quality_score", 0.9] },
                        { $gte: ["$data_completeness", 0.95] }
                      ]
                    },
                    then: "excellent"
                  },
                  {
                    case: {
                      $and: [
                        { $gte: ["$quality_score", 0.7] },
                        { $gte: ["$data_completeness", 0.8] }
                      ]
                    },
                    then: "good"
                  },
                  {
                    case: {
                      $or: [
                        { $lt: ["$quality_score", 0.5] },
                        { $lt: ["$data_completeness", 0.6] }
                      ]
                    },
                    then: "poor"
                  }
                ],
                default: "fair"
              }
            }
          }
        },

        // Stage 6: Final projection and formatting
        {
          $project: {
            _id: 0,
            time_bucket: "$_id.time_bucket",
            sensor_type: "$_id.sensor_type",
            facility_id: "$_id.facility_id",
            device_id: deviceId,

            // Core metrics
            reading_count: 1,
            avg_value: { $round: ["$avg_value", 3] },
            min_value: { $round: ["$min_value", 3] },
            max_value: { $round: ["$max_value", 3] },
            median_value: { $round: ["$median_value", 3] },
            p95_value: { $round: ["$p95_value", 3] },
            value_range: { $round: ["$value_range", 3] },

            // Quality and completeness
            data_completeness: { $round: ["$data_completeness", 3] },
            quality_score: { $round: ["$quality_score", 3] },
            poor_quality_count: 1,

            // Analytical insights
            moving_avg_3: { $round: ["$moving_avg_3", 3] },
            moving_avg_6: { $round: ["$moving_avg_6", 3] },
            rate_of_change: { $round: ["$rate_of_change", 4] },
            trend_direction: 1,
            is_potential_anomaly: 1,
            performance_status: 1,

            // Environmental factors
            environmental_context: {
              ambient_temperature: { $round: ["$avg_ambient_temp", 1] },
              humidity: { $round: ["$avg_humidity", 1] },
              atmospheric_pressure: { $round: ["$avg_pressure", 1] }
            },

            // System health
            system_health: {
              signal_strength: { $round: ["$avg_signal_strength", 1] },
              battery_level: { $round: ["$avg_battery_level", 1] },
              transmission_latency: { $round: ["$avg_transmission_latency", 1] }
            },

            // Time boundaries
            time_range: {
              start: "$first_timestamp",
              end: "$last_timestamp",
              duration_minutes: {
                $round: [
                  { $divide: [
                    { $subtract: ["$last_timestamp", "$first_timestamp"] },
                    60000
                  ]}, 
                  1
                ]
              }
            },

            // Device context
            device_metadata: "$device_metadata"
          }
        },

        // Stage 7: Sort by time
        {
          $sort: { "time_bucket": 1 }
        }
      ];

      const startAnalysis = Date.now();
      const analyticsResults = await readingsCollection.aggregate(analyticalPipeline, {
        allowDiskUse: true,
        maxTimeMS: 30000  // 30 second timeout
      }).toArray();
      const analysisTime = Date.now() - startAnalysis;

      console.log(`✅ Real-time analytics completed in ${analysisTime}ms`);
      console.log(`Generated ${analyticsResults.length} analytical data points`);

      // Calculate summary statistics
      const summary = this.calculateAnalyticsSummary(analyticsResults);

      return {
        deviceId,
        timeRange,
        analysisTime: analysisTime,
        dataPoints: analyticsResults.length,
        analytics: analyticsResults,
        summary: summary
      };

    } catch (error) {
      console.error('Error performing real-time analytics:', error);
      throw error;
    }
  }

  calculateAnalyticsSummary(analyticsResults) {
    if (analyticsResults.length === 0) return {};

    const summary = {
      totalReadings: analyticsResults.reduce((sum, point) => sum + point.reading_count, 0),
      averageQuality: analyticsResults.reduce((sum, point) => sum + point.quality_score, 0) / analyticsResults.length,
      averageCompleteness: analyticsResults.reduce((sum, point) => sum + point.data_completeness, 0) / analyticsResults.length,

      anomalyCount: analyticsResults.filter(point => point.is_potential_anomaly).length,
      trendDistribution: {
        increasing: analyticsResults.filter(p => p.trend_direction === 'increasing').length,
        decreasing: analyticsResults.filter(p => p.trend_direction === 'decreasing').length, 
        stable: analyticsResults.filter(p => p.trend_direction === 'stable').length
      },

      performanceDistribution: {
        excellent: analyticsResults.filter(p => p.performance_status === 'excellent').length,
        good: analyticsResults.filter(p => p.performance_status === 'good').length,
        fair: analyticsResults.filter(p => p.performance_status === 'fair').length,
        poor: analyticsResults.filter(p => p.performance_status === 'poor').length
      }
    };

    return summary;
  }

  async createRealTimeAggregations() {
    console.log('Setting up real-time aggregation pipelines...');

    const readingsCollection = this.collections.get('readings');
    const analyticsCollection = this.collections.get('analytics');

    // Create change stream for real-time processing
    const changeStream = readingsCollection.watch([
      {
        $match: {
          'fullDocument.measurements.quality_score': { $gte: 80 }
        }
      }
    ], {
      fullDocument: 'updateLookup'
    });

    changeStream.on('change', async (change) => {
      if (change.operationType === 'insert') {
        await this.processRealTimeUpdate(change.fullDocument);
      }
    });

    this.realtimeStreams.set('readings_processor', changeStream);
    console.log('✅ Real-time aggregation pipelines active');
  }

  async processRealTimeUpdate(newReading) {
    // Process individual readings for real-time dashboards
    const deviceId = newReading.device.device_id;
    const sensorType = newReading.device.sensor_type;

    // Update running statistics
    await this.updateRunningStatistics(deviceId, sensorType, newReading);

    // Check for anomalies
    const anomalyCheck = await this.checkForAnomalies(deviceId, newReading);
    if (anomalyCheck.isAnomaly) {
      await this.handleAnomalyAlert(deviceId, anomalyCheck);
    }
  }

  async updateRunningStatistics(deviceId, sensorType, reading) {
    // Update minute-level running statistics for real-time dashboards
    const analyticsCollection = this.collections.get('analytics');
    const currentMinute = new Date();
    currentMinute.setSeconds(0, 0);

    await analyticsCollection.updateOne(
      {
        "aggregation_metadata.device_id": deviceId,
        "aggregation_metadata.sensor_type": sensorType,
        "window_start": currentMinute
      },
      {
        $inc: {
          "metrics.reading_count": 1,
          "metrics.value_sum": reading.measurements.value
        },
        $min: { "metrics.min_value": reading.measurements.value },
        $max: { "metrics.max_value": reading.measurements.value },
        $push: {
          "metrics.recent_values": {
            $each: [reading.measurements.value],
            $slice: -100  // Keep last 100 values for rolling calculations
          }
        },
        $setOnInsert: {
          aggregation_metadata: {
            device_id: deviceId,
            sensor_type: sensorType,
            facility_id: reading.device.facility_id,
            aggregation_type: "real_time_minute"
          },
          window_start: currentMinute,
          created_at: new Date()
        }
      },
      { upsert: true }
    );
  }

  async checkForAnomalies(deviceId, reading) {
    // Simple anomaly detection based on recent history
    const readingsCollection = this.collections.get('readings');
    const lookbackTime = new Date(reading.timestamp.getTime() - (60 * 60 * 1000)); // 1 hour lookback

    const recentStats = await readingsCollection.aggregate([
      {
        $match: {
          "device.device_id": deviceId,
          "device.sensor_type": reading.device.sensor_type,
          "timestamp": { $gte: lookbackTime, $lt: reading.timestamp }
        }
      },
      {
        $group: {
          _id: null,
          avg_value: { $avg: "$measurements.value" },
          stddev_value: { $stdDevPop: "$measurements.value" },
          count: { $sum: 1 }
        }
      }
    ]).toArray();

    if (recentStats.length === 0 || recentStats[0].count < 10) {
      return { isAnomaly: false, reason: 'insufficient_history' };
    }

    const stats = recentStats[0];
    const currentValue = reading.measurements.value;
    const zScore = Math.abs(currentValue - stats.avg_value) / (stats.stddev_value || 1);

    const isAnomaly = zScore > 3;  // 3-sigma threshold

    return {
      isAnomaly,
      zScore,
      currentValue,
      historicalAverage: stats.avg_value,
      historicalStdDev: stats.stddev_value,
      reason: isAnomaly ? 'statistical_outlier' : 'normal_variation'
    };
  }

  async handleAnomalyAlert(deviceId, anomalyDetails) {
    console.log(`🚨 Anomaly detected for device ${deviceId}:`);
    console.log(`  Z-Score: ${anomalyDetails.zScore.toFixed(2)}`);
    console.log(`  Current Value: ${anomalyDetails.currentValue}`);
    console.log(`  Historical Average: ${anomalyDetails.historicalAverage.toFixed(2)}`);

    // Store anomaly record
    await this.db.collection('anomaly_alerts').insertOne({
      device_id: deviceId,
      detection_timestamp: new Date(),
      anomaly_details: anomalyDetails,
      alert_status: 'active',
      severity: anomalyDetails.zScore > 5 ? 'critical' : 'warning'
    });
  }

  // Utility methods
  parseTimeRange(timeRange) {
    const timeMap = {
      '1h': 60 * 60 * 1000,
      '6h': 6 * 60 * 60 * 1000,
      '24h': 24 * 60 * 60 * 1000,
      '7d': 7 * 24 * 60 * 60 * 1000,
      '30d': 30 * 24 * 60 * 60 * 1000
    };
    return timeMap[timeRange] || timeMap['1h'];
  }

  detectSimpleAnomaly(reading) {
    // Placeholder for simple anomaly detection during ingestion
    return false;
  }

  calculateTrendDirection(reading) {
    // Placeholder for trend calculation during ingestion
    return 'stable';
  }

  calculateCompletenessScore(reading) {
    // Calculate data completeness based on expected vs actual fields
    const requiredFields = ['device_id', 'sensor_type', 'value', 'timestamp'];
    const presentFields = requiredFields.filter(field => reading[field] != null);
    return presentFields.length / requiredFields.length;
  }

  async generatePerformanceReport() {
    console.log('Generating time-series performance report...');

    const collections = ['sensor_readings', 'device_heartbeat', 'analytics_aggregations'];
    const report = {
      generated_at: new Date(),
      collections: {}
    };

    for (const collectionName of collections) {
      try {
        const stats = await this.db.runCommand({ collStats: collectionName });
        report.collections[collectionName] = {
          documentCount: stats.count,
          storageSize: stats.storageSize,
          avgObjSize: stats.avgObjSize,
          totalIndexSize: stats.totalIndexSize,
          compressionRatio: stats.storageSize > 0 ? (stats.size / stats.storageSize).toFixed(2) : 0
        };
      } catch (error) {
        report.collections[collectionName] = { error: error.message };
      }
    }

    return report;
  }

  async shutdown() {
    console.log('Shutting down time-series manager...');

    // Close change streams
    for (const [name, stream] of this.realtimeStreams) {
      await stream.close();
      console.log(`✅ Closed stream: ${name}`);
    }

    await this.client.close();
    console.log('Time-series manager shutdown completed');
  }
}

// Export the time-series manager
module.exports = { MongoTimeSeriesManager };

// Benefits of MongoDB Time-Series Collections:
// - Automatic storage optimization and compression for temporal data patterns
// - Native support for time-based bucketing and aggregations without manual partitioning
// - Intelligent indexing strategies optimized for time-series query patterns
// - Built-in data lifecycle management with TTL (time-to-live) capabilities
// - High-performance ingestion with optimized write operations for time-series workloads
// - Advanced analytical capabilities with window functions and statistical aggregations
// - Real-time change streams for immediate processing of incoming sensor data
// - Flexible schema evolution without complex migration strategies
// - Integrated anomaly detection and alerting capabilities
// - SQL-compatible analytical operations through QueryLeaf integration

Understanding MongoDB Time-Series Architecture

Advanced Analytics Patterns for IoT Data

MongoDB Time-Series Collections enable sophisticated analytical patterns for IoT applications:

// Advanced IoT analytics patterns with MongoDB Time-Series Collections
class IoTAnalyticsProcessor {
  constructor(db) {
    this.db = db;
    this.analyticsCache = new Map();
    this.alertThresholds = new Map();
  }

  async implementAdvancedAnalytics() {
    console.log('Implementing advanced IoT analytics patterns...');

    // Pattern 1: Hierarchical time-series aggregations
    await this.createHierarchicalAggregations();

    // Pattern 2: Cross-device correlation analysis
    await this.implementCrossDeviceAnalysis();

    // Pattern 3: Predictive maintenance analytics
    await this.setupPredictiveAnalytics();

    // Pattern 4: Real-time dashboard feeds
    await this.createRealTimeDashboards();

    console.log('Advanced analytics patterns implemented');
  }

  async createHierarchicalAggregations() {
    console.log('Creating hierarchical time-series aggregations...');

    const readingsCollection = this.db.collection('sensor_readings');

    // Multi-level time aggregation pipeline
    const hierarchicalPipeline = [
      {
        $match: {
          "timestamp": {
            $gte: new Date(Date.now() - 24 * 60 * 60 * 1000) // Last 24 hours
          },
          "measurements.quality_score": { $gte: 70 }
        }
      },

      // Create multiple time bucket levels
      {
        $facet: {
          // Minute-level aggregations
          minutely: [
            {
              $group: {
                _id: {
                  facility: "$device.facility_id",
                  building: "$device.building_id",
                  sensor_type: "$device.sensor_type",
                  minute: {
                    $dateTrunc: { date: "$timestamp", unit: "minute" }
                  }
                },
                avg_value: { $avg: "$measurements.value" },
                min_value: { $min: "$measurements.value" },
                max_value: { $max: "$measurements.value" },
                reading_count: { $sum: 1 },
                quality_avg: { $avg: "$measurements.quality_score" }
              }
            }
          ],

          // Hourly aggregations
          hourly: [
            {
              $group: {
                _id: {
                  facility: "$device.facility_id",
                  sensor_type: "$device.sensor_type",
                  hour: {
                    $dateTrunc: { date: "$timestamp", unit: "hour" }
                  }
                },
                avg_value: { $avg: "$measurements.value" },
                min_value: { $min: "$measurements.value" },
                max_value: { $max: "$measurements.value" },
                reading_count: { $sum: 1 },
                device_count: { $addToSet: "$device.device_id" },
                building_coverage: { $addToSet: "$device.building_id" }
              }
            },
            {
              $addFields: {
                device_count: { $size: "$device_count" },
                building_count: { $size: "$building_coverage" }
              }
            }
          ],

          // Daily aggregations  
          daily: [
            {
              $group: {
                _id: {
                  facility: "$device.facility_id",
                  sensor_type: "$device.sensor_type",
                  day: {
                    $dateTrunc: { date: "$timestamp", unit: "day" }
                  }
                },
                avg_value: { $avg: "$measurements.value" },
                min_value: { $min: "$measurements.value" },
                max_value: { $max: "$measurements.value" },
                reading_count: { $sum: 1 },
                unique_devices: { $addToSet: "$device.device_id" },
                data_coverage_hours: {
                  $addToSet: {
                    $dateTrunc: { date: "$timestamp", unit: "hour" }
                  }
                }
              }
            },
            {
              $addFields: {
                device_count: { $size: "$unique_devices" },
                coverage_hours: { $size: "$data_coverage_hours" },
                coverage_percentage: {
                  $multiply: [
                    { $divide: [{ $size: "$data_coverage_hours" }, 24] },
                    100
                  ]
                }
              }
            }
          ]
        }
      }
    ];

    const hierarchicalResults = await readingsCollection.aggregate(hierarchicalPipeline, {
      allowDiskUse: true
    }).toArray();

    // Store aggregated results
    const analyticsCollection = this.db.collection('analytics_aggregations');

    for (const levelName of ['minutely', 'hourly', 'daily']) {
      const levelData = hierarchicalResults[0][levelName];

      if (levelData && levelData.length > 0) {
        const documents = levelData.map(agg => ({
          window_start: agg._id[levelName === 'minutely' ? 'minute' : levelName === 'hourly' ? 'hour' : 'day'],
          aggregation_metadata: {
            aggregation_level: levelName,
            facility_id: agg._id.facility,
            sensor_type: agg._id.sensor_type,
            building_id: agg._id.building,
            generated_at: new Date()
          },
          metrics: {
            avg_value: agg.avg_value,
            min_value: agg.min_value,
            max_value: agg.max_value,
            reading_count: agg.reading_count,
            device_count: agg.device_count,
            coverage_percentage: agg.coverage_percentage,
            quality_average: agg.quality_avg
          }
        }));

        await analyticsCollection.insertMany(documents, { ordered: false });
      }
    }

    console.log('✅ Hierarchical aggregations completed');
  }

  async implementCrossDeviceAnalysis() {
    console.log('Implementing cross-device correlation analysis...');

    const readingsCollection = this.db.collection('sensor_readings');

    // Cross-device correlation pipeline
    const correlationPipeline = [
      {
        $match: {
          "timestamp": {
            $gte: new Date(Date.now() - 6 * 60 * 60 * 1000) // Last 6 hours
          },
          "device.facility_id": { $exists: true }
        }
      },

      // Group by facility and time windows
      {
        $group: {
          _id: {
            facility: "$device.facility_id",
            time_window: {
              $dateTrunc: { 
                date: "$timestamp", 
                unit: "minute",
                binSize: 15  // 15-minute windows
              }
            }
          },

          // Collect readings by sensor type
          temperature_readings: {
            $push: {
              $cond: [
                { $eq: ["$device.sensor_type", "temperature"] },
                "$measurements.value",
                "$$REMOVE"
              ]
            }
          },
          humidity_readings: {
            $push: {
              $cond: [
                { $eq: ["$device.sensor_type", "humidity"] },
                "$measurements.value", 
                "$$REMOVE"
              ]
            }
          },
          co2_readings: {
            $push: {
              $cond: [
                { $eq: ["$device.sensor_type", "co2"] },
                "$measurements.value",
                "$$REMOVE"
              ]
            }
          },
          air_quality_readings: {
            $push: {
              $cond: [
                { $eq: ["$device.sensor_type", "air_quality"] },
                "$measurements.value",
                "$$REMOVE"
              ]
            }
          },

          total_readings: { $sum: 1 },
          unique_devices: { $addToSet: "$device.device_id" }
        }
      },

      // Calculate correlations and insights
      {
        $addFields: {
          // Calculate averages for each sensor type
          avg_temperature: { $avg: "$temperature_readings" },
          avg_humidity: { $avg: "$humidity_readings" },
          avg_co2: { $avg: "$co2_readings" },
          avg_air_quality: { $avg: "$air_quality_readings" },

          device_count: { $size: "$unique_devices" },

          // Data completeness by sensor type
          temperature_coverage: { $size: "$temperature_readings" },
          humidity_coverage: { $size: "$humidity_readings" },
          co2_coverage: { $size: "$co2_readings" },
          air_quality_coverage: { $size: "$air_quality_readings" }
        }
      },

      // Add correlation analysis
      {
        $addFields: {
          // Environmental comfort index calculation
          comfort_index: {
            $switch: {
              branches: [
                {
                  case: {
                    $and: [
                      { $gte: ["$avg_temperature", 20] },
                      { $lte: ["$avg_temperature", 24] },
                      { $gte: ["$avg_humidity", 30] },
                      { $lte: ["$avg_humidity", 60] }
                    ]
                  },
                  then: "optimal"
                },
                {
                  case: {
                    $and: [
                      { $gte: ["$avg_temperature", 18] },
                      { $lte: ["$avg_temperature", 26] },
                      { $gte: ["$avg_humidity", 25] },
                      { $lte: ["$avg_humidity", 70] }
                    ]
                  },
                  then: "good"
                }
              ],
              default: "suboptimal"
            }
          },

          // Air quality assessment
          air_quality_status: {
            $switch: {
              branches: [
                { case: { $lte: ["$avg_co2", 1000] }, then: "excellent" },
                { case: { $lte: ["$avg_co2", 1500] }, then: "good" },
                { case: { $lte: ["$avg_co2", 2000] }, then: "moderate" }
              ],
              default: "poor"
            }
          },

          // Data quality assessment
          data_quality_score: {
            $divide: [
              {
                $add: [
                  "$temperature_coverage",
                  "$humidity_coverage", 
                  "$co2_coverage",
                  "$air_quality_coverage"
                ]
              },
              { $multiply: ["$device_count", 4] }  // Assuming 4 sensor types per device
            ]
          }
        }
      },

      // Filter for meaningful results
      {
        $match: {
          "device_count": { $gte: 2 },  // At least 2 devices
          "total_readings": { $gte: 10 } // At least 10 readings
        }
      },

      // Sort by time window
      {
        $sort: { "_id.time_window": 1 }
      }
    ];

    const correlationResults = await readingsCollection.aggregate(correlationPipeline, {
      allowDiskUse: true
    }).toArray();

    // Store correlation analysis results
    if (correlationResults.length > 0) {
      const correlationDocs = correlationResults.map(result => ({
        window_start: result._id.time_window,
        aggregation_metadata: {
          aggregation_type: "cross_device_correlation",
          facility_id: result._id.facility,
          analysis_timestamp: new Date()
        },
        environmental_metrics: {
          avg_temperature: result.avg_temperature,
          avg_humidity: result.avg_humidity,
          avg_co2: result.avg_co2,
          avg_air_quality: result.avg_air_quality
        },
        assessments: {
          comfort_index: result.comfort_index,
          air_quality_status: result.air_quality_status,
          data_quality_score: result.data_quality_score
        },
        coverage_stats: {
          device_count: result.device_count,
          total_readings: result.total_readings,
          sensor_coverage: {
            temperature: result.temperature_coverage,
            humidity: result.humidity_coverage,
            co2: result.co2_coverage,
            air_quality: result.air_quality_coverage
          }
        }
      }));

      await this.db.collection('analytics_aggregations').insertMany(correlationDocs, {
        ordered: false
      });
    }

    console.log(`✅ Cross-device correlation analysis completed: ${correlationResults.length} facility-time windows analyzed`);
  }

  async setupPredictiveAnalytics() {
    console.log('Setting up predictive maintenance analytics...');

    const readingsCollection = this.db.collection('sensor_readings');

    // Predictive analytics pipeline for device health
    const predictivePipeline = [
      {
        $match: {
          "timestamp": {
            $gte: new Date(Date.now() - 7 * 24 * 60 * 60 * 1000) // Last 7 days
          }
        }
      },

      // Group by device and calculate health indicators
      {
        $group: {
          _id: {
            device_id: "$device.device_id",
            sensor_type: "$device.sensor_type"
          },

          // Time-series health metrics
          reading_timestamps: { $push: "$timestamp" },
          quality_scores: { $push: "$measurements.quality_score" },
          values: { $push: "$measurements.value" },

          // System health indicators  
          battery_levels: { $push: "$system.battery_level" },
          signal_strengths: { $push: "$system.signal_strength" },
          transmission_latencies: { $push: "$system.transmission_latency_ms" },

          // Basic statistics
          total_readings: { $sum: 1 },
          avg_value: { $avg: "$measurements.value" },
          avg_quality: { $avg: "$measurements.quality_score" },

          // Device metadata
          device_info: { $first: "$device" },
          latest_timestamp: { $max: "$timestamp" },
          earliest_timestamp: { $min: "$timestamp" }
        }
      },

      // Calculate predictive health indicators
      {
        $addFields: {
          // Expected readings calculation
          time_span_hours: {
            $divide: [
              { $subtract: ["$latest_timestamp", "$earliest_timestamp"] },
              3600000  // Convert to hours
            ]
          },

          expected_readings: {
            $divide: [
              { $multiply: ["$time_span_hours", 3600] },  // Total seconds
              { $ifNull: ["$device_info.sampling_interval", 300] }  // Default 5 min interval
            ]
          }
        }
      },

      {
        $addFields: {
          // Data availability percentage
          data_availability: {
            $multiply: [
              { $divide: ["$total_readings", "$expected_readings"] },
              100
            ]
          },

          // Quality trend analysis
          recent_quality: {
            $avg: {
              $slice: ["$quality_scores", -20]  // Last 20 readings
            }
          },

          historical_quality: {
            $avg: {
              $slice: ["$quality_scores", 0, 20]  // First 20 readings  
            }
          },

          // Battery health trend
          current_battery: {
            $avg: {
              $slice: ["$battery_levels", -10]  // Last 10 readings
            }
          },

          initial_battery: {
            $avg: {
              $slice: ["$battery_levels", 0, 10]  // First 10 readings
            }
          },

          // Signal quality trend
          avg_signal_strength: { $avg: "$signal_strengths" },
          avg_latency: { $avg: "$transmission_latencies" }
        }
      },

      // Calculate health scores and predictions
      {
        $addFields: {
          // Overall device health score (0-100)
          health_score: {
            $min: [
              100,
              {
                $multiply: [
                  {
                    $add: [
                      // Data availability component (40%)
                      { $multiply: [{ $min: ["$data_availability", 100] }, 0.4] },

                      // Quality component (30%)
                      { $multiply: ["$avg_quality", 0.3] },

                      // Battery component (20%)
                      { 
                        $multiply: [
                          { $ifNull: ["$current_battery", 100] },
                          0.2
                        ]
                      },

                      // Signal component (10%)
                      {
                        $multiply: [
                          {
                            $cond: {
                              if: { $gte: ["$avg_signal_strength", -70] },
                              then: 100,
                              else: {
                                $max: [0, { $add: [100, { $multiply: ["$avg_signal_strength", 1.5] }] }]
                              }
                            }
                          },
                          0.1
                        ]
                      }
                    ]
                  }
                ]
              }
            ]
          },

          // Maintenance predictions
          quality_trend: {
            $cond: {
              if: { $gt: ["$recent_quality", "$historical_quality"] },
              then: "improving",
              else: {
                $cond: {
                  if: { $lt: ["$recent_quality", { $multiply: ["$historical_quality", 0.9] }] },
                  then: "degrading",
                  else: "stable"
                }
              }
            }
          },

          battery_trend: {
            $cond: {
              if: { $and: ["$current_battery", "$initial_battery"] },
              then: {
                $cond: {
                  if: { $lt: ["$current_battery", { $multiply: ["$initial_battery", 0.8] }] },
                  then: "declining",
                  else: "stable"
                }
              },
              else: "unknown"
            }
          },

          // Estimated days until maintenance needed
          maintenance_urgency: {
            $switch: {
              branches: [
                {
                  case: { $lt: ["$health_score", 60] },
                  then: "immediate"
                },
                {
                  case: { $lt: ["$health_score", 75] },
                  then: "within_week"
                },
                {
                  case: { $lt: ["$health_score", 85] },
                  then: "within_month"
                }
              ],
              default: "routine"
            }
          }
        }
      },

      // Filter devices that need attention
      {
        $match: {
          $or: [
            { "health_score": { $lt: 90 } },
            { "quality_trend": "degrading" },
            { "battery_trend": "declining" },
            { "data_availability": { $lt: 90 } }
          ]
        }
      },

      // Sort by health score (worst first)
      {
        $sort: { "health_score": 1 }
      }
    ];

    const predictiveResults = await readingsCollection.aggregate(predictivePipeline, {
      allowDiskUse: true
    }).toArray();

    // Store predictive analytics results
    if (predictiveResults.length > 0) {
      const maintenanceDocs = predictiveResults.map(result => ({
        window_start: new Date(),
        aggregation_metadata: {
          aggregation_type: "predictive_maintenance",
          device_id: result._id.device_id,
          sensor_type: result._id.sensor_type,
          analysis_timestamp: new Date()
        },
        health_assessment: {
          overall_health_score: Math.round(result.health_score * 100) / 100,
          data_availability: Math.round(result.data_availability * 100) / 100,
          quality_trend: result.quality_trend,
          battery_trend: result.battery_trend,
          maintenance_urgency: result.maintenance_urgency
        },
        metrics: {
          total_readings: result.total_readings,
          avg_quality: Math.round(result.avg_quality * 100) / 100,
          avg_signal_strength: result.avg_signal_strength,
          avg_latency: result.avg_latency,
          current_battery_level: result.current_battery
        },
        recommendations: this.generateMaintenanceRecommendations(result)
      }));

      await this.db.collection('maintenance_predictions').insertMany(maintenanceDocs, {
        ordered: false
      });
    }

    console.log(`✅ Predictive analytics completed: ${predictiveResults.length} devices analyzed`);
    return predictiveResults;
  }

  generateMaintenanceRecommendations(deviceAnalysis) {
    const recommendations = [];

    if (deviceAnalysis.health_score < 60) {
      recommendations.push('Immediate inspection required - device health critical');
    }

    if (deviceAnalysis.data_availability < 80) {
      recommendations.push('Check connectivity and power supply');
    }

    if (deviceAnalysis.quality_trend === 'degrading') {
      recommendations.push('Sensor calibration may be needed');
    }

    if (deviceAnalysis.battery_trend === 'declining') {
      recommendations.push('Schedule battery replacement');
    }

    if (deviceAnalysis.avg_signal_strength < -80) {
      recommendations.push('Improve network coverage or relocate device');
    }

    return recommendations.length > 0 ? recommendations : ['Continue routine monitoring'];
  }
}

// Export the analytics processor
module.exports = { IoTAnalyticsProcessor };

SQL-Style Time-Series Operations with QueryLeaf

QueryLeaf provides familiar SQL syntax for MongoDB Time-Series Collections operations:

-- QueryLeaf time-series operations with SQL-familiar syntax

-- Create time-series collection with SQL DDL syntax
CREATE TABLE sensor_readings (
  timestamp TIMESTAMPTZ NOT NULL,
  device_id VARCHAR(50) NOT NULL,
  sensor_type VARCHAR(50) NOT NULL,
  value NUMERIC(10,4) NOT NULL,
  quality_score INTEGER DEFAULT 100,

  -- Device metadata (metaField in MongoDB)
  facility_id VARCHAR(50),
  building_id VARCHAR(50), 
  location VARCHAR(200),

  -- Environmental context
  ambient_temperature NUMERIC(5,2),
  humidity NUMERIC(5,2),
  atmospheric_pressure NUMERIC(7,2)
) WITH (
  collection_type = 'timeseries',
  time_field = 'timestamp',
  meta_field = 'device_metadata',
  granularity = 'minutes',
  expire_after_seconds = 63072000  -- 2 years
);

-- Time-series data ingestion with SQL INSERT
INSERT INTO sensor_readings (
  timestamp, device_id, sensor_type, value, quality_score,
  facility_id, building_id, location,
  ambient_temperature, humidity, atmospheric_pressure
) VALUES 
  ('2025-11-15 10:00:00'::TIMESTAMPTZ, 'TEMP-001', 'temperature', 22.5, 98, 'FAC-A', 'BLDG-1', 'Conference Room A', 22.5, 45.2, 1013.25),
  ('2025-11-15 10:00:00'::TIMESTAMPTZ, 'HUM-001', 'humidity', 45.2, 95, 'FAC-A', 'BLDG-1', 'Conference Room A', 22.5, 45.2, 1013.25),
  ('2025-11-15 10:00:00'::TIMESTAMPTZ, 'CO2-001', 'co2', 850, 92, 'FAC-A', 'BLDG-1', 'Conference Room A', 22.5, 45.2, 1013.25);

-- Time-series analytical queries with window functions
WITH hourly_sensor_analytics AS (
  SELECT 
    device_id,
    sensor_type,
    facility_id,
    DATE_TRUNC('hour', timestamp) as hour_bucket,

    -- Statistical aggregations
    COUNT(*) as reading_count,
    AVG(value) as avg_value,
    MIN(value) as min_value,  
    MAX(value) as max_value,
    STDDEV(value) as stddev_value,
    PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY value) as median_value,
    PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY value) as p95_value,

    -- Quality metrics
    AVG(quality_score) as avg_quality,
    COUNT(*) FILTER (WHERE quality_score < 80) as poor_quality_count,

    -- Environmental correlations
    AVG(ambient_temperature) as avg_ambient_temp,
    AVG(humidity) as avg_humidity,
    CORR(value, ambient_temperature) as temp_correlation,
    CORR(value, humidity) as humidity_correlation,

    -- Data completeness assessment
    COUNT(*) * 100.0 / 60 as data_completeness_percent  -- Expected: 60 readings per hour

  FROM sensor_readings
  WHERE timestamp >= CURRENT_TIMESTAMP - INTERVAL '24 hours'
    AND timestamp < CURRENT_TIMESTAMP
    AND quality_score >= 50  -- Filter poor quality data
  GROUP BY device_id, sensor_type, facility_id, DATE_TRUNC('hour', timestamp)
),

time_series_insights AS (
  SELECT 
    hsa.*,

    -- Time-based analytical functions
    LAG(avg_value) OVER (
      PARTITION BY device_id, sensor_type 
      ORDER BY hour_bucket
    ) as previous_hour_avg,

    -- Moving averages for trend analysis
    AVG(avg_value) OVER (
      PARTITION BY device_id, sensor_type
      ORDER BY hour_bucket
      ROWS BETWEEN 5 PRECEDING AND CURRENT ROW
    ) as moving_avg_6h,

    AVG(avg_value) OVER (
      PARTITION BY device_id, sensor_type
      ORDER BY hour_bucket  
      ROWS BETWEEN 23 PRECEDING AND CURRENT ROW
    ) as moving_avg_24h,

    -- Anomaly detection using z-score
    (avg_value - AVG(avg_value) OVER (
      PARTITION BY device_id, sensor_type, EXTRACT(hour FROM hour_bucket)
    )) / NULLIF(STDDEV(avg_value) OVER (
      PARTITION BY device_id, sensor_type, EXTRACT(hour FROM hour_bucket)
    ), 0) as hourly_zscore,

    -- Rate of change calculations
    CASE 
      WHEN LAG(avg_value) OVER (PARTITION BY device_id, sensor_type ORDER BY hour_bucket) IS NOT NULL
      THEN (avg_value - LAG(avg_value) OVER (PARTITION BY device_id, sensor_type ORDER BY hour_bucket)) 
           / NULLIF(LAG(avg_value) OVER (PARTITION BY device_id, sensor_type ORDER BY hour_bucket), 0) * 100
      ELSE 0
    END as hourly_change_percent

  FROM hourly_sensor_analytics hsa
),

anomaly_detection AS (
  SELECT 
    tsi.*,

    -- Anomaly classification
    CASE 
      WHEN ABS(hourly_zscore) > 3 THEN 'statistical_anomaly'
      WHEN ABS(hourly_change_percent) > 50 AND moving_avg_6h IS NOT NULL THEN 'rapid_change'
      WHEN data_completeness_percent < 70 THEN 'data_availability_issue'
      WHEN avg_quality < 70 THEN 'data_quality_issue'
      ELSE 'normal'
    END as anomaly_type,

    -- Alert priority
    CASE 
      WHEN ABS(hourly_zscore) > 4 OR ABS(hourly_change_percent) > 75 THEN 'critical'
      WHEN ABS(hourly_zscore) > 3 OR ABS(hourly_change_percent) > 50 THEN 'high'
      WHEN data_completeness_percent < 70 OR avg_quality < 70 THEN 'medium'
      ELSE 'low'
    END as alert_priority,

    -- Performance classification  
    CASE 
      WHEN data_completeness_percent >= 95 AND avg_quality >= 90 THEN 'excellent'
      WHEN data_completeness_percent >= 85 AND avg_quality >= 80 THEN 'good'
      WHEN data_completeness_percent >= 70 AND avg_quality >= 70 THEN 'fair'
      ELSE 'poor'
    END as performance_rating

  FROM time_series_insights tsi
)

SELECT 
  device_id,
  sensor_type,
  facility_id,
  TO_CHAR(hour_bucket, 'YYYY-MM-DD HH24:00') as analysis_hour,

  -- Core time-series metrics
  reading_count,
  ROUND(avg_value::NUMERIC, 3) as average_value,
  ROUND(min_value::NUMERIC, 3) as minimum_value,
  ROUND(max_value::NUMERIC, 3) as maximum_value,
  ROUND(stddev_value::NUMERIC, 3) as std_deviation,
  ROUND(median_value::NUMERIC, 3) as median_value,
  ROUND(p95_value::NUMERIC, 3) as p95_value,

  -- Trend analysis
  ROUND(hourly_change_percent::NUMERIC, 2) as hourly_change_pct,
  ROUND(moving_avg_6h::NUMERIC, 3) as six_hour_moving_avg,
  ROUND(moving_avg_24h::NUMERIC, 3) as daily_moving_avg,

  -- Anomaly detection
  ROUND(hourly_zscore::NUMERIC, 3) as anomaly_zscore,
  anomaly_type,
  alert_priority,

  -- Quality and performance
  ROUND(data_completeness_percent::NUMERIC, 1) as data_completeness_pct,
  ROUND(avg_quality::NUMERIC, 1) as average_quality_score,
  poor_quality_count,
  performance_rating,

  -- Environmental correlations
  ROUND(temp_correlation::NUMERIC, 3) as temperature_correlation,
  ROUND(humidity_correlation::NUMERIC, 3) as humidity_correlation,
  ROUND(avg_ambient_temp::NUMERIC, 1) as avg_ambient_temperature,
  ROUND(avg_humidity::NUMERIC, 1) as avg_humidity_percent,

  -- Alert conditions
  CASE 
    WHEN anomaly_type != 'normal' THEN 
      CONCAT('Alert: ', anomaly_type, ' detected with ', alert_priority, ' priority')
    WHEN performance_rating IN ('poor', 'fair') THEN
      CONCAT('Performance issue: ', performance_rating, ' quality detected')
    ELSE 'Normal operation'
  END as status_message

FROM anomaly_detection
WHERE hour_bucket >= CURRENT_TIMESTAMP - INTERVAL '24 hours'
ORDER BY 
  facility_id,
  device_id,
  sensor_type,
  hour_bucket DESC;

-- Cross-device environmental correlation analysis
WITH facility_environmental_data AS (
  SELECT 
    facility_id,
    building_id,
    DATE_TRUNC('minute', timestamp, 15) as time_window,  -- 15-minute buckets

    -- Aggregate by sensor type
    AVG(CASE WHEN sensor_type = 'temperature' THEN value END) as avg_temperature,
    AVG(CASE WHEN sensor_type = 'humidity' THEN value END) as avg_humidity,
    AVG(CASE WHEN sensor_type = 'co2' THEN value END) as avg_co2,
    AVG(CASE WHEN sensor_type = 'air_quality' THEN value END) as avg_air_quality,

    -- Count devices by type
    COUNT(DISTINCT CASE WHEN sensor_type = 'temperature' THEN device_id END) as temp_devices,
    COUNT(DISTINCT CASE WHEN sensor_type = 'humidity' THEN device_id END) as humidity_devices,
    COUNT(DISTINCT CASE WHEN sensor_type = 'co2' THEN device_id END) as co2_devices,

    -- Overall data quality
    AVG(quality_score) as avg_quality,
    COUNT(*) as total_readings

  FROM sensor_readings
  WHERE timestamp >= CURRENT_TIMESTAMP - INTERVAL '6 hours'
    AND facility_id IS NOT NULL
    AND quality_score >= 70
  GROUP BY facility_id, building_id, DATE_TRUNC('minute', timestamp, 15)
  HAVING COUNT(*) >= 5  -- Minimum readings threshold
),

environmental_assessment AS (
  SELECT 
    fed.*,

    -- Environmental comfort calculations
    CASE 
      WHEN avg_temperature BETWEEN 20 AND 24 AND avg_humidity BETWEEN 30 AND 60 THEN 'optimal'
      WHEN avg_temperature BETWEEN 18 AND 26 AND avg_humidity BETWEEN 25 AND 70 THEN 'comfortable'
      WHEN avg_temperature BETWEEN 16 AND 28 AND avg_humidity BETWEEN 20 AND 80 THEN 'acceptable'
      ELSE 'uncomfortable'
    END as comfort_level,

    -- Air quality assessment
    CASE 
      WHEN avg_co2 <= 1000 THEN 'excellent'
      WHEN avg_co2 <= 1500 THEN 'good'  
      WHEN avg_co2 <= 2000 THEN 'moderate'
      WHEN avg_co2 <= 5000 THEN 'poor'
      ELSE 'hazardous'
    END as air_quality_level,

    -- Data coverage assessment
    CASE 
      WHEN temp_devices >= 2 AND humidity_devices >= 2 AND co2_devices >= 1 THEN 'comprehensive'
      WHEN temp_devices >= 1 AND humidity_devices >= 1 THEN 'basic'
      ELSE 'limited'
    END as sensor_coverage,

    -- Environmental health score (0-100)
    (
      CASE 
        WHEN avg_temperature BETWEEN 20 AND 24 THEN 25
        WHEN avg_temperature BETWEEN 18 AND 26 THEN 20
        WHEN avg_temperature BETWEEN 16 AND 28 THEN 15
        ELSE 5
      END +
      CASE 
        WHEN avg_humidity BETWEEN 40 AND 50 THEN 25
        WHEN avg_humidity BETWEEN 30 AND 60 THEN 20
        WHEN avg_humidity BETWEEN 25 AND 70 THEN 15
        ELSE 5
      END +
      CASE 
        WHEN avg_co2 <= 800 THEN 25
        WHEN avg_co2 <= 1000 THEN 20
        WHEN avg_co2 <= 1500 THEN 15
        WHEN avg_co2 <= 2000 THEN 10
        ELSE 0
      END +
      CASE 
        WHEN avg_air_quality >= 80 THEN 25
        WHEN avg_air_quality >= 60 THEN 20
        WHEN avg_air_quality >= 40 THEN 15
        ELSE 5
      END
    ) as environmental_health_score

  FROM facility_environmental_data fed
)

SELECT 
  facility_id,
  building_id,
  TO_CHAR(time_window, 'YYYY-MM-DD HH24:MI') as measurement_time,

  -- Environmental measurements
  ROUND(avg_temperature::NUMERIC, 1) as temperature_c,
  ROUND(avg_humidity::NUMERIC, 1) as humidity_percent,
  ROUND(avg_co2::NUMERIC, 0) as co2_ppm,
  ROUND(avg_air_quality::NUMERIC, 1) as air_quality_index,

  -- Assessment results
  comfort_level,
  air_quality_level,
  sensor_coverage,
  environmental_health_score,

  -- Device coverage
  temp_devices,
  humidity_devices,  
  co2_devices,
  total_readings,

  -- Data quality
  ROUND(avg_quality::NUMERIC, 1) as average_data_quality,

  -- Recommendations
  CASE 
    WHEN environmental_health_score >= 90 THEN 'Optimal environmental conditions'
    WHEN environmental_health_score >= 75 THEN 'Good environmental conditions'
    WHEN comfort_level = 'uncomfortable' THEN 'Adjust HVAC settings for comfort'
    WHEN air_quality_level IN ('poor', 'hazardous') THEN 'Improve ventilation immediately'
    WHEN sensor_coverage = 'limited' THEN 'Add more environmental sensors'
    ELSE 'Monitor conditions closely'
  END as recommendation,

  -- Alert conditions
  CASE 
    WHEN avg_co2 > 2000 THEN 'HIGH CO2 ALERT'
    WHEN avg_temperature > 28 OR avg_temperature < 16 THEN 'TEMPERATURE ALERT'
    WHEN avg_humidity > 80 OR avg_humidity < 20 THEN 'HUMIDITY ALERT'
    WHEN environmental_health_score < 50 THEN 'ENVIRONMENTAL QUALITY ALERT'
    ELSE NULL
  END as alert_status

FROM environmental_assessment
WHERE time_window >= CURRENT_TIMESTAMP - INTERVAL '6 hours'
ORDER BY 
  facility_id,
  building_id,
  time_window DESC;

-- Predictive maintenance analytics with time-series data
CREATE VIEW device_health_predictions AS
WITH device_performance_history AS (
  SELECT 
    device_id,
    sensor_type,
    facility_id,

    -- Performance metrics over time
    COUNT(*) as total_readings_7d,
    AVG(quality_score) as avg_quality_7d,
    STDDEV(quality_score) as quality_stability,

    -- Expected vs actual readings
    COUNT(*) * 100.0 / (7 * 24 * 12) as data_availability_percent,  -- Expected: 5min intervals

    -- Value stability analysis
    STDDEV(value) as value_volatility,
    AVG(value) as avg_value_7d,

    -- Trend analysis using linear regression
    REGR_SLOPE(quality_score, EXTRACT(EPOCH FROM timestamp)) as quality_trend_slope,
    REGR_SLOPE(value, EXTRACT(EPOCH FROM timestamp)) as value_trend_slope,

    -- Time coverage
    MAX(timestamp) as last_reading_time,
    MIN(timestamp) as first_reading_time,

    -- Recent performance (last 24h vs historical)
    AVG(CASE WHEN timestamp >= CURRENT_TIMESTAMP - INTERVAL '24 hours' 
         THEN quality_score END) as recent_quality_24h,
    AVG(CASE WHEN timestamp < CURRENT_TIMESTAMP - INTERVAL '24 hours' 
         THEN quality_score END) as historical_quality

  FROM sensor_readings
  WHERE timestamp >= CURRENT_TIMESTAMP - INTERVAL '7 days'
  GROUP BY device_id, sensor_type, facility_id
  HAVING COUNT(*) >= 100  -- Minimum data threshold for analysis
),

health_scoring AS (
  SELECT 
    dph.*,

    -- Overall device health score (0-100)
    (
      -- Data availability component (40%)
      (LEAST(data_availability_percent, 100) * 0.4) +

      -- Quality component (30%)  
      (avg_quality_7d * 0.3) +

      -- Stability component (20%)
      (GREATEST(0, 100 - quality_stability) * 0.2) +

      -- Recency component (10%)
      (CASE 
        WHEN last_reading_time >= CURRENT_TIMESTAMP - INTERVAL '1 hour' THEN 10
        WHEN last_reading_time >= CURRENT_TIMESTAMP - INTERVAL '6 hours' THEN 8
        WHEN last_reading_time >= CURRENT_TIMESTAMP - INTERVAL '24 hours' THEN 5
        ELSE 0
      END)
    ) as device_health_score,

    -- Maintenance predictions
    CASE 
      WHEN data_availability_percent < 70 THEN 'connectivity_issue'
      WHEN avg_quality_7d < 70 THEN 'sensor_degradation'
      WHEN quality_trend_slope < -0.1 THEN 'declining_quality'
      WHEN quality_stability > 15 THEN 'unstable_readings'
      WHEN last_reading_time < CURRENT_TIMESTAMP - INTERVAL '6 hours' THEN 'communication_failure'
      ELSE 'normal_operation'
    END as maintenance_issue,

    -- Urgency assessment
    CASE 
      WHEN data_availability_percent < 50 OR avg_quality_7d < 50 THEN 'immediate'
      WHEN data_availability_percent < 80 OR avg_quality_7d < 75 THEN 'within_week'
      WHEN quality_trend_slope < -0.05 OR quality_stability > 10 THEN 'within_month'
      ELSE 'routine'
    END as maintenance_urgency,

    -- Performance trend
    CASE 
      WHEN recent_quality_24h > historical_quality * 1.1 THEN 'improving'
      WHEN recent_quality_24h < historical_quality * 0.9 THEN 'degrading'
      ELSE 'stable'
    END as performance_trend

  FROM device_performance_history dph
)

SELECT 
  device_id,
  sensor_type,
  facility_id,

  -- Health metrics
  ROUND(device_health_score::NUMERIC, 1) as health_score,
  ROUND(data_availability_percent::NUMERIC, 1) as data_availability_pct,
  ROUND(avg_quality_7d::NUMERIC, 1) as avg_quality_score,
  ROUND(quality_stability::NUMERIC, 2) as quality_std_dev,

  -- Performance indicators
  performance_trend,
  maintenance_issue,
  maintenance_urgency,

  -- Trend analysis
  CASE 
    WHEN quality_trend_slope > 0.1 THEN 'Quality Improving'
    WHEN quality_trend_slope < -0.1 THEN 'Quality Declining'
    ELSE 'Quality Stable'
  END as quality_trend,

  -- Data freshness
  ROUND(EXTRACT(EPOCH FROM (CURRENT_TIMESTAMP - last_reading_time)) / 3600::NUMERIC, 1) as hours_since_last_reading,

  -- Maintenance recommendations
  CASE maintenance_issue
    WHEN 'connectivity_issue' THEN 'Check network connectivity and power supply'
    WHEN 'sensor_degradation' THEN 'Schedule sensor calibration or replacement'
    WHEN 'declining_quality' THEN 'Investigate environmental factors affecting sensor'
    WHEN 'unstable_readings' THEN 'Check sensor mounting and interference sources'
    WHEN 'communication_failure' THEN 'Immediate device inspection required'
    ELSE 'Continue routine monitoring'
  END as maintenance_recommendation,

  -- Priority ranking
  CASE maintenance_urgency
    WHEN 'immediate' THEN 1
    WHEN 'within_week' THEN 2  
    WHEN 'within_month' THEN 3
    ELSE 4
  END as priority_rank

FROM health_scoring
ORDER BY 
  CASE maintenance_urgency
    WHEN 'immediate' THEN 1
    WHEN 'within_week' THEN 2
    WHEN 'within_month' THEN 3
    ELSE 4
  END,
  device_health_score ASC;

-- QueryLeaf provides comprehensive time-series capabilities:
-- 1. SQL-familiar CREATE TABLE syntax for time-series collections
-- 2. Advanced window functions and time-based aggregations
-- 3. Built-in anomaly detection with statistical analysis
-- 4. Cross-device correlation analysis and environmental assessments
-- 5. Predictive maintenance analytics with health scoring
-- 6. Real-time monitoring and alerting with SQL queries
-- 7. Hierarchical time aggregations (minute/hour/day levels)
-- 8. Performance trend analysis and maintenance recommendations
-- 9. Native integration with MongoDB time-series optimizations
-- 10. Familiar SQL patterns for complex IoT analytics requirements

Best Practices for Time-Series Implementation

Collection Design and Optimization

Essential practices for production time-series deployments:

Granularity Selection: Choose appropriate granularity (seconds/minutes/hours) based on data frequency
MetaField Strategy: Design metaField schemas that optimize for common query patterns
TTL Management: Implement time-based data lifecycle policies for storage optimization
Index Planning: Create indexes that align with time-based and metadata query patterns
Compression Benefits: Leverage MongoDB's automatic compression for time-series data
Schema Evolution: Design flexible schemas that accommodate IoT device changes

Performance and Scalability

Optimize time-series collections for high-throughput IoT workloads:

Batch Ingestion: Use bulk operations for high-frequency sensor data ingestion
Write Concern: Balance durability and performance with appropriate write concerns
Read Optimization: Use aggregation pipelines for efficient analytical queries
Real-time Processing: Implement change streams for immediate data processing
Memory Management: Monitor working set size and configure appropriate caching
Sharding Strategy: Plan horizontal scaling for very high-volume deployments

Conclusion

MongoDB Time-Series Collections provide comprehensive IoT data management capabilities that eliminate the complexity and overhead of traditional time-series database approaches. The combination of automatic storage optimization, intelligent indexing, and sophisticated analytical capabilities enables high-performance IoT applications that scale efficiently with growing sensor deployments.

Key Time-Series Collection benefits include:

Automatic Optimization: Native storage compression and intelligent bucketing for temporal data patterns
Simplified Operations: No manual partitioning or complex maintenance procedures required
High-Performance Analytics: Built-in support for statistical aggregations and window functions
Real-time Processing: Change streams enable immediate response to incoming sensor data
Flexible Schema: Easy accommodation of evolving IoT device capabilities and data structures
SQL Compatibility: Familiar query patterns for complex time-series analytical operations

Whether you're building smart building systems, industrial monitoring platforms, environmental sensors networks, or any IoT application requiring temporal data analysis, MongoDB Time-Series Collections with QueryLeaf's SQL-familiar interface provides the foundation for modern IoT analytics that scales efficiently while maintaining familiar database interaction patterns.

QueryLeaf Integration: QueryLeaf automatically manages MongoDB Time-Series Collections while providing SQL-familiar syntax for time-series operations, statistical analysis, and IoT analytics. Advanced time-based aggregations, anomaly detection, and predictive maintenance patterns are seamlessly accessible through familiar SQL constructs, making sophisticated IoT development both powerful and approachable for SQL-oriented teams.

The integration of optimized time-series capabilities with SQL-style operations makes MongoDB an ideal platform for IoT applications that require both high-performance temporal data processing and familiar analytical query patterns, ensuring your IoT solutions remain both effective and maintainable as they scale and evolve.

November 14, 2025
23 min read

MongoDB Change Streams for Event-Driven Microservices: Real-Time Architecture and Reactive Data Processing

Modern distributed applications require real-time responsiveness to data changes, enabling immediate updates across microservices, cache invalidation, data synchronization, and user notification systems. Traditional polling-based approaches create unnecessary load, introduce latency, and fail to scale with growing data volumes and user expectations for instant updates.

MongoDB Change Streams provide native change data capture (CDC) capabilities that enable real-time event-driven architectures without the complexity of external message queues or polling mechanisms. Unlike traditional database triggers that operate at the database level with limited scalability, Change Streams offer application-level event processing with comprehensive filtering, transformation, and distributed processing capabilities.

The Traditional Event Processing Challenge

Building real-time event-driven systems with traditional databases requires complex infrastructure and polling mechanisms:

-- Traditional PostgreSQL event processing - complex and inefficient

-- Event log table for change tracking
CREATE TABLE event_log (
    event_id BIGSERIAL PRIMARY KEY,
    table_name VARCHAR(100) NOT NULL,
    operation_type VARCHAR(10) NOT NULL, -- INSERT, UPDATE, DELETE
    record_id TEXT NOT NULL,
    old_data JSONB,
    new_data JSONB,
    event_timestamp TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    processed BOOLEAN DEFAULT FALSE,

    -- Event routing information
    event_type VARCHAR(50),
    service_name VARCHAR(50),
    correlation_id UUID,

    -- Processing metadata
    retry_count INTEGER DEFAULT 0,
    last_retry_at TIMESTAMP,
    error_message TEXT,

    -- Partitioning for performance
    created_date DATE GENERATED ALWAYS AS (DATE(event_timestamp)) STORED
);

-- Partition by date for performance
CREATE TABLE event_log_2025_11 PARTITION OF event_log
    FOR VALUES FROM ('2025-11-01') TO ('2025-12-01');

-- Indexes for event processing
CREATE INDEX idx_event_log_unprocessed ON event_log(processed, event_timestamp) 
    WHERE processed = FALSE;
CREATE INDEX idx_event_log_correlation ON event_log(correlation_id);
CREATE INDEX idx_event_log_service ON event_log(service_name, event_timestamp);

-- Product catalog table with change tracking
CREATE TABLE products (
    product_id BIGSERIAL PRIMARY KEY,
    sku VARCHAR(50) UNIQUE NOT NULL,
    name VARCHAR(200) NOT NULL,
    description TEXT,
    price DECIMAL(12,2) NOT NULL,
    category_id BIGINT,
    inventory_count INTEGER DEFAULT 0,
    status VARCHAR(20) DEFAULT 'active',

    -- Metadata
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    version INTEGER DEFAULT 1
);

-- Trigger function for change tracking
CREATE OR REPLACE FUNCTION log_product_changes() 
RETURNS TRIGGER AS $$
DECLARE
    event_data JSONB;
    operation_type TEXT;
BEGIN
    -- Determine operation type
    IF TG_OP = 'DELETE' THEN
        operation_type := 'DELETE';
        event_data := to_jsonb(OLD);
    ELSIF TG_OP = 'UPDATE' THEN
        operation_type := 'UPDATE';
        event_data := jsonb_build_object(
            'old', to_jsonb(OLD),
            'new', to_jsonb(NEW)
        );
    ELSIF TG_OP = 'INSERT' THEN
        operation_type := 'INSERT';
        event_data := to_jsonb(NEW);
    END IF;

    -- Insert event log entry
    INSERT INTO event_log (
        table_name,
        operation_type, 
        record_id,
        old_data,
        new_data,
        event_type,
        correlation_id
    ) VALUES (
        TG_TABLE_NAME,
        operation_type,
        CASE 
            WHEN TG_OP = 'DELETE' THEN OLD.product_id::TEXT
            ELSE NEW.product_id::TEXT
        END,
        CASE WHEN TG_OP IN ('UPDATE', 'DELETE') THEN to_jsonb(OLD) ELSE NULL END,
        CASE WHEN TG_OP IN ('UPDATE', 'INSERT') THEN to_jsonb(NEW) ELSE NULL END,
        'product_change',
        gen_random_uuid()
    );

    RETURN COALESCE(NEW, OLD);
END;
$$ LANGUAGE plpgsql;

-- Create triggers for change tracking
CREATE TRIGGER product_change_trigger
    AFTER INSERT OR UPDATE OR DELETE ON products
    FOR EACH ROW EXECUTE FUNCTION log_product_changes();

-- Complex polling-based event processing
WITH unprocessed_events AS (
    SELECT 
        event_id,
        table_name,
        operation_type,
        record_id,
        old_data,
        new_data,
        event_timestamp,
        event_type,
        correlation_id,

        -- Determine event priority
        CASE 
            WHEN event_type = 'product_change' AND operation_type = 'UPDATE' THEN
                CASE 
                    WHEN (new_data->>'status') != (old_data->>'status') THEN 1 -- Status changes are critical
                    WHEN (new_data->>'price')::NUMERIC != (old_data->>'price')::NUMERIC THEN 2 -- Price changes
                    WHEN (new_data->>'inventory_count')::INTEGER != (old_data->>'inventory_count')::INTEGER THEN 3 -- Inventory
                    ELSE 4 -- Other changes
                END
            WHEN operation_type = 'INSERT' THEN 2
            WHEN operation_type = 'DELETE' THEN 1
            ELSE 5
        END as priority,

        -- Calculate processing delay
        EXTRACT(EPOCH FROM (CURRENT_TIMESTAMP - event_timestamp)) as delay_seconds

    FROM event_log
    WHERE processed = FALSE
        AND retry_count < 3 -- Limit retry attempts
        AND (last_retry_at IS NULL OR last_retry_at < CURRENT_TIMESTAMP - INTERVAL '5 minutes')
    ORDER BY priority ASC, event_timestamp ASC
    LIMIT 100 -- Process in batches
),

event_processing_plan AS (
    SELECT 
        ue.*,

        -- Determine target services based on event type
        CASE 
            WHEN event_type = 'product_change' THEN 
                ARRAY['inventory-service', 'catalog-service', 'search-service', 'cache-service']
            ELSE ARRAY['default-service']
        END as target_services,

        -- Generate event payload
        jsonb_build_object(
            'eventId', event_id,
            'eventType', event_type,
            'operationType', operation_type,
            'timestamp', event_timestamp,
            'correlationId', correlation_id,
            'data', 
                CASE 
                    WHEN operation_type = 'UPDATE' THEN 
                        jsonb_build_object(
                            'before', old_data,
                            'after', new_data,
                            'changes', (
                                SELECT jsonb_object_agg(key, value)
                                FROM jsonb_each(new_data)
                                WHERE value IS DISTINCT FROM (old_data->key)
                            )
                        )
                    WHEN operation_type = 'INSERT' THEN new_data
                    WHEN operation_type = 'DELETE' THEN old_data
                END
        ) as event_payload

    FROM unprocessed_events ue
),

service_notifications AS (
    SELECT 
        epp.event_id,
        epp.correlation_id,
        epp.event_payload,
        unnest(epp.target_services) as service_name,
        epp.priority,

        -- Service-specific payload customization
        CASE 
            WHEN unnest(epp.target_services) = 'inventory-service' THEN
                epp.event_payload || jsonb_build_object(
                    'inventoryData', 
                    jsonb_build_object(
                        'productId', epp.record_id,
                        'currentCount', (epp.event_payload->'data'->'after'->>'inventory_count')::INTEGER,
                        'previousCount', (epp.event_payload->'data'->'before'->>'inventory_count')::INTEGER
                    )
                )
            WHEN unnest(epp.target_services) = 'search-service' THEN
                epp.event_payload || jsonb_build_object(
                    'searchData',
                    jsonb_build_object(
                        'productId', epp.record_id,
                        'name', epp.event_payload->'data'->'after'->>'name',
                        'description', epp.event_payload->'data'->'after'->>'description',
                        'category', epp.event_payload->'data'->'after'->>'category_id',
                        'status', epp.event_payload->'data'->'after'->>'status'
                    )
                )
            ELSE epp.event_payload
        END as service_payload

    FROM event_processing_plan epp
)

SELECT 
    event_id,
    correlation_id,
    service_name,
    priority,
    service_payload,

    -- Generate webhook URLs or message queue topics
    CASE service_name
        WHEN 'inventory-service' THEN 'http://inventory-service/webhook/product-change'
        WHEN 'catalog-service' THEN 'http://catalog-service/api/events'
        WHEN 'search-service' THEN 'kafka://search-updates-topic'
        WHEN 'cache-service' THEN 'redis://cache-invalidation'
        ELSE 'http://default-service/webhook'
    END as target_endpoint,

    -- Event processing metadata
    jsonb_build_object(
        'processingAttempt', 1,
        'maxRetries', 3,
        'timeoutSeconds', 30,
        'exponentialBackoff', true
    ) as processing_config

FROM service_notifications
ORDER BY priority ASC, event_id ASC;

-- Update processed events (requires separate transaction)
UPDATE event_log 
SET processed = TRUE,
    updated_at = CURRENT_TIMESTAMP
WHERE event_id IN (
    SELECT event_id FROM unprocessed_events
);

-- Problems with traditional event processing:
-- 1. Complex trigger-based change tracking with limited filtering capabilities
-- 2. Polling-based processing introduces latency and resource waste
-- 3. Manual event routing and service coordination logic
-- 4. Limited scalability due to database-level trigger overhead
-- 5. Complex retry logic and error handling for failed event processing
-- 6. Difficult to implement real-time filtering and transformation
-- 7. No native support for distributed event processing patterns
-- 8. Complex partitioning and cleanup strategies for event log tables
-- 9. Limited integration with microservices and modern event architectures
-- 10. High operational complexity for maintaining event processing infrastructure

MongoDB Change Streams provide comprehensive real-time event processing capabilities:

// MongoDB Change Streams - native real-time event processing for microservices
const { MongoClient } = require('mongodb');

const client = new MongoClient('mongodb://localhost:27017');
const db = client.db('ecommerce_platform');

// Advanced Change Streams Event Processing System
class MongoChangeStreamManager {
  constructor(db) {
    this.db = db;
    this.changeStreams = new Map();
    this.eventHandlers = new Map();
    this.processingMetrics = new Map();

    // Event routing configuration
    this.eventRoutes = new Map([
      ['products', ['inventory-service', 'catalog-service', 'search-service', 'cache-service']],
      ['orders', ['fulfillment-service', 'payment-service', 'notification-service']],
      ['customers', ['profile-service', 'marketing-service', 'analytics-service']],
      ['inventory', ['warehouse-service', 'alert-service', 'reporting-service']]
    ]);

    this.serviceEndpoints = new Map([
      ['inventory-service', 'http://inventory-service:3001/webhook/events'],
      ['catalog-service', 'http://catalog-service:3002/api/events'],
      ['search-service', 'http://search-service:3003/events/index'],
      ['cache-service', 'redis://cache-cluster:6379/invalidate'],
      ['fulfillment-service', 'http://fulfillment:3004/orders/events'],
      ['payment-service', 'http://payments:3005/webhook/order-events'],
      ['notification-service', 'http://notifications:3006/events/send'],
      ['profile-service', 'http://profiles:3007/customers/events'],
      ['marketing-service', 'http://marketing:3008/events/customer'],
      ['analytics-service', 'kafka://analytics-cluster/customer-events']
    ]);
  }

  async setupComprehensiveChangeStreams() {
    console.log('Setting up comprehensive change streams for microservices architecture...');

    // Product catalog change stream with intelligent filtering
    await this.createProductChangeStream();

    // Order processing change stream
    await this.createOrderChangeStream();

    // Customer data change stream
    await this.createCustomerChangeStream();

    // Inventory management change stream
    await this.createInventoryChangeStream();

    // Cross-collection aggregated events
    await this.createAggregatedChangeStream();

    console.log('Change streams initialized for real-time event-driven architecture');
  }

  async createProductChangeStream() {
    console.log('Creating product catalog change stream...');

    const productsCollection = this.db.collection('products');

    // Comprehensive change stream pipeline for product events
    const pipeline = [
      {
        $match: {
          $and: [
            // Only watch specific operation types
            {
              "operationType": { 
                $in: ["insert", "update", "delete", "replace"] 
              }
            },

            // Filter based on significant changes
            {
              $or: [
                // New products
                { "operationType": "insert" },

                // Product deletions
                { "operationType": "delete" },

                // Critical field updates
                {
                  $and: [
                    { "operationType": "update" },
                    {
                      $or: [
                        { "updateDescription.updatedFields.status": { $exists: true } },
                        { "updateDescription.updatedFields.price": { $exists: true } },
                        { "updateDescription.updatedFields.inventory_count": { $exists: true } },
                        { "updateDescription.updatedFields.name": { $exists: true } },
                        { "updateDescription.updatedFields.category": { $exists: true } },
                        { "updateDescription.updatedFields.availability": { $exists: true } }
                      ]
                    }
                  ]
                }
              ]
            }
          ]
        }
      },

      // Add computed fields for event processing
      {
        $addFields: {
          // Event classification
          "eventSeverity": {
            $switch: {
              branches: [
                {
                  case: { $eq: ["$operationType", "delete"] },
                  then: "critical"
                },
                {
                  case: {
                    $and: [
                      { $eq: ["$operationType", "update"] },
                      { $ne: ["$updateDescription.updatedFields.status", null] }
                    ]
                  },
                  then: "high"
                },
                {
                  case: {
                    $or: [
                      { $ne: ["$updateDescription.updatedFields.price", null] },
                      { $ne: ["$updateDescription.updatedFields.inventory_count", null] }
                    ]
                  },
                  then: "medium"
                }
              ],
              default: "low"
            }
          },

          // Processing metadata
          "processingMetadata": {
            "streamId": "product-changes",
            "timestamp": "$$NOW",
            "source": "mongodb-change-stream",
            "correlationId": { $toString: "$_id" }
          },

          // Change summary for efficient processing
          "changeSummary": {
            $cond: {
              if: { $eq: ["$operationType", "update"] },
              then: {
                "fieldsChanged": { $objectToArray: "$updateDescription.updatedFields" },
                "fieldsRemoved": "$updateDescription.removedFields",
                "changeCount": { $size: { $objectToArray: "$updateDescription.updatedFields" } }
              },
              else: null
            }
          }
        }
      }
    ];

    const productChangeStream = productsCollection.watch(pipeline, {
      fullDocument: 'updateLookup',
      fullDocumentBeforeChange: 'whenAvailable'
    });

    // Event handler for product changes
    productChangeStream.on('change', async (change) => {
      try {
        await this.handleProductChange(change);
      } catch (error) {
        console.error('Error handling product change:', error);
        await this.handleEventProcessingError('products', change, error);
      }
    });

    productChangeStream.on('error', (error) => {
      console.error('Product change stream error:', error);
      this.handleChangeStreamError('products', error);
    });

    this.changeStreams.set('products', productChangeStream);
    console.log('✅ Product change stream active');
  }

  async handleProductChange(change) {
    console.log(`Processing product change: ${change.operationType} for product ${change.documentKey._id}`);

    const eventPayload = {
      eventId: change._id.toString(),
      eventType: 'product_change',
      operationType: change.operationType,
      timestamp: new Date(),
      correlationId: change.processingMetadata?.correlationId,
      severity: change.eventSeverity,

      // Document data
      documentId: change.documentKey._id,
      fullDocument: change.fullDocument,
      fullDocumentBeforeChange: change.fullDocumentBeforeChange,

      // Change details
      updateDescription: change.updateDescription,
      changeSummary: change.changeSummary,

      // Event-specific data extraction
      productData: this.extractProductEventData(change),

      // Processing metadata
      processingMetadata: {
        ...change.processingMetadata,
        targetServices: this.eventRoutes.get('products') || [],
        retryPolicy: {
          maxRetries: 3,
          backoffMultiplier: 2,
          initialDelayMs: 1000
        }
      }
    };

    // Route event to appropriate microservices
    const targetServices = this.eventRoutes.get('products') || [];
    await this.routeEventToServices(eventPayload, targetServices);

    // Update processing metrics
    this.updateProcessingMetrics('products', change.operationType, 'success');
  }

  extractProductEventData(change) {
    const productData = {
      productId: change.documentKey._id,
      operation: change.operationType
    };

    switch (change.operationType) {
      case 'insert':
        productData.newProduct = {
          sku: change.fullDocument?.sku,
          name: change.fullDocument?.name,
          category: change.fullDocument?.category,
          price: change.fullDocument?.price,
          status: change.fullDocument?.status,
          inventory_count: change.fullDocument?.inventory_count
        };
        break;

      case 'update':
        productData.changes = {};

        // Extract specific field changes
        if (change.updateDescription?.updatedFields) {
          const updatedFields = change.updateDescription.updatedFields;

          if ('price' in updatedFields) {
            productData.changes.priceChange = {
              oldPrice: change.fullDocumentBeforeChange?.price,
              newPrice: updatedFields.price
            };
          }

          if ('inventory_count' in updatedFields) {
            productData.changes.inventoryChange = {
              oldCount: change.fullDocumentBeforeChange?.inventory_count,
              newCount: updatedFields.inventory_count,
              delta: updatedFields.inventory_count - (change.fullDocumentBeforeChange?.inventory_count || 0)
            };
          }

          if ('status' in updatedFields) {
            productData.changes.statusChange = {
              oldStatus: change.fullDocumentBeforeChange?.status,
              newStatus: updatedFields.status,
              isActivation: updatedFields.status === 'active' && change.fullDocumentBeforeChange?.status !== 'active',
              isDeactivation: updatedFields.status !== 'active' && change.fullDocumentBeforeChange?.status === 'active'
            };
          }
        }

        productData.currentState = change.fullDocument;
        break;

      case 'delete':
        productData.deletedProduct = {
          sku: change.fullDocumentBeforeChange?.sku,
          name: change.fullDocumentBeforeChange?.name,
          category: change.fullDocumentBeforeChange?.category
        };
        break;
    }

    return productData;
  }

  async createOrderChangeStream() {
    console.log('Creating order processing change stream...');

    const ordersCollection = this.db.collection('orders');

    const pipeline = [
      {
        $match: {
          $or: [
            // New orders
            { "operationType": "insert" },

            // Order status changes
            {
              $and: [
                { "operationType": "update" },
                { "updateDescription.updatedFields.status": { $exists: true } }
              ]
            },

            // Payment status changes
            {
              $and: [
                { "operationType": "update" },
                { "updateDescription.updatedFields.payment.status": { $exists: true } }
              ]
            },

            // Shipping information updates
            {
              $and: [
                { "operationType": "update" },
                {
                  $or: [
                    { "updateDescription.updatedFields.shipping.trackingNumber": { $exists: true } },
                    { "updateDescription.updatedFields.shipping.status": { $exists: true } },
                    { "updateDescription.updatedFields.shipping.actualDelivery": { $exists: true } }
                  ]
                }
              ]
            }
          ]
        }
      },

      {
        $addFields: {
          "eventType": {
            $switch: {
              branches: [
                { case: { $eq: ["$operationType", "insert"] }, then: "order_created" },
                {
                  case: {
                    $and: [
                      { $eq: ["$operationType", "update"] },
                      { $ne: ["$updateDescription.updatedFields.status", null] }
                    ]
                  },
                  then: "order_status_changed"
                },
                {
                  case: {
                    $ne: ["$updateDescription.updatedFields.payment.status", null]
                  },
                  then: "payment_status_changed"
                },
                {
                  case: {
                    $or: [
                      { $ne: ["$updateDescription.updatedFields.shipping.trackingNumber", null] },
                      { $ne: ["$updateDescription.updatedFields.shipping.status", null] }
                    ]
                  },
                  then: "shipping_updated"
                }
              ],
              default: "order_modified"
            }
          },

          "urgencyLevel": {
            $switch: {
              branches: [
                {
                  case: {
                    $and: [
                      { $eq: ["$operationType", "update"] },
                      { $eq: ["$updateDescription.updatedFields.status", "cancelled"] }
                    ]
                  },
                  then: "high"
                },
                {
                  case: {
                    $or: [
                      { $eq: ["$updateDescription.updatedFields.payment.status", "failed"] },
                      { $eq: ["$updateDescription.updatedFields.status", "processing"] }
                    ]
                  },
                  then: "medium"
                }
              ],
              default: "normal"
            }
          }
        }
      }
    ];

    const orderChangeStream = ordersCollection.watch(pipeline, {
      fullDocument: 'updateLookup',
      fullDocumentBeforeChange: 'whenAvailable'
    });

    orderChangeStream.on('change', async (change) => {
      try {
        await this.handleOrderChange(change);
      } catch (error) {
        console.error('Error handling order change:', error);
        await this.handleEventProcessingError('orders', change, error);
      }
    });

    this.changeStreams.set('orders', orderChangeStream);
    console.log('✅ Order change stream active');
  }

  async handleOrderChange(change) {
    console.log(`Processing order change: ${change.eventType} for order ${change.documentKey._id}`);

    const eventPayload = {
      eventId: change._id.toString(),
      eventType: change.eventType,
      operationType: change.operationType,
      urgencyLevel: change.urgencyLevel,
      timestamp: new Date(),

      orderId: change.documentKey._id,
      orderData: this.extractOrderEventData(change),

      // Customer information for notifications
      customerInfo: {
        customerId: change.fullDocument?.customer?.customerId,
        email: change.fullDocument?.customer?.email,
        name: change.fullDocument?.customer?.name
      },

      processingMetadata: {
        targetServices: this.determineOrderTargetServices(change),
        correlationId: change.fullDocument?.correlationId || change._id.toString()
      }
    };

    await this.routeEventToServices(eventPayload, eventPayload.processingMetadata.targetServices);
    this.updateProcessingMetrics('orders', change.operationType, 'success');
  }

  extractOrderEventData(change) {
    const orderData = {
      orderId: change.documentKey._id,
      operation: change.operationType,
      eventType: change.eventType
    };

    if (change.operationType === 'insert') {
      orderData.newOrder = {
        orderNumber: change.fullDocument?.orderNumber,
        customerId: change.fullDocument?.customer?.customerId,
        totalAmount: change.fullDocument?.totals?.grandTotal,
        status: change.fullDocument?.status,
        itemCount: change.fullDocument?.items?.length || 0,
        priority: change.fullDocument?.priority
      };
    }

    if (change.operationType === 'update' && change.updateDescription?.updatedFields) {
      orderData.changes = {};
      const fields = change.updateDescription.updatedFields;

      if ('status' in fields) {
        orderData.changes.statusChange = {
          from: change.fullDocumentBeforeChange?.status,
          to: fields.status,
          timestamp: new Date()
        };
      }

      if ('payment.status' in fields || fields['payment.status']) {
        orderData.changes.paymentStatusChange = {
          from: change.fullDocumentBeforeChange?.payment?.status,
          to: fields['payment.status'] || fields.payment?.status,
          paymentMethod: change.fullDocument?.payment?.method
        };
      }
    }

    return orderData;
  }

  determineOrderTargetServices(change) {
    const baseServices = ['fulfillment-service', 'notification-service'];

    if (change.eventType === 'payment_status_changed') {
      baseServices.push('payment-service');
    }

    if (change.eventType === 'shipping_updated') {
      baseServices.push('shipping-service', 'tracking-service');
    }

    if (change.urgencyLevel === 'high') {
      baseServices.push('alert-service');
    }

    return baseServices;
  }

  async createCustomerChangeStream() {
    console.log('Creating customer data change stream...');

    const customersCollection = this.db.collection('customers');

    const pipeline = [
      {
        $match: {
          $or: [
            { "operationType": "insert" },
            {
              $and: [
                { "operationType": "update" },
                {
                  $or: [
                    { "updateDescription.updatedFields.email": { $exists: true } },
                    { "updateDescription.updatedFields.tier": { $exists: true } },
                    { "updateDescription.updatedFields.preferences": { $exists: true } },
                    { "updateDescription.updatedFields.status": { $exists: true } }
                  ]
                }
              ]
            }
          ]
        }
      }
    ];

    const customerChangeStream = customersCollection.watch(pipeline, {
      fullDocument: 'updateLookup'
    });

    customerChangeStream.on('change', async (change) => {
      try {
        await this.handleCustomerChange(change);
      } catch (error) {
        console.error('Error handling customer change:', error);
        await this.handleEventProcessingError('customers', change, error);
      }
    });

    this.changeStreams.set('customers', customerChangeStream);
    console.log('✅ Customer change stream active');
  }

  async handleCustomerChange(change) {
    const eventPayload = {
      eventId: change._id.toString(),
      eventType: 'customer_change',
      operationType: change.operationType,
      timestamp: new Date(),

      customerId: change.documentKey._id,
      customerData: {
        email: change.fullDocument?.email,
        name: change.fullDocument?.name,
        tier: change.fullDocument?.tier,
        status: change.fullDocument?.status
      },

      processingMetadata: {
        targetServices: ['profile-service', 'marketing-service', 'analytics-service'],
        isNewCustomer: change.operationType === 'insert'
      }
    };

    await this.routeEventToServices(eventPayload, eventPayload.processingMetadata.targetServices);
  }

  async routeEventToServices(eventPayload, targetServices) {
    console.log(`Routing event ${eventPayload.eventId} to services: ${targetServices.join(', ')}`);

    const routingPromises = targetServices.map(async (serviceName) => {
      try {
        const endpoint = this.serviceEndpoints.get(serviceName);
        if (!endpoint) {
          console.warn(`No endpoint configured for service: ${serviceName}`);
          return;
        }

        const servicePayload = this.customizePayloadForService(eventPayload, serviceName);
        await this.sendEventToService(serviceName, endpoint, servicePayload);

        console.log(`✅ Event sent to ${serviceName}`);
      } catch (error) {
        console.error(`❌ Failed to send event to ${serviceName}:`, error.message);
        await this.handleServiceDeliveryError(serviceName, eventPayload, error);
      }
    });

    await Promise.allSettled(routingPromises);
  }

  customizePayloadForService(eventPayload, serviceName) {
    // Clone base payload
    const servicePayload = {
      ...eventPayload,
      targetService: serviceName,
      deliveryTimestamp: new Date()
    };

    // Service-specific customization
    switch (serviceName) {
      case 'inventory-service':
        if (eventPayload.productData) {
          servicePayload.inventoryData = {
            productId: eventPayload.productData.productId,
            inventoryChange: eventPayload.productData.changes?.inventoryChange,
            currentCount: eventPayload.fullDocument?.inventory_count,
            lowStockThreshold: eventPayload.fullDocument?.low_stock_threshold
          };
        }
        break;

      case 'search-service':
        if (eventPayload.productData) {
          servicePayload.searchData = {
            productId: eventPayload.productData.productId,
            indexOperation: eventPayload.operationType === 'delete' ? 'remove' : 'upsert',
            document: eventPayload.operationType !== 'delete' ? {
              name: eventPayload.fullDocument?.name,
              description: eventPayload.fullDocument?.description,
              category: eventPayload.fullDocument?.category,
              tags: eventPayload.fullDocument?.tags,
              searchable: eventPayload.fullDocument?.status === 'active'
            } : null
          };
        }
        break;

      case 'notification-service':
        if (eventPayload.customerInfo) {
          servicePayload.notificationData = {
            recipientEmail: eventPayload.customerInfo.email,
            recipientName: eventPayload.customerInfo.name,
            notificationType: this.determineNotificationType(eventPayload),
            priority: eventPayload.urgencyLevel || 'normal',
            templateData: this.buildNotificationTemplateData(eventPayload)
          };
        }
        break;

      case 'cache-service':
        servicePayload.cacheOperations = this.determineCacheOperations(eventPayload);
        break;
    }

    return servicePayload;
  }

  async sendEventToService(serviceName, endpoint, payload) {
    if (endpoint.startsWith('http://') || endpoint.startsWith('https://')) {
      // HTTP webhook delivery
      const response = await fetch(endpoint, {
        method: 'POST',
        headers: {
          'Content-Type': 'application/json',
          'X-Event-Source': 'mongodb-change-stream',
          'X-Event-ID': payload.eventId,
          'X-Correlation-ID': payload.processingMetadata?.correlationId
        },
        body: JSON.stringify(payload),
        timeout: 10000 // 10 second timeout
      });

      if (!response.ok) {
        throw new Error(`HTTP ${response.status}: ${response.statusText}`);
      }

    } else if (endpoint.startsWith('kafka://')) {
      // Kafka message delivery (mock implementation)
      await this.sendToKafka(endpoint, payload);

    } else if (endpoint.startsWith('redis://')) {
      // Redis cache operations (mock implementation)
      await this.sendToRedis(endpoint, payload);
    }
  }

  async sendToKafka(endpoint, payload) {
    // Mock Kafka implementation
    console.log(`[KAFKA] Sending to ${endpoint}:`, JSON.stringify(payload, null, 2));
  }

  async sendToRedis(endpoint, payload) {
    // Mock Redis implementation
    console.log(`[REDIS] Cache operation at ${endpoint}:`, JSON.stringify(payload.cacheOperations, null, 2));
  }

  determineNotificationType(eventPayload) {
    switch (eventPayload.eventType) {
      case 'order_created': return 'order_confirmation';
      case 'order_status_changed': 
        if (eventPayload.orderData?.changes?.statusChange?.to === 'shipped') return 'order_shipped';
        if (eventPayload.orderData?.changes?.statusChange?.to === 'delivered') return 'order_delivered';
        return 'order_update';
      case 'payment_status_changed': return 'payment_update';
      default: return 'general_update';
    }
  }

  buildNotificationTemplateData(eventPayload) {
    const templateData = {
      eventType: eventPayload.eventType,
      timestamp: eventPayload.timestamp
    };

    if (eventPayload.orderData) {
      templateData.order = {
        id: eventPayload.orderId,
        number: eventPayload.orderData.newOrder?.orderNumber,
        status: eventPayload.orderData.changes?.statusChange?.to || eventPayload.orderData.newOrder?.status,
        total: eventPayload.orderData.newOrder?.totalAmount
      };
    }

    return templateData;
  }

  determineCacheOperations(eventPayload) {
    const operations = [];

    if (eventPayload.eventType === 'product_change') {
      operations.push({
        operation: 'invalidate',
        keys: [
          `product:${eventPayload.productData?.productId}`,
          `products:category:${eventPayload.fullDocument?.category}`,
          'products:featured',
          'products:search:*'
        ]
      });
    }

    if (eventPayload.eventType === 'order_created' || eventPayload.eventType.includes('order_')) {
      operations.push({
        operation: 'invalidate',
        keys: [
          `customer:${eventPayload.customerInfo?.customerId}:orders`,
          `order:${eventPayload.orderId}`
        ]
      });
    }

    return operations;
  }

  async createAggregatedChangeStream() {
    console.log('Creating aggregated change stream for cross-collection events...');

    // Watch multiple collections for coordinated events
    const aggregatedPipeline = [
      {
        $match: {
          $and: [
            {
              "ns.coll": { $in: ["products", "orders", "inventory"] }
            },
            {
              $or: [
                { "operationType": "insert" },
                { "operationType": "update" },
                { "operationType": "delete" }
              ]
            }
          ]
        }
      },

      {
        $addFields: {
          "crossCollectionEventType": {
            $switch: {
              branches: [
                {
                  case: {
                    $and: [
                      { $eq: ["$ns.coll", "orders"] },
                      { $eq: ["$operationType", "insert"] }
                    ]
                  },
                  then: "new_order_created"
                },
                {
                  case: {
                    $and: [
                      { $eq: ["$ns.coll", "inventory"] },
                      { $lt: ["$fullDocument.quantity", 10] }
                    ]
                  },
                  then: "low_stock_alert"
                },
                {
                  case: {
                    $and: [
                      { $eq: ["$ns.coll", "products"] },
                      { $eq: ["$updateDescription.updatedFields.status", "discontinued"] }
                    ]
                  },
                  then: "product_discontinued"
                }
              ],
              default: "standard_change"
            }
          }
        }
      }
    ];

    const aggregatedChangeStream = this.db.watch(aggregatedPipeline, {
      fullDocument: 'updateLookup'
    });

    aggregatedChangeStream.on('change', async (change) => {
      try {
        await this.handleCrossCollectionEvent(change);
      } catch (error) {
        console.error('Error handling cross-collection event:', error);
      }
    });

    this.changeStreams.set('aggregated', aggregatedChangeStream);
    console.log('✅ Aggregated change stream active');
  }

  async handleCrossCollectionEvent(change) {
    console.log(`Processing cross-collection event: ${change.crossCollectionEventType}`);

    if (change.crossCollectionEventType === 'new_order_created') {
      // Trigger inventory reservation
      await this.triggerInventoryReservation(change.fullDocument);
    } else if (change.crossCollectionEventType === 'low_stock_alert') {
      // Send low stock notifications
      await this.triggerLowStockAlert(change.fullDocument);
    } else if (change.crossCollectionEventType === 'product_discontinued') {
      // Handle product discontinuation workflow
      await this.handleProductDiscontinuation(change.documentKey._id);
    }
  }

  async triggerInventoryReservation(order) {
    console.log(`Triggering inventory reservation for order ${order._id}`);
    // Implementation would coordinate with inventory service
  }

  async triggerLowStockAlert(inventoryRecord) {
    console.log(`Triggering low stock alert for product ${inventoryRecord.productId}`);
    // Implementation would send alerts to purchasing team
  }

  async handleProductDiscontinuation(productId) {
    console.log(`Handling product discontinuation for ${productId}`);
    // Implementation would update related systems and cancel pending orders
  }

  updateProcessingMetrics(collection, operation, status) {
    const key = `${collection}-${operation}`;
    const current = this.processingMetrics.get(key) || { success: 0, error: 0 };
    current[status]++;
    this.processingMetrics.set(key, current);
  }

  async handleEventProcessingError(collection, change, error) {
    console.error(`Event processing error in ${collection}:`, error);

    // Log error for monitoring
    await this.db.collection('event_processing_errors').insertOne({
      collection,
      changeId: change._id,
      error: error.message,
      timestamp: new Date(),
      changeDetails: {
        operationType: change.operationType,
        documentKey: change.documentKey
      }
    });

    this.updateProcessingMetrics(collection, change.operationType, 'error');
  }

  async handleServiceDeliveryError(serviceName, eventPayload, error) {
    // Implement retry logic
    const retryKey = `${serviceName}-${eventPayload.eventId}`;
    console.warn(`Service delivery failed for ${serviceName}, scheduling retry...`);

    // Store for retry processing (implementation would use a proper queue)
    setTimeout(async () => {
      try {
        const endpoint = this.serviceEndpoints.get(serviceName);
        const servicePayload = this.customizePayloadForService(eventPayload, serviceName);
        await this.sendEventToService(serviceName, endpoint, servicePayload);
        console.log(`✅ Retry successful for ${serviceName}`);
      } catch (retryError) {
        console.error(`❌ Retry failed for ${serviceName}:`, retryError.message);
      }
    }, 5000); // 5 second retry delay
  }

  handleChangeStreamError(streamName, error) {
    console.error(`Change stream error for ${streamName}:`, error);

    // Implement stream recovery logic
    setTimeout(() => {
      console.log(`Attempting to recover change stream: ${streamName}`);
      // Recovery implementation would recreate the stream
    }, 10000);
  }

  async getProcessingMetrics() {
    const metrics = {
      activeStreams: Array.from(this.changeStreams.keys()),
      processingStats: Object.fromEntries(this.processingMetrics),
      timestamp: new Date()
    };

    return metrics;
  }

  async shutdown() {
    console.log('Shutting down change streams...');

    for (const [streamName, stream] of this.changeStreams) {
      await stream.close();
      console.log(`✅ Closed change stream: ${streamName}`);
    }

    this.changeStreams.clear();
    console.log('All change streams closed');
  }
}

// Export the change stream manager
module.exports = { MongoChangeStreamManager };

// Benefits of MongoDB Change Streams for Microservices:
// - Real-time event processing without polling overhead
// - Comprehensive filtering and transformation capabilities at the database level
// - Native support for microservices event routing and coordination
// - Automatic retry and error handling for distributed event processing
// - Cross-collection event aggregation for complex business workflows
// - Integration with existing MongoDB infrastructure without additional components
// - Scalable event processing that grows with your data and application needs
// - Built-in support for event ordering and consistency guarantees
// - Comprehensive monitoring and metrics for event processing pipelines
// - SQL-familiar event processing patterns through QueryLeaf integration

Understanding MongoDB Change Streams Architecture

Real-Time Event Processing Patterns

MongoDB Change Streams enable sophisticated real-time architectures with comprehensive event processing capabilities:

// Advanced event processing patterns for production microservices
class AdvancedEventProcessor {
  constructor(db) {
    this.db = db;
    this.eventProcessors = new Map();
    this.eventFilters = new Map();
    this.businessRules = new Map();
  }

  async setupEventDrivenWorkflows() {
    console.log('Setting up advanced event-driven workflows...');

    // Workflow 1: Order fulfillment coordination
    await this.createOrderFulfillmentWorkflow();

    // Workflow 2: Inventory management automation
    await this.createInventoryManagementWorkflow();

    // Workflow 3: Customer lifecycle events
    await this.createCustomerLifecycleWorkflow();

    // Workflow 4: Real-time analytics triggers
    await this.createAnalyticsTriggerWorkflow();

    console.log('Event-driven workflows active');
  }

  async createOrderFulfillmentWorkflow() {
    console.log('Creating order fulfillment workflow...');

    // Multi-stage fulfillment process triggered by order changes
    const fulfillmentPipeline = [
      {
        $match: {
          $and: [
            { "ns.coll": "orders" },
            {
              $or: [
                // New order created
                { "operationType": "insert" },

                // Order status progression
                {
                  $and: [
                    { "operationType": "update" },
                    {
                      "updateDescription.updatedFields.status": {
                        $in: ["confirmed", "processing", "fulfilling", "shipped"]
                      }
                    }
                  ]
                },

                // Payment confirmation
                {
                  $and: [
                    { "operationType": "update" },
                    { "updateDescription.updatedFields.payment.status": "captured" }
                  ]
                }
              ]
            }
          ]
        }
      },

      {
        $addFields: {
          "workflowStage": {
            $switch: {
              branches: [
                { case: { $eq: ["$operationType", "insert"] }, then: "order_received" },
                { case: { $eq: ["$updateDescription.updatedFields.payment.status", "captured"] }, then: "payment_confirmed" },
                { case: { $eq: ["$updateDescription.updatedFields.status", "confirmed"] }, then: "order_confirmed" },
                { case: { $eq: ["$updateDescription.updatedFields.status", "processing"] }, then: "processing_started" },
                { case: { $eq: ["$updateDescription.updatedFields.status", "fulfilling"] }, then: "fulfillment_started" },
                { case: { $eq: ["$updateDescription.updatedFields.status", "shipped"] }, then: "order_shipped" }
              ],
              default: "unknown_stage"
            }
          },

          "nextActions": {
            $switch: {
              branches: [
                { 
                  case: { $eq: ["$operationType", "insert"] },
                  then: ["validate_inventory", "process_payment", "send_confirmation"]
                },
                { 
                  case: { $eq: ["$updateDescription.updatedFields.payment.status", "captured"] },
                  then: ["reserve_inventory", "generate_pick_list", "notify_warehouse"]
                },
                { 
                  case: { $eq: ["$updateDescription.updatedFields.status", "processing"] },
                  then: ["allocate_warehouse", "schedule_picking", "update_eta"]
                },
                { 
                  case: { $eq: ["$updateDescription.updatedFields.status", "shipped"] },
                  then: ["send_tracking", "schedule_delivery_updates", "prepare_feedback_request"]
                }
              ],
              default: []
            }
          }
        }
      }
    ];

    const fulfillmentStream = this.db.watch(fulfillmentPipeline, {
      fullDocument: 'updateLookup',
      fullDocumentBeforeChange: 'whenAvailable'
    });

    fulfillmentStream.on('change', async (change) => {
      await this.processFulfillmentWorkflow(change);
    });

    this.eventProcessors.set('fulfillment', fulfillmentStream);
  }

  async processFulfillmentWorkflow(change) {
    const workflowContext = {
      orderId: change.documentKey._id,
      stage: change.workflowStage,
      nextActions: change.nextActions,
      orderData: change.fullDocument,
      timestamp: new Date()
    };

    console.log(`Processing fulfillment workflow: ${workflowContext.stage} for order ${workflowContext.orderId}`);

    // Execute next actions based on workflow stage
    for (const action of workflowContext.nextActions) {
      try {
        await this.executeWorkflowAction(action, workflowContext);
      } catch (error) {
        console.error(`Failed to execute workflow action ${action}:`, error);
        await this.handleWorkflowError(workflowContext, action, error);
      }
    }

    // Record workflow progress
    await this.recordWorkflowProgress(workflowContext);
  }

  async executeWorkflowAction(action, context) {
    console.log(`Executing workflow action: ${action}`);

    const actionHandlers = {
      'validate_inventory': () => this.validateInventoryAvailability(context),
      'process_payment': () => this.initiatePaymentProcessing(context),
      'send_confirmation': () => this.sendOrderConfirmation(context),
      'reserve_inventory': () => this.reserveInventoryItems(context),
      'generate_pick_list': () => this.generateWarehousePickList(context),
      'notify_warehouse': () => this.notifyWarehouseSystems(context),
      'allocate_warehouse': () => this.allocateOptimalWarehouse(context),
      'schedule_picking': () => this.schedulePickingSlot(context),
      'update_eta': () => this.updateEstimatedDelivery(context),
      'send_tracking': () => this.sendTrackingInformation(context),
      'schedule_delivery_updates': () => this.scheduleDeliveryNotifications(context),
      'prepare_feedback_request': () => this.prepareFeedbackCollection(context)
    };

    const handler = actionHandlers[action];
    if (handler) {
      await handler();
    } else {
      console.warn(`No handler found for workflow action: ${action}`);
    }
  }

  async createInventoryManagementWorkflow() {
    console.log('Creating inventory management workflow...');

    const inventoryPipeline = [
      {
        $match: {
          $and: [
            {
              $or: [
                { "ns.coll": "products" },
                { "ns.coll": "inventory" },
                { "ns.coll": "orders" }
              ]
            },
            {
              $or: [
                // Product inventory updates
                {
                  $and: [
                    { "ns.coll": "products" },
                    { "updateDescription.updatedFields.inventory_count": { $exists: true } }
                  ]
                },

                // Direct inventory updates  
                {
                  $and: [
                    { "ns.coll": "inventory" },
                    { "operationType": "update" }
                  ]
                },

                // New orders affecting inventory
                {
                  $and: [
                    { "ns.coll": "orders" },
                    { "operationType": "insert" }
                  ]
                }
              ]
            }
          ]
        }
      },

      {
        $addFields: {
          "inventoryEventType": {
            $switch: {
              branches: [
                {
                  case: {
                    $and: [
                      { $eq: ["$ns.coll", "products"] },
                      { $lt: ["$updateDescription.updatedFields.inventory_count", 10] }
                    ]
                  },
                  then: "low_stock_alert"
                },
                {
                  case: {
                    $and: [
                      { $eq: ["$ns.coll", "products"] },
                      { $eq: ["$updateDescription.updatedFields.inventory_count", 0] }
                    ]
                  },
                  then: "out_of_stock"
                },
                {
                  case: {
                    $and: [
                      { $eq: ["$ns.coll", "orders"] },
                      { $eq: ["$operationType", "insert"] }
                    ]
                  },
                  then: "inventory_reservation_needed"
                }
              ],
              default: "inventory_change"
            }
          }
        }
      }
    ];

    const inventoryStream = this.db.watch(inventoryPipeline, {
      fullDocument: 'updateLookup'
    });

    inventoryStream.on('change', async (change) => {
      await this.processInventoryWorkflow(change);
    });

    this.eventProcessors.set('inventory', inventoryStream);
  }

  async processInventoryWorkflow(change) {
    const eventType = change.inventoryEventType;

    console.log(`Processing inventory workflow: ${eventType}`);

    switch (eventType) {
      case 'low_stock_alert':
        await this.handleLowStockAlert(change);
        break;

      case 'out_of_stock':
        await this.handleOutOfStock(change);
        break;

      case 'inventory_reservation_needed':
        await this.handleInventoryReservation(change);
        break;

      default:
        await this.handleGeneralInventoryChange(change);
    }
  }

  async handleLowStockAlert(change) {
    const productId = change.documentKey._id;
    const currentCount = change.updateDescription?.updatedFields?.inventory_count;

    console.log(`Low stock alert: Product ${productId} has ${currentCount} units remaining`);

    // Trigger multiple actions
    await Promise.all([
      this.notifyPurchasingTeam(productId, currentCount),
      this.updateProductVisibility(productId, 'low_stock'),
      this.triggerReplenishmentOrder(productId),
      this.notifyCustomersOnWaitlist(productId)
    ]);
  }

  async handleOutOfStock(change) {
    const productId = change.documentKey._id;

    console.log(`Out of stock: Product ${productId}`);

    await Promise.all([
      this.updateProductStatus(productId, 'out_of_stock'),
      this.pauseMarketingCampaigns(productId),
      this.notifyCustomersBackorder(productId),
      this.createEmergencyReplenishment(productId)
    ]);
  }

  async createCustomerLifecycleWorkflow() {
    console.log('Creating customer lifecycle workflow...');

    const customerPipeline = [
      {
        $match: {
          $and: [
            {
              $or: [
                { "ns.coll": "customers" },
                { "ns.coll": "orders" }
              ]
            },
            {
              $or: [
                // New customer registration
                {
                  $and: [
                    { "ns.coll": "customers" },
                    { "operationType": "insert" }
                  ]
                },

                // Customer tier changes
                {
                  $and: [
                    { "ns.coll": "customers" },
                    { "updateDescription.updatedFields.tier": { $exists: true } }
                  ]
                },

                // First order placement
                {
                  $and: [
                    { "ns.coll": "orders" },
                    { "operationType": "insert" }
                  ]
                }
              ]
            }
          ]
        }
      }
    ];

    const customerStream = this.db.watch(customerPipeline, {
      fullDocument: 'updateLookup'
    });

    customerStream.on('change', async (change) => {
      await this.processCustomerLifecycleEvent(change);
    });

    this.eventProcessors.set('customer_lifecycle', customerStream);
  }

  async processCustomerLifecycleEvent(change) {
    if (change.ns.coll === 'customers' && change.operationType === 'insert') {
      await this.handleNewCustomerOnboarding(change.fullDocument);
    } else if (change.ns.coll === 'orders' && change.operationType === 'insert') {
      await this.handleCustomerOrderPlaced(change.fullDocument);
    }
  }

  async handleNewCustomerOnboarding(customer) {
    console.log(`Starting onboarding workflow for new customer: ${customer._id}`);

    const onboardingTasks = [
      { action: 'send_welcome_email', delay: 0 },
      { action: 'create_loyalty_account', delay: 1000 },
      { action: 'suggest_initial_products', delay: 5000 },
      { action: 'schedule_follow_up', delay: 86400000 } // 24 hours
    ];

    for (const task of onboardingTasks) {
      setTimeout(async () => {
        await this.executeCustomerAction(task.action, customer);
      }, task.delay);
    }
  }

  async executeCustomerAction(action, customer) {
    console.log(`Executing customer action: ${action} for customer ${customer._id}`);

    const actionHandlers = {
      'send_welcome_email': () => this.sendWelcomeEmail(customer),
      'create_loyalty_account': () => this.createLoyaltyAccount(customer),
      'suggest_initial_products': () => this.suggestProducts(customer),
      'schedule_follow_up': () => this.scheduleFollowUp(customer)
    };

    const handler = actionHandlers[action];
    if (handler) {
      await handler();
    }
  }

  // Service integration methods (mock implementations)
  async validateInventoryAvailability(context) {
    console.log(`✅ Validating inventory for order ${context.orderId}`);
  }

  async initiatePaymentProcessing(context) {
    console.log(`✅ Initiating payment processing for order ${context.orderId}`);
  }

  async sendOrderConfirmation(context) {
    console.log(`✅ Sending order confirmation for order ${context.orderId}`);
  }

  async notifyPurchasingTeam(productId, currentCount) {
    console.log(`✅ Notifying purchasing team: Product ${productId} has ${currentCount} units`);
  }

  async sendWelcomeEmail(customer) {
    console.log(`✅ Sending welcome email to ${customer.email}`);
  }

  async recordWorkflowProgress(context) {
    await this.db.collection('workflow_progress').insertOne({
      orderId: context.orderId,
      stage: context.stage,
      actions: context.nextActions,
      timestamp: context.timestamp,
      status: 'completed'
    });
  }

  async handleWorkflowError(context, action, error) {
    console.error(`Workflow error in ${action} for order ${context.orderId}:`, error.message);

    await this.db.collection('workflow_errors').insertOne({
      orderId: context.orderId,
      stage: context.stage,
      failedAction: action,
      error: error.message,
      timestamp: new Date(),
      retryCount: 0
    });
  }

  async getWorkflowMetrics() {
    const activeProcessors = Array.from(this.eventProcessors.keys());

    return {
      activeWorkflows: activeProcessors.length,
      processorNames: activeProcessors,
      timestamp: new Date()
    };
  }

  async shutdown() {
    console.log('Shutting down event processors...');

    for (const [name, processor] of this.eventProcessors) {
      await processor.close();
      console.log(`✅ Closed event processor: ${name}`);
    }

    this.eventProcessors.clear();
  }
}

// Export the advanced event processor
module.exports = { AdvancedEventProcessor };

SQL-Style Change Stream Operations with QueryLeaf

QueryLeaf provides familiar SQL syntax for MongoDB Change Streams and event processing:

-- QueryLeaf change stream operations with SQL-familiar syntax

-- Create change stream listener with SQL-style syntax
CREATE CHANGE_STREAM product_changes AS
SELECT 
  operation_type,
  document_key,
  full_document,
  full_document_before_change,
  update_description,

  -- Event classification
  CASE 
    WHEN operation_type = 'delete' THEN 'critical'
    WHEN operation_type = 'update' AND update_description.updated_fields ? 'status' THEN 'high'
    WHEN operation_type = 'update' AND (update_description.updated_fields ? 'price' OR update_description.updated_fields ? 'inventory_count') THEN 'medium'
    ELSE 'low'
  END as event_severity,

  -- Change summary
  CASE 
    WHEN operation_type = 'update' THEN 
      JSON_BUILD_OBJECT(
        'fields_changed', JSON_ARRAY_LENGTH(JSON_KEYS(update_description.updated_fields)),
        'key_changes', ARRAY(
          SELECT key FROM JSON_EACH_TEXT(update_description.updated_fields) WHERE key IN ('status', 'price', 'inventory_count')
        )
      )
    ELSE NULL
  END as change_summary,

  CURRENT_TIMESTAMP as processing_timestamp

FROM CHANGE_STREAM('products')
WHERE 
  operation_type IN ('insert', 'update', 'delete')
  AND (
    operation_type != 'update' OR
    (
      update_description.updated_fields ? 'status' OR
      update_description.updated_fields ? 'price' OR  
      update_description.updated_fields ? 'inventory_count' OR
      update_description.updated_fields ? 'name' OR
      update_description.updated_fields ? 'category'
    )
  );

-- Advanced change stream with business rules
CREATE CHANGE_STREAM order_workflow AS
WITH order_events AS (
  SELECT 
    operation_type,
    document_key.order_id,
    full_document,
    update_description,

    -- Workflow stage determination
    CASE 
      WHEN operation_type = 'insert' THEN 'order_created'
      WHEN operation_type = 'update' AND update_description.updated_fields ? 'status' THEN
        CASE update_description.updated_fields.status
          WHEN 'confirmed' THEN 'order_confirmed'
          WHEN 'processing' THEN 'processing_started'
          WHEN 'shipped' THEN 'order_shipped' 
          WHEN 'delivered' THEN 'order_completed'
          WHEN 'cancelled' THEN 'order_cancelled'
          ELSE 'status_updated'
        END
      WHEN operation_type = 'update' AND update_description.updated_fields ? 'payment.status' THEN 'payment_updated'
      ELSE 'order_modified'
    END as workflow_stage,

    -- Priority level
    CASE 
      WHEN operation_type = 'update' AND update_description.updated_fields.status = 'cancelled' THEN 'urgent'
      WHEN operation_type = 'insert' AND full_document.totals.grand_total > 1000 THEN 'high'
      WHEN operation_type = 'update' AND update_description.updated_fields ? 'payment.status' THEN 'medium'
      ELSE 'normal'
    END as priority_level,

    -- Next actions determination
    CASE 
      WHEN operation_type = 'insert' THEN 
        ARRAY['validate_inventory', 'process_payment', 'send_confirmation']
      WHEN operation_type = 'update' AND update_description.updated_fields.payment.status = 'captured' THEN
        ARRAY['reserve_inventory', 'notify_warehouse', 'update_eta']
      WHEN operation_type = 'update' AND update_description.updated_fields.status = 'processing' THEN
        ARRAY['allocate_warehouse', 'generate_pick_list', 'schedule_picking']
      WHEN operation_type = 'update' AND update_description.updated_fields.status = 'shipped' THEN
        ARRAY['send_tracking', 'schedule_delivery_updates', 'prepare_feedback']
      ELSE ARRAY[]::TEXT[]
    END as next_actions,

    CURRENT_TIMESTAMP as event_timestamp

  FROM CHANGE_STREAM('orders')
  WHERE operation_type IN ('insert', 'update')
),

workflow_routing AS (
  SELECT 
    oe.*,

    -- Determine target services based on workflow stage
    CASE workflow_stage
      WHEN 'order_created' THEN 
        ARRAY['inventory-service', 'payment-service', 'notification-service']
      WHEN 'payment_updated' THEN
        ARRAY['payment-service', 'fulfillment-service', 'accounting-service']
      WHEN 'order_shipped' THEN
        ARRAY['shipping-service', 'tracking-service', 'notification-service']
      WHEN 'order_cancelled' THEN
        ARRAY['inventory-service', 'payment-service', 'notification-service', 'analytics-service']
      ELSE ARRAY['fulfillment-service']
    END as target_services,

    -- Service-specific payloads
    JSON_BUILD_OBJECT(
      'event_id', GENERATE_UUID(),
      'event_type', workflow_stage,
      'priority', priority_level,
      'order_id', order_id,
      'customer_id', full_document.customer.customer_id,
      'order_total', full_document.totals.grand_total,
      'next_actions', next_actions,
      'timestamp', event_timestamp
    ) as event_payload

  FROM order_events oe
)

SELECT 
  order_id,
  workflow_stage,
  priority_level,
  UNNEST(target_services) as service_name,
  event_payload,

  -- Service endpoint routing
  CASE UNNEST(target_services)
    WHEN 'inventory-service' THEN 'http://inventory-service:3001/webhook/orders'
    WHEN 'payment-service' THEN 'http://payment-service:3002/events/orders'
    WHEN 'notification-service' THEN 'http://notification-service:3003/events/order'
    WHEN 'fulfillment-service' THEN 'http://fulfillment-service:3004/orders/events'
    WHEN 'shipping-service' THEN 'http://shipping-service:3005/orders/shipping'
    ELSE 'http://default-service:3000/webhook'
  END as target_endpoint,

  -- Delivery configuration
  JSON_BUILD_OBJECT(
    'timeout_ms', 10000,
    'retry_attempts', 3,
    'retry_backoff', 'exponential'
  ) as delivery_config

FROM workflow_routing
WHERE array_length(target_services, 1) > 0
ORDER BY 
  CASE priority_level
    WHEN 'urgent' THEN 1
    WHEN 'high' THEN 2
    WHEN 'medium' THEN 3
    ELSE 4
  END,
  event_timestamp ASC;

-- Cross-collection change aggregation
CREATE CHANGE_STREAM business_events AS
WITH cross_collection_changes AS (
  SELECT 
    namespace.collection as source_collection,
    operation_type,
    document_key,
    full_document,
    update_description,
    CURRENT_TIMESTAMP as change_timestamp

  FROM CHANGE_STREAM_DATABASE()
  WHERE namespace.collection IN ('products', 'orders', 'customers', 'inventory')
),

business_event_classification AS (
  SELECT 
    ccc.*,

    -- Business event type determination
    CASE 
      WHEN source_collection = 'orders' AND operation_type = 'insert' THEN 'new_sale'
      WHEN source_collection = 'customers' AND operation_type = 'insert' THEN 'customer_acquisition'
      WHEN source_collection = 'products' AND operation_type = 'update' AND 
           update_description.updated_fields ? 'inventory_count' AND 
           (update_description.updated_fields.inventory_count)::INTEGER < 10 THEN 'low_inventory'
      WHEN source_collection = 'orders' AND operation_type = 'update' AND
           update_description.updated_fields.status = 'cancelled' THEN 'order_cancellation'
      ELSE 'standard_change'
    END as business_event_type,

    -- Impact level assessment  
    CASE 
      WHEN source_collection = 'orders' AND full_document.totals.grand_total > 5000 THEN 'high_value'
      WHEN source_collection = 'products' AND update_description.updated_fields.inventory_count = 0 THEN 'critical'
      WHEN source_collection = 'customers' AND full_document.tier = 'enterprise' THEN 'vip'
      ELSE 'standard'
    END as impact_level,

    -- Coordinated actions needed
    CASE business_event_type
      WHEN 'new_sale' THEN ARRAY['update_analytics', 'check_inventory', 'process_loyalty_points']
      WHEN 'customer_acquisition' THEN ARRAY['send_welcome', 'setup_recommendations', 'track_source']
      WHEN 'low_inventory' THEN ARRAY['alert_purchasing', 'update_website', 'notify_subscribers']
      WHEN 'order_cancellation' THEN ARRAY['release_inventory', 'process_refund', 'update_analytics']
      ELSE ARRAY[]::TEXT[]
    END as coordinated_actions

  FROM cross_collection_changes ccc
),

event_aggregation AS (
  SELECT 
    bec.*,

    -- Aggregate related changes within time window
    COUNT(*) OVER (
      PARTITION BY business_event_type, impact_level 
      ORDER BY change_timestamp 
      RANGE BETWEEN INTERVAL '5 minutes' PRECEDING AND CURRENT ROW
    ) as related_events_count,

    -- Time since last similar event
    EXTRACT(EPOCH FROM (
      change_timestamp - LAG(change_timestamp) OVER (
        PARTITION BY business_event_type 
        ORDER BY change_timestamp
      )
    )) as seconds_since_last_similar

  FROM business_event_classification bec
)

SELECT 
  business_event_type,
  impact_level,
  source_collection,
  document_key,
  related_events_count,
  coordinated_actions,

  -- Event batching for efficiency
  CASE 
    WHEN related_events_count > 5 AND seconds_since_last_similar < 300 THEN 'batch_process'
    WHEN impact_level = 'critical' THEN 'immediate_process'
    ELSE 'normal_process'
  END as processing_mode,

  -- Comprehensive event payload
  JSON_BUILD_OBJECT(
    'event_id', GENERATE_UUID(),
    'business_event_type', business_event_type,
    'impact_level', impact_level,
    'source_collection', source_collection,
    'operation_type', operation_type,
    'document_id', document_key,
    'full_document', full_document,
    'coordinated_actions', coordinated_actions,
    'related_events_count', related_events_count,
    'processing_mode', processing_mode,
    'timestamp', change_timestamp
  ) as event_payload,

  change_timestamp

FROM event_aggregation
WHERE business_event_type != 'standard_change'
ORDER BY 
  CASE impact_level 
    WHEN 'critical' THEN 1
    WHEN 'high_value' THEN 2  
    WHEN 'vip' THEN 3
    ELSE 4
  END,
  change_timestamp DESC;

-- Change stream monitoring and analytics
CREATE MATERIALIZED VIEW change_stream_analytics AS
WITH change_stream_metrics AS (
  SELECT 
    DATE_TRUNC('hour', event_timestamp) as hour_bucket,
    source_collection,
    operation_type,
    business_event_type,
    impact_level,

    -- Volume metrics
    COUNT(*) as event_count,
    COUNT(DISTINCT document_key) as unique_documents,

    -- Processing metrics
    AVG(processing_latency_ms) as avg_processing_latency,
    PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY processing_latency_ms) as p95_processing_latency,

    -- Success rate
    COUNT(*) FILTER (WHERE processing_status = 'success') as successful_events,
    COUNT(*) FILTER (WHERE processing_status = 'failed') as failed_events,
    COUNT(*) FILTER (WHERE processing_status = 'retry') as retry_events,

    -- Service delivery metrics
    AVG(service_delivery_time_ms) as avg_service_delivery_time,
    COUNT(*) FILTER (WHERE service_delivery_success = true) as successful_deliveries

  FROM change_stream_events_log
  WHERE event_timestamp >= CURRENT_TIMESTAMP - INTERVAL '24 hours'
  GROUP BY 
    DATE_TRUNC('hour', event_timestamp),
    source_collection,
    operation_type, 
    business_event_type,
    impact_level
),

performance_analysis AS (
  SELECT 
    csm.*,

    -- Success rates
    ROUND((successful_events::numeric / NULLIF(event_count, 0)) * 100, 2) as success_rate_percent,
    ROUND((successful_deliveries::numeric / NULLIF(event_count, 0)) * 100, 2) as delivery_success_rate_percent,

    -- Performance health score
    CASE 
      WHEN avg_processing_latency <= 100 AND success_rate_percent >= 95 THEN 'excellent'
      WHEN avg_processing_latency <= 500 AND success_rate_percent >= 90 THEN 'good'
      WHEN avg_processing_latency <= 1000 AND success_rate_percent >= 85 THEN 'fair'
      ELSE 'poor'
    END as performance_health,

    -- Trend analysis
    LAG(event_count) OVER (
      PARTITION BY source_collection, business_event_type 
      ORDER BY hour_bucket
    ) as previous_hour_count,

    LAG(avg_processing_latency) OVER (
      PARTITION BY source_collection, business_event_type
      ORDER BY hour_bucket  
    ) as previous_hour_latency

  FROM change_stream_metrics csm
)

SELECT 
  TO_CHAR(hour_bucket, 'YYYY-MM-DD HH24:00') as monitoring_hour,
  source_collection,
  business_event_type,
  impact_level,
  event_count,
  unique_documents,

  -- Performance metrics
  ROUND(avg_processing_latency::numeric, 2) as avg_processing_latency_ms,
  ROUND(p95_processing_latency::numeric, 2) as p95_processing_latency_ms,
  success_rate_percent,
  delivery_success_rate_percent,
  performance_health,

  -- Volume trends
  CASE 
    WHEN previous_hour_count IS NOT NULL THEN
      ROUND(((event_count - previous_hour_count)::numeric / NULLIF(previous_hour_count, 0)) * 100, 1)
    ELSE NULL
  END as volume_change_percent,

  -- Performance trends
  CASE 
    WHEN previous_hour_latency IS NOT NULL THEN
      ROUND(((avg_processing_latency - previous_hour_latency)::numeric / NULLIF(previous_hour_latency, 0)) * 100, 1)
    ELSE NULL
  END as latency_change_percent,

  -- Health indicators
  CASE 
    WHEN performance_health = 'excellent' THEN '🟢 Optimal'
    WHEN performance_health = 'good' THEN '🟡 Good'
    WHEN performance_health = 'fair' THEN '🟠 Attention Needed'
    ELSE '🔴 Critical'
  END as health_indicator,

  -- Recommendations
  CASE 
    WHEN failed_events > event_count * 0.05 THEN 'High failure rate - investigate error causes'
    WHEN avg_processing_latency > 1000 THEN 'High latency - optimize event processing'
    WHEN retry_events > event_count * 0.1 THEN 'High retry rate - check service availability'
    WHEN event_count > previous_hour_count * 2 THEN 'Unusual volume spike - monitor capacity'
    ELSE 'Performance within normal parameters'
  END as recommendation

FROM performance_analysis
WHERE hour_bucket >= CURRENT_TIMESTAMP - INTERVAL '24 hours'
ORDER BY hour_bucket DESC, event_count DESC;

-- QueryLeaf provides comprehensive change stream capabilities:
-- 1. SQL-familiar change stream creation and management syntax
-- 2. Advanced event filtering and transformation with business logic
-- 3. Cross-collection event aggregation and coordination patterns
-- 4. Real-time workflow orchestration with SQL-style routing
-- 5. Comprehensive monitoring and analytics for event processing
-- 6. Service integration patterns with familiar SQL constructs
-- 7. Event batching and performance optimization strategies
-- 8. Business rule integration for intelligent event processing
-- 9. Error handling and retry logic with SQL-familiar patterns
-- 10. Native integration with MongoDB Change Streams infrastructure

Best Practices for Change Streams Implementation

Event-Driven Architecture Design

Essential practices for building production-ready event-driven systems:

Event Filtering: Design precise change stream filters to minimize processing overhead
Service Decoupling: Use event-driven patterns to maintain loose coupling between microservices
Error Handling: Implement comprehensive retry logic and dead letter patterns
Event Ordering: Consider event ordering requirements for business-critical workflows
Monitoring: Deploy extensive monitoring for event processing pipelines and service health
Scalability: Design event processing to scale horizontally with growing data volumes

Performance Optimization

Optimize change streams for high-throughput production environments:

Pipeline Optimization: Use efficient aggregation pipelines to filter events at the database level
Batch Processing: Group related events for efficient processing where appropriate
Resource Management: Monitor and manage change stream resource consumption
Service Coordination: Implement intelligent routing to avoid overwhelming downstream services
Caching Strategy: Use appropriate caching to reduce redundant processing
Capacity Planning: Plan for peak event volumes and service capacity requirements

Conclusion

MongoDB Change Streams provide comprehensive real-time event processing capabilities that enable sophisticated event-driven microservices architectures without the complexity and overhead of external message queues or polling mechanisms. The combination of native change data capture, intelligent event filtering, and comprehensive service integration patterns makes it ideal for building responsive, scalable distributed systems.

Key Change Streams benefits include:

Real-Time Processing: Native change data capture without polling overhead or latency
Intelligent Filtering: Comprehensive event filtering and transformation at the database level
Service Integration: Built-in patterns for microservices coordination and event routing
Workflow Orchestration: Advanced business logic integration for complex event-driven workflows
Scalable Architecture: Horizontal scaling capabilities that grow with your application needs
Developer Familiarity: SQL-compatible event processing patterns with MongoDB's flexible data model

Whether you're building e-commerce platforms, real-time analytics systems, IoT applications, or any system requiring immediate responsiveness to data changes, MongoDB Change Streams with QueryLeaf's SQL-familiar interface provides the foundation for modern event-driven architectures that scale efficiently while maintaining familiar development patterns.

QueryLeaf Integration: QueryLeaf automatically translates SQL-style change stream operations into MongoDB Change Streams, providing familiar CREATE CHANGE_STREAM syntax, event filtering with SQL WHERE clauses, and comprehensive event routing patterns. Advanced event-driven workflows, business rule integration, and microservices coordination are seamlessly handled through familiar SQL constructs, making sophisticated real-time architecture both powerful and approachable for SQL-oriented development teams.

The integration of comprehensive event processing capabilities with SQL-familiar operations makes MongoDB an ideal platform for applications requiring both real-time responsiveness and familiar database interaction patterns, ensuring your event-driven solutions remain both effective and maintainable as they scale and evolve.

November 13, 2025
23 min read

MongoDB GridFS for File Storage and Binary Data Management: Production-Scale File Handling with SQL-Style Binary Operations

Modern applications require sophisticated file storage capabilities that can handle large binary files, streaming operations, and metadata management while maintaining performance and scalability across distributed systems. Traditional approaches to file storage often struggle with large file limitations, database size constraints, and the complexity of managing both structured data and binary content within a unified system.

MongoDB GridFS provides comprehensive file storage capabilities that seamlessly integrate with document databases, enabling applications to store and retrieve large files while maintaining ACID properties, metadata relationships, and query capabilities. Unlike external file systems that require separate infrastructure and complex synchronization, GridFS provides unified data management where files and metadata exist within the same transactional boundary and replication topology.

The Traditional File Storage Limitation Challenge

Conventional approaches to managing large files and binary data face significant architectural and performance limitations:

-- Traditional PostgreSQL BLOB storage - severe limitations with large files and performance

-- Basic file storage table with BYTEA limitations
CREATE TABLE file_storage (
    file_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    filename VARCHAR(500) NOT NULL,
    mime_type VARCHAR(100) NOT NULL,
    file_size BIGINT NOT NULL,
    file_data BYTEA NOT NULL,  -- Limited to ~1GB, causes performance issues
    upload_date TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    uploader_id UUID NOT NULL,

    -- Basic metadata
    description TEXT,
    tags TEXT[],
    is_public BOOLEAN DEFAULT false,
    access_count INTEGER DEFAULT 0,
    last_accessed TIMESTAMP DEFAULT CURRENT_TIMESTAMP,

    -- File organization
    folder_path VARCHAR(1000),
    parent_folder_id UUID,

    -- Storage metadata
    storage_location VARCHAR(200),
    checksum VARCHAR(64),
    compression_type VARCHAR(20),
    original_size BIGINT
);

-- Additional tables for file relationships and versions
CREATE TABLE file_versions (
    version_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    file_id UUID NOT NULL REFERENCES file_storage(file_id),
    version_number INTEGER NOT NULL,
    file_data BYTEA NOT NULL,  -- Duplicate storage issues
    version_date TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    version_notes TEXT,
    created_by UUID NOT NULL
);

CREATE TABLE file_shares (
    share_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    file_id UUID NOT NULL REFERENCES file_storage(file_id),
    shared_with_user UUID NOT NULL,
    permission_level VARCHAR(20) NOT NULL CHECK (permission_level IN ('read', 'write', 'admin')),
    shared_date TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    expires_at TIMESTAMP,
    share_link VARCHAR(500)
);

-- Attempt to retrieve files with associated metadata (problematic query)
WITH file_analysis AS (
    SELECT 
        f.file_id,
        f.filename,
        f.mime_type,
        f.file_size,
        f.upload_date,
        f.uploader_id,
        f.access_count,
        f.tags,
        f.folder_path,

        -- File data retrieval (major performance bottleneck)
        CASE 
            WHEN f.file_size > 1048576 THEN 'Large file - performance warning'
            WHEN f.file_size > 52428800 THEN 'Very large file - severe performance impact'
            ELSE 'Normal size file'
        END as size_warning,

        -- Version information
        COUNT(fv.version_id) as version_count,
        MAX(fv.version_date) as latest_version_date,
        SUM(LENGTH(fv.file_data)) as total_version_storage, -- Expensive calculation

        -- Sharing information
        COUNT(fs.share_id) as share_count,
        ARRAY_AGG(DISTINCT fs.permission_level) as permission_levels,

        -- Storage efficiency analysis
        f.file_size + COALESCE(SUM(LENGTH(fv.file_data)), 0) as total_storage_used,
        ROUND(
            (f.file_size::float / (f.file_size + COALESCE(SUM(LENGTH(fv.file_data)), 0))) * 100, 
            2
        ) as storage_efficiency_pct

    FROM file_storage f
    LEFT JOIN file_versions fv ON f.file_id = fv.file_id
    LEFT JOIN file_shares fs ON f.file_id = fs.file_id
    WHERE f.upload_date >= CURRENT_DATE - INTERVAL '90 days'
      AND f.file_size > 0
    GROUP BY f.file_id, f.filename, f.mime_type, f.file_size, f.upload_date, 
             f.uploader_id, f.access_count, f.tags, f.folder_path
),
file_performance_metrics AS (
    SELECT 
        fa.*,
        u.username as uploader_name,
        u.email as uploader_email,

        -- Performance classification
        CASE 
            WHEN fa.file_size > 104857600 THEN 'Performance Critical'  -- >100MB
            WHEN fa.file_size > 10485760 THEN 'Performance Impact'    -- >10MB  
            WHEN fa.file_size > 1048576 THEN 'Moderate Impact'        -- >1MB
            ELSE 'Low Impact'
        END as performance_impact,

        -- Storage optimization recommendations  
        CASE
            WHEN fa.version_count > 10 AND fa.storage_efficiency_pct < 20 THEN 
                'High version overhead - implement version cleanup'
            WHEN fa.total_storage_used > 1073741824 THEN  -- >1GB total
                'Large storage footprint - consider external storage'
            WHEN fa.file_size > 52428800 AND fa.access_count < 5 THEN  -- >50MB, low access
                'Large rarely-accessed file - candidate for archival'
            ELSE 'Storage usage acceptable'
        END as optimization_recommendation,

        -- Access patterns analysis
        CASE
            WHEN fa.access_count > 1000 THEN 'High traffic - consider CDN/caching'
            WHEN fa.access_count > 100 THEN 'Moderate traffic - monitor performance'
            WHEN fa.access_count < 10 THEN 'Low traffic - archival candidate'
            ELSE 'Normal traffic pattern'
        END as access_pattern_analysis

    FROM file_analysis fa
    JOIN users u ON fa.uploader_id = u.user_id
),
storage_summary AS (
    SELECT 
        COUNT(*) as total_files,
        SUM(file_size) as total_storage_bytes,
        ROUND(AVG(file_size)::numeric, 0) as avg_file_size,
        MAX(file_size) as largest_file_size,
        COUNT(*) FILTER (WHERE performance_impact = 'Performance Critical') as critical_files,
        COUNT(*) FILTER (WHERE version_count > 5) as high_version_files,

        -- Storage distribution
        ROUND((SUM(file_size)::numeric / 1024 / 1024 / 1024), 2) as total_storage_gb,
        ROUND((SUM(total_storage_used)::numeric / 1024 / 1024 / 1024), 2) as total_with_versions_gb,

        -- Performance impact assessment
        ROUND((
            COUNT(*) FILTER (WHERE performance_impact IN ('Performance Critical', 'Performance Impact'))::float /
            COUNT(*) * 100
        ), 1) as high_impact_files_pct
    FROM file_performance_metrics
)
SELECT 
    -- File details
    fpm.file_id,
    fpm.filename,
    fpm.mime_type,
    ROUND((fpm.file_size::numeric / 1024 / 1024), 2) as file_size_mb,
    fpm.upload_date,
    fpm.uploader_name,
    fpm.access_count,
    fpm.version_count,

    -- Performance and optimization
    fpm.performance_impact,
    fpm.optimization_recommendation,
    fpm.access_pattern_analysis,
    fpm.storage_efficiency_pct,

    -- File organization
    fpm.tags,
    fpm.folder_path,
    fpm.share_count,

    -- Summary statistics (same for all rows)
    ss.total_files,
    ss.total_storage_gb,
    ss.total_with_versions_gb,
    ss.high_impact_files_pct,

    -- Issues and warnings
    CASE 
        WHEN fpm.file_size > 1073741824 THEN 'WARNING: File size exceeds PostgreSQL practical limits'
        WHEN fpm.total_storage_used > 2147483648 THEN 'CRITICAL: Storage usage may cause performance issues'
        WHEN fpm.version_count > 20 THEN 'WARNING: Excessive version history'
        ELSE 'No major issues detected'
    END as system_warnings

FROM file_performance_metrics fpm
CROSS JOIN storage_summary ss
WHERE fpm.performance_impact IN ('Performance Critical', 'Performance Impact', 'Moderate Impact')
ORDER BY fpm.file_size DESC, fpm.access_count DESC
LIMIT 50;

-- Problems with traditional PostgreSQL BLOB approach:
-- 1. BYTEA field size limitations (~1GB practical limit, 1GB theoretical limit)
-- 2. Severe memory consumption during file retrieval and processing
-- 3. Query performance degradation with large binary data in results
-- 4. Backup and replication overhead due to large binary data in tables
-- 5. No built-in streaming capabilities for large file transfers
-- 6. Expensive storage overhead for file versioning and duplication
-- 7. Limited file organization and hierarchical structure support
-- 8. No native file chunk management or partial retrieval capabilities
-- 9. Complex application-level implementation required for file operations
-- 10. Poor integration between file operations and transactional data management

-- Alternative external storage approach (additional complexity)
CREATE TABLE external_file_references (
    file_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    filename VARCHAR(500) NOT NULL,
    mime_type VARCHAR(100) NOT NULL,
    file_size BIGINT NOT NULL,

    -- External storage references
    storage_provider VARCHAR(50) NOT NULL,  -- 'aws_s3', 'azure_blob', 'gcp_storage'
    storage_bucket VARCHAR(100) NOT NULL,
    storage_path VARCHAR(1000) NOT NULL,
    storage_url VARCHAR(2000),

    -- File metadata (separated from binary data)
    upload_date TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    uploader_id UUID NOT NULL,
    checksum VARCHAR(64),

    -- Synchronization challenges
    sync_status VARCHAR(20) DEFAULT 'pending',
    last_sync_attempt TIMESTAMP,
    sync_error_message TEXT
);

-- External storage approach problems:
-- 1. Complex synchronization between database metadata and external files
-- 2. Eventual consistency issues between file operations and transactions  
-- 3. Additional infrastructure dependencies and failure points
-- 4. Complex backup and disaster recovery coordination
-- 5. Network latency and bandwidth costs for file operations
-- 6. Security and access control complexity across multiple systems
-- 7. Limited transactional guarantees between file and data operations
-- 8. Vendor lock-in and migration challenges with cloud storage providers
-- 9. Additional cost and complexity for CDN, caching, and performance optimization
-- 10. Difficult monitoring and debugging across distributed storage systems

MongoDB GridFS provides comprehensive file storage with unified data management:

// MongoDB GridFS - comprehensive file storage and binary data management system
const { MongoClient, GridFSBucket, ObjectId } = require('mongodb');
const fs = require('fs');
const path = require('path');
const crypto = require('crypto');
const mime = require('mime-types');

const client = new MongoClient('mongodb://localhost:27017');
const db = client.db('enterprise_file_management');

// Advanced GridFS file management system with comprehensive features
class EnterpriseGridFSManager {
  constructor(db, bucketName = 'fs') {
    this.db = db;
    this.bucket = new GridFSBucket(db, { bucketName: bucketName });

    // Configuration for production file management
    this.config = {
      bucketName: bucketName,
      chunkSizeBytes: 261120, // 255KB chunks for optimal streaming
      maxFileSize: 16777216000, // 16GB maximum file size
      allowedMimeTypes: [], // Empty array allows all types
      compressionEnabled: true,
      versioningEnabled: true,
      metadataIndexing: true,
      streamingOptimization: true
    };

    // Collections for enhanced file management
    this.collections = {
      files: db.collection(`${bucketName}.files`),
      chunks: db.collection(`${bucketName}.chunks`),
      metadata: db.collection('file_metadata'),
      versions: db.collection('file_versions'),
      shares: db.collection('file_shares'),
      analytics: db.collection('file_analytics')
    };

    // Performance and analytics tracking
    this.performanceMetrics = {
      uploadStats: new Map(),
      downloadStats: new Map(),
      storageStats: new Map()
    };

    this.initializeGridFSSystem();
  }

  async initializeGridFSSystem() {
    console.log('Initializing enterprise GridFS system...');

    try {
      // Create optimized indexes for GridFS performance
      await this.setupGridFSIndexes();

      // Initialize metadata tracking
      await this.initializeMetadataSystem();

      // Setup file analytics and monitoring
      await this.initializeFileAnalytics();

      // Configure streaming and performance optimization
      await this.configureStreamingOptimization();

      console.log('GridFS system initialized successfully');

    } catch (error) {
      console.error('Error initializing GridFS system:', error);
      throw error;
    }
  }

  async setupGridFSIndexes() {
    console.log('Setting up optimized GridFS indexes...');

    try {
      // Optimized indexes for GridFS files collection
      await this.collections.files.createIndex({ filename: 1, uploadDate: -1 });
      await this.collections.files.createIndex({ 'metadata.contentType': 1, uploadDate: -1 });
      await this.collections.files.createIndex({ 'metadata.uploader': 1, uploadDate: -1 });
      await this.collections.files.createIndex({ 'metadata.tags': 1 });
      await this.collections.files.createIndex({ 'metadata.folder': 1, filename: 1 });
      await this.collections.files.createIndex({ length: -1, uploadDate: -1 }); // Size-based queries

      // Specialized indexes for chunks collection performance  
      await this.collections.chunks.createIndex({ files_id: 1, n: 1 }, { unique: true });

      // Extended metadata indexes
      await this.collections.metadata.createIndex({ fileId: 1 }, { unique: true });
      await this.collections.metadata.createIndex({ 'customMetadata.project': 1 });
      await this.collections.metadata.createIndex({ 'customMetadata.department': 1 });
      await this.collections.metadata.createIndex({ 'permissions.userId': 1 });

      // Version management indexes
      await this.collections.versions.createIndex({ originalFileId: 1, versionNumber: -1 });
      await this.collections.versions.createIndex({ createdAt: -1 });

      // File sharing indexes
      await this.collections.shares.createIndex({ fileId: 1, userId: 1 });
      await this.collections.shares.createIndex({ shareToken: 1 }, { unique: true });
      await this.collections.shares.createIndex({ expiresAt: 1 }, { expireAfterSeconds: 0 });

      console.log('GridFS indexes created successfully');

    } catch (error) {
      console.error('Error creating GridFS indexes:', error);
      throw error;
    }
  }

  async uploadFileWithMetadata(filePath, metadata = {}) {
    console.log(`Uploading file: ${filePath}`);
    const startTime = Date.now();

    try {
      // Validate file and prepare metadata
      const fileStats = await fs.promises.stat(filePath);
      const filename = path.basename(filePath);
      const mimeType = mime.lookup(filePath) || 'application/octet-stream';

      // Validate file size and type
      await this.validateFileUpload(filePath, fileStats, mimeType);

      // Generate comprehensive metadata
      const enhancedMetadata = await this.generateFileMetadata(filePath, fileStats, metadata);

      // Create GridFS upload stream with optimized settings
      const uploadStream = this.bucket.openUploadStream(filename, {
        chunkSizeBytes: this.config.chunkSizeBytes,
        metadata: enhancedMetadata
      });

      // Create read stream and setup progress tracking
      const readStream = fs.createReadStream(filePath);
      let uploadedBytes = 0;

      readStream.on('data', (chunk) => {
        uploadedBytes += chunk.length;
        const progress = (uploadedBytes / fileStats.size * 100).toFixed(1);
        if (uploadedBytes % (1024 * 1024) === 0 || uploadedBytes === fileStats.size) {
          console.log(`Upload progress: ${progress}% (${uploadedBytes}/${fileStats.size} bytes)`);
        }
      });

      // Handle upload completion
      return new Promise((resolve, reject) => {
        uploadStream.on('error', (error) => {
          console.error('Upload error:', error);
          reject(error);
        });

        uploadStream.on('finish', async () => {
          const uploadTime = Date.now() - startTime;
          console.log(`File uploaded successfully in ${uploadTime}ms`);

          try {
            // Store extended metadata
            await this.storeExtendedMetadata(uploadStream.id, enhancedMetadata, filePath);

            // Update analytics
            await this.updateUploadAnalytics(uploadStream.id, fileStats.size, uploadTime);

            // Generate file preview if applicable
            await this.generateFilePreview(uploadStream.id, mimeType, filePath);

            resolve({
              fileId: uploadStream.id,
              filename: filename,
              size: fileStats.size,
              mimeType: mimeType,
              uploadTime: uploadTime,
              metadata: enhancedMetadata,
              success: true
            });

          } catch (metadataError) {
            console.error('Error storing extended metadata:', metadataError);
            // File upload succeeded, but metadata storage failed
            resolve({
              fileId: uploadStream.id,
              filename: filename,
              size: fileStats.size,
              success: true,
              warning: 'Extended metadata storage failed',
              error: metadataError.message
            });
          }
        });

        // Start the upload
        readStream.pipe(uploadStream);
      });

    } catch (error) {
      console.error(`Failed to upload file ${filePath}:`, error);
      throw error;
    }
  }

  async downloadFileStream(fileId, options = {}) {
    console.log(`Creating download stream for file: ${fileId}`);

    try {
      // Validate file exists and get metadata
      const fileInfo = await this.getFileInfo(fileId);
      if (!fileInfo) {
        throw new Error(`File not found: ${fileId}`);
      }

      // Check download permissions
      await this.validateDownloadPermissions(fileId, options.userId);

      // Create optimized download stream
      const downloadStream = this.bucket.openDownloadStream(new ObjectId(fileId), {
        start: options.start || 0,
        end: options.end || undefined
      });

      // Track download analytics
      await this.updateDownloadAnalytics(fileId, options.userId);

      // Setup error handling
      downloadStream.on('error', (error) => {
        console.error(`Download stream error for file ${fileId}:`, error);
      });

      return {
        stream: downloadStream,
        fileInfo: fileInfo,
        contentType: fileInfo.metadata?.contentType || 'application/octet-stream',
        contentLength: fileInfo.length,
        filename: fileInfo.filename
      };

    } catch (error) {
      console.error(`Failed to create download stream for file ${fileId}:`, error);
      throw error;
    }
  }

  async searchFiles(searchCriteria, options = {}) {
    console.log('Performing advanced file search...');

    try {
      const pipeline = [];

      // Build search pipeline based on criteria
      const matchStage = this.buildSearchMatchStage(searchCriteria);
      if (Object.keys(matchStage).length > 0) {
        pipeline.push({ $match: matchStage });
      }

      // Join with extended metadata
      pipeline.push({
        $lookup: {
          from: 'file_metadata',
          localField: '_id',
          foreignField: 'fileId',
          as: 'extendedMetadata'
        }
      });

      // Join with version information
      pipeline.push({
        $lookup: {
          from: 'file_versions',
          localField: '_id', 
          foreignField: 'originalFileId',
          as: 'versions',
          pipeline: [
            { $sort: { versionNumber: -1 } },
            { $limit: 5 }
          ]
        }
      });

      // Join with sharing information
      pipeline.push({
        $lookup: {
          from: 'file_shares',
          localField: '_id',
          foreignField: 'fileId',
          as: 'shares'
        }
      });

      // Add computed fields and analytics
      pipeline.push({
        $addFields: {
          // File size in human readable format
          sizeFormatted: {
            $switch: {
              branches: [
                { case: { $gte: ['$length', 1073741824] }, then: { $concat: [{ $toString: { $round: [{ $divide: ['$length', 1073741824] }, 2] } }, ' GB'] } },
                { case: { $gte: ['$length', 1048576] }, then: { $concat: [{ $toString: { $round: [{ $divide: ['$length', 1048576] }, 2] } }, ' MB'] } },
                { case: { $gte: ['$length', 1024] }, then: { $concat: [{ $toString: { $round: [{ $divide: ['$length', 1024] }, 2] } }, ' KB'] } }
              ],
              default: { $concat: [{ $toString: '$length' }, ' bytes'] }
            }
          },

          // Version information
          versionCount: { $size: '$versions' },
          latestVersion: { $arrayElemAt: ['$versions.versionNumber', 0] },

          // Sharing information
          shareCount: { $size: '$shares' },
          isShared: { $gt: [{ $size: '$shares' }, 0] },

          // Extended metadata extraction
          customMetadata: { $arrayElemAt: ['$extendedMetadata.customMetadata', 0] },
          permissions: { $arrayElemAt: ['$extendedMetadata.permissions', 0] },

          // File age calculation
          ageInDays: {
            $round: [{
              $divide: [
                { $subtract: [new Date(), '$uploadDate'] },
                1000 * 60 * 60 * 24
              ]
            }, 0]
          }
        }
      });

      // Apply sorting
      const sortStage = this.buildSortStage(options.sortBy, options.sortOrder);
      pipeline.push({ $sort: sortStage });

      // Apply pagination
      if (options.skip) pipeline.push({ $skip: options.skip });
      if (options.limit) pipeline.push({ $limit: options.limit });

      // Project final results
      pipeline.push({
        $project: {
          _id: 1,
          filename: 1,
          length: 1,
          sizeFormatted: 1,
          uploadDate: 1,
          ageInDays: 1,
          'metadata.contentType': 1,
          'metadata.uploader': 1,
          'metadata.tags': 1,
          'metadata.folder': 1,
          versionCount: 1,
          latestVersion: 1,
          shareCount: 1,
          isShared: 1,
          customMetadata: 1,
          permissions: 1
        }
      });

      // Execute search
      const results = await this.collections.files.aggregate(pipeline).toArray();

      // Get total count for pagination
      const totalCount = await this.getSearchResultCount(searchCriteria);

      return {
        files: results,
        totalCount: totalCount,
        pageSize: options.limit || results.length,
        currentPage: options.skip ? Math.floor(options.skip / (options.limit || 20)) + 1 : 1,
        totalPages: options.limit ? Math.ceil(totalCount / options.limit) : 1,
        searchCriteria: searchCriteria,
        executionTime: Date.now() - (options.startTime || Date.now())
      };

    } catch (error) {
      console.error('File search error:', error);
      throw error;
    }
  }

  async generateFileMetadata(filePath, fileStats, customMetadata) {
    const metadata = {
      // Basic file information
      originalPath: filePath,
      contentType: mime.lookup(filePath) || 'application/octet-stream',
      size: fileStats.size,

      // File timestamps
      createdAt: fileStats.birthtime,
      modifiedAt: fileStats.mtime,
      uploadedAt: new Date(),

      // User and context information
      uploader: customMetadata.uploader || 'system',
      uploaderIP: customMetadata.uploaderIP,
      userAgent: customMetadata.userAgent,

      // File organization
      folder: customMetadata.folder || '/',
      tags: customMetadata.tags || [],
      category: customMetadata.category || 'general',

      // Security and permissions
      permissions: customMetadata.permissions || { public: false, users: [] },
      encryption: customMetadata.encryption || false,

      // File characteristics
      checksum: await this.calculateFileChecksum(filePath),
      compression: this.shouldCompressFile(mime.lookup(filePath)),

      // Custom business metadata
      ...customMetadata
    };

    return metadata;
  }

  async calculateFileChecksum(filePath) {
    return new Promise((resolve, reject) => {
      const hash = crypto.createHash('sha256');
      const stream = fs.createReadStream(filePath);

      stream.on('data', (data) => hash.update(data));
      stream.on('end', () => resolve(hash.digest('hex')));
      stream.on('error', (error) => reject(error));
    });
  }

  async storeExtendedMetadata(fileId, metadata, filePath) {
    const extendedMetadata = {
      fileId: fileId,
      customMetadata: metadata,
      createdAt: new Date(),

      // File analysis results
      analysis: {
        isImage: this.isImageFile(metadata.contentType),
        isVideo: this.isVideoFile(metadata.contentType),
        isAudio: this.isAudioFile(metadata.contentType),
        isDocument: this.isDocumentFile(metadata.contentType),
        isArchive: this.isArchiveFile(metadata.contentType)
      },

      // Storage optimization
      storageOptimization: {
        compressionRecommended: this.shouldCompressFile(metadata.contentType),
        archivalCandidate: false, // Will be updated based on access patterns
        cachingRecommended: this.shouldCacheFile(metadata.contentType)
      },

      // Performance tracking
      performance: {
        uploadDuration: 0, // Will be updated
        averageDownloadTime: null,
        accessCount: 0,
        lastAccessed: null
      }
    };

    await this.collections.metadata.insertOne(extendedMetadata);
  }

  async updateUploadAnalytics(fileId, fileSize, uploadTime) {
    const analytics = {
      fileId: fileId,
      event: 'upload',
      timestamp: new Date(),
      metrics: {
        fileSize: fileSize,
        uploadDuration: uploadTime,
        uploadSpeed: Math.round(fileSize / (uploadTime / 1000)), // bytes per second
      }
    };

    await this.collections.analytics.insertOne(analytics);
  }

  async updateDownloadAnalytics(fileId, userId) {
    const analytics = {
      fileId: new ObjectId(fileId),
      userId: userId,
      event: 'download',
      timestamp: new Date(),
      ipAddress: null, // Would be filled from request context
      userAgent: null  // Would be filled from request context
    };

    await this.collections.analytics.insertOne(analytics);

    // Update access count and last accessed time in metadata
    await this.collections.metadata.updateOne(
      { fileId: new ObjectId(fileId) },
      {
        $inc: { 'performance.accessCount': 1 },
        $set: { 'performance.lastAccessed': new Date() }
      }
    );
  }

  buildSearchMatchStage(criteria) {
    const match = {};

    // Filename search
    if (criteria.filename) {
      match.filename = new RegExp(criteria.filename, 'i');
    }

    // Content type filtering
    if (criteria.contentType) {
      match['metadata.contentType'] = criteria.contentType;
    }

    // Size range filtering
    if (criteria.minSize || criteria.maxSize) {
      match.length = {};
      if (criteria.minSize) match.length.$gte = criteria.minSize;
      if (criteria.maxSize) match.length.$lte = criteria.maxSize;
    }

    // Date range filtering
    if (criteria.dateFrom || criteria.dateTo) {
      match.uploadDate = {};
      if (criteria.dateFrom) match.uploadDate.$gte = new Date(criteria.dateFrom);
      if (criteria.dateTo) match.uploadDate.$lte = new Date(criteria.dateTo);
    }

    // Tag filtering
    if (criteria.tags && criteria.tags.length > 0) {
      match['metadata.tags'] = { $in: criteria.tags };
    }

    // Folder filtering
    if (criteria.folder) {
      match['metadata.folder'] = criteria.folder;
    }

    // Uploader filtering
    if (criteria.uploader) {
      match['metadata.uploader'] = criteria.uploader;
    }

    return match;
  }

  buildSortStage(sortBy = 'uploadDate', sortOrder = 'desc') {
    const sortDirection = sortOrder.toLowerCase() === 'desc' ? -1 : 1;

    switch (sortBy.toLowerCase()) {
      case 'filename':
        return { filename: sortDirection };
      case 'size':
        return { length: sortDirection };
      case 'contenttype':
        return { 'metadata.contentType': sortDirection };
      case 'uploader':
        return { 'metadata.uploader': sortDirection };
      default:
        return { uploadDate: sortDirection };
    }
  }

  async getFileInfo(fileId) {
    try {
      const file = await this.collections.files.findOne({ _id: new ObjectId(fileId) });
      return file;
    } catch (error) {
      console.error(`Error getting file info for ${fileId}:`, error);
      return null;
    }
  }

  async validateFileUpload(filePath, fileStats, mimeType) {
    // Size validation
    if (fileStats.size > this.config.maxFileSize) {
      throw new Error(`File size ${fileStats.size} exceeds maximum allowed size ${this.config.maxFileSize}`);
    }

    // MIME type validation (if restrictions configured)
    if (this.config.allowedMimeTypes.length > 0 && !this.config.allowedMimeTypes.includes(mimeType)) {
      throw new Error(`File type ${mimeType} is not allowed`);
    }

    // File accessibility validation
    try {
      await fs.promises.access(filePath, fs.constants.R_OK);
    } catch (error) {
      throw new Error(`Cannot read file: ${filePath}`);
    }
  }

  async validateDownloadPermissions(fileId, userId) {
    // In a real implementation, this would check user permissions
    // For now, we'll just validate the file exists
    const fileInfo = await this.getFileInfo(fileId);
    if (!fileInfo) {
      throw new Error(`File not found: ${fileId}`);
    }
    return true;
  }

  async getSearchResultCount(searchCriteria) {
    const matchStage = this.buildSearchMatchStage(searchCriteria);
    return await this.collections.files.countDocuments(matchStage);
  }

  // Utility methods for file type detection and optimization

  isImageFile(contentType) {
    return contentType && contentType.startsWith('image/');
  }

  isVideoFile(contentType) {
    return contentType && contentType.startsWith('video/');
  }

  isAudioFile(contentType) {
    return contentType && contentType.startsWith('audio/');
  }

  isDocumentFile(contentType) {
    const documentTypes = [
      'application/pdf',
      'application/msword',
      'application/vnd.openxmlformats-officedocument.wordprocessingml.document',
      'application/vnd.ms-excel',
      'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet',
      'text/plain',
      'text/csv'
    ];
    return documentTypes.includes(contentType);
  }

  isArchiveFile(contentType) {
    const archiveTypes = [
      'application/zip',
      'application/x-rar-compressed',
      'application/x-tar',
      'application/gzip'
    ];
    return archiveTypes.includes(contentType);
  }

  shouldCompressFile(contentType) {
    // Don't compress already compressed files
    const noCompressionTypes = [
      'image/jpeg',
      'image/png', 
      'video/',
      'audio/',
      'application/zip',
      'application/x-rar-compressed'
    ];

    return !noCompressionTypes.some(type => contentType && contentType.startsWith(type));
  }

  shouldCacheFile(contentType) {
    // Cache frequently accessed file types
    const cacheableTypes = [
      'image/',
      'text/css',
      'text/javascript',
      'application/javascript'
    ];

    return cacheableTypes.some(type => contentType && contentType.startsWith(type));
  }

  async generateFilePreview(fileId, mimeType, filePath) {
    // Placeholder for preview generation logic
    // Would implement thumbnail generation for images, 
    // text extraction for documents, etc.
    console.log(`Preview generation for ${fileId} (${mimeType}) - placeholder`);
  }

  async initializeMetadataSystem() {
    console.log('Initializing metadata tracking system...');
    // Placeholder for metadata system initialization
  }

  async initializeFileAnalytics() {
    console.log('Initializing file analytics system...');
    // Placeholder for analytics system initialization
  }

  async configureStreamingOptimization() {
    console.log('Configuring streaming optimization...');
    // Placeholder for streaming optimization configuration
  }
}

// Benefits of MongoDB GridFS for File Storage:
// - Native support for files larger than 16MB (BSON document size limit)
// - Automatic chunking and reassembly for efficient streaming operations
// - Built-in metadata storage and indexing capabilities
// - Seamless integration with MongoDB's replication and sharding
// - ACID transaction support for file operations
// - Comprehensive file versioning and relationship management
// - Advanced query capabilities on file metadata and content
// - Automatic load balancing and fault tolerance
// - Integration with MongoDB's authentication and authorization
// - Unified backup and restore with application data

module.exports = {
  EnterpriseGridFSManager
};

Understanding MongoDB GridFS Architecture

Advanced File Management Patterns and Performance Optimization

Implement sophisticated GridFS strategies for production-scale file management:

// Production-ready GridFS with advanced file management and optimization patterns
class ProductionGridFSPlatform extends EnterpriseGridFSManager {
  constructor(db, config = {}) {
    super(db, config.bucketName);

    this.productionConfig = {
      ...config,

      // Advanced GridFS configuration
      replicationFactor: 3,
      shardingStrategy: 'file_size_based',
      compressionAlgorithm: 'zstd',
      encryptionEnabled: true,

      // Performance optimization
      chunkCaching: {
        enabled: true,
        maxCacheSize: '2GB',
        ttl: 3600 // 1 hour
      },

      // Storage tiering
      storageTiers: {
        hot: { accessFrequency: 'daily', compressionLevel: 'fast' },
        warm: { accessFrequency: 'weekly', compressionLevel: 'balanced' },
        cold: { accessFrequency: 'monthly', compressionLevel: 'maximum' },
        archive: { accessFrequency: 'yearly', compressionLevel: 'ultra' }
      },

      // Advanced features
      contentDeduplication: true,
      automaticTiering: true,
      virusScanning: true,
      contentIndexing: true
    };

    this.initializeProductionFeatures();
  }

  async implementAdvancedFileManagement() {
    console.log('Implementing advanced production file management...');

    const managementFeatures = {
      // Automated storage tiering
      storageTiering: await this.implementStorageTiering(),

      // Content deduplication
      deduplication: await this.setupContentDeduplication(),

      // Advanced security features
      security: await this.implementAdvancedSecurity(),

      // Performance monitoring and optimization
      performance: await this.setupPerformanceOptimization(),

      // Disaster recovery and backup
      backup: await this.configureBackupStrategies(),

      // Content delivery optimization
      cdn: await this.setupCDNIntegration()
    };

    return {
      features: managementFeatures,
      monitoring: await this.setupProductionMonitoring(),
      maintenance: await this.configureAutomatedMaintenance()
    };
  }

  async implementStorageTiering() {
    console.log('Implementing automated storage tiering...');

    const tieringStrategy = {
      // Automatic tier migration based on access patterns
      migrationRules: [
        {
          condition: 'accessCount < 5 AND ageInDays > 30',
          action: 'migrate_to_warm',
          compressionIncrease: true
        },
        {
          condition: 'accessCount < 2 AND ageInDays > 90', 
          action: 'migrate_to_cold',
          compressionMaximize: true
        },
        {
          condition: 'accessCount = 0 AND ageInDays > 365',
          action: 'migrate_to_archive',
          compressionUltra: true
        }
      ],

      // Performance optimization per tier
      tierOptimization: {
        hot: { 
          chunkSize: 261120,
          cachePolicy: 'aggressive',
          replicationFactor: 3 
        },
        warm: { 
          chunkSize: 524288,
          cachePolicy: 'moderate', 
          replicationFactor: 2
        },
        cold: { 
          chunkSize: 1048576,
          cachePolicy: 'minimal',
          replicationFactor: 1
        }
      }
    };

    // Implement tiering automation
    await this.setupTieringAutomation(tieringStrategy);

    return tieringStrategy;
  }

  async setupContentDeduplication() {
    console.log('Setting up content deduplication system...');

    const deduplicationSystem = {
      // Hash-based deduplication
      hashingStrategy: {
        algorithm: 'sha256',
        chunkLevel: true,
        fileLevel: true,
        crossBucketDeduplication: true
      },

      // Reference counting for shared chunks
      referenceManagement: {
        chunkReferences: true,
        garbageCollection: true,
        orphanCleanup: true
      },

      // Space savings tracking
      savingsTracking: {
        enabled: true,
        reportingInterval: 'daily',
        alertThresholds: {
          spaceReclaimed: '1GB',
          deduplicationRatio: 0.2
        }
      }
    };

    // Create deduplication indexes and processes
    await this.implementDeduplicationSystem(deduplicationSystem);

    return deduplicationSystem;
  }

  async implementAdvancedSecurity() {
    console.log('Implementing advanced security features...');

    const securityFeatures = {
      // Encryption at rest and in transit
      encryption: {
        atRest: {
          algorithm: 'AES-256-GCM',
          keyRotation: 'quarterly',
          keyManagement: 'vault_integration'
        },
        inTransit: {
          tlsMinVersion: '1.3',
          certificateValidation: 'strict'
        }
      },

      // Access control and auditing
      accessControl: {
        roleBasedAccess: true,
        attributeBasedAccess: true,
        temporaryAccess: true,
        shareLinksWithExpiration: true
      },

      // Content security
      contentSecurity: {
        virusScanning: {
          enabled: true,
          scanOnUpload: true,
          quarantineInfected: true
        },
        contentFiltering: {
          enabled: true,
          malwareDetection: true,
          dataLossPreventionRules: []
        }
      },

      // Audit and compliance
      auditing: {
        accessLogging: true,
        modificationTracking: true,
        retentionPolicies: true,
        complianceReporting: true
      }
    };

    await this.deploySecurityFeatures(securityFeatures);

    return securityFeatures;
  }

  async setupPerformanceOptimization() {
    console.log('Setting up performance optimization system...');

    const performanceOptimization = {
      // Intelligent caching strategies
      caching: {
        chunkCaching: {
          memoryCache: '2GB',
          diskCache: '20GB',
          distributedCache: true
        },
        metadataCache: {
          size: '500MB',
          ttl: '1 hour'
        },
        preloadStrategies: {
          popularFiles: true,
          sequentialAccess: true,
          userPatterns: true
        }
      },

      // Connection and streaming optimization
      streaming: {
        connectionPooling: {
          minConnections: 10,
          maxConnections: 100,
          connectionTimeout: '30s'
        },
        chunkOptimization: {
          adaptiveChunkSize: true,
          parallelStreaming: true,
          compressionOnTheFly: true
        }
      },

      // Load balancing and scaling
      scaling: {
        autoScaling: {
          enabled: true,
          metrics: ['cpu', 'memory', 'io'],
          thresholds: { cpu: 70, memory: 80, io: 75 }
        },
        loadBalancing: {
          algorithm: 'least_connections',
          healthChecks: true,
          failoverTimeout: '5s'
        }
      }
    };

    await this.deployPerformanceOptimization(performanceOptimization);

    return performanceOptimization;
  }

  // Advanced implementation methods

  async setupTieringAutomation(strategy) {
    // Create background job for automated tiering
    const tieringJob = {
      schedule: '0 2 * * *', // Daily at 2 AM
      action: async () => {
        console.log('Running automated storage tiering...');

        // Analyze file access patterns
        const analysisResults = await this.analyzeFileAccessPatterns();

        // Apply tiering rules
        for (const rule of strategy.migrationRules) {
          await this.applyTieringRule(rule, analysisResults);
        }

        // Generate tiering report
        await this.generateTieringReport();
      }
    };

    // Schedule the tiering automation
    await this.scheduleBackgroundJob('storage_tiering', tieringJob);
  }

  async implementDeduplicationSystem(system) {
    // Create deduplication tracking collections
    await this.collections.chunks.createIndex({ 'data': 'hashed' });

    // Setup chunk reference tracking
    const chunkReferences = this.db.collection('chunk_references');
    await chunkReferences.createIndex({ chunkHash: 1, refCount: 1 });

    // Implement deduplication logic in upload process
    this.enableChunkDeduplication = true;
  }

  async deploySecurityFeatures(features) {
    // Setup encryption middleware
    if (features.encryption.atRest.algorithm) {
      await this.setupEncryptionMiddleware(features.encryption);
    }

    // Configure access control
    await this.setupAccessControlSystem(features.accessControl);

    // Enable content security scanning
    if (features.contentSecurity.virusScanning.enabled) {
      await this.setupVirusScanning(features.contentSecurity.virusScanning);
    }
  }

  async deployPerformanceOptimization(optimization) {
    // Configure caching layers
    await this.setupCachingSystem(optimization.caching);

    // Optimize streaming configuration
    await this.configureStreamingOptimization(optimization.streaming);

    // Setup auto-scaling
    if (optimization.scaling.autoScaling.enabled) {
      await this.configureAutoScaling(optimization.scaling.autoScaling);
    }
  }

  // Monitoring and analytics methods

  async generateComprehensiveAnalytics() {
    const analytics = {
      storageAnalytics: await this.generateStorageAnalytics(),
      performanceAnalytics: await this.generatePerformanceAnalytics(),
      usageAnalytics: await this.generateUsageAnalytics(),
      securityAnalytics: await this.generateSecurityAnalytics()
    };

    return analytics;
  }

  async generateStorageAnalytics() {
    const pipeline = [
      {
        $group: {
          _id: null,
          totalFiles: { $sum: 1 },
          totalStorage: { $sum: '$length' },
          avgFileSize: { $avg: '$length' },
          minFileSize: { $min: '$length' },
          maxFileSize: { $max: '$length' },

          // Storage by content type
          imageFiles: { $sum: { $cond: [{ $regexMatch: { input: '$metadata.contentType', regex: '^image/' } }, 1, 0] } },
          videoFiles: { $sum: { $cond: [{ $regexMatch: { input: '$metadata.contentType', regex: '^video/' } }, 1, 0] } },
          documentFiles: { $sum: { $cond: [{ $regexMatch: { input: '$metadata.contentType', regex: '^application/' } }, 1, 0] } },

          // Storage by size category
          smallFiles: { $sum: { $cond: [{ $lt: ['$length', 1048576] }, 1, 0] } }, // < 1MB
          mediumFiles: { $sum: { $cond: [{ $and: [{ $gte: ['$length', 1048576] }, { $lt: ['$length', 104857600] }] }, 1, 0] } }, // 1MB - 100MB
          largeFiles: { $sum: { $cond: [{ $gte: ['$length', 104857600] }, 1, 0] } }, // > 100MB

          // Storage by age
          recentFiles: { $sum: { $cond: [{ $gte: ['$uploadDate', new Date(Date.now() - 30 * 24 * 60 * 60 * 1000)] }, 1, 0] } }
        }
      }
    ];

    const results = await this.collections.files.aggregate(pipeline).toArray();
    return results[0] || {};
  }

  async setupProductionMonitoring() {
    const monitoring = {
      metrics: [
        'storage_utilization',
        'upload_throughput', 
        'download_throughput',
        'cache_hit_ratio',
        'deduplication_savings',
        'security_events'
      ],

      alerts: [
        { metric: 'storage_utilization', threshold: 85, severity: 'warning' },
        { metric: 'upload_throughput', threshold: 100, severity: 'critical' },
        { metric: 'cache_hit_ratio', threshold: 70, severity: 'warning' }
      ],

      dashboards: [
        'storage_overview',
        'performance_metrics', 
        'security_dashboard',
        'usage_analytics'
      ]
    };

    return monitoring;
  }

  async initializeProductionFeatures() {
    console.log('Initializing production GridFS features...');
    // Placeholder for production feature initialization
  }

  async configureAutomatedMaintenance() {
    return {
      tasks: [
        'chunk_optimization',
        'metadata_cleanup',
        'performance_tuning',
        'security_updates'
      ],
      schedule: 'daily_2am'
    };
  }
}

SQL-Style File Management with QueryLeaf

QueryLeaf provides familiar SQL syntax for MongoDB GridFS operations and file management:

-- QueryLeaf GridFS file management with SQL-familiar syntax

-- Create virtual tables for file management operations
CREATE FILE_STORAGE TABLE documents_bucket 
USING GRIDFS (
  bucket_name = 'documents',
  chunk_size = 261120,
  compression = true,
  encryption = true
)
WITH STORAGE_OPTIONS (
  auto_tiering = true,
  deduplication = true,
  virus_scanning = true,

  -- Storage tier configuration
  tier_hot = { access_frequency = 'daily', compression_level = 'fast' },
  tier_warm = { access_frequency = 'weekly', compression_level = 'balanced' },
  tier_cold = { access_frequency = 'monthly', compression_level = 'maximum' }
);

-- Upload files with comprehensive metadata
INSERT INTO documents_bucket (
  filename,
  content,
  content_type,
  metadata
) VALUES (
  'project_document.pdf',
  LOAD_FILE('/uploads/project_document.pdf'),
  'application/pdf',
  JSON_OBJECT(
    'uploader', '[email protected]',
    'project_id', 'PROJ-2024-001',
    'department', 'engineering',
    'classification', 'confidential',
    'tags', JSON_ARRAY('project', 'specification', '2024'),
    'folder', '/projects/2024/specifications',
    'permissions', JSON_OBJECT(
      'public', false,
      'users', JSON_ARRAY('john.doe', 'jane.smith', 'team.lead'),
      'roles', JSON_ARRAY('project_manager', 'engineer')
    ),
    'custom_fields', JSON_OBJECT(
      'review_status', 'pending',
      'approval_required', true,
      'retention_years', 7
    )
  )
);

-- Comprehensive file search and management queries
WITH file_analytics AS (
  SELECT 
    file_id,
    filename,
    file_size,
    content_type,
    upload_date,
    metadata->>'$.uploader' as uploader,
    metadata->>'$.department' as department,
    metadata->>'$.folder' as folder_path,
    JSON_EXTRACT(metadata, '$.tags') as tags,

    -- File age calculation
    DATEDIFF(CURRENT_DATE, upload_date) as age_days,

    -- Size categorization
    CASE 
      WHEN file_size < 1048576 THEN 'Small (<1MB)'
      WHEN file_size < 104857600 THEN 'Medium (1-100MB)'  
      WHEN file_size < 1073741824 THEN 'Large (100MB-1GB)'
      ELSE 'Very Large (>1GB)'
    END as size_category,

    -- Content type categorization
    CASE
      WHEN content_type LIKE 'image/%' THEN 'Image'
      WHEN content_type LIKE 'video/%' THEN 'Video'
      WHEN content_type LIKE 'audio/%' THEN 'Audio'
      WHEN content_type IN ('application/pdf', 'application/msword', 
                           'application/vnd.openxmlformats-officedocument.wordprocessingml.document') 
        THEN 'Document'
      WHEN content_type LIKE 'text/%' THEN 'Text'
      ELSE 'Other'
    END as content_category,

    -- Access pattern analysis from analytics collection
    COALESCE(a.access_count, 0) as total_accesses,
    COALESCE(a.last_access_date, upload_date) as last_accessed,

    -- Storage tier recommendation
    CASE
      WHEN COALESCE(a.access_count, 0) = 0 AND DATEDIFF(CURRENT_DATE, upload_date) > 365 
        THEN 'archive'
      WHEN COALESCE(a.access_count, 0) < 5 AND DATEDIFF(CURRENT_DATE, upload_date) > 90 
        THEN 'cold'
      WHEN COALESCE(a.access_count, 0) < 20 AND DATEDIFF(CURRENT_DATE, upload_date) > 30 
        THEN 'warm'  
      ELSE 'hot'
    END as recommended_tier

  FROM documents_bucket fb
  LEFT JOIN (
    SELECT 
      file_id,
      COUNT(*) as access_count,
      MAX(access_date) as last_access_date,
      AVG(download_duration_ms) as avg_download_time
    FROM file_access_log
    WHERE access_date >= DATE_SUB(CURRENT_DATE, INTERVAL 1 YEAR)
    GROUP BY file_id
  ) a ON fb.file_id = a.file_id
),

storage_optimization AS (
  SELECT 
    fa.*,

    -- Deduplication analysis
    COUNT(*) OVER (PARTITION BY CHECKSUM(content)) as duplicate_count,
    CASE 
      WHEN COUNT(*) OVER (PARTITION BY CHECKSUM(content)) > 1 
        THEN 'Deduplication opportunity'
      ELSE 'Unique file'
    END as deduplication_status,

    -- Compression potential
    CASE
      WHEN content_category IN ('Text', 'Document') AND file_size > 1048576 
        THEN 'High compression potential'
      WHEN content_category = 'Image' AND content_type NOT IN ('image/jpeg', 'image/png')
        THEN 'Moderate compression potential'
      ELSE 'Low compression potential'  
    END as compression_potential,

    -- Storage cost analysis
    file_size * 
      CASE recommended_tier
        WHEN 'hot' THEN 0.10
        WHEN 'warm' THEN 0.05  
        WHEN 'cold' THEN 0.02
        WHEN 'archive' THEN 0.01
      END as estimated_monthly_storage_cost,

    -- Performance impact assessment
    CASE
      WHEN total_accesses > 100 AND file_size > 104857600 
        THEN 'High performance impact - consider optimization'
      WHEN total_accesses > 50 AND age_days < 30
        THEN 'Frequently accessed - ensure hot tier placement'
      WHEN total_accesses = 0 AND age_days > 30
        THEN 'Unused file - candidate for archival or deletion'
      ELSE 'Normal performance profile'
    END as performance_assessment

  FROM file_analytics fa
),

security_compliance AS (
  SELECT 
    so.*,

    -- Access control validation
    CASE
      WHEN JSON_EXTRACT(metadata, '$.classification') = 'confidential' AND 
           JSON_EXTRACT(metadata, '$.permissions.public') = true
        THEN 'SECURITY RISK: Confidential file marked as public'
      WHEN JSON_LENGTH(JSON_EXTRACT(metadata, '$.permissions.users')) > 10
        THEN 'WARNING: File shared with many users'
      WHEN metadata->>'$.permissions' IS NULL
        THEN 'WARNING: No explicit permissions defined'
      ELSE 'Access control compliant'
    END as security_status,

    -- Retention policy compliance
    CASE
      WHEN metadata->>'$.retention_years' IS NOT NULL AND 
           age_days > (CAST(metadata->>'$.retention_years' AS SIGNED) * 365)
        THEN 'COMPLIANCE: File exceeds retention period - schedule for deletion'
      WHEN metadata->>'$.retention_years' IS NULL
        THEN 'WARNING: No retention policy defined'
      ELSE 'Retention policy compliant'
    END as retention_status,

    -- Data classification validation
    CASE
      WHEN metadata->>'$.classification' IS NULL
        THEN 'WARNING: No data classification assigned'
      WHEN metadata->>'$.classification' = 'confidential' AND department = 'public'
        THEN 'ERROR: Classification mismatch with department'
      ELSE 'Classification appropriate'
    END as classification_status

  FROM storage_optimization so
)

-- Final comprehensive file management report
SELECT 
  -- File identification
  sc.file_id,
  sc.filename,
  sc.folder_path,
  sc.uploader,
  sc.department,

  -- File characteristics
  sc.size_category,
  sc.content_category,
  ROUND(sc.file_size / 1024 / 1024, 2) as size_mb,
  sc.age_days,

  -- Access patterns
  sc.total_accesses,
  DATEDIFF(CURRENT_DATE, sc.last_accessed) as days_since_access,

  -- Storage optimization
  sc.recommended_tier,
  sc.deduplication_status,
  sc.compression_potential,
  ROUND(sc.estimated_monthly_storage_cost, 4) as monthly_cost_usd,

  -- Performance and security
  sc.performance_assessment,
  sc.security_status,
  sc.retention_status,
  sc.classification_status,

  -- Action recommendations
  CASE
    WHEN sc.security_status LIKE 'SECURITY RISK%' THEN 'URGENT: Review security settings'
    WHEN sc.retention_status LIKE 'COMPLIANCE%' THEN 'SCHEDULE: File deletion per retention policy'
    WHEN sc.recommended_tier != 'hot' AND sc.total_accesses > 20 THEN 'OPTIMIZE: Move to hot tier'
    WHEN sc.duplicate_count > 1 THEN 'OPTIMIZE: Implement deduplication'
    WHEN sc.performance_assessment LIKE 'High performance impact%' THEN 'OPTIMIZE: File size or access pattern'
    ELSE 'MAINTAIN: No immediate action required'
  END as recommended_action,

  -- Priority calculation
  CASE
    WHEN sc.security_status LIKE 'SECURITY RISK%' OR sc.security_status LIKE 'ERROR%' THEN 'CRITICAL'
    WHEN sc.retention_status LIKE 'COMPLIANCE%' THEN 'HIGH'
    WHEN sc.performance_assessment LIKE 'High performance impact%' THEN 'HIGH'
    WHEN sc.deduplication_status = 'Deduplication opportunity' AND sc.file_size > 10485760 THEN 'MEDIUM'
    WHEN sc.recommended_tier = 'archive' AND sc.total_accesses = 0 THEN 'MEDIUM'
    ELSE 'LOW'
  END as priority_level

FROM security_compliance sc
WHERE sc.recommended_action != 'MAINTAIN: No immediate action required'
   OR sc.priority_level IN ('CRITICAL', 'HIGH')
ORDER BY 
  CASE priority_level 
    WHEN 'CRITICAL' THEN 1 
    WHEN 'HIGH' THEN 2 
    WHEN 'MEDIUM' THEN 3 
    ELSE 4 
  END,
  sc.file_size DESC;

-- File streaming and download operations with SQL syntax
SELECT 
  file_id,
  filename,
  content_type,
  file_size,

  -- Generate streaming URLs for different access patterns
  CONCAT('/api/files/stream/', file_id) as stream_url,
  CONCAT('/api/files/download/', file_id, '?filename=', URLENCODE(filename)) as download_url,

  -- Generate thumbnail/preview URLs for supported content types  
  CASE
    WHEN content_type LIKE 'image/%' THEN 
      CONCAT('/api/files/thumbnail/', file_id, '?size=200x200')
    WHEN content_type = 'application/pdf' THEN
      CONCAT('/api/files/preview/', file_id, '?page=1&format=image')
    WHEN content_type LIKE 'video/%' THEN
      CONCAT('/api/files/thumbnail/', file_id, '?time=00:00:05')
    ELSE NULL
  END as preview_url,

  -- Generate sharing links with expiration
  GENERATE_SHARE_LINK(file_id, '7 days', 'read') as temporary_share_link,

  -- Content delivery optimization
  CASE
    WHEN total_accesses > 100 THEN 'CDN_RECOMMENDED'
    WHEN file_size > 104857600 THEN 'STREAMING_RECOMMENDED'  
    WHEN content_type LIKE 'image/%' THEN 'CACHE_AGGRESSIVE'
    ELSE 'STANDARD_DELIVERY'
  END as delivery_optimization

FROM file_analytics
WHERE security_status NOT LIKE 'SECURITY RISK%'
  AND (
    total_accesses > 10 OR 
    upload_date >= DATE_SUB(CURRENT_DATE, INTERVAL 7 DAY)
  )
ORDER BY total_accesses DESC, file_size DESC;

-- File versioning and history management
CREATE FILE_VERSIONS TABLE document_versions
USING GRIDFS (
  bucket_name = 'document_versions',
  parent_table = 'documents_bucket'
);

-- Version management operations
WITH version_analysis AS (
  SELECT 
    original_file_id,
    COUNT(*) as version_count,
    MAX(version_number) as latest_version,
    SUM(file_size) as total_version_storage,
    MIN(created_date) as first_version_date,
    MAX(created_date) as latest_version_date,

    -- Calculate storage overhead from versioning
    file_size as original_size,
    SUM(file_size) - file_size as version_overhead_bytes,
    ROUND(((SUM(file_size) - file_size) / file_size * 100), 2) as version_overhead_pct

  FROM document_versions dv
  JOIN documents_bucket db ON dv.original_file_id = db.file_id
  GROUP BY original_file_id, file_size
),
version_optimization AS (
  SELECT 
    va.*,

    -- Version cleanup recommendations
    CASE
      WHEN version_count > 10 AND version_overhead_pct > 300 
        THEN 'Aggressive cleanup recommended - keep last 3 versions'
      WHEN version_count > 5 AND version_overhead_pct > 200
        THEN 'Moderate cleanup recommended - keep last 5 versions'  
      WHEN version_count > 20
        THEN 'Version limit enforcement recommended'
      ELSE 'Version count acceptable'
    END as cleanup_recommendation,

    -- Storage impact assessment
    CASE
      WHEN total_version_storage > 1073741824 -- >1GB
        THEN 'High storage impact - prioritize optimization'
      WHEN total_version_storage > 104857600 -- >100MB  
        THEN 'Moderate storage impact - monitor'
      ELSE 'Low storage impact'
    END as storage_impact

  FROM version_analysis va
)

SELECT 
  original_file_id,
  (SELECT filename FROM documents_bucket WHERE file_id = vo.original_file_id) as filename,
  version_count,
  latest_version,
  ROUND(total_version_storage / 1024 / 1024, 2) as total_storage_mb,
  ROUND(version_overhead_bytes / 1024 / 1024, 2) as overhead_mb, 
  version_overhead_pct,
  cleanup_recommendation,
  storage_impact,

  -- Generate cleanup commands
  CASE cleanup_recommendation
    WHEN 'Aggressive cleanup recommended - keep last 3 versions' THEN
      CONCAT('DELETE FROM document_versions WHERE original_file_id = ''', original_file_id, 
             ''' AND version_number <= ', latest_version - 3)
    WHEN 'Moderate cleanup recommended - keep last 5 versions' THEN  
      CONCAT('DELETE FROM document_versions WHERE original_file_id = ''', original_file_id,
             ''' AND version_number <= ', latest_version - 5)
    ELSE 'No cleanup required'
  END as cleanup_command

FROM version_optimization vo
WHERE version_count > 3
ORDER BY total_version_storage DESC, version_overhead_pct DESC;

-- QueryLeaf GridFS capabilities:
-- 1. SQL-familiar syntax for GridFS file operations and management
-- 2. Advanced file metadata querying and analytics with JSON operations
-- 3. Automated storage tiering and optimization recommendations
-- 4. Comprehensive security and compliance validation
-- 5. File versioning and history management with cleanup automation
-- 6. Performance optimization through intelligent caching and delivery
-- 7. Content deduplication and compression analysis
-- 8. Integration with MongoDB's native GridFS capabilities
-- 9. Real-time file analytics and usage pattern analysis  
-- 10. Production-ready file management with monitoring and alerting

Best Practices for Production GridFS Management

File Storage Architecture and Performance Optimization

Essential principles for effective MongoDB GridFS deployment and management:

Chunk Size Optimization: Configure appropriate chunk sizes (255KB default) based on file types and access patterns for optimal streaming performance
Index Strategy: Implement comprehensive indexing on metadata fields for fast file discovery and management operations
Storage Tiering: Design automated storage tiering strategies based on access frequency and file age for cost optimization
Content Deduplication: Implement hash-based deduplication to reduce storage overhead and improve efficiency
Security Integration: Deploy encryption, access control, and content scanning for enterprise security requirements
Performance Monitoring: Track upload/download throughput, cache hit ratios, and storage utilization continuously

Scalability and Production Optimization

Optimize GridFS deployments for enterprise-scale file management:

Sharding Strategy: Design effective sharding strategies for large file collections based on access patterns and geographic distribution
Replication Configuration: Configure appropriate replication factors based on availability requirements and storage costs
Caching Implementation: Deploy multi-tier caching (memory, disk, distributed) for frequently accessed files
Content Delivery: Integrate with CDN services for global file distribution and performance optimization
Backup Management: Implement comprehensive backup strategies that handle both metadata and binary content efficiently
Resource Management: Monitor and optimize CPU, memory, and storage resources for sustained performance

Conclusion

MongoDB GridFS provides comprehensive file storage capabilities that seamlessly integrate binary data management with document database operations, enabling applications to handle large files while maintaining ACID properties, metadata relationships, and query capabilities. The unified data management approach eliminates the complexity of external file systems while providing enterprise-grade features for security, performance, and scalability.

Key MongoDB GridFS benefits include:

Unified Data Management: Seamless integration of file storage with document data within the same database system
Scalable Architecture: Native support for large files with automatic chunking and streaming capabilities
Advanced Metadata: Comprehensive metadata storage and indexing for powerful file discovery and management
Production Features: Enterprise security, encryption, deduplication, and automated storage tiering
Performance Optimization: Intelligent caching, compression, and content delivery optimization
Operational Simplicity: Unified backup, replication, and monitoring with existing MongoDB infrastructure

Whether you're building content management systems, media platforms, document repositories, or IoT data storage solutions, MongoDB GridFS with QueryLeaf's familiar SQL interface provides the foundation for scalable and efficient file management.

QueryLeaf Integration: QueryLeaf automatically manages MongoDB GridFS operations while providing SQL-familiar syntax for file upload, download, search, and management operations. Advanced file management patterns, storage optimization strategies, and production-ready features are seamlessly handled through familiar SQL constructs, making sophisticated file storage both powerful and accessible to SQL-oriented development teams.

The combination of MongoDB's robust GridFS capabilities with SQL-style file operations makes it an ideal platform for applications requiring both advanced file management and familiar database interaction patterns, ensuring your file storage solutions remain performant, secure, and maintainable as they scale.

November 12, 2025
23 min read

MongoDB Atlas Vector Search for AI Applications: Building Semantic Search and Retrieval-Augmented Generation Systems with SQL-Style Operations

Modern AI applications require sophisticated data retrieval capabilities that go beyond traditional text matching to understand semantic meaning, context, and conceptual similarity. Vector search technology enables applications to find relevant information based on meaning rather than exact keyword matches, powering everything from recommendation engines to retrieval-augmented generation (RAG) systems.

MongoDB Atlas Vector Search provides native vector database capabilities integrated directly into MongoDB's document model, enabling developers to build AI applications without managing separate vector databases. Unlike standalone vector databases that require complex data synchronization and additional infrastructure, Atlas Vector Search combines traditional document operations with vector similarity search in a single, scalable platform.

The Traditional Vector Search Infrastructure Challenge

Building AI applications with traditional vector databases often requires complex, fragmented infrastructure:

-- Traditional PostgreSQL with pgvector extension - complex setup and limited scalability

-- Enable vector extension (requires superuser privileges)
CREATE EXTENSION IF NOT EXISTS vector;

-- Create table for document storage with vector embeddings
CREATE TABLE document_embeddings (
    document_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    title TEXT NOT NULL,
    content TEXT NOT NULL,
    source_url TEXT,
    document_type VARCHAR(50),
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,

    -- Vector embedding column (limited to 16,000 dimensions in pgvector)
    embedding vector(1536), -- OpenAI embedding dimension

    -- Metadata for filtering
    category VARCHAR(100),
    language VARCHAR(10) DEFAULT 'en',
    author VARCHAR(200),
    tags TEXT[],

    -- Full-text search support
    search_vector tsvector GENERATED ALWAYS AS (
        setweight(to_tsvector('english', coalesce(title, '')), 'A') ||
        setweight(to_tsvector('english', coalesce(content, '')), 'B')
    ) STORED
);

-- Vector similarity index (limited indexing options)
CREATE INDEX embedding_idx ON document_embeddings 
USING ivfflat (embedding vector_cosine_ops) 
WITH (lists = 1000); -- Requires manual tuning

-- Full-text search index
CREATE INDEX document_search_idx ON document_embeddings USING GIN(search_vector);

-- Compound index for metadata filtering
CREATE INDEX document_metadata_idx ON document_embeddings(category, language, created_at);

-- Complex vector similarity search with metadata filtering
WITH vector_search AS (
  SELECT 
    document_id,
    title,
    content,
    category,
    author,
    created_at,

    -- Cosine similarity calculation
    1 - (embedding <=> $1::vector) as similarity_score,

    -- L2 distance (alternative metric)
    embedding <-> $1::vector as l2_distance,

    -- Inner product similarity  
    (embedding <#> $1::vector) * -1 as inner_product_similarity,

    -- Hybrid scoring combining vector and text search
    ts_rank(search_vector, plainto_tsquery('english', $2)) as text_relevance_score

  FROM document_embeddings
  WHERE 
    -- Metadata filtering (applied before vector search for performance)
    category = ANY($3::text[]) 
    AND language = $4
    AND created_at >= $5::timestamp

    -- Optional full-text pre-filtering
    AND (CASE WHEN $2 IS NOT NULL AND $2 != '' 
         THEN search_vector @@ plainto_tsquery('english', $2)
         ELSE true END)
),

ranked_results AS (
  SELECT *,
    -- Hybrid ranking combining multiple signals
    (0.7 * similarity_score + 0.3 * text_relevance_score) as hybrid_score,

    -- Relevance classification
    CASE 
      WHEN similarity_score >= 0.8 THEN 'highly_relevant'
      WHEN similarity_score >= 0.6 THEN 'relevant'  
      WHEN similarity_score >= 0.4 THEN 'somewhat_relevant'
      ELSE 'low_relevance'
    END as relevance_category,

    -- Diversity scoring (for result diversification)
    ROW_NUMBER() OVER (PARTITION BY category ORDER BY similarity_score DESC) as category_rank

  FROM vector_search
  WHERE similarity_score >= 0.3 -- Similarity threshold
),

diversified_results AS (
  SELECT *,
    -- Result diversification logic
    CASE 
      WHEN category_rank <= 2 THEN hybrid_score -- Top 2 per category get full score
      WHEN category_rank <= 5 THEN hybrid_score * 0.8 -- Next 3 get reduced score
      ELSE hybrid_score * 0.5 -- Others get significantly reduced score
    END as diversified_score

  FROM ranked_results
)

SELECT 
  document_id,
  title,
  LEFT(content, 500) as content_preview, -- Truncate for performance
  category,
  author,
  created_at,
  ROUND(similarity_score::numeric, 4) as similarity,
  ROUND(text_relevance_score::numeric, 4) as text_relevance,
  ROUND(diversified_score::numeric, 4) as final_score,
  relevance_category,

  -- Highlight matching terms (requires additional processing)
  ts_headline('english', content, plainto_tsquery('english', $2), 
              'MaxWords=50, MinWords=20, MaxFragments=3') as highlighted_content

FROM diversified_results
ORDER BY diversified_score DESC, similarity_score DESC
LIMIT $6::int -- Result limit parameter
OFFSET $7::int; -- Pagination offset

-- Problems with traditional vector database approaches:
-- 1. Complex infrastructure requiring separate vector database setup and management
-- 2. Limited integration between vector search and traditional document operations
-- 3. Manual index tuning and maintenance for optimal vector search performance
-- 4. Difficult data synchronization between operational databases and vector stores
-- 5. Limited scalability and high operational complexity for production deployments
-- 6. Fragmented query capabilities requiring multiple systems for comprehensive search
-- 7. Complex hybrid search implementations combining vector and traditional search
-- 8. Limited support for real-time updates and dynamic vector index management
-- 9. Expensive infrastructure costs for separate specialized vector database systems
-- 10. Difficult migration paths and vendor lock-in with specialized vector database solutions

-- Pinecone example (proprietary vector database)
-- Requires separate service, API calls, and complex data synchronization
-- Limited filtering capabilities and expensive for large-scale applications
-- No native SQL interface or familiar query patterns

-- Weaviate/Chroma examples similarly require:
-- - Separate infrastructure and service management  
-- - Complex data pipeline orchestration
-- - Limited integration with existing application databases
-- - Expensive scaling and operational complexity

MongoDB Atlas Vector Search provides integrated vector database capabilities:

// MongoDB Atlas Vector Search - native integration with document operations
const { MongoClient } = require('mongodb');

// Advanced Atlas Vector Search system for AI applications
class AtlasVectorSearchManager {
  constructor(connectionString, databaseName) {
    this.client = new MongoClient(connectionString);
    this.db = this.client.db(databaseName);
    this.collections = {
      documents: this.db.collection('documents'),
      embeddings: this.db.collection('embeddings'), 
      searchLogs: this.db.collection('search_logs'),
      userProfiles: this.db.collection('user_profiles')
    };

    this.embeddingDimensions = 1536; // OpenAI embedding size
    this.searchConfigs = new Map();
    this.performanceMetrics = new Map();
  }

  async createVectorSearchIndexes() {
    console.log('Creating optimized vector search indexes for AI applications...');

    try {
      // Primary vector search index for document embeddings
      await this.collections.documents.createSearchIndex({
        name: "document_vector_index",
        type: "vectorSearch",
        definition: {
          "fields": [
            {
              "type": "vector",
              "path": "embedding",
              "numDimensions": this.embeddingDimensions,
              "similarity": "cosine"
            },
            {
              "type": "filter", 
              "path": "metadata.category"
            },
            {
              "type": "filter",
              "path": "metadata.language" 
            },
            {
              "type": "filter",
              "path": "metadata.source"
            },
            {
              "type": "filter",
              "path": "created_at"
            },
            {
              "type": "filter",
              "path": "metadata.tags"
            }
          ]
        }
      });

      // Hybrid search index combining full-text and vector search
      await this.collections.documents.createSearchIndex({
        name: "hybrid_search_index",
        type: "search",
        definition: {
          "mappings": {
            "dynamic": false,
            "fields": {
              "title": {
                "type": "text",
                "analyzer": "lucene.standard"
              },
              "content": {
                "type": "text", 
                "analyzer": "lucene.english"
              },
              "metadata": {
                "type": "document",
                "fields": {
                  "category": {
                    "type": "string"
                  },
                  "tags": {
                    "type": "stringFacet"
                  },
                  "language": {
                    "type": "string"
                  }
                }
              }
            }
          }
        }
      });

      // User preference vector index for personalized search
      await this.collections.userProfiles.createSearchIndex({
        name: "user_preference_vector_index",
        type: "vectorSearch", 
        definition: {
          "fields": [
            {
              "type": "vector",
              "path": "preference_embedding",
              "numDimensions": this.embeddingDimensions,
              "similarity": "cosine"
            },
            {
              "type": "filter",
              "path": "user_id"
            },
            {
              "type": "filter", 
              "path": "profile_type"
            }
          ]
        }
      });

      console.log('Vector search indexes created successfully');
      return { success: true, indexes: ['document_vector_index', 'hybrid_search_index', 'user_preference_vector_index'] };

    } catch (error) {
      console.error('Error creating vector search indexes:', error);
      return { success: false, error: error.message };
    }
  }

  async ingestDocumentsWithEmbeddings(documents, embeddingFunction) {
    console.log(`Ingesting ${documents.length} documents with vector embeddings...`);

    const batchSize = 100;
    const batches = [];
    let totalIngested = 0;

    // Process documents in batches for optimal performance
    for (let i = 0; i < documents.length; i += batchSize) {
      const batch = documents.slice(i, i + batchSize);
      batches.push(batch);
    }

    for (const [batchIndex, batch] of batches.entries()) {
      console.log(`Processing batch ${batchIndex + 1}/${batches.length}`);

      try {
        // Generate embeddings for batch
        const batchTexts = batch.map(doc => `${doc.title}\n\n${doc.content}`);
        const embeddings = await embeddingFunction(batchTexts);

        // Prepare documents with embeddings and metadata
        const enrichedDocuments = batch.map((doc, index) => ({
          _id: doc._id || new ObjectId(),
          title: doc.title,
          content: doc.content,

          // Vector embedding
          embedding: embeddings[index],

          // Rich metadata for filtering and analytics
          metadata: {
            category: doc.category || 'general',
            subcategory: doc.subcategory,
            language: doc.language || 'en',
            source: doc.source || 'unknown',
            source_url: doc.source_url,
            author: doc.author,
            tags: doc.tags || [],

            // Content analysis metadata
            word_count: this.calculateWordCount(doc.content),
            reading_time_minutes: Math.ceil(this.calculateWordCount(doc.content) / 200),
            content_type: this.inferContentType(doc),
            sentiment_score: doc.sentiment_score,

            // Technical metadata
            extraction_method: doc.extraction_method || 'manual',
            processing_version: '1.0',
            quality_score: this.calculateQualityScore(doc)
          },

          // Timestamps
          created_at: doc.created_at || new Date(),
          updated_at: new Date(),
          indexed_at: new Date(),

          // Search optimization fields
          searchable_text: `${doc.title} ${doc.content} ${(doc.tags || []).join(' ')}`,

          // Embedding metadata
          embedding_model: 'text-embedding-ada-002',
          embedding_dimensions: this.embeddingDimensions,
          embedding_created_at: new Date()
        }));

        // Bulk insert with error handling
        const result = await this.collections.documents.insertMany(enrichedDocuments, {
          ordered: false,
          writeConcern: { w: 'majority' }
        });

        totalIngested += result.insertedCount;
        console.log(`Batch ${batchIndex + 1} completed: ${result.insertedCount} documents ingested`);

      } catch (error) {
        console.error(`Error processing batch ${batchIndex + 1}:`, error);
        continue; // Continue with next batch
      }
    }

    console.log(`Document ingestion completed: ${totalIngested}/${documents.length} documents successfully ingested`);
    return {
      success: true,
      totalIngested,
      totalDocuments: documents.length,
      successRate: (totalIngested / documents.length * 100).toFixed(2)
    };
  }

  async performSemanticSearch(queryEmbedding, options = {}) {
    console.log('Performing semantic vector search...');

    const {
      limit = 10,
      categories = [],
      language = null,
      source = null,
      tags = [],
      dateRange = null,
      similarityThreshold = 0.7,
      includeMetadata = true,
      boostFactors = {},
      userProfile = null
    } = options;

    // Build filter criteria
    const filterCriteria = [];

    if (categories.length > 0) {
      filterCriteria.push({
        "metadata.category": { $in: categories }
      });
    }

    if (language) {
      filterCriteria.push({
        "metadata.language": { $eq: language }
      });
    }

    if (source) {
      filterCriteria.push({
        "metadata.source": { $eq: source }
      });
    }

    if (tags.length > 0) {
      filterCriteria.push({
        "metadata.tags": { $in: tags }
      });
    }

    if (dateRange) {
      filterCriteria.push({
        "created_at": {
          $gte: dateRange.start,
          $lte: dateRange.end
        }
      });
    }

    try {
      // Build aggregation pipeline for vector search
      const pipeline = [
        {
          $vectorSearch: {
            index: "document_vector_index",
            path: "embedding",
            queryVector: queryEmbedding,
            numCandidates: limit * 10, // Search more candidates for better results
            limit: limit * 2, // Get extra results for post-processing
            ...(filterCriteria.length > 0 && {
              filter: {
                $and: filterCriteria
              }
            })
          }
        },

        // Add similarity score
        {
          $addFields: {
            similarity_score: { $meta: "vectorSearchScore" }
          }
        },

        // Filter by similarity threshold
        {
          $match: {
            similarity_score: { $gte: similarityThreshold }
          }
        },

        // Add computed fields for ranking
        {
          $addFields: {
            // Content quality boost
            quality_boost: {
              $multiply: [
                "$metadata.quality_score",
                boostFactors.quality || 1.0
              ]
            },

            // Recency boost
            recency_boost: {
              $multiply: [
                {
                  $divide: [
                    { $subtract: [new Date(), "$created_at"] },
                    86400000 * 365 // Days in milliseconds
                  ]
                },
                boostFactors.recency || 0.1
              ]
            },

            // Source authority boost
            source_boost: {
              $switch: {
                branches: [
                  { case: { $eq: ["$metadata.source", "official"] }, then: boostFactors.official || 1.2 },
                  { case: { $eq: ["$metadata.source", "expert"] }, then: boostFactors.expert || 1.1 }
                ],
                default: 1.0
              }
            }
          }
        },

        // Calculate final ranking score
        {
          $addFields: {
            final_score: {
              $multiply: [
                "$similarity_score",
                {
                  $add: [
                    1.0,
                    "$quality_boost",
                    "$recency_boost", 
                    "$source_boost"
                  ]
                }
              ]
            },

            // Relevance classification
            relevance_category: {
              $switch: {
                branches: [
                  { case: { $gte: ["$similarity_score", 0.9] }, then: "highly_relevant" },
                  { case: { $gte: ["$similarity_score", 0.8] }, then: "relevant" },
                  { case: { $gte: ["$similarity_score", 0.7] }, then: "somewhat_relevant" }
                ],
                default: "marginally_relevant"
              }
            }
          }
        },

        // Add personalization if user profile provided
        ...(userProfile ? [{
          $lookup: {
            from: "user_profiles",
            let: { doc_category: "$metadata.category", doc_tags: "$metadata.tags" },
            pipeline: [
              {
                $match: {
                  user_id: userProfile.user_id,
                  $expr: {
                    $or: [
                      { $in: ["$$doc_category", "$preferred_categories"] },
                      { $gt: [{ $size: { $setIntersection: ["$$doc_tags", "$preferred_tags"] } }, 0] }
                    ]
                  }
                }
              }
            ],
            as: "user_preference_match"
          }
        }, {
          $addFields: {
            personalization_boost: {
              $cond: {
                if: { $gt: [{ $size: "$user_preference_match" }, 0] },
                then: boostFactors.personalization || 1.15,
                else: 1.0
              }
            },
            final_score: {
              $multiply: ["$final_score", "$personalization_boost"]
            }
          }
        }] : []),

        // Sort by final score
        {
          $sort: { final_score: -1, similarity_score: -1 }
        },

        // Limit results
        {
          $limit: limit
        },

        // Project final fields
        {
          $project: {
            _id: 1,
            title: 1,
            content: 1,
            ...(includeMetadata && { metadata: 1 }),
            similarity_score: { $round: ["$similarity_score", 4] },
            final_score: { $round: ["$final_score", 4] },
            relevance_category: 1,
            created_at: 1,

            // Generate content snippet
            content_snippet: {
              $substr: ["$content", 0, 300]
            },

            // Search result metadata
            search_metadata: {
              embedding_model: "$embedding_model",
              indexed_at: "$indexed_at",
              quality_score: "$metadata.quality_score"
            }
          }
        }
      ];

      const startTime = Date.now();
      const results = await this.collections.documents.aggregate(pipeline).toArray();
      const searchTime = Date.now() - startTime;

      // Log search performance
      this.recordSearchMetrics({
        query_type: 'semantic_vector_search',
        results_count: results.length,
        search_time_ms: searchTime,
        similarity_threshold: similarityThreshold,
        filters_applied: filterCriteria.length,
        timestamp: new Date()
      });

      console.log(`Semantic search completed: ${results.length} results in ${searchTime}ms`);

      return {
        success: true,
        results: results,
        search_metadata: {
          query_type: 'semantic',
          results_count: results.length,
          search_time_ms: searchTime,
          similarity_threshold: similarityThreshold,
          filters_applied: filterCriteria.length,
          personalized: !!userProfile
        }
      };

    } catch (error) {
      console.error('Semantic search error:', error);
      return {
        success: false,
        error: error.message,
        results: []
      };
    }
  }

  async performHybridSearch(query, queryEmbedding, options = {}) {
    console.log('Performing hybrid search combining text and vector similarity...');

    const {
      limit = 10,
      textWeight = 0.3,
      vectorWeight = 0.7,
      categories = [],
      language = 'en'
    } = options;

    try {
      // Execute vector search
      const vectorResults = await this.performSemanticSearch(queryEmbedding, {
        ...options,
        limit: limit * 2 // Get more results for hybrid ranking
      });

      // Execute text search using Atlas Search
      const textSearchPipeline = [
        {
          $search: {
            index: "hybrid_search_index",
            compound: {
              must: [
                {
                  text: {
                    query: query,
                    path: ["title", "content"],
                    fuzzy: {
                      maxEdits: 2,
                      prefixLength: 3
                    }
                  }
                }
              ],
              ...(categories.length > 0 && {
                filter: [
                  {
                    text: {
                      query: categories,
                      path: "metadata.category"
                    }
                  }
                ]
              })
            },
            highlight: {
              path: "content",
              maxCharsToExamine: 1000,
              maxNumPassages: 3
            }
          }
        },
        {
          $addFields: {
            text_score: { $meta: "searchScore" },
            highlights: { $meta: "searchHighlights" }
          }
        },
        {
          $limit: limit * 2
        }
      ];

      const textResults = await this.collections.documents.aggregate(textSearchPipeline).toArray();

      // Combine and rank results using hybrid scoring
      const combinedResults = this.combineHybridResults(
        vectorResults.results || [], 
        textResults,
        textWeight,
        vectorWeight
      );

      // Sort by hybrid score and limit
      combinedResults.sort((a, b) => b.hybrid_score - a.hybrid_score);
      const finalResults = combinedResults.slice(0, limit);

      return {
        success: true,
        results: finalResults,
        search_metadata: {
          query_type: 'hybrid',
          text_results_count: textResults.length,
          vector_results_count: vectorResults.results?.length || 0,
          combined_results_count: combinedResults.length,
          final_results_count: finalResults.length,
          text_weight: textWeight,
          vector_weight: vectorWeight
        }
      };

    } catch (error) {
      console.error('Hybrid search error:', error);
      return {
        success: false,
        error: error.message,
        results: []
      };
    }
  }

  combineHybridResults(vectorResults, textResults, textWeight, vectorWeight) {
    const resultMap = new Map();

    // Normalize scores to 0-1 range
    const maxVectorScore = Math.max(...vectorResults.map(r => r.similarity_score || 0));
    const maxTextScore = Math.max(...textResults.map(r => r.text_score || 0));

    // Process vector results
    vectorResults.forEach(result => {
      const normalizedVectorScore = maxVectorScore > 0 ? result.similarity_score / maxVectorScore : 0;
      resultMap.set(result._id.toString(), {
        ...result,
        normalized_vector_score: normalizedVectorScore,
        normalized_text_score: 0,
        hybrid_score: normalizedVectorScore * vectorWeight
      });
    });

    // Process text results and combine
    textResults.forEach(result => {
      const normalizedTextScore = maxTextScore > 0 ? result.text_score / maxTextScore : 0;
      const docId = result._id.toString();

      if (resultMap.has(docId)) {
        // Document found in both searches - combine scores
        const existing = resultMap.get(docId);
        existing.normalized_text_score = normalizedTextScore;
        existing.hybrid_score = (existing.normalized_vector_score * vectorWeight) + 
                               (normalizedTextScore * textWeight);
        existing.highlights = result.highlights;
        existing.search_type = 'both';
      } else {
        // Document only found in text search
        resultMap.set(docId, {
          ...result,
          normalized_vector_score: 0,
          normalized_text_score: normalizedTextScore,
          hybrid_score: normalizedTextScore * textWeight,
          search_type: 'text_only',
          similarity_score: 0,
          relevance_category: 'text_match'
        });
      }
    });

    return Array.from(resultMap.values());
  }

  async buildRAGPipeline(query, options = {}) {
    console.log('Building Retrieval-Augmented Generation pipeline...');

    const {
      contextLimit = 5,
      maxContextLength = 4000,
      embeddingFunction,
      llmFunction,
      temperature = 0.7,
      includeSourceCitations = true
    } = options;

    try {
      // Step 1: Generate query embedding
      const queryEmbedding = await embeddingFunction([query]);

      // Step 2: Retrieve relevant context using semantic search
      const searchResults = await this.performSemanticSearch(queryEmbedding[0], {
        limit: contextLimit * 2, // Get extra results for context selection
        similarityThreshold: 0.6
      });

      if (!searchResults.success || searchResults.results.length === 0) {
        return {
          success: false,
          error: 'No relevant context found',
          query: query
        };
      }

      // Step 3: Select and rank context documents
      const contextDocuments = this.selectOptimalContext(
        searchResults.results,
        maxContextLength
      );

      // Step 4: Build context string with source tracking
      const contextString = contextDocuments.map((doc, index) => {
        const sourceId = `[${index + 1}]`;
        return `${sourceId} ${doc.title}\n${doc.content_snippet || doc.content.substring(0, 500)}...`;
      }).join('\n\n');

      // Step 5: Create RAG prompt
      const ragPrompt = this.buildRAGPrompt(query, contextString, includeSourceCitations);

      // Step 6: Generate response using LLM
      const llmResponse = await llmFunction(ragPrompt, {
        temperature,
        max_tokens: 1000,
        stop: ["[END]"]
      });

      // Step 7: Extract citations and build response
      const response = {
        success: true,
        query: query,
        answer: llmResponse.text || llmResponse,
        context_used: contextDocuments.length,
        sources: contextDocuments.map((doc, index) => ({
          id: index + 1,
          title: doc.title,
          similarity_score: doc.similarity_score,
          source: doc.metadata?.source,
          url: doc.metadata?.source_url
        })),
        search_metadata: searchResults.search_metadata,
        generation_metadata: {
          model: llmResponse.model || 'unknown',
          temperature: temperature,
          context_length: contextString.length,
          response_tokens: llmResponse.usage?.total_tokens || 0
        }
      };

      // Log RAG pipeline usage
      await this.logRAGUsage({
        query: query,
        context_documents: contextDocuments.length,
        response_length: response.answer.length,
        sources_cited: response.sources.length,
        timestamp: new Date()
      });

      return response;

    } catch (error) {
      console.error('RAG pipeline error:', error);
      return {
        success: false,
        error: error.message,
        query: query
      };
    }
  }

  selectOptimalContext(searchResults, maxLength) {
    let totalLength = 0;
    const selectedDocs = [];

    // Sort by relevance and diversity
    const rankedResults = searchResults.sort((a, b) => {
      // Primary sort by similarity score
      if (b.similarity_score !== a.similarity_score) {
        return b.similarity_score - a.similarity_score;
      }
      // Secondary sort by content quality
      return (b.metadata?.quality_score || 0) - (a.metadata?.quality_score || 0);
    });

    for (const doc of rankedResults) {
      const docLength = (doc.content_snippet || doc.content || '').length;

      if (totalLength + docLength <= maxLength) {
        selectedDocs.push(doc);
        totalLength += docLength;
      }

      if (selectedDocs.length >= 5) break; // Limit to top 5 documents
    }

    return selectedDocs;
  }

  buildRAGPrompt(query, context, includeCitations) {
    return `You are a helpful assistant that answers questions based on the provided context. Use the context information to provide accurate and comprehensive answers.

Context Information:
${context}

Question: ${query}

Instructions:
- Answer based solely on the information provided in the context
- If the context doesn't contain enough information to answer fully, state what information is missing
- Be comprehensive but concise
${includeCitations ? '- Include source citations using the [number] format from the context' : ''}
- If no relevant information is found, clearly state that the context doesn't contain the answer

Answer:`;
  }

  recordSearchMetrics(metrics) {
    const key = `${metrics.query_type}_${Date.now()}`;
    this.performanceMetrics.set(key, metrics);

    // Keep only last 1000 metrics
    if (this.performanceMetrics.size > 1000) {
      const oldestKey = this.performanceMetrics.keys().next().value;
      this.performanceMetrics.delete(oldestKey);
    }
  }

  async logRAGUsage(usage) {
    try {
      await this.collections.searchLogs.insertOne({
        ...usage,
        type: 'rag_pipeline'
      });
    } catch (error) {
      console.warn('Failed to log RAG usage:', error);
    }
  }

  calculateWordCount(text) {
    return (text || '').split(/\s+/).filter(word => word.length > 0).length;
  }

  inferContentType(doc) {
    if (doc.content && doc.content.includes('```')) return 'technical';
    if (doc.title && doc.title.includes('Tutorial')) return 'tutorial';
    if (doc.content && doc.content.length > 2000) return 'long_form';
    return 'standard';
  }

  calculateQualityScore(doc) {
    let score = 0.5; // Base score

    if (doc.title && doc.title.length > 10) score += 0.1;
    if (doc.content && doc.content.length > 500) score += 0.2;
    if (doc.author) score += 0.1;
    if (doc.tags && doc.tags.length > 0) score += 0.1;

    return Math.min(1.0, score);
  }
}

// Benefits of MongoDB Atlas Vector Search:
// - Native integration with MongoDB document model and operations
// - Automatic scaling and management without separate vector database infrastructure  
// - Advanced filtering capabilities combined with vector similarity search
// - Hybrid search combining full-text and vector search capabilities
// - Built-in indexing optimization for high-performance vector operations
// - Integrated analytics and monitoring for vector search performance
// - Real-time updates and dynamic index management
// - Cost-effective scaling with MongoDB Atlas infrastructure
// - Comprehensive security and compliance features
// - SQL-compatible vector operations through QueryLeaf integration

module.exports = {
  AtlasVectorSearchManager
};

Understanding MongoDB Atlas Vector Search Architecture

Advanced Vector Search Patterns for AI Applications

Implement sophisticated vector search patterns for production AI applications:

// Advanced vector search patterns and AI application integration
class ProductionVectorSearchSystem {
  constructor(atlasConfig) {
    this.atlasManager = new AtlasVectorSearchManager(
      atlasConfig.connectionString, 
      atlasConfig.database
    );
    this.embeddingCache = new Map();
    this.searchCache = new Map();
    this.analyticsCollector = new Map();
  }

  async buildIntelligentDocumentProcessor(documents, processingOptions = {}) {
    console.log('Building intelligent document processing pipeline...');

    const {
      chunkSize = 1000,
      chunkOverlap = 200,
      embeddingModel = 'text-embedding-ada-002',
      enableSemanticChunking = true,
      extractKeywords = true,
      analyzeSentiment = true
    } = processingOptions;

    const processedDocuments = [];

    for (const doc of documents) {
      try {
        // Step 1: Intelligent document chunking
        const chunks = enableSemanticChunking ? 
          await this.performSemanticChunking(doc.content, chunkSize, chunkOverlap) :
          this.performFixedChunking(doc.content, chunkSize, chunkOverlap);

        // Step 2: Process each chunk
        for (const [chunkIndex, chunk] of chunks.entries()) {
          const chunkDoc = {
            _id: new ObjectId(),
            parent_document_id: doc._id,
            title: `${doc.title} - Part ${chunkIndex + 1}`,
            content: chunk.text,
            chunk_index: chunkIndex,

            // Chunk metadata
            chunk_metadata: {
              word_count: chunk.word_count,
              sentence_count: chunk.sentence_count,
              start_position: chunk.start_position,
              end_position: chunk.end_position,
              semantic_density: chunk.semantic_density || 0
            },

            // Enhanced metadata processing
            metadata: {
              ...doc.metadata,
              // Keyword extraction
              ...(extractKeywords && {
                keywords: await this.extractKeywords(chunk.text),
                entities: await this.extractEntities(chunk.text)
              }),

              // Sentiment analysis  
              ...(analyzeSentiment && {
                sentiment: await this.analyzeSentiment(chunk.text)
              }),

              // Document structure analysis
              structure_type: this.analyzeDocumentStructure(chunk.text),
              information_density: this.calculateInformationDensity(chunk.text)
            },

            created_at: doc.created_at,
            updated_at: new Date(),
            processing_version: '2.0'
          };

          processedDocuments.push(chunkDoc);
        }

      } catch (error) {
        console.error(`Error processing document ${doc._id}:`, error);
        continue;
      }
    }

    console.log(`Document processing completed: ${processedDocuments.length} chunks created from ${documents.length} documents`);
    return processedDocuments;
  }

  async performSemanticChunking(text, targetSize, overlap) {
    // Implement semantic-aware chunking that preserves meaning
    const sentences = this.splitIntoSentences(text);
    const chunks = [];
    let currentChunk = '';
    let currentWordCount = 0;
    let startPosition = 0;

    for (const sentence of sentences) {
      const sentenceWordCount = sentence.split(/\s+/).length;

      if (currentWordCount + sentenceWordCount > targetSize && currentChunk.length > 0) {
        // Create chunk with semantic coherence
        chunks.push({
          text: currentChunk.trim(),
          word_count: currentWordCount,
          sentence_count: currentChunk.split(/[.!?]+/).length - 1,
          start_position: startPosition,
          end_position: startPosition + currentChunk.length,
          semantic_density: await this.calculateSemanticDensity(currentChunk)
        });

        // Start new chunk with overlap
        const overlapText = this.extractOverlapText(currentChunk, overlap);
        currentChunk = overlapText + ' ' + sentence;
        currentWordCount = this.countWords(currentChunk);
        startPosition += currentChunk.length - overlapText.length;
      } else {
        currentChunk += (currentChunk ? ' ' : '') + sentence;
        currentWordCount += sentenceWordCount;
      }
    }

    // Add final chunk
    if (currentChunk.trim().length > 0) {
      chunks.push({
        text: currentChunk.trim(),
        word_count: currentWordCount,
        sentence_count: currentChunk.split(/[.!?]+/).length - 1,
        start_position: startPosition,
        end_position: startPosition + currentChunk.length,
        semantic_density: await this.calculateSemanticDensity(currentChunk)
      });
    }

    return chunks;
  }

  async buildConversationalRAG(conversationHistory, currentQuery, options = {}) {
    console.log('Building conversational RAG system...');

    const {
      contextWindow = 5,
      includeConversationContext = true,
      personalizeResponse = true,
      userId = null
    } = options;

    try {
      // Step 1: Build conversational context
      let enhancedQuery = currentQuery;

      if (includeConversationContext && conversationHistory.length > 0) {
        const recentContext = conversationHistory.slice(-contextWindow);
        const contextSummary = recentContext.map(turn => 
          `${turn.role}: ${turn.content}`
        ).join('\n');

        enhancedQuery = `Previous conversation context:\n${contextSummary}\n\nCurrent question: ${currentQuery}`;
      }

      // Step 2: Generate enhanced query embedding
      const queryEmbedding = await this.generateEmbedding(enhancedQuery);

      // Step 3: Personalized retrieval if user profile available
      let userProfile = null;
      if (personalizeResponse && userId) {
        userProfile = await this.getUserProfile(userId);
      }

      // Step 4: Perform contextual search
      const searchResults = await this.atlasManager.performSemanticSearch(queryEmbedding, {
        limit: 8,
        userProfile: userProfile,
        boostFactors: {
          recency: 0.2,
          quality: 0.3,
          personalization: 0.2
        }
      });

      // Step 5: Build conversational RAG response
      const ragResponse = await this.atlasManager.buildRAGPipeline(enhancedQuery, {
        contextLimit: 6,
        maxContextLength: 5000,
        embeddingFunction: (texts) => Promise.resolve([queryEmbedding]),
        llmFunction: this.createConversationalLLMFunction(conversationHistory),
        includeSourceCitations: true
      });

      // Step 6: Post-process for conversation continuity
      if (ragResponse.success) {
        ragResponse.conversation_metadata = {
          context_turns_used: Math.min(contextWindow, conversationHistory.length),
          personalized: !!userProfile,
          query_enhanced: includeConversationContext,
          user_id: userId
        };
      }

      return ragResponse;

    } catch (error) {
      console.error('Conversational RAG error:', error);
      return {
        success: false,
        error: error.message,
        query: currentQuery
      };
    }
  }

  createConversationalLLMFunction(conversationHistory) {
    return async (prompt, options = {}) => {
      // Add conversation-aware instructions
      const conversationalPrompt = `You are a helpful assistant engaged in an ongoing conversation. 

Previous conversation context has been provided. Use this context to:
- Maintain conversation continuity
- Reference previous topics when relevant
- Provide contextually appropriate responses
- Acknowledge when building on previous answers

${prompt}

Remember to be conversational and reference the ongoing dialogue when appropriate.`;

      // This would integrate with your preferred LLM service
      return await this.callLLMService(conversationalPrompt, options);
    };
  }

  async implementRecommendationSystem(userId, options = {}) {
    console.log(`Building recommendation system for user ${userId}...`);

    const {
      recommendationType = 'content',
      diversityFactor = 0.3,
      noveltyBoost = 0.2,
      limit = 10
    } = options;

    try {
      // Step 1: Get user profile and interaction history
      const userProfile = await this.getUserProfile(userId);
      const interactionHistory = await this.getUserInteractions(userId);

      // Step 2: Build user preference embedding
      const userPreferenceEmbedding = await this.buildUserPreferenceEmbedding(
        userProfile, 
        interactionHistory
      );

      // Step 3: Find similar content
      const candidateResults = await this.atlasManager.performSemanticSearch(
        userPreferenceEmbedding,
        {
          limit: limit * 3, // Get more candidates for diversity
          similarityThreshold: 0.4
        }
      );

      // Step 4: Apply diversity and novelty filtering
      const diversifiedResults = this.applyDiversityFiltering(
        candidateResults.results,
        interactionHistory,
        diversityFactor,
        noveltyBoost
      );

      // Step 5: Rank final recommendations
      const finalRecommendations = diversifiedResults.slice(0, limit).map((rec, index) => ({
        ...rec,
        recommendation_rank: index + 1,
        recommendation_score: rec.final_score,
        recommendation_reasons: this.generateRecommendationReasons(rec, userProfile)
      }));

      return {
        success: true,
        user_id: userId,
        recommendations: finalRecommendations,
        recommendation_metadata: {
          algorithm: 'vector_similarity_with_diversity',
          diversity_factor: diversityFactor,
          novelty_boost: noveltyBoost,
          candidates_evaluated: candidateResults.results?.length || 0,
          final_count: finalRecommendations.length
        }
      };

    } catch (error) {
      console.error('Recommendation system error:', error);
      return {
        success: false,
        error: error.message,
        user_id: userId
      };
    }
  }

  applyDiversityFiltering(candidates, userHistory, diversityFactor, noveltyBoost) {
    // Track categories and topics to ensure diversity
    const categoryCount = new Map();
    const diversifiedResults = [];

    // Get user's previously interacted content for novelty scoring
    const previouslyViewed = new Set(
      userHistory.map(interaction => interaction.document_id?.toString())
    );

    for (const candidate of candidates) {
      const category = candidate.metadata?.category || 'unknown';
      const currentCategoryCount = categoryCount.get(category) || 0;

      // Calculate diversity penalty (more items in category = higher penalty)
      const diversityPenalty = currentCategoryCount * diversityFactor;

      // Calculate novelty boost (unseen content gets boost)
      const noveltyScore = previouslyViewed.has(candidate._id.toString()) ? 0 : noveltyBoost;

      // Apply adjustments to final score
      candidate.final_score = (candidate.final_score || candidate.similarity_score) - diversityPenalty + noveltyScore;
      candidate.diversity_penalty = diversityPenalty;
      candidate.novelty_boost = noveltyScore;

      diversifiedResults.push(candidate);
      categoryCount.set(category, currentCategoryCount + 1);
    }

    return diversifiedResults.sort((a, b) => b.final_score - a.final_score);
  }

  generateRecommendationReasons(recommendation, userProfile) {
    const reasons = [];

    if (userProfile.preferred_categories?.includes(recommendation.metadata?.category)) {
      reasons.push(`Matches your interest in ${recommendation.metadata.category}`);
    }

    if (recommendation.similarity_score > 0.8) {
      reasons.push('Highly relevant to your preferences');
    }

    if (recommendation.novelty_boost > 0) {
      reasons.push('New content you haven\'t seen');
    }

    if (recommendation.metadata?.quality_score > 0.8) {
      reasons.push('High-quality content');
    }

    return reasons.length > 0 ? reasons : ['Recommended based on your profile'];
  }

  // Utility methods
  splitIntoSentences(text) {
    return text.split(/[.!?]+/).filter(s => s.trim().length > 0);
  }

  extractOverlapText(text, overlapSize) {
    const words = text.split(/\s+/);
    return words.slice(-overlapSize).join(' ');
  }

  countWords(text) {
    return text.split(/\s+/).filter(word => word.length > 0).length;
  }

  async calculateSemanticDensity(text) {
    // Simplified semantic density calculation
    const sentences = this.splitIntoSentences(text);
    const avgSentenceLength = text.length / sentences.length;
    const wordCount = this.countWords(text);

    // Higher density = more information per word
    return Math.min(1.0, (avgSentenceLength / 100) * (wordCount / 500));
  }

  analyzeDocumentStructure(text) {
    if (text.includes('```') || text.includes('function') || text.includes('class')) return 'code';
    if (text.match(/^\d+\./m) || text.includes('Step')) return 'procedural';
    if (text.includes('?') && text.split('?').length > 2) return 'faq';
    return 'narrative';
  }

  calculateInformationDensity(text) {
    const uniqueWords = new Set(text.toLowerCase().match(/\b\w+\b/g) || []);
    const totalWords = this.countWords(text);
    return totalWords > 0 ? uniqueWords.size / totalWords : 0;
  }
}

SQL-Style Vector Search Operations with QueryLeaf

QueryLeaf provides familiar SQL syntax for MongoDB Atlas Vector Search operations:

-- QueryLeaf vector search operations with SQL-familiar syntax

-- Create vector search enabled collection
CREATE COLLECTION documents_with_vectors (
  _id OBJECTID PRIMARY KEY,
  title VARCHAR(500) NOT NULL,
  content TEXT NOT NULL,

  -- Vector embedding field
  embedding VECTOR(1536) NOT NULL, -- OpenAI embedding dimensions

  -- Metadata for filtering
  category VARCHAR(100),
  language VARCHAR(10) DEFAULT 'en',
  source VARCHAR(100),
  tags VARCHAR[] DEFAULT ARRAY[]::VARCHAR[],
  created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,

  -- Document analysis fields
  word_count INTEGER,
  reading_time_minutes INTEGER,
  quality_score DECIMAL(3,2) DEFAULT 0.5,

  -- Full-text search support
  searchable_text TEXT GENERATED ALWAYS AS (title || ' ' || content) STORED
);

-- Create Atlas Vector Search index
CREATE VECTOR INDEX document_semantic_search ON documents_with_vectors (
  embedding USING cosine_similarity
  WITH FILTER FIELDS (category, language, source, created_at, tags)
);

-- Create hybrid search index for text + vector
CREATE SEARCH INDEX document_hybrid_search ON documents_with_vectors (
  title WITH lucene_analyzer('standard'),
  content WITH lucene_analyzer('english'),
  category WITH string_facet(),
  tags WITH string_facet()
);

-- Semantic vector search with SQL syntax
SELECT 
  _id,
  title,
  LEFT(content, 300) as content_preview,
  category,
  source,
  created_at,

  -- Vector similarity score
  VECTOR_SIMILARITY(embedding, $1::VECTOR(1536), 'cosine') as similarity_score,

  -- Relevance classification
  CASE 
    WHEN VECTOR_SIMILARITY(embedding, $1, 'cosine') >= 0.9 THEN 'highly_relevant'
    WHEN VECTOR_SIMILARITY(embedding, $1, 'cosine') >= 0.8 THEN 'relevant'
    WHEN VECTOR_SIMILARITY(embedding, $1, 'cosine') >= 0.7 THEN 'somewhat_relevant'
    ELSE 'marginally_relevant'
  END as relevance_category,

  -- Quality-adjusted ranking score
  VECTOR_SIMILARITY(embedding, $1, 'cosine') * (1 + quality_score * 0.2) as final_score

FROM documents_with_vectors
WHERE 
  -- Vector similarity threshold
  VECTOR_SIMILARITY(embedding, $1, 'cosine') >= $2::DECIMAL -- similarity threshold parameter

  -- Optional metadata filtering
  AND ($3::VARCHAR[] IS NULL OR category = ANY($3)) -- categories filter
  AND ($4::VARCHAR IS NULL OR language = $4) -- language filter  
  AND ($5::VARCHAR IS NULL OR source = $5) -- source filter
  AND ($6::VARCHAR[] IS NULL OR tags && $6) -- tags overlap filter
  AND ($7::TIMESTAMP IS NULL OR created_at >= $7) -- date filter

ORDER BY final_score DESC, similarity_score DESC
LIMIT $8::INTEGER; -- result limit

-- Advanced hybrid search combining vector and text similarity
WITH vector_search AS (
  SELECT 
    _id, title, content, category, source, created_at,
    VECTOR_SIMILARITY(embedding, $1::VECTOR(1536), 'cosine') as vector_score
  FROM documents_with_vectors
  WHERE VECTOR_SIMILARITY(embedding, $1, 'cosine') >= 0.6
  ORDER BY vector_score DESC
  LIMIT 20
),

text_search AS (
  SELECT 
    _id, title, content, category, source, created_at,
    SEARCH_SCORE() as text_score,
    SEARCH_HIGHLIGHTS('content', 3) as highlighted_content
  FROM documents_with_vectors
  WHERE MATCH(searchable_text, $2::TEXT) -- text query parameter
    WITH search_options(
      fuzzy_max_edits = 2,
      fuzzy_prefix_length = 3,
      highlight_max_chars = 1000
    )
  ORDER BY text_score DESC
  LIMIT 20
),

hybrid_results AS (
  SELECT 
    COALESCE(vs._id, ts._id) as _id,
    COALESCE(vs.title, ts.title) as title,
    COALESCE(vs.content, ts.content) as content,
    COALESCE(vs.category, ts.category) as category,
    COALESCE(vs.source, ts.source) as source,
    COALESCE(vs.created_at, ts.created_at) as created_at,

    -- Normalize scores to 0-1 range
    COALESCE(vs.vector_score, 0) / (SELECT MAX(vector_score) FROM vector_search) as normalized_vector_score,
    COALESCE(ts.text_score, 0) / (SELECT MAX(text_score) FROM text_search) as normalized_text_score,

    -- Hybrid scoring with configurable weights
    ($3::DECIMAL * COALESCE(vs.vector_score, 0) / (SELECT MAX(vector_score) FROM vector_search)) + 
    ($4::DECIMAL * COALESCE(ts.text_score, 0) / (SELECT MAX(text_score) FROM text_search)) as hybrid_score,

    ts.highlighted_content,

    -- Search type classification
    CASE 
      WHEN vs._id IS NOT NULL AND ts._id IS NOT NULL THEN 'both'
      WHEN vs._id IS NOT NULL THEN 'vector_only'
      ELSE 'text_only'
    END as search_type

  FROM vector_search vs
  FULL OUTER JOIN text_search ts ON vs._id = ts._id
)

SELECT 
  _id,
  title,
  LEFT(content, 400) as content_preview,
  category,
  source,
  created_at,

  -- Scores
  ROUND(normalized_vector_score::NUMERIC, 4) as vector_similarity,
  ROUND(normalized_text_score::NUMERIC, 4) as text_relevance, 
  ROUND(hybrid_score::NUMERIC, 4) as final_score,

  search_type,
  highlighted_content,

  -- Content insights
  CASE 
    WHEN hybrid_score >= 0.8 THEN 'excellent_match'
    WHEN hybrid_score >= 0.6 THEN 'good_match' 
    WHEN hybrid_score >= 0.4 THEN 'fair_match'
    ELSE 'weak_match'
  END as match_quality

FROM hybrid_results
ORDER BY hybrid_score DESC, normalized_vector_score DESC
LIMIT $5::INTEGER; -- final result limit

-- Retrieval-Augmented Generation (RAG) pipeline with QueryLeaf
WITH context_retrieval AS (
  SELECT 
    _id,
    title,
    content,
    category,
    VECTOR_SIMILARITY(embedding, $1::VECTOR(1536), 'cosine') as relevance_score
  FROM documents_with_vectors
  WHERE VECTOR_SIMILARITY(embedding, $1, 'cosine') >= 0.7
  ORDER BY relevance_score DESC
  LIMIT 5
),

context_preparation AS (
  SELECT 
    STRING_AGG(
      '[' || ROW_NUMBER() OVER (ORDER BY relevance_score DESC) || '] ' || 
      title || E'\n' || LEFT(content, 500) || '...',
      E'\n\n'
      ORDER BY relevance_score DESC
    ) as context_string,

    COUNT(*) as context_documents,
    AVG(relevance_score) as avg_relevance,

    JSON_AGG(
      JSON_BUILD_OBJECT(
        'id', ROW_NUMBER() OVER (ORDER BY relevance_score DESC),
        'title', title,
        'category', category,
        'relevance', ROUND(relevance_score::NUMERIC, 4)
      ) ORDER BY relevance_score DESC
    ) as source_citations

  FROM context_retrieval
)

SELECT 
  context_string,
  context_documents,
  ROUND(avg_relevance::NUMERIC, 4) as average_context_relevance,
  source_citations,

  -- RAG prompt construction
  'You are a helpful assistant that answers questions based on provided context. ' ||
  'Use the following context information to provide accurate answers.' || E'\n\n' ||
  'Context Information:' || E'\n' || context_string || E'\n\n' ||
  'Question: ' || $2::TEXT || E'\n\n' ||
  'Instructions:' || E'\n' ||
  '- Answer based solely on the provided context' || E'\n' ||  
  '- Include source citations using [number] format' || E'\n' ||
  '- If context is insufficient, clearly state what information is missing' || E'\n\n' ||
  'Answer:' as rag_prompt,

  -- Query metadata
  $2::TEXT as original_query,
  CURRENT_TIMESTAMP as generated_at

FROM context_preparation;

-- User preference-based semantic search and recommendations  
WITH user_profile AS (
  SELECT 
    user_id,
    preference_embedding,
    preferred_categories,
    preferred_languages,
    interaction_history,
    last_active
  FROM user_profiles
  WHERE user_id = $1::UUID
),

personalized_search AS (
  SELECT 
    d._id,
    d.title,
    d.content,
    d.category,
    d.source,
    d.created_at,
    d.quality_score,

    -- Semantic similarity to user preferences
    VECTOR_SIMILARITY(d.embedding, up.preference_embedding, 'cosine') as preference_similarity,

    -- Category preference boost
    CASE 
      WHEN d.category = ANY(up.preferred_categories) THEN 1.2
      ELSE 1.0
    END as category_boost,

    -- Novelty boost (content user hasn't seen)
    CASE 
      WHEN d._id = ANY(up.interaction_history) THEN 0.8 -- Reduce score for seen content
      ELSE 1.1 -- Boost novel content
    END as novelty_boost,

    -- Recency factor
    CASE 
      WHEN d.created_at >= CURRENT_DATE - INTERVAL '7 days' THEN 1.1
      WHEN d.created_at >= CURRENT_DATE - INTERVAL '30 days' THEN 1.05
      ELSE 1.0  
    END as recency_boost

  FROM documents_with_vectors d
  CROSS JOIN user_profile up
  WHERE VECTOR_SIMILARITY(d.embedding, up.preference_embedding, 'cosine') >= 0.5
    AND (up.preferred_languages IS NULL OR d.language = ANY(up.preferred_languages))
),

ranked_recommendations AS (
  SELECT *,
    -- Calculate final personalized score
    preference_similarity * category_boost * novelty_boost * recency_boost * (1 + quality_score * 0.3) as personalized_score,

    -- Diversity scoring to avoid over-concentration in single category
    ROW_NUMBER() OVER (PARTITION BY category ORDER BY preference_similarity DESC) as category_rank

  FROM personalized_search
),

diversified_recommendations AS (
  SELECT *,
    -- Apply diversity penalty for category concentration
    CASE 
      WHEN category_rank <= 2 THEN personalized_score
      WHEN category_rank <= 4 THEN personalized_score * 0.9
      ELSE personalized_score * 0.7
    END as final_recommendation_score

  FROM ranked_recommendations
)

SELECT 
  _id,
  title,
  LEFT(content, 300) as content_preview,
  category,
  source,
  created_at,

  -- Recommendation scores
  ROUND(preference_similarity::NUMERIC, 4) as user_preference_match,
  ROUND(personalized_score::NUMERIC, 4) as personalized_relevance,
  ROUND(final_recommendation_score::NUMERIC, 4) as recommendation_score,

  -- Recommendation explanations
  CASE 
    WHEN category_boost > 1.0 AND novelty_boost > 1.0 THEN 'New content in your preferred categories'
    WHEN category_boost > 1.0 THEN 'Matches your category preferences'
    WHEN novelty_boost > 1.0 THEN 'New content you might find interesting'
    WHEN recency_boost > 1.0 THEN 'Recently published content'
    ELSE 'Recommended based on your preferences'
  END as recommendation_reason,

  -- Quality indicators
  CASE 
    WHEN quality_score >= 0.8 AND preference_similarity >= 0.8 THEN 'high_confidence'
    WHEN quality_score >= 0.6 AND preference_similarity >= 0.6 THEN 'medium_confidence'
    ELSE 'exploratory'
  END as confidence_level

FROM diversified_recommendations
ORDER BY final_recommendation_score DESC, preference_similarity DESC  
LIMIT $2::INTEGER; -- recommendation count limit

-- Real-time vector search analytics and performance monitoring
CREATE MATERIALIZED VIEW vector_search_analytics AS
WITH search_performance AS (
  SELECT 
    DATE_TRUNC('hour', search_timestamp) as hour_bucket,
    search_type, -- 'vector', 'text', 'hybrid'

    -- Performance metrics
    COUNT(*) as search_count,
    AVG(search_duration_ms) as avg_search_time,
    PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY search_duration_ms) as p95_search_time,
    AVG(result_count) as avg_results_returned,

    -- Quality metrics  
    AVG(avg_similarity_score) as avg_result_relevance,
    COUNT(*) FILTER (WHERE avg_similarity_score >= 0.8) as high_relevance_searches,
    COUNT(*) FILTER (WHERE result_count = 0) as zero_result_searches,

    -- User interaction metrics
    COUNT(DISTINCT user_id) as unique_users,
    AVG(user_interaction_score) as avg_user_satisfaction

  FROM search_logs
  WHERE search_timestamp >= CURRENT_TIMESTAMP - INTERVAL '24 hours'
  GROUP BY DATE_TRUNC('hour', search_timestamp), search_type
),

embedding_performance AS (
  SELECT 
    DATE_TRUNC('hour', created_at) as hour_bucket,
    embedding_model,

    -- Embedding metrics
    COUNT(*) as embeddings_generated,
    AVG(embedding_generation_time_ms) as avg_embedding_time,
    AVG(ARRAY_LENGTH(embedding, 1)) as avg_dimensions -- Vector dimension validation

  FROM documents_with_vectors
  WHERE created_at >= CURRENT_TIMESTAMP - INTERVAL '24 hours'
  GROUP BY DATE_TRUNC('hour', created_at), embedding_model
)

SELECT 
  sp.hour_bucket,
  sp.search_type,

  -- Volume metrics
  sp.search_count,
  sp.unique_users,
  ROUND((sp.search_count::DECIMAL / sp.unique_users)::NUMERIC, 2) as searches_per_user,

  -- Performance metrics
  ROUND(sp.avg_search_time::NUMERIC, 2) as avg_search_time_ms,
  ROUND(sp.p95_search_time::NUMERIC, 2) as p95_search_time_ms,
  sp.avg_results_returned,

  -- Quality metrics
  ROUND(sp.avg_result_relevance::NUMERIC, 3) as avg_relevance_score,
  ROUND((sp.high_relevance_searches::DECIMAL / sp.search_count * 100)::NUMERIC, 1) as high_relevance_rate_pct,
  ROUND((sp.zero_result_searches::DECIMAL / sp.search_count * 100)::NUMERIC, 1) as zero_results_rate_pct,

  -- User satisfaction
  ROUND(sp.avg_user_satisfaction::NUMERIC, 2) as user_satisfaction_score,

  -- Embedding performance (when available)
  ep.embeddings_generated,
  ep.avg_embedding_time,

  -- Health indicators
  CASE 
    WHEN sp.avg_search_time <= 100 AND sp.avg_result_relevance >= 0.7 THEN 'healthy'
    WHEN sp.avg_search_time <= 500 AND sp.avg_result_relevance >= 0.5 THEN 'acceptable'
    ELSE 'needs_attention'
  END as system_health_status,

  -- Recommendations
  CASE 
    WHEN sp.zero_result_searches::DECIMAL / sp.search_count > 0.1 THEN 'Improve embedding coverage'
    WHEN sp.avg_search_time > 1000 THEN 'Optimize vector indexes'
    WHEN sp.avg_result_relevance < 0.6 THEN 'Review similarity thresholds'
    ELSE 'Performance within targets'
  END as optimization_recommendation

FROM search_performance sp
LEFT JOIN embedding_performance ep ON sp.hour_bucket = ep.hour_bucket
ORDER BY sp.hour_bucket DESC, sp.search_type;

-- QueryLeaf provides comprehensive Atlas Vector Search capabilities:
-- 1. SQL-familiar vector search syntax with similarity functions
-- 2. Advanced hybrid search combining vector and full-text capabilities  
-- 3. Built-in RAG pipeline construction with context retrieval and ranking
-- 4. Personalized recommendation systems with user preference integration
-- 5. Real-time analytics and performance monitoring for vector operations
-- 6. Automatic embedding management and vector index optimization
-- 7. Conversational AI support with context-aware search capabilities
-- 8. Production-scale vector search with filtering and metadata integration
-- 9. Comprehensive search quality metrics and optimization recommendations
-- 10. Native integration with MongoDB Atlas Vector Search infrastructure

Best Practices for Atlas Vector Search Implementation

Vector Index Design and Optimization

Essential practices for production Atlas Vector Search deployments:

Vector Dimensionality: Choose embedding dimensions based on model requirements and performance constraints
Similarity Metrics: Select appropriate similarity functions (cosine, euclidean, dot product) for your use case
Index Configuration: Configure vector indexes with optimal numCandidates and filter field selections
Metadata Strategy: Design metadata schemas that enable efficient filtering during vector search
Embedding Quality: Implement embedding generation strategies that capture semantic meaning effectively
Performance Monitoring: Deploy comprehensive monitoring for search latency, accuracy, and user satisfaction

Production AI Application Patterns

Optimize Atlas Vector Search for real-world AI applications:

Hybrid Search: Combine vector similarity with traditional search for comprehensive results
RAG Optimization: Implement context selection strategies that balance relevance and diversity
Real-time Updates: Design pipelines for incremental embedding updates and index maintenance
Personalization: Build user preference models that enhance search relevance
Cost Management: Optimize embedding generation and storage costs through intelligent caching
Security Integration: Implement proper authentication and access controls for vector data

Conclusion

MongoDB Atlas Vector Search provides a comprehensive platform for building modern AI applications that require sophisticated semantic search capabilities. By integrating vector search directly into MongoDB's document model, developers can build powerful AI systems without the complexity of managing separate vector databases.

Key Atlas Vector Search benefits include:

Native Integration: Seamless combination of document operations and vector search in a single platform
Scalable Architecture: Built on MongoDB Atlas infrastructure with automatic scaling and management
Hybrid Capabilities: Advanced search patterns combining vector similarity with traditional text search
AI-Ready Features: Built-in support for RAG pipelines, personalization, and conversational AI
Production Optimized: Enterprise-grade security, monitoring, and performance optimization
Developer Friendly: Familiar MongoDB query patterns extended with vector search capabilities

Whether you're building recommendation systems, semantic search engines, RAG-powered chatbots, or other AI applications, MongoDB Atlas Vector Search with QueryLeaf's SQL-familiar interface provides the foundation for modern AI-powered applications that scale efficiently and maintain high performance.

QueryLeaf Integration: QueryLeaf automatically manages MongoDB Atlas Vector Search operations while providing SQL-familiar syntax for semantic search, hybrid search patterns, and RAG pipeline construction. Advanced vector search capabilities, personalization systems, and AI application patterns are seamlessly accessible through familiar SQL constructs, making sophisticated AI development both powerful and approachable for SQL-oriented teams.

The combination of MongoDB's flexible document model with advanced vector search capabilities makes it an ideal platform for AI applications that require both semantic understanding and operational flexibility, ensuring your AI systems can evolve with advancing technology while maintaining familiar development patterns.

November 11, 2025
26 min read

MongoDB Index Optimization and Query Performance Tuning: Advanced Database Performance Engineering

Modern enterprise applications demand exceptional database performance to support millions of users, complex queries, and real-time analytics workloads. Traditional approaches to database performance optimization often rely on rigid indexing strategies, manual query tuning, and reactive performance monitoring that fails to scale with growing data volumes and evolving access patterns.

MongoDB's flexible indexing system provides comprehensive performance optimization capabilities that combine intelligent index selection, advanced compound indexing strategies, and sophisticated query execution analysis. Unlike traditional database systems that require extensive manual tuning, MongoDB's index optimization features enable proactive performance management with automated recommendations, flexible indexing patterns, and detailed performance analytics.

The Traditional Database Performance Challenge

Relational database performance optimization has significant complexity and maintenance overhead:

-- Traditional PostgreSQL performance optimization - complex and manual

-- Customer orders table with performance challenges
CREATE TABLE customer_orders (
    order_id BIGSERIAL PRIMARY KEY,
    customer_id BIGINT NOT NULL,
    order_date TIMESTAMP NOT NULL,
    order_status VARCHAR(20) NOT NULL,
    total_amount DECIMAL(12,2) NOT NULL,
    shipping_address_id BIGINT,
    billing_address_id BIGINT,
    payment_method VARCHAR(50),
    shipping_method VARCHAR(50),
    order_priority VARCHAR(20) DEFAULT 'standard',
    sales_rep_id BIGINT,

    -- Additional fields for complex queries
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    completed_at TIMESTAMP,
    cancelled_at TIMESTAMP,

    -- Foreign key constraints
    CONSTRAINT fk_customer FOREIGN KEY (customer_id) REFERENCES customers(customer_id),
    CONSTRAINT fk_shipping_address FOREIGN KEY (shipping_address_id) REFERENCES addresses(address_id),
    CONSTRAINT fk_billing_address FOREIGN KEY (billing_address_id) REFERENCES addresses(address_id),
    CONSTRAINT fk_sales_rep FOREIGN KEY (sales_rep_id) REFERENCES employees(employee_id)
);

-- Order items table for line-level details
CREATE TABLE order_items (
    item_id BIGSERIAL PRIMARY KEY,
    order_id BIGINT NOT NULL,
    product_id BIGINT NOT NULL,
    quantity INTEGER NOT NULL,
    unit_price DECIMAL(10,2) NOT NULL,
    discount_amount DECIMAL(10,2) DEFAULT 0.00,
    tax_amount DECIMAL(10,2) NOT NULL,

    CONSTRAINT fk_order FOREIGN KEY (order_id) REFERENCES customer_orders(order_id),
    CONSTRAINT fk_product FOREIGN KEY (product_id) REFERENCES products(product_id)
);

-- Manual index creation - requires extensive analysis and planning
-- Basic indexes for common queries
CREATE INDEX idx_orders_customer_id ON customer_orders(customer_id);
CREATE INDEX idx_orders_order_date ON customer_orders(order_date);
CREATE INDEX idx_orders_status ON customer_orders(order_status);

-- Compound indexes for complex query patterns
CREATE INDEX idx_orders_customer_date ON customer_orders(customer_id, order_date DESC);
CREATE INDEX idx_orders_status_date ON customer_orders(order_status, order_date DESC);
CREATE INDEX idx_orders_rep_status ON customer_orders(sales_rep_id, order_status, order_date DESC);

-- Partial indexes for selective filtering
CREATE INDEX idx_orders_completed_recent ON customer_orders(completed_at, total_amount) 
    WHERE order_status = 'completed' AND completed_at >= CURRENT_DATE - INTERVAL '90 days';

-- Covering indexes for query optimization (include columns)
CREATE INDEX idx_orders_customer_covering ON customer_orders(customer_id, order_date DESC) 
    INCLUDE (order_status, total_amount, payment_method);

-- Complex multi-table query requiring careful index planning
SELECT 
    o.order_id,
    o.order_date,
    o.total_amount,
    o.order_status,
    c.customer_name,
    c.customer_email,

    -- Aggregated order items (expensive without proper indexes)
    COUNT(oi.item_id) as item_count,
    SUM(oi.quantity * oi.unit_price) as items_subtotal,
    SUM(oi.discount_amount) as total_discount,
    SUM(oi.tax_amount) as total_tax,

    -- Product information (requires additional joins)
    array_agg(DISTINCT p.product_name) as product_names,
    array_agg(DISTINCT p.category) as product_categories,

    -- Address information (more joins)
    sa.street_address as shipping_street,
    sa.city as shipping_city,
    sa.state as shipping_state,

    -- Employee information
    e.first_name || ' ' || e.last_name as sales_rep_name

FROM customer_orders o
JOIN customers c ON o.customer_id = c.customer_id
JOIN order_items oi ON o.order_id = oi.order_id
JOIN products p ON oi.product_id = p.product_id
LEFT JOIN addresses sa ON o.shipping_address_id = sa.address_id
LEFT JOIN employees e ON o.sales_rep_id = e.employee_id

WHERE 
    o.order_date >= CURRENT_DATE - INTERVAL '30 days'
    AND o.order_status IN ('processing', 'shipped', 'delivered')
    AND o.total_amount >= 100.00
    AND c.customer_tier IN ('premium', 'enterprise')

GROUP BY 
    o.order_id, o.order_date, o.total_amount, o.order_status,
    c.customer_name, c.customer_email,
    sa.street_address, sa.city, sa.state,
    e.first_name, e.last_name

HAVING COUNT(oi.item_id) >= 2

ORDER BY o.order_date DESC, o.total_amount DESC
LIMIT 100;

-- Analyze query performance (complex interpretation required)
EXPLAIN (ANALYZE, BUFFERS, FORMAT JSON) 
SELECT o.order_id, o.total_amount, c.customer_name
FROM customer_orders o
JOIN customers c ON o.customer_id = c.customer_id
WHERE o.order_date >= CURRENT_DATE - INTERVAL '7 days'
    AND o.order_status = 'completed'
    AND c.customer_tier = 'premium'
ORDER BY o.total_amount DESC
LIMIT 50;

-- Performance monitoring queries (complex and manual)
SELECT 
    schemaname,
    tablename,
    attname as column_name,
    n_distinct,
    correlation,
    most_common_vals,
    most_common_freqs
FROM pg_stats 
WHERE schemaname = 'public' 
    AND tablename IN ('customer_orders', 'order_items')
ORDER BY tablename, attname;

-- Index usage statistics
SELECT 
    schemaname,
    tablename,
    indexname,
    idx_tup_read,
    idx_tup_fetch,
    idx_scan,

    -- Index efficiency calculation
    CASE 
        WHEN idx_scan > 0 THEN ROUND((idx_tup_fetch::numeric / idx_scan), 2)
        ELSE 0 
    END as avg_tuples_per_scan,

    -- Index selectivity (estimated)
    CASE 
        WHEN idx_tup_read > 0 THEN ROUND((idx_tup_fetch::numeric / idx_tup_read) * 100, 2)
        ELSE 0 
    END as selectivity_percent

FROM pg_stat_user_indexes 
WHERE schemaname = 'public'
ORDER BY idx_scan DESC;

-- Problems with traditional PostgreSQL performance optimization:
-- 1. Manual index design requires deep expertise and continuous maintenance
-- 2. Query plan analysis is complex and difficult to interpret
-- 3. Index maintenance overhead grows with data volume
-- 4. Limited support for dynamic query patterns and evolving schemas
-- 5. Difficult to optimize across multiple tables and complex joins
-- 6. Performance monitoring requires custom scripts and manual interpretation
-- 7. Index selection strategies are static and don't adapt to changing workloads
-- 8. Covering index management is complex and error-prone
-- 9. Partial index design requires detailed knowledge of data distribution
-- 10. Limited automated recommendations for performance improvements

MongoDB provides comprehensive performance optimization with intelligent indexing:

// MongoDB Index Optimization - intelligent and automated performance tuning
const { MongoClient } = require('mongodb');

const client = new MongoClient('mongodb://localhost:27017');
const db = client.db('ecommerce_platform');

// Advanced Index Management and Optimization
class MongoDBIndexOptimizer {
  constructor(db) {
    this.db = db;
    this.performanceMetrics = new Map();
    this.indexRecommendations = new Map();
  }

  async createOptimizedCollections() {
    console.log('Creating optimized collections with intelligent indexing...');

    // Orders collection with comprehensive document structure
    const ordersCollection = db.collection('orders');

    // Sample order document structure for index planning
    const sampleOrder = {
      _id: new ObjectId(),
      orderNumber: "ORD-2025-001234",

      // Customer information (embedded for performance)
      customer: {
        customerId: new ObjectId("64a1b2c3d4e5f6789012345a"),
        name: "John Doe",
        email: "[email protected]",
        tier: "premium", // standard, premium, enterprise
        accountType: "individual" // individual, business
      },

      // Order details
      orderDate: new Date("2025-11-11T10:30:00Z"),
      status: "processing", // pending, processing, shipped, delivered, cancelled
      priority: "standard", // standard, expedited, overnight

      // Financial information
      totals: {
        subtotal: 299.97,
        tax: 24.00,
        shipping: 12.99,
        discount: 15.00,
        grandTotal: 321.96,
        currency: "USD"
      },

      // Items array (embedded for query performance)
      items: [
        {
          productId: new ObjectId("64b2c3d4e5f6789012345b1a"),
          sku: "WIDGET-001",
          name: "Premium Widget",
          category: "electronics",
          subcategory: "gadgets",
          quantity: 2,
          unitPrice: 99.99,
          totalPrice: 199.98,

          // Product attributes for filtering
          attributes: {
            brand: "TechCorp",
            model: "WG-2024",
            color: "black",
            size: null,
            weight: 1.2
          }
        },
        {
          productId: new ObjectId("64b2c3d4e5f6789012345b1b"),
          sku: "ACCESSORY-001", 
          name: "Widget Accessory",
          category: "electronics",
          subcategory: "accessories",
          quantity: 1,
          unitPrice: 99.99,
          totalPrice: 99.99,

          attributes: {
            brand: "TechCorp",
            model: "AC-2024",
            color: "silver",
            compatibility: ["WG-2024", "WG-2023"]
          }
        }
      ],

      // Address information
      addresses: {
        shipping: {
          name: "John Doe",
          street: "123 Main Street",
          city: "San Francisco",
          state: "CA",
          postalCode: "94105",
          country: "US",
          coordinates: {
            latitude: 37.7749,
            longitude: -122.4194
          }
        },

        billing: {
          name: "John Doe",
          street: "123 Main Street", 
          city: "San Francisco",
          state: "CA",
          postalCode: "94105",
          country: "US"
        }
      },

      // Payment information
      payment: {
        method: "credit_card", // credit_card, debit_card, paypal, etc.
        provider: "stripe",
        transactionId: "txn_1234567890",
        status: "captured" // pending, authorized, captured, failed
      },

      // Shipping information
      shipping: {
        method: "standard", // standard, expedited, overnight
        carrier: "UPS",
        trackingNumber: "1Z12345E1234567890",
        estimatedDelivery: new Date("2025-11-15T18:00:00Z"),
        actualDelivery: null
      },

      // Sales and marketing
      salesInfo: {
        salesRepId: new ObjectId("64c3d4e5f67890123456c2a"),
        salesRepName: "Jane Smith",
        channel: "online", // online, phone, in_store
        source: "organic", // organic, paid_search, social, email
        campaign: "holiday_2025"
      },

      // Operational metadata
      fulfillment: {
        warehouseId: "WH-SF-001",
        pickingStarted: null,
        pickingCompleted: null,
        packingStarted: null,
        packingCompleted: null,
        shippedAt: null
      },

      // Analytics and business intelligence
      analytics: {
        customerLifetimeValue: 1250.00,
        orderFrequency: "monthly",
        seasonality: "Q4",
        profitMargin: 0.35,
        riskScore: 12 // fraud risk score 0-100
      },

      // Audit trail
      audit: {
        createdAt: new Date("2025-11-11T10:30:00Z"),
        updatedAt: new Date("2025-11-11T14:45:00Z"),
        createdBy: "system",
        updatedBy: "user_12345",
        version: 2,

        // Change history for critical fields
        statusHistory: [
          {
            status: "pending",
            timestamp: new Date("2025-11-11T10:30:00Z"),
            userId: "customer_67890"
          },
          {
            status: "processing", 
            timestamp: new Date("2025-11-11T14:45:00Z"),
            userId: "system"
          }
        ]
      }
    };

    // Insert sample data for index testing
    await ordersCollection.insertOne(sampleOrder);

    // Create comprehensive index strategy
    await this.createIntelligentIndexes(ordersCollection);

    return ordersCollection;
  }

  async createIntelligentIndexes(collection) {
    console.log('Creating intelligent index strategy...');

    try {
      // 1. Primary query patterns - single field indexes
      await collection.createIndexes([

        // Customer-based queries (most common pattern)
        {
          key: { "customer.customerId": 1 },
          name: "idx_customer_id",
          background: true
        },

        // Date-based queries for reporting
        {
          key: { "orderDate": -1 },
          name: "idx_order_date_desc", 
          background: true
        },

        // Status queries for operational workflows
        {
          key: { "status": 1 },
          name: "idx_status",
          background: true
        },

        // Order number lookups (unique)
        {
          key: { "orderNumber": 1 },
          name: "idx_order_number",
          unique: true,
          background: true
        }
      ]);

      // 2. Compound indexes for complex query patterns
      await collection.createIndexes([

        // Customer order history (most frequent compound query)
        {
          key: { 
            "customer.customerId": 1, 
            "orderDate": -1,
            "status": 1
          },
          name: "idx_customer_date_status",
          background: true
        },

        // Order fulfillment workflow
        {
          key: {
            "status": 1,
            "priority": 1,
            "orderDate": 1
          },
          name: "idx_fulfillment_workflow",
          background: true
        },

        // Financial reporting and analytics
        {
          key: {
            "orderDate": -1,
            "totals.grandTotal": -1,
            "customer.tier": 1
          },
          name: "idx_financial_reporting",
          background: true
        },

        // Sales rep performance tracking
        {
          key: {
            "salesInfo.salesRepId": 1,
            "orderDate": -1,
            "status": 1
          },
          name: "idx_sales_rep_performance",
          background: true
        },

        // Geographic analysis
        {
          key: {
            "addresses.shipping.state": 1,
            "addresses.shipping.city": 1,
            "orderDate": -1
          },
          name: "idx_geographic_analysis",
          background: true
        }
      ]);

      // 3. Specialized indexes for advanced query patterns
      await collection.createIndexes([

        // Text search across multiple fields
        {
          key: {
            "customer.name": "text",
            "customer.email": "text", 
            "orderNumber": "text",
            "items.name": "text",
            "items.sku": "text"
          },
          name: "idx_text_search",
          background: true
        },

        // Geospatial index for location-based queries
        {
          key: { "addresses.shipping.coordinates": "2dsphere" },
          name: "idx_shipping_location",
          background: true
        },

        // Sparse index for optional tracking numbers
        {
          key: { "shipping.trackingNumber": 1 },
          name: "idx_tracking_number",
          sparse: true,
          background: true
        },

        // Partial index for recent high-value orders
        {
          key: { 
            "orderDate": -1,
            "totals.grandTotal": -1 
          },
          name: "idx_recent_high_value",
          partialFilterExpression: {
            "orderDate": { $gte: new Date(Date.now() - 90 * 24 * 60 * 60 * 1000) },
            "totals.grandTotal": { $gte: 500 }
          },
          background: true
        }
      ]);

      // 4. Array indexing for embedded documents
      await collection.createIndexes([

        // Product-based queries on order items
        {
          key: { "items.productId": 1 },
          name: "idx_product_id",
          background: true
        },

        // SKU lookups
        {
          key: { "items.sku": 1 },
          name: "idx_item_sku",
          background: true
        },

        // Category-based analytics
        {
          key: { 
            "items.category": 1,
            "items.subcategory": 1,
            "orderDate": -1
          },
          name: "idx_category_analytics",
          background: true
        },

        // Brand analysis
        {
          key: { "items.attributes.brand": 1 },
          name: "idx_brand_analysis",
          background: true
        }
      ]);

      // 5. TTL index for data lifecycle management
      await collection.createIndex(
        { "audit.createdAt": 1 },
        { 
          name: "idx_ttl_cleanup",
          expireAfterSeconds: 60 * 60 * 24 * 365 * 7, // 7 years retention
          background: true
        }
      );

      console.log('Intelligent indexing strategy implemented successfully');

    } catch (error) {
      console.error('Error creating indexes:', error);
      throw error;
    }
  }

  async analyzeQueryPerformance(collection, queryPattern, options = {}) {
    console.log('Analyzing query performance with advanced explain plans...');

    try {
      // Sample query patterns for analysis
      const queryPatterns = {
        customerOrders: {
          filter: { "customer.customerId": new ObjectId("64a1b2c3d4e5f6789012345a") },
          sort: { "orderDate": -1 },
          limit: 20
        },

        recentHighValue: {
          filter: {
            "orderDate": { $gte: new Date(Date.now() - 30 * 24 * 60 * 60 * 1000) },
            "totals.grandTotal": { $gte: 100 },
            "status": { $in: ["processing", "shipped", "delivered"] }
          },
          sort: { "totals.grandTotal": -1 },
          limit: 50
        },

        fulfillmentQueue: {
          filter: {
            "status": "processing",
            "priority": { $in: ["expedited", "overnight"] }
          },
          sort: { "orderDate": 1 },
          limit: 100
        },

        salesAnalytics: {
          filter: {
            "salesInfo.salesRepId": new ObjectId("64c3d4e5f67890123456c2a"),
            "orderDate": { 
              $gte: new Date("2025-11-01"),
              $lt: new Date("2025-12-01")
            }
          },
          sort: { "orderDate": -1 }
        }
      };

      const selectedQuery = queryPatterns[queryPattern] || queryPatterns.customerOrders;

      // Execute explain plan with detailed analysis
      const explainResult = await collection.find(selectedQuery.filter)
        .sort(selectedQuery.sort || {})
        .limit(selectedQuery.limit || 1000)
        .explain("executionStats");

      // Analyze execution statistics
      const executionStats = explainResult.executionStats;
      const winningPlan = explainResult.queryPlanner.winningPlan;

      const performanceAnalysis = {
        queryPattern: queryPattern,
        executionTime: executionStats.executionTimeMillis,
        documentsExamined: executionStats.totalDocsExamined,
        documentsReturned: executionStats.totalDocsReturned,
        indexesUsed: this.extractIndexesUsed(winningPlan),

        // Performance efficiency metrics
        selectivityRatio: executionStats.totalDocsReturned / Math.max(executionStats.totalDocsExamined, 1),
        indexEfficiency: this.calculateIndexEfficiency(executionStats),

        // Performance classification
        performanceRating: this.classifyPerformance(executionStats),

        // Optimization recommendations
        recommendations: this.generateOptimizationRecommendations(explainResult),

        // Detailed execution breakdown
        executionBreakdown: this.analyzeExecutionStages(winningPlan),

        queryDetails: {
          filter: selectedQuery.filter,
          sort: selectedQuery.sort,
          limit: selectedQuery.limit
        },

        timestamp: new Date()
      };

      // Store performance metrics for trending
      this.performanceMetrics.set(queryPattern, performanceAnalysis);

      console.log(`Query Performance Analysis for ${queryPattern}:`);
      console.log(`  Execution Time: ${performanceAnalysis.executionTime}ms`);
      console.log(`  Documents Examined: ${performanceAnalysis.documentsExamined}`);
      console.log(`  Documents Returned: ${performanceAnalysis.documentsReturned}`);
      console.log(`  Selectivity Ratio: ${performanceAnalysis.selectivityRatio.toFixed(4)}`);
      console.log(`  Performance Rating: ${performanceAnalysis.performanceRating}`);
      console.log(`  Indexes Used: ${JSON.stringify(performanceAnalysis.indexesUsed)}`);

      if (performanceAnalysis.recommendations.length > 0) {
        console.log('  Optimization Recommendations:');
        performanceAnalysis.recommendations.forEach(rec => {
          console.log(`    - ${rec}`);
        });
      }

      return performanceAnalysis;

    } catch (error) {
      console.error('Error analyzing query performance:', error);
      throw error;
    }
  }

  extractIndexesUsed(winningPlan) {
    const indexes = [];

    const extractFromStage = (stage) => {
      if (stage.indexName) {
        indexes.push(stage.indexName);
      }

      if (stage.inputStage) {
        extractFromStage(stage.inputStage);
      }

      if (stage.inputStages) {
        stage.inputStages.forEach(inputStage => {
          extractFromStage(inputStage);
        });
      }
    };

    extractFromStage(winningPlan);
    return [...new Set(indexes)]; // Remove duplicates
  }

  calculateIndexEfficiency(executionStats) {
    // Index efficiency = (docs returned / docs examined) * (1 / execution time factor)
    const selectivity = executionStats.totalDocsReturned / Math.max(executionStats.totalDocsExamined, 1);
    const timeFactor = Math.min(executionStats.executionTimeMillis / 100, 1); // Normalize execution time

    return selectivity * (1 - timeFactor);
  }

  classifyPerformance(executionStats) {
    const { executionTimeMillis, totalDocsExamined, totalDocsReturned } = executionStats;
    const selectivity = totalDocsReturned / Math.max(totalDocsExamined, 1);

    if (executionTimeMillis < 10 && selectivity > 0.1) return 'Excellent';
    if (executionTimeMillis < 50 && selectivity > 0.01) return 'Good';
    if (executionTimeMillis < 100 && selectivity > 0.001) return 'Fair';
    return 'Poor';
  }

  generateOptimizationRecommendations(explainResult) {
    const recommendations = [];
    const executionStats = explainResult.executionStats;
    const winningPlan = explainResult.queryPlanner.winningPlan;

    // High execution time
    if (executionStats.executionTimeMillis > 100) {
      recommendations.push('Consider adding compound indexes for better query selectivity');
    }

    // Low selectivity (examining many documents vs returning few)
    const selectivity = executionStats.totalDocsReturned / Math.max(executionStats.totalDocsExamined, 1);
    if (selectivity < 0.01) {
      recommendations.push('Improve query selectivity with more specific filtering criteria');
    }

    // Collection scan detected
    if (winningPlan.stage === 'COLLSCAN') {
      recommendations.push('Critical: Query is performing collection scan - add appropriate indexes');
    }

    // Sort not using index
    if (this.findStageInPlan(winningPlan, 'SORT') && !this.findStageInPlan(winningPlan, 'IXSCAN')) {
      recommendations.push('Sort operation not using index - consider compound index with sort fields');
    }

    // High key examination
    if (executionStats.totalKeysExamined > executionStats.totalDocsReturned * 10) {
      recommendations.push('High key examination ratio - consider more selective compound indexes');
    }

    return recommendations;
  }

  findStageInPlan(plan, stageName) {
    if (plan.stage === stageName) return true;

    if (plan.inputStage && this.findStageInPlan(plan.inputStage, stageName)) return true;

    if (plan.inputStages) {
      return plan.inputStages.some(stage => this.findStageInPlan(stage, stageName));
    }

    return false;
  }

  analyzeExecutionStages(winningPlan) {
    const stages = [];

    const extractStages = (stage) => {
      stages.push({
        stage: stage.stage,
        indexName: stage.indexName || null,
        direction: stage.direction || null,
        keysExamined: stage.keysExamined || null,
        docsExamined: stage.docsExamined || null,
        executionTimeMillis: stage.executionTimeMillisEstimate || null
      });

      if (stage.inputStage) {
        extractStages(stage.inputStage);
      }

      if (stage.inputStages) {
        stage.inputStages.forEach(inputStage => {
          extractStages(inputStage);
        });
      }
    };

    extractStages(winningPlan);
    return stages;
  }

  async performComprehensiveIndexAnalysis(collection) {
    console.log('Performing comprehensive index analysis...');

    try {
      // Get index statistics
      const indexStats = await collection.aggregate([
        { $indexStats: {} }
      ]).toArray();

      // Get collection statistics
      const collectionStats = await db.runCommand({ collStats: collection.collectionName });

      // Analyze index usage patterns
      const indexAnalysis = indexStats.map(index => {
        const usageStats = index.accesses;
        const indexSize = index.size || 0;
        const indexName = index.name;

        return {
          name: indexName,

          // Usage metrics
          accessCount: usageStats.ops || 0,
          lastAccessed: usageStats.since || null,

          // Size metrics
          sizeBytes: indexSize,
          sizeMB: (indexSize / 1024 / 1024).toFixed(2),

          // Efficiency analysis
          accessFrequency: this.calculateAccessFrequency(usageStats),
          utilizationScore: this.calculateUtilizationScore(usageStats, indexSize),

          // Recommendations
          recommendation: this.analyzeIndexRecommendation(indexName, usageStats, indexSize)
        };
      });

      // Collection-level analysis
      const collectionAnalysis = {
        totalDocuments: collectionStats.count,
        totalSize: collectionStats.size,
        averageDocumentSize: collectionStats.avgObjSize,
        totalIndexSize: collectionStats.totalIndexSize,
        indexToDataRatio: (collectionStats.totalIndexSize / collectionStats.size).toFixed(2),

        // Index efficiency summary
        totalIndexes: indexStats.length,
        activeIndexes: indexStats.filter(idx => idx.accesses.ops > 0).length,
        unusedIndexes: indexStats.filter(idx => idx.accesses.ops === 0).length,

        // Performance indicators
        indexOverhead: ((collectionStats.totalIndexSize / collectionStats.size) * 100).toFixed(1) + '%',

        recommendations: this.generateCollectionRecommendations(indexAnalysis, collectionStats)
      };

      const analysis = {
        collection: collection.collectionName,
        analyzedAt: new Date(),
        collectionMetrics: collectionAnalysis,
        indexDetails: indexAnalysis,

        // Summary classifications
        performanceStatus: this.classifyCollectionPerformance(collectionAnalysis),
        optimizationPriority: this.determineOptimizationPriority(indexAnalysis),

        // Action items
        actionItems: this.generateActionItems(indexAnalysis, collectionAnalysis)
      };

      console.log('Index Analysis Summary:');
      console.log(`  Total Indexes: ${collectionAnalysis.totalIndexes}`);
      console.log(`  Active Indexes: ${collectionAnalysis.activeIndexes}`);  
      console.log(`  Unused Indexes: ${collectionAnalysis.unusedIndexes}`);
      console.log(`  Index Overhead: ${collectionAnalysis.indexOverhead}`);
      console.log(`  Performance Status: ${analysis.performanceStatus}`);

      return analysis;

    } catch (error) {
      console.error('Error performing index analysis:', error);
      throw error;
    }
  }

  calculateAccessFrequency(usageStats) {
    if (!usageStats.since || usageStats.ops === 0) return 'Never';

    const daysSince = (Date.now() - usageStats.since.getTime()) / (1000 * 60 * 60 * 24);
    const accessesPerDay = usageStats.ops / Math.max(daysSince, 1);

    if (accessesPerDay > 1000) return 'Very High';
    if (accessesPerDay > 100) return 'High';
    if (accessesPerDay > 10) return 'Moderate';
    if (accessesPerDay > 1) return 'Low';
    return 'Very Low';
  }

  calculateUtilizationScore(usageStats, indexSize) {
    // Score based on access frequency vs storage cost
    const accessCount = usageStats.ops || 0;
    const sizeCost = indexSize / (1024 * 1024); // Size in MB

    if (accessCount === 0) return 0;

    // Higher score for more accesses per MB of storage
    return Math.min((accessCount / Math.max(sizeCost, 1)) / 1000, 10);
  }

  analyzeIndexRecommendation(indexName, usageStats, indexSize) {
    if (indexName === '_id_') return 'System index - always keep';

    if (usageStats.ops === 0) {
      return 'Consider dropping - unused index consuming storage';
    }

    if (usageStats.ops < 10 && indexSize > 10 * 1024 * 1024) { // < 10 uses and > 10MB
      return 'Low utilization - evaluate if index is necessary';
    }

    if (usageStats.ops > 10000) {
      return 'High utilization - keep and monitor performance';
    }

    return 'Normal utilization - maintain current index';
  }

  generateCollectionRecommendations(indexAnalysis, collectionStats) {
    const recommendations = [];

    // Check for unused indexes
    const unusedIndexes = indexAnalysis.filter(idx => idx.accessCount === 0 && idx.name !== '_id_');
    if (unusedIndexes.length > 0) {
      recommendations.push(`Drop ${unusedIndexes.length} unused indexes to reduce storage overhead`);
    }

    // Check index-to-data ratio
    const indexRatio = collectionStats.totalIndexSize / collectionStats.size;
    if (indexRatio > 1.5) {
      recommendations.push('High index overhead - review index necessity and consider consolidation');
    }

    // Check for very large indexes with low utilization
    const inefficientIndexes = indexAnalysis.filter(idx => 
      idx.sizeBytes > 100 * 1024 * 1024 && idx.utilizationScore < 1
    );
    if (inefficientIndexes.length > 0) {
      recommendations.push('Large indexes with low utilization detected - consider optimization');
    }

    return recommendations;
  }

  classifyCollectionPerformance(collectionAnalysis) {
    const unusedRatio = collectionAnalysis.unusedIndexes / collectionAnalysis.totalIndexes;
    const indexOverheadPercent = parseFloat(collectionAnalysis.indexOverhead);

    if (unusedRatio > 0.3 || indexOverheadPercent > 200) return 'Poor';
    if (unusedRatio > 0.2 || indexOverheadPercent > 150) return 'Fair';
    if (unusedRatio > 0.1 || indexOverheadPercent > 100) return 'Good';
    return 'Excellent';
  }

  determineOptimizationPriority(indexAnalysis) {
    const unusedCount = indexAnalysis.filter(idx => idx.accessCount === 0).length;
    const lowUtilizationCount = indexAnalysis.filter(idx => idx.utilizationScore < 1).length;

    if (unusedCount > 3 || lowUtilizationCount > 5) return 'High';
    if (unusedCount > 1 || lowUtilizationCount > 2) return 'Medium';
    return 'Low';
  }

  generateActionItems(indexAnalysis, collectionAnalysis) {
    const actions = [];

    // Unused index cleanup
    const unusedIndexes = indexAnalysis.filter(idx => idx.accessCount === 0 && idx.name !== '_id_');
    unusedIndexes.forEach(idx => {
      actions.push({
        type: 'DROP_INDEX',
        indexName: idx.name,
        reason: 'Unused index consuming storage',
        priority: 'Medium',
        estimatedSavings: `${idx.sizeMB}MB storage`
      });
    });

    // Low utilization optimization
    const lowUtilizationIndexes = indexAnalysis.filter(idx => 
      idx.utilizationScore < 1 && idx.accessCount > 0 && idx.sizeBytes > 10 * 1024 * 1024
    );
    lowUtilizationIndexes.forEach(idx => {
      actions.push({
        type: 'REVIEW_INDEX',
        indexName: idx.name,
        reason: 'Low utilization for large index',
        priority: 'Low',
        recommendation: 'Evaluate query patterns and consider consolidation'
      });
    });

    return actions;
  }

  async demonstrateAdvancedQuerying(collection) {
    console.log('Demonstrating advanced querying with performance optimization...');

    const queryExamples = [
      {
        name: 'Customer Order History with Analytics',
        query: async () => {
          return await collection.find({
            "customer.customerId": new ObjectId("64a1b2c3d4e5f6789012345a"),
            "orderDate": { $gte: new Date("2025-01-01") }
          })
          .sort({ "orderDate": -1 })
          .limit(20)
          .explain("executionStats");
        }
      },

      {
        name: 'High-Value Recent Orders',
        query: async () => {
          return await collection.find({
            "orderDate": { $gte: new Date(Date.now() - 30 * 24 * 60 * 60 * 1000) },
            "totals.grandTotal": { $gte: 500 },
            "status": { $in: ["processing", "shipped", "delivered"] }
          })
          .sort({ "totals.grandTotal": -1 })
          .limit(50)
          .explain("executionStats");
        }
      },

      {
        name: 'Geographic Sales Analysis',
        query: async () => {
          return await collection.find({
            "addresses.shipping.state": "CA",
            "orderDate": { 
              $gte: new Date("2025-11-01"),
              $lt: new Date("2025-12-01")
            }
          })
          .sort({ "orderDate": -1 })
          .explain("executionStats");
        }
      },

      {
        name: 'Product Category Performance',
        query: async () => {
          return await collection.find({
            "items.category": "electronics",
            "orderDate": { $gte: new Date("2025-11-01") }
          })
          .sort({ "totals.grandTotal": -1 })
          .explain("executionStats");
        }
      }
    ];

    const results = {};

    for (const example of queryExamples) {
      try {
        console.log(`\nTesting: ${example.name}`);
        const result = await example.query();

        const stats = result.executionStats;
        const performance = {
          executionTime: stats.executionTimeMillis,
          documentsExamined: stats.totalDocsExamined,
          documentsReturned: stats.totalDocsReturned,
          indexesUsed: this.extractIndexesUsed(result.queryPlanner.winningPlan),
          efficiency: (stats.totalDocsReturned / Math.max(stats.totalDocsExamined, 1)).toFixed(4)
        };

        console.log(`  Execution Time: ${performance.executionTime}ms`);
        console.log(`  Efficiency Ratio: ${performance.efficiency}`);
        console.log(`  Indexes Used: ${JSON.stringify(performance.indexesUsed)}`);

        results[example.name] = performance;

      } catch (error) {
        console.error(`Error testing ${example.name}:`, error);
        results[example.name] = { error: error.message };
      }
    }

    return results;
  }
}

// Export optimization class
module.exports = { MongoDBIndexOptimizer };

// Benefits of MongoDB Index Optimization:
// - Intelligent compound indexing for complex query patterns
// - Automated performance analysis and recommendations
// - Flexible indexing strategies for evolving schemas
// - Advanced query execution analysis with detailed metrics
// - Comprehensive index utilization monitoring
// - Automated optimization suggestions based on usage patterns
// - Support for specialized indexes (geospatial, text, sparse, partial)
// - Integration with existing MongoDB ecosystem and tooling
// - Real-time performance monitoring and alerting capabilities
// - Cost-effective storage optimization through intelligent index management

Understanding MongoDB Index Architecture

Compound Index Design Patterns

MongoDB's compound indexing system supports sophisticated query optimization strategies:

// Advanced compound indexing patterns for enterprise applications
class CompoundIndexStrategist {
  constructor(db) {
    this.db = db;
    this.indexStrategies = new Map();
    this.queryPatterns = new Map();
  }

  async analyzeQueryPatternsAndCreateIndexes() {
    console.log('Analyzing query patterns and creating optimized compound indexes...');

    // Pattern 1: ESR (Equality, Sort, Range) Index Design
    const esrPattern = {
      description: "Equality-Sort-Range compound index optimization",

      // Customer order queries: customer (equality) + date (sort) + status (range)
      index: {
        "customer.customerId": 1,  // Equality first
        "orderDate": -1,           // Sort second  
        "status": 1                // Range/filter third
      },

      queryExamples: [
        {
          filter: { 
            "customer.customerId": "specific_customer_id",
            "status": { $in: ["processing", "shipped"] }
          },
          sort: { "orderDate": -1 },
          description: "Customer order history with status filtering"
        }
      ],

      performance: "Optimal - follows ESR pattern for maximum efficiency"
    };

    // Pattern 2: Multi-dimensional Analytics Index
    const analyticsPattern = {
      description: "Multi-dimensional analytics with hierarchical grouping",

      index: {
        "orderDate": -1,           // Time dimension (most selective)
        "customer.tier": 1,        // Customer segment
        "items.category": 1,       // Product category
        "totals.grandTotal": -1    // Value dimension
      },

      queryExamples: [
        {
          pipeline: [
            { 
              $match: {
                "orderDate": { $gte: new Date("2025-01-01") },
                "customer.tier": "premium"
              }
            },
            {
              $group: {
                _id: {
                  month: { $dateToString: { format: "%Y-%m", date: "$orderDate" } },
                  category: "$items.category"
                },
                totalRevenue: { $sum: "$totals.grandTotal" },
                orderCount: { $sum: 1 }
              }
            }
          ],
          description: "Monthly revenue by customer tier and product category"
        }
      ]
    };

    // Pattern 3: Geospatial + Business Logic Index
    const geospatialPattern = {
      description: "Geospatial queries combined with business filters",

      index: {
        "addresses.shipping.coordinates": "2dsphere",  // Geospatial first
        "status": 1,                                    // Business filter
        "orderDate": -1                                 // Time component
      },

      queryExamples: [
        {
          filter: {
            "addresses.shipping.coordinates": {
              $near: {
                $geometry: { type: "Point", coordinates: [-122.4194, 37.7749] },
                $maxDistance: 10000 // 10km radius
              }
            },
            "status": "processing",
            "orderDate": { $gte: new Date("2025-11-01") }
          },
          description: "Recent processing orders within geographic radius"
        }
      ]
    };

    // Pattern 4: Text Search + Faceted Filtering
    const textSearchPattern = {
      description: "Full-text search with multiple filter dimensions",

      textIndex: {
        "customer.name": "text",
        "items.name": "text", 
        "items.sku": "text",
        "orderNumber": "text"
      },

      supportingIndexes: [
        {
          "customer.tier": 1,
          "orderDate": -1
        },
        {
          "items.category": 1,
          "totals.grandTotal": -1
        }
      ],

      queryExamples: [
        {
          filter: {
            $text: { $search: "premium widget" },
            "customer.tier": "enterprise",
            "orderDate": { $gte: new Date("2025-10-01") }
          },
          sort: { score: { $meta: "textScore" } },
          description: "Text search with customer tier and date filtering"
        }
      ]
    };

    // Create indexes based on patterns
    const ordersCollection = this.db.collection('orders');

    await this.implementIndexStrategy(ordersCollection, 'ESR_Pattern', esrPattern.index);
    await this.implementIndexStrategy(ordersCollection, 'Analytics_Pattern', analyticsPattern.index);  
    await this.implementIndexStrategy(ordersCollection, 'Geospatial_Pattern', geospatialPattern.index);
    await this.implementTextSearchStrategy(ordersCollection, textSearchPattern);

    // Store strategies for analysis
    this.indexStrategies.set('esr', esrPattern);
    this.indexStrategies.set('analytics', analyticsPattern);
    this.indexStrategies.set('geospatial', geospatialPattern);
    this.indexStrategies.set('textSearch', textSearchPattern);

    console.log('Advanced compound index strategies implemented');
    return this.indexStrategies;
  }

  async implementIndexStrategy(collection, strategyName, indexSpec) {
    try {
      await collection.createIndex(indexSpec, {
        name: `idx_${strategyName.toLowerCase()}`,
        background: true
      });
      console.log(`✅ Created index strategy: ${strategyName}`);
    } catch (error) {
      console.error(`❌ Failed to create ${strategyName}:`, error.message);
    }
  }

  async implementTextSearchStrategy(collection, textPattern) {
    try {
      // Create text index
      await collection.createIndex(textPattern.textIndex, {
        name: "idx_text_search_comprehensive",
        background: true
      });

      // Create supporting indexes for faceted filtering
      for (let i = 0; i < textPattern.supportingIndexes.length; i++) {
        await collection.createIndex(textPattern.supportingIndexes[i], {
          name: `idx_text_support_${i + 1}`,
          background: true
        });
      }

      console.log('✅ Created text search strategy with supporting indexes');
    } catch (error) {
      console.error('❌ Failed to create text search strategy:', error.message);
    }
  }

  async optimizeExistingIndexes(collection) {
    console.log('Optimizing existing indexes based on query patterns...');

    try {
      // Get current indexes
      const currentIndexes = await collection.listIndexes().toArray();

      // Analyze index effectiveness
      const indexAnalysis = await this.analyzeIndexEffectiveness(collection, currentIndexes);

      // Generate optimization plan
      const optimizationPlan = this.createOptimizationPlan(indexAnalysis);

      // Execute optimization (with safety checks)
      await this.executeOptimizationPlan(collection, optimizationPlan);

      return optimizationPlan;

    } catch (error) {
      console.error('Error optimizing indexes:', error);
      throw error;
    }
  }

  async analyzeIndexEffectiveness(collection, indexes) {
    const analysis = [];

    for (const index of indexes) {
      if (index.name === '_id_') continue; // Skip default index

      try {
        // Get index statistics
        const stats = await collection.aggregate([
          { $indexStats: {} },
          { $match: { name: index.name } }
        ]).toArray();

        const indexStat = stats[0];
        if (!indexStat) continue;

        // Analyze index composition
        const indexComposition = this.analyzeIndexComposition(index.key);

        // Calculate efficiency metrics
        const efficiency = {
          usageCount: indexStat.accesses?.ops || 0,
          lastUsed: indexStat.accesses?.since || null,
          sizeBytes: indexStat.size || 0,

          // Index pattern analysis
          composition: indexComposition,
          followsESRPattern: this.checkESRPattern(index.key),
          hasRedundancy: await this.checkIndexRedundancy(collection, index),

          // Performance classification
          utilizationScore: this.calculateUtilizationScore(indexStat),
          efficiencyRating: this.rateIndexEfficiency(indexStat, indexComposition)
        };

        analysis.push({
          name: index.name,
          keyPattern: index.key,
          ...efficiency
        });

      } catch (error) {
        console.warn(`Could not analyze index ${index.name}:`, error.message);
      }
    }

    return analysis;
  }

  analyzeIndexComposition(keyPattern) {
    const keys = Object.keys(keyPattern);
    const composition = {
      fieldCount: keys.length,
      hasEquality: false,
      hasSort: false,
      hasRange: false,
      hasGeospatial: false,
      hasText: false
    };

    keys.forEach((key, index) => {
      const value = keyPattern[key];

      // Detect index type based on value and position
      if (value === 1 || value === -1) {
        if (index === 0) composition.hasEquality = true;
        if (index === 1) composition.hasSort = true;
        if (index > 1) composition.hasRange = true;
      }

      if (value === '2dsphere' || value === '2d') composition.hasGeospatial = true;
      if (value === 'text') composition.hasText = true;
    });

    return composition;
  }

  checkESRPattern(keyPattern) {
    const keys = Object.keys(keyPattern);
    if (keys.length < 3) return false;

    // ESR: First field equality, second sort, third range
    const values = Object.values(keyPattern);
    return (values[0] === 1 || values[0] === -1) &&
           (values[1] === 1 || values[1] === -1) &&
           (values[2] === 1 || values[2] === -1);
  }

  async checkIndexRedundancy(collection, targetIndex) {
    // Check if this index is redundant with other indexes
    const allIndexes = await collection.listIndexes().toArray();
    const targetKeys = Object.keys(targetIndex.key);

    for (const otherIndex of allIndexes) {
      if (otherIndex.name === targetIndex.name || otherIndex.name === '_id_') continue;

      const otherKeys = Object.keys(otherIndex.key);

      // Check if targetIndex is a prefix of otherIndex (redundant)
      if (targetKeys.length <= otherKeys.length) {
        const isPrefix = targetKeys.every((key, index) => 
          otherKeys[index] === key && 
          targetIndex.key[key] === otherIndex.key[key]
        );

        if (isPrefix) return otherIndex.name;
      }
    }

    return false;
  }

  calculateUtilizationScore(indexStat) {
    const usage = indexStat.accesses?.ops || 0;
    const size = indexStat.size || 0;

    if (usage === 0) return 0;
    if (size === 0) return 10; // System indexes

    // Score based on usage per MB
    const sizeMB = size / (1024 * 1024);
    return Math.min((usage / sizeMB) / 100, 10);
  }

  rateIndexEfficiency(indexStat, composition) {
    let score = 5; // Base score

    // Usage factor
    const usage = indexStat.accesses?.ops || 0;
    if (usage > 10000) score += 2;
    else if (usage > 1000) score += 1;
    else if (usage === 0) score -= 3;

    // Composition factor
    if (composition.followsESRPattern) score += 2;
    if (composition.hasGeospatial || composition.hasText) score += 1;
    if (composition.fieldCount > 5) score -= 1; // Too many fields

    // Size factor (prefer smaller indexes for same functionality)
    const sizeMB = (indexStat.size || 0) / (1024 * 1024);
    if (sizeMB > 100) score -= 1;

    return Math.max(Math.min(score, 10), 0);
  }

  createOptimizationPlan(indexAnalysis) {
    const plan = {
      actions: [],
      expectedBenefits: [],
      risks: [],
      estimatedImpact: {}
    };

    // Identify unused indexes
    const unusedIndexes = indexAnalysis.filter(idx => idx.usageCount === 0);
    unusedIndexes.forEach(idx => {
      plan.actions.push({
        type: 'DROP',
        indexName: idx.name,
        reason: 'Unused index consuming storage',
        impact: `Save ${(idx.sizeBytes / 1024 / 1024).toFixed(2)}MB storage`,
        priority: 'HIGH'
      });
    });

    // Identify redundant indexes
    const redundantIndexes = indexAnalysis.filter(idx => idx.hasRedundancy);
    redundantIndexes.forEach(idx => {
      plan.actions.push({
        type: 'DROP',
        indexName: idx.name,
        reason: `Redundant with ${idx.hasRedundancy}`,
        impact: 'Reduce index maintenance overhead',
        priority: 'MEDIUM'
      });
    });

    // Suggest compound index improvements
    const inefficientIndexes = indexAnalysis.filter(idx => 
      idx.efficiencyRating < 5 && idx.usageCount > 0
    );
    inefficientIndexes.forEach(idx => {
      if (!idx.composition.followsESRPattern) {
        plan.actions.push({
          type: 'REBUILD',
          indexName: idx.name,
          reason: 'Does not follow ESR pattern',
          suggestion: 'Reorder fields: Equality, Sort, Range',
          impact: 'Improve query performance',
          priority: 'MEDIUM'
        });
      }
    });

    // Calculate expected benefits
    const storageSavings = unusedIndexes.reduce((sum, idx) => sum + idx.sizeBytes, 0);
    plan.estimatedImpact.storageSavings = `${(storageSavings / 1024 / 1024).toFixed(2)}MB`;
    plan.estimatedImpact.maintenanceReduction = `${unusedIndexes.length + redundantIndexes.length} fewer indexes`;

    return plan;
  }

  async executeOptimizationPlan(collection, plan) {
    console.log('Executing index optimization plan...');

    for (const action of plan.actions) {
      try {
        if (action.type === 'DROP' && action.priority === 'HIGH') {
          // Only auto-execute high-priority drops (unused indexes)
          console.log(`Dropping unused index: ${action.indexName}`);
          await collection.dropIndex(action.indexName);
          console.log(`✅ Successfully dropped index: ${action.indexName}`);
        } else {
          console.log(`📋 Recommended action: ${action.type} ${action.indexName} - ${action.reason}`);
        }
      } catch (error) {
        console.error(`❌ Failed to execute action on ${action.indexName}:`, error.message);
      }
    }

    console.log('Index optimization plan execution completed');
  }

  async generatePerformanceReport(collection) {
    console.log('Generating comprehensive performance report...');

    try {
      // Get collection statistics
      const stats = await this.db.runCommand({ collStats: collection.collectionName });

      // Get index usage statistics
      const indexStats = await collection.aggregate([
        { $indexStats: {} }
      ]).toArray();

      // Analyze recent query performance
      const performanceMetrics = Array.from(this.performanceMetrics.values());

      // Generate comprehensive report
      const report = {
        collectionName: collection.collectionName,
        generatedAt: new Date(),

        // Collection overview
        overview: {
          totalDocuments: stats.count,
          totalSizeGB: (stats.size / 1024 / 1024 / 1024).toFixed(2),
          averageDocumentSizeKB: (stats.avgObjSize / 1024).toFixed(2),
          totalIndexes: indexStats.length,
          totalIndexSizeGB: (stats.totalIndexSize / 1024 / 1024 / 1024).toFixed(2),
          indexToDataRatio: (stats.totalIndexSize / stats.size).toFixed(2)
        },

        // Index performance summary
        indexPerformance: {
          activeIndexes: indexStats.filter(idx => idx.accesses?.ops > 0).length,
          unusedIndexes: indexStats.filter(idx => idx.accesses?.ops === 0).length - 1, // Exclude _id_
          highUtilizationIndexes: indexStats.filter(idx => idx.accesses?.ops > 10000).length,

          // Top performing indexes
          topIndexes: indexStats
            .filter(idx => idx.name !== '_id_' && idx.accesses?.ops > 0)
            .sort((a, b) => (b.accesses?.ops || 0) - (a.accesses?.ops || 0))
            .slice(0, 5)
            .map(idx => ({
              name: idx.name,
              accessCount: idx.accesses?.ops || 0,
              sizeMB: ((idx.size || 0) / 1024 / 1024).toFixed(2)
            }))
        },

        // Query performance analysis
        queryPerformance: {
          totalQueriesAnalyzed: performanceMetrics.length,
          averageExecutionTime: performanceMetrics.length > 0 ? 
            (performanceMetrics.reduce((sum, metric) => sum + metric.executionTime, 0) / performanceMetrics.length).toFixed(2) : 0,
          excellentQueries: performanceMetrics.filter(m => m.performanceRating === 'Excellent').length,
          poorQueries: performanceMetrics.filter(m => m.performanceRating === 'Poor').length,

          // Query patterns
          commonPatterns: this.identifyCommonQueryPatterns(performanceMetrics)
        },

        // Recommendations
        recommendations: this.generatePerformanceRecommendations(stats, indexStats, performanceMetrics),

        // Health score
        healthScore: this.calculateHealthScore(stats, indexStats, performanceMetrics)
      };

      // Display report summary
      console.log('\n📊 Performance Report Summary:');
      console.log(`Collection: ${report.collectionName}`);
      console.log(`Documents: ${report.overview.totalDocuments.toLocaleString()}`);
      console.log(`Data Size: ${report.overview.totalSizeGB}GB`);
      console.log(`Index Size: ${report.overview.totalIndexSizeGB}GB`);
      console.log(`Active Indexes: ${report.indexPerformance.activeIndexes}/${report.overview.totalIndexes}`);
      console.log(`Health Score: ${report.healthScore}/100`);

      if (report.recommendations.length > 0) {
        console.log('\n💡 Top Recommendations:');
        report.recommendations.slice(0, 3).forEach(rec => {
          console.log(`  • ${rec}`);
        });
      }

      return report;

    } catch (error) {
      console.error('Error generating performance report:', error);
      throw error;
    }
  }

  identifyCommonQueryPatterns(performanceMetrics) {
    // Analyze query patterns to identify common access patterns
    const patterns = new Map();

    performanceMetrics.forEach(metric => {
      const pattern = metric.queryPattern || 'unknown';
      if (patterns.has(pattern)) {
        patterns.set(pattern, patterns.get(pattern) + 1);
      } else {
        patterns.set(pattern, 1);
      }
    });

    return Array.from(patterns.entries())
      .sort(([,a], [,b]) => b - a)
      .slice(0, 5)
      .map(([pattern, count]) => ({ pattern, count }));
  }

  generatePerformanceRecommendations(collectionStats, indexStats, queryMetrics) {
    const recommendations = [];

    // Index optimization recommendations
    const unusedCount = indexStats.filter(idx => idx.name !== '_id_' && idx.accesses?.ops === 0).length;
    if (unusedCount > 0) {
      recommendations.push(`Remove ${unusedCount} unused indexes to reduce storage and maintenance overhead`);
    }

    // Size recommendations
    const indexRatio = collectionStats.totalIndexSize / collectionStats.size;
    if (indexRatio > 1.5) {
      recommendations.push('High index-to-data ratio detected - review index necessity');
    }

    // Query performance recommendations
    const poorQueries = queryMetrics.filter(m => m.performanceRating === 'Poor').length;
    if (poorQueries > 0) {
      recommendations.push(`Optimize ${poorQueries} poorly performing query patterns`);
    }

    // Compound index recommendations
    const singleFieldIndexes = indexStats.filter(idx => 
      Object.keys(idx.key || {}).length === 1 && idx.name !== '_id_'
    ).length;
    if (singleFieldIndexes > 5) {
      recommendations.push('Consider consolidating single-field indexes into compound indexes');
    }

    return recommendations;
  }

  calculateHealthScore(collectionStats, indexStats, queryMetrics) {
    let score = 100;

    // Index efficiency penalty
    const unusedIndexes = indexStats.filter(idx => idx.name !== '_id_' && idx.accesses?.ops === 0).length;
    const totalIndexes = indexStats.length - 1; // Exclude _id_
    const unusedRatio = unusedIndexes / Math.max(totalIndexes, 1);
    score -= unusedRatio * 30; // Up to 30 points penalty

    // Size efficiency penalty
    const indexRatio = collectionStats.totalIndexSize / collectionStats.size;
    if (indexRatio > 2) score -= 20;
    else if (indexRatio > 1.5) score -= 10;

    // Query performance penalty
    const poorQueryRatio = queryMetrics.filter(m => m.performanceRating === 'Poor').length / Math.max(queryMetrics.length, 1);
    score -= poorQueryRatio * 25; // Up to 25 points penalty

    // Average execution time penalty
    const avgExecutionTime = queryMetrics.length > 0 ? 
      queryMetrics.reduce((sum, metric) => sum + metric.executionTime, 0) / queryMetrics.length : 0;
    if (avgExecutionTime > 100) score -= 15;
    else if (avgExecutionTime > 50) score -= 8;

    return Math.max(Math.round(score), 0);
  }
}

// Export the compound index strategist
module.exports = { CompoundIndexStrategist };

SQL-Style Index Optimization with QueryLeaf

QueryLeaf provides familiar SQL syntax for MongoDB index optimization and performance tuning:

-- QueryLeaf index optimization with SQL-familiar syntax

-- Create optimized indexes using SQL DDL syntax
CREATE INDEX idx_customer_order_history ON orders (
  customer.customer_id ASC,
  order_date DESC,
  status ASC
) WITH (
  background = true,
  name = 'idx_customer_order_history'
);

-- Create compound indexes following ESR (Equality, Sort, Range) pattern
CREATE INDEX idx_sales_analytics ON orders (
  sales_info.sales_rep_id ASC,     -- Equality filter (most selective)
  order_date DESC,                 -- Sort operation
  totals.grand_total DESC          -- Range filter
) WITH (
  background = true,
  partial_filter = 'status IN (''completed'', ''delivered'')'
);

-- Create geospatial index for location-based queries
CREATE INDEX idx_shipping_location ON orders 
USING GEOSPHERE (addresses.shipping.coordinates)
WITH (background = true);

-- Create text index for search functionality
CREATE INDEX idx_full_text_search ON orders 
USING TEXT (
  customer.name,
  customer.email, 
  order_number,
  items.name,
  items.sku
) WITH (
  default_language = 'english',
  background = true
);

-- Analyze query performance with SQL EXPLAIN
EXPLAIN (ANALYZE true, BUFFERS true) 
SELECT 
  order_number,
  customer.name,
  order_date,
  totals.grand_total,
  status
FROM orders 
WHERE customer.customer_id = ObjectId('64a1b2c3d4e5f6789012345a')
  AND order_date >= CURRENT_DATE - INTERVAL '90 days'
  AND status IN ('processing', 'shipped', 'delivered')
ORDER BY order_date DESC
LIMIT 20;

-- Index usage analysis and optimization recommendations
WITH index_usage_stats AS (
  SELECT 
    index_name,
    access_count,
    last_accessed,
    size_bytes,
    size_mb,

    -- Calculate utilization metrics
    CASE 
      WHEN access_count = 0 THEN 'Unused'
      WHEN access_count < 100 THEN 'Low'
      WHEN access_count < 10000 THEN 'Moderate'
      ELSE 'High'
    END as usage_level,

    -- Calculate efficiency score
    CASE 
      WHEN access_count = 0 THEN 0
      ELSE ROUND((access_count::numeric / (size_mb + 1)) * 100, 2)
    END as efficiency_score

  FROM mongodb_index_statistics('orders')
  WHERE index_name != '_id_'
),
index_recommendations AS (
  SELECT 
    index_name,
    usage_level,
    efficiency_score,
    size_mb,

    -- Generate recommendations based on usage patterns
    CASE 
      WHEN usage_level = 'Unused' THEN 'DROP - Unused index consuming storage'
      WHEN usage_level = 'Low' AND size_mb > 10 THEN 'REVIEW - Low usage for large index'
      WHEN efficiency_score > 1000 THEN 'MAINTAIN - High efficiency index'
      WHEN efficiency_score < 50 THEN 'OPTIMIZE - Poor efficiency ratio'
      ELSE 'MONITOR - Normal usage pattern'
    END as recommendation,

    -- Priority for action
    CASE 
      WHEN usage_level = 'Unused' THEN 'HIGH'
      WHEN usage_level = 'Low' AND size_mb > 50 THEN 'MEDIUM'
      WHEN efficiency_score < 25 THEN 'MEDIUM'
      ELSE 'LOW'
    END as priority

  FROM index_usage_stats
)
SELECT 
  index_name,
  usage_level,
  ROUND(efficiency_score, 2) as efficiency_score,
  ROUND(size_mb, 2) as size_mb,
  recommendation,
  priority,

  -- Estimated impact
  CASE 
    WHEN recommendation LIKE 'DROP%' THEN CONCAT('Save ', ROUND(size_mb, 1), 'MB storage')
    WHEN recommendation LIKE 'OPTIMIZE%' THEN 'Improve query performance'
    ELSE 'Monitor performance'
  END as estimated_impact

FROM index_recommendations
ORDER BY 
  CASE priority 
    WHEN 'HIGH' THEN 1 
    WHEN 'MEDIUM' THEN 2 
    ELSE 3 
  END,
  efficiency_score ASC;

-- Compound index optimization analysis
WITH query_pattern_analysis AS (
  SELECT 
    collection_name,
    query_pattern,
    avg_execution_time_ms,
    avg_docs_examined,
    avg_docs_returned,

    -- Calculate selectivity ratio
    CASE 
      WHEN avg_docs_examined > 0 THEN 
        ROUND((avg_docs_returned::numeric / avg_docs_examined) * 100, 2)
      ELSE 0 
    END as selectivity_percent,

    -- Identify query pattern type
    CASE 
      WHEN query_pattern LIKE '%customer_id%' AND query_pattern LIKE '%order_date%' THEN 'customer_history'
      WHEN query_pattern LIKE '%status%' AND query_pattern LIKE '%priority%' THEN 'fulfillment'
      WHEN query_pattern LIKE '%sales_rep%' THEN 'sales_analytics'
      WHEN query_pattern LIKE '%location%' THEN 'geographic'
      ELSE 'other'
    END as pattern_type

  FROM mongodb_query_performance_log
  WHERE timestamp >= CURRENT_TIMESTAMP - INTERVAL '7 days'
    AND collection_name = 'orders'
),
index_optimization_opportunities AS (
  SELECT 
    pattern_type,
    COUNT(*) as query_count,
    AVG(avg_execution_time_ms) as avg_execution_time,
    AVG(selectivity_percent) as avg_selectivity,

    -- Performance classification
    CASE 
      WHEN AVG(avg_execution_time_ms) > 100 THEN 'Poor'
      WHEN AVG(avg_execution_time_ms) > 50 THEN 'Fair'
      WHEN AVG(avg_execution_time_ms) > 10 THEN 'Good'
      ELSE 'Excellent'
    END as performance_rating,

    -- Optimization recommendations
    CASE pattern_type
      WHEN 'customer_history' THEN 'Compound index: customer_id + order_date + status'
      WHEN 'fulfillment' THEN 'Compound index: status + priority + order_date'
      WHEN 'sales_analytics' THEN 'Compound index: sales_rep_id + order_date + total_amount'
      WHEN 'geographic' THEN 'Geospatial index: shipping_coordinates + status + date'
      ELSE 'Analyze query patterns for custom compound index'
    END as index_recommendation

  FROM query_pattern_analysis
  GROUP BY pattern_type
  HAVING COUNT(*) >= 10  -- Only analyze patterns with sufficient volume
)
SELECT 
  pattern_type,
  query_count,
  ROUND(avg_execution_time, 2) as avg_execution_time_ms,
  ROUND(avg_selectivity, 2) as avg_selectivity_percent,
  performance_rating,
  index_recommendation,

  -- Optimization priority
  CASE 
    WHEN performance_rating = 'Poor' AND query_count > 1000 THEN 'CRITICAL'
    WHEN performance_rating IN ('Poor', 'Fair') AND query_count > 100 THEN 'HIGH'
    WHEN performance_rating = 'Fair' THEN 'MEDIUM'
    ELSE 'LOW'
  END as optimization_priority

FROM index_optimization_opportunities
ORDER BY 
  CASE optimization_priority
    WHEN 'CRITICAL' THEN 1
    WHEN 'HIGH' THEN 2 
    WHEN 'MEDIUM' THEN 3
    ELSE 4
  END,
  query_count DESC;

-- Performance monitoring dashboard
WITH performance_metrics AS (
  SELECT 
    DATE_TRUNC('hour', timestamp) as hour_bucket,
    collection_name,

    -- Query performance metrics
    COUNT(*) as total_queries,
    AVG(execution_time_ms) as avg_execution_time,
    PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY execution_time_ms) as p95_execution_time,
    MAX(execution_time_ms) as max_execution_time,

    -- Index usage metrics
    AVG(docs_examined::numeric / GREATEST(docs_returned, 1)) as avg_docs_per_result,
    AVG(CASE WHEN index_used THEN 1.0 ELSE 0.0 END) as index_usage_ratio,

    -- Query efficiency
    AVG(CASE WHEN docs_examined > 0 THEN docs_returned::numeric / docs_examined ELSE 1 END) as avg_selectivity,

    -- Performance classification
    COUNT(*) FILTER (WHERE execution_time_ms <= 10) as excellent_queries,
    COUNT(*) FILTER (WHERE execution_time_ms <= 50) as good_queries,
    COUNT(*) FILTER (WHERE execution_time_ms <= 100) as fair_queries,
    COUNT(*) FILTER (WHERE execution_time_ms > 100) as poor_queries

  FROM mongodb_query_performance_log
  WHERE timestamp >= CURRENT_TIMESTAMP - INTERVAL '24 hours'
    AND collection_name = 'orders'
  GROUP BY DATE_TRUNC('hour', timestamp), collection_name
),
performance_trends AS (
  SELECT *,
    -- Calculate performance trends
    LAG(avg_execution_time) OVER (ORDER BY hour_bucket) as prev_hour_avg_time,
    LAG(index_usage_ratio) OVER (ORDER BY hour_bucket) as prev_hour_index_usage,

    -- Performance health score (0-100)
    ROUND(
      (excellent_queries::numeric / total_queries * 40) +
      (good_queries::numeric / total_queries * 30) +
      (fair_queries::numeric / total_queries * 20) +
      (index_usage_ratio * 10),
      0
    ) as performance_health_score

  FROM performance_metrics
)
SELECT 
  TO_CHAR(hour_bucket, 'YYYY-MM-DD HH24:00') as monitoring_hour,
  total_queries,
  ROUND(avg_execution_time::numeric, 2) as avg_execution_time_ms,
  ROUND(p95_execution_time::numeric, 2) as p95_execution_time_ms,
  ROUND((index_usage_ratio * 100)::numeric, 1) as index_usage_percent,
  ROUND((avg_selectivity * 100)::numeric, 2) as avg_selectivity_percent,
  performance_health_score,

  -- Performance distribution
  CONCAT(
    excellent_queries, ' excellent, ',
    good_queries, ' good, ', 
    fair_queries, ' fair, ',
    poor_queries, ' poor'
  ) as query_distribution,

  -- Trend indicators
  CASE 
    WHEN avg_execution_time > prev_hour_avg_time * 1.2 THEN '📈 Degrading'
    WHEN avg_execution_time < prev_hour_avg_time * 0.8 THEN '📉 Improving' 
    ELSE '➡️ Stable'
  END as performance_trend,

  -- Health status
  CASE 
    WHEN performance_health_score >= 90 THEN '🟢 Excellent'
    WHEN performance_health_score >= 75 THEN '🟡 Good'
    WHEN performance_health_score >= 60 THEN '🟠 Fair'
    ELSE '🔴 Poor'
  END as health_status,

  -- Recommendations
  CASE 
    WHEN performance_health_score < 60 THEN 'Immediate optimization required'
    WHEN index_usage_ratio < 0.8 THEN 'Review query patterns and add missing indexes'
    WHEN avg_selectivity < 0.1 THEN 'Improve query selectivity with better filtering'
    WHEN poor_queries > total_queries * 0.1 THEN 'Optimize slow query patterns'
    ELSE 'Performance within acceptable range'
  END as recommendation

FROM performance_trends
WHERE hour_bucket >= CURRENT_TIMESTAMP - INTERVAL '24 hours'
ORDER BY hour_bucket DESC;

-- Index maintenance automation
CREATE PROCEDURE optimize_collection_indexes(
  collection_name VARCHAR(100),
  maintenance_mode VARCHAR(20) DEFAULT 'conservative'
) AS
BEGIN
  -- Analyze current index usage
  WITH index_analysis AS (
    SELECT 
      index_name,
      access_count,
      size_bytes,
      last_accessed,
      CASE 
        WHEN access_count = 0 THEN 'unused'
        WHEN access_count < 10 AND size_bytes > 10 * 1024 * 1024 THEN 'underutilized'
        WHEN access_count > 50000 THEN 'high_usage'
        ELSE 'normal'
      END as usage_category
    FROM mongodb_index_statistics(collection_name)
    WHERE index_name != '_id_'
  )
  SELECT 
    COUNT(*) FILTER (WHERE usage_category = 'unused') as unused_count,
    COUNT(*) FILTER (WHERE usage_category = 'underutilized') as underutilized_count,
    SUM(size_bytes) FILTER (WHERE usage_category = 'unused') as unused_size_bytes
  INTO TEMPORARY TABLE maintenance_summary;

  -- Execute maintenance based on mode
  CASE maintenance_mode
    WHEN 'aggressive' THEN
      -- Drop unused and underutilized indexes
      CALL mongodb_drop_unused_indexes(collection_name);
      CALL mongodb_review_underutilized_indexes(collection_name);

    WHEN 'conservative' THEN 
      -- Only drop clearly unused indexes (0 access, older than 30 days)
      CALL mongodb_drop_unused_indexes(collection_name, min_age_days => 30);

    WHEN 'analyze_only' THEN
      -- Generate report without making changes
      CALL mongodb_generate_index_report(collection_name);
  END CASE;

  -- Log maintenance activity
  INSERT INTO index_maintenance_log (
    collection_name,
    maintenance_mode,
    maintenance_timestamp,
    unused_indexes_dropped,
    storage_saved_bytes
  ) 
  SELECT 
    collection_name,
    maintenance_mode,
    CURRENT_TIMESTAMP,
    (SELECT unused_count FROM maintenance_summary),
    (SELECT unused_size_bytes FROM maintenance_summary);

  COMMIT;
END;

-- QueryLeaf provides comprehensive index optimization capabilities:
-- 1. SQL-familiar index creation and management syntax
-- 2. Advanced compound index strategies with ESR pattern optimization
-- 3. Automated query performance analysis and explain plan interpretation
-- 4. Index usage monitoring and utilization tracking
-- 5. Performance trend analysis and health scoring
-- 6. Automated optimization recommendations based on usage patterns
-- 7. Maintenance procedures for index lifecycle management
-- 8. Integration with MongoDB's native indexing and performance features
-- 9. Real-time performance monitoring with alerting capabilities
-- 10. Familiar SQL patterns for complex index optimization requirements

Best Practices for Index Optimization

Index Design Principles

Essential practices for effective MongoDB index optimization:

ESR Pattern: Design compound indexes following Equality-Sort-Range order
Query-First Design: Create indexes based on actual query patterns, not theoretical needs
Selectivity Optimization: Place most selective fields first in compound indexes
Index Intersection: Leverage MongoDB's ability to use multiple indexes for complex queries
Covering Indexes: Include frequently accessed fields to avoid document lookups
Maintenance Balance: Balance query performance with write performance and storage costs

Performance Monitoring

Implement comprehensive performance monitoring for production environments:

Continuous Analysis: Monitor query performance patterns and execution statistics
Usage Tracking: Track index utilization to identify unused or underutilized indexes
Trend Analysis: Identify performance degradation trends before they impact users
Automated Alerting: Set up alerts for slow queries and index efficiency metrics
Regular Optimization: Schedule periodic index analysis and optimization cycles
Capacity Planning: Monitor index growth and plan for scaling requirements

Conclusion

MongoDB Index Optimization provides comprehensive query performance tuning capabilities that eliminate the complexity and manual overhead of traditional database optimization approaches. The combination of intelligent compound indexing, automated performance analysis, and sophisticated query execution monitoring enables proactive performance management that scales with growing data volumes and evolving access patterns.

Key Index Optimization benefits include:

Intelligent Compound Indexing: Advanced ESR pattern optimization for maximum query efficiency
Automated Performance Analysis: Comprehensive query execution analysis with actionable recommendations
Usage-Based Optimization: Index recommendations based on actual utilization patterns
Comprehensive Monitoring: Real-time performance tracking with trend analysis and alerting
Maintenance Automation: Automated cleanup of unused indexes and optimization suggestions
Developer Familiarity: SQL-style optimization patterns with MongoDB's flexible indexing system

Whether you're building high-traffic web applications, analytics platforms, real-time systems, or any application requiring exceptional database performance, MongoDB Index Optimization with QueryLeaf's familiar SQL interface provides the foundation for enterprise-grade performance engineering. This combination enables you to implement sophisticated optimization strategies while preserving familiar database interaction patterns.

QueryLeaf Integration: QueryLeaf automatically translates SQL index operations into MongoDB index management, providing SQL-familiar CREATE INDEX syntax, EXPLAIN plan analysis, and performance monitoring queries. Advanced optimization strategies, compound index design, and automated maintenance are seamlessly handled through familiar SQL patterns, making enterprise performance optimization both powerful and accessible.

The integration of comprehensive optimization capabilities with SQL-style operations makes MongoDB an ideal platform for applications requiring both exceptional performance and familiar database optimization patterns, ensuring your performance solutions remain both effective and maintainable as they scale and evolve.

November 10, 2025
29 min read

MongoDB Replica Sets for High Availability and Data Resilience: SQL-Compatible Distributed Database Architecture

Modern enterprise applications require database systems that can maintain continuous availability, handle hardware failures gracefully, and provide data redundancy without sacrificing performance or consistency. Traditional single-server database deployments create critical points of failure that can result in extended downtime, data loss, and significant business disruption when servers crash, networks fail, or maintenance windows require database restarts.

MongoDB Replica Sets provide comprehensive high availability and data resilience through automatic replication, intelligent failover, and distributed consensus mechanisms. Unlike traditional master-slave replication that requires manual intervention during failures, MongoDB Replica Sets automatically elect new primary nodes, maintain data consistency across multiple servers, and provide configurable read preferences for optimal performance and availability.

The Traditional High Availability Challenge

Conventional database high availability solutions have significant complexity and operational overhead:

-- Traditional PostgreSQL high availability setup - complex and operationally intensive

-- Primary server configuration with write-ahead logging
CREATE TABLE critical_business_data (
    transaction_id BIGSERIAL PRIMARY KEY,
    account_id BIGINT NOT NULL,
    transaction_type VARCHAR(50) NOT NULL,
    amount DECIMAL(15,2) NOT NULL,
    transaction_timestamp TIMESTAMP DEFAULT CURRENT_TIMESTAMP,

    -- Business logic fields
    from_account BIGINT,
    to_account BIGINT, 
    description TEXT,
    reference_number VARCHAR(100),
    status VARCHAR(20) DEFAULT 'pending',

    -- Audit and compliance
    created_by VARCHAR(100),
    authorized_by VARCHAR(100),
    authorized_at TIMESTAMP,

    -- Geographic and regulatory
    processing_region VARCHAR(50),
    regulatory_flags JSONB,

    -- System metadata
    server_id VARCHAR(50),
    processing_node VARCHAR(50),

    CONSTRAINT valid_transaction_type CHECK (
        transaction_type IN ('deposit', 'withdrawal', 'transfer', 'fee', 'interest', 'adjustment')
    ),
    CONSTRAINT valid_status CHECK (
        status IN ('pending', 'processing', 'completed', 'failed', 'cancelled')
    ),
    CONSTRAINT valid_amount CHECK (amount != 0)
);

-- Complex indexing for performance across multiple servers
CREATE INDEX idx_transactions_account_timestamp ON critical_business_data(account_id, transaction_timestamp DESC);
CREATE INDEX idx_transactions_status_type ON critical_business_data(status, transaction_type, transaction_timestamp);
CREATE INDEX idx_transactions_reference ON critical_business_data(reference_number) WHERE reference_number IS NOT NULL;
CREATE INDEX idx_transactions_region ON critical_business_data(processing_region, transaction_timestamp);

-- Streaming replication configuration (requires extensive setup)
-- postgresql.conf settings required:
-- wal_level = replica
-- max_wal_senders = 3
-- max_replication_slots = 3
-- archive_mode = on
-- archive_command = 'cp %p /var/lib/postgresql/archive/%f'

-- Create replication user (security complexity)
CREATE ROLE replication_user WITH REPLICATION LOGIN PASSWORD 'complex_secure_password';
GRANT CONNECT ON DATABASE production TO replication_user;

-- Manual standby server setup required on each replica
-- pg_basebackup -h primary_server -D /var/lib/postgresql/standby -U replication_user -v -P -W

-- Standby server recovery configuration (recovery.conf)
-- standby_mode = 'on'
-- primary_conninfo = 'host=primary_server port=5432 user=replication_user password=complex_secure_password'
-- trigger_file = '/var/lib/postgresql/failover_trigger'

-- Connection pooling and load balancing (requires external tools)
-- HAProxy configuration for read/write splitting
-- backend postgresql_primary
--   server primary primary_server:5432 check
-- backend postgresql_standby
--   server standby1 standby1_server:5432 check
--   server standby2 standby2_server:5432 check

-- Monitoring and health checking (complex setup)
SELECT 
    client_addr,
    state,
    sync_state,
    sync_priority,

    -- Replication lag monitoring
    pg_wal_lsn_diff(pg_current_wal_lsn(), flush_lsn) / 1024 / 1024 as flush_lag_mb,
    pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) / 1024 / 1024 as replay_lag_mb,

    -- Time-based lag analysis
    EXTRACT(EPOCH FROM (now() - backend_start)) as connection_age_seconds,
    EXTRACT(EPOCH FROM (now() - state_change)) as state_change_age_seconds

FROM pg_stat_replication
ORDER BY sync_priority, client_addr;

-- Manual failover procedure (complex and error-prone)
-- 1. Check replication status and lag
-- 2. Stop applications from connecting to primary
-- 3. Ensure all transactions are replicated
-- 4. Create trigger file on desired standby: touch /var/lib/postgresql/failover_trigger
-- 5. Update application connection strings
-- 6. Redirect traffic to new primary
-- 7. Reconfigure remaining standbys to follow new primary

-- Problems with traditional PostgreSQL HA:
-- 1. Complex manual setup and configuration management
-- 2. Manual failover procedures with potential for human error
-- 3. Split-brain scenarios without proper fencing mechanisms
-- 4. No automatic conflict resolution during network partitions
-- 5. Requires external load balancers and connection pooling solutions
-- 6. Limited built-in monitoring and alerting capabilities
-- 7. Difficult to add/remove replica servers dynamically
-- 8. Complex backup and recovery procedures across multiple servers
-- 9. No built-in read preference configuration
-- 10. Requires significant PostgreSQL expertise for proper maintenance

-- MySQL replication limitations (even more manual)
-- Enable binary logging on master:
-- log-bin=mysql-bin
-- server-id=1

-- Manual slave configuration:
CHANGE MASTER TO
  MASTER_HOST='master_server',
  MASTER_USER='replication_user',
  MASTER_PASSWORD='replication_password',
  MASTER_LOG_FILE='mysql-bin.000001',
  MASTER_LOG_POS=154;

START SLAVE;

-- Basic replication monitoring
SHOW SLAVE STATUS\G

-- MySQL HA problems:
-- - No automatic failover mechanisms
-- - Manual binary log position management
-- - Limited conflict resolution capabilities
-- - Basic monitoring and error reporting
-- - Complex setup for multi-master scenarios
-- - No built-in load balancing or read distribution
-- - Requires external tools for comprehensive HA solutions

MongoDB Replica Sets provide comprehensive high availability with minimal operational overhead:

// MongoDB Replica Sets - enterprise-ready high availability with automatic management
const { MongoClient } = require('mongodb');

// Replica set connection with automatic failover handling
const client = new MongoClient('mongodb://server1:27017,server2:27017,server3:27017/production?replicaSet=production-rs', {
  // Connection options optimized for high availability
  maxPoolSize: 50,
  minPoolSize: 5,
  maxIdleTimeMS: 300000, // 5 minutes
  serverSelectionTimeoutMS: 5000,
  heartbeatFrequencyMS: 10000, // 10 seconds

  // Read and write preferences for optimal performance
  readPreference: 'secondaryPreferred', // Distribute read load
  writeConcern: { w: 'majority', j: true, wtimeout: 5000 }, // Ensure data durability

  // Advanced replica set options
  maxStalenessSeconds: 90, // Maximum acceptable replication lag
  readConcern: { level: 'majority' }, // Ensure consistent reads

  // Connection resilience
  connectTimeoutMS: 10000,
  socketTimeoutMS: 30000,
  retryWrites: true,
  retryReads: true
});

const db = client.db('enterprise_production');

// Comprehensive business data model with replica set optimization
const setupEnterpriseCollections = async () => {
  console.log('Setting up enterprise collections with replica set optimization...');

  // Financial transactions with high availability requirements
  const transactions = db.collection('financial_transactions');

  // Sample enterprise transaction document structure
  const transactionDocument = {
    _id: new ObjectId(),

    // Transaction identification
    transactionId: "TXN-2025-11-10-001234567",
    externalReference: "EXT-REF-987654321",

    // Account information
    accounts: {
      sourceAccount: {
        accountId: new ObjectId("64a1b2c3d4e5f6789012347a"),
        accountNumber: "ACC-123456789",
        accountType: "checking",
        accountHolder: "Enterprise Customer LLC",
        bankCode: "ENTBANK001"
      },
      destinationAccount: {
        accountId: new ObjectId("64a1b2c3d4e5f6789012347b"), 
        accountNumber: "ACC-987654321",
        accountType: "savings",
        accountHolder: "Business Partner Inc",
        bankCode: "PARTNER002"
      }
    },

    // Transaction details
    transaction: {
      type: "wire_transfer", // wire_transfer, ach_transfer, check_deposit, etc.
      category: "business_payment",
      subcategory: "vendor_payment",

      // Financial amounts
      amount: {
        value: 125000.00,
        currency: "USD",
        precision: 2
      },

      fees: {
        processingFee: 25.00,
        wireTransferFee: 15.00,
        regulatoryFee: 2.50,
        totalFees: 42.50
      },

      // Exchange rate information (for international transfers)
      exchangeRate: {
        fromCurrency: "USD",
        toCurrency: "USD", 
        rate: 1.0000,
        rateTimestamp: new Date("2025-11-10T14:30:00Z"),
        rateProvider: "internal"
      }
    },

    // Status and workflow tracking
    status: {
      current: "pending_authorization", // pending, authorized, processing, completed, failed, cancelled
      workflow: [
        {
          status: "initiated",
          timestamp: new Date("2025-11-10T14:30:00Z"),
          userId: new ObjectId("64b2c3d4e5f6789012347c1a"),
          userRole: "customer",
          notes: "Transaction initiated via mobile banking"
        },
        {
          status: "validated",
          timestamp: new Date("2025-11-10T14:30:15Z"),
          userId: "system",
          userRole: "automated_validation",
          notes: "Account balance and limits validated"
        }
      ],

      // Authorization requirements
      authorization: {
        required: true,
        level: "dual_approval", // single, dual_approval, committee
        approvals: [
          {
            approverId: new ObjectId("64b2c3d4e5f6789012347c1b"),
            approverRole: "account_manager", 
            status: "approved",
            timestamp: new Date("2025-11-10T14:32:00Z"),
            notes: "Verified customer and transaction purpose"
          }
        ],
        pendingApprovals: [
          {
            approverId: new ObjectId("64b2c3d4e5f6789012347c1c"),
            approverRole: "compliance_officer",
            requiredBy: new Date("2025-11-10T16:30:00Z")
          }
        ]
      }
    },

    // Risk and compliance
    riskAssessment: {
      riskScore: 35, // 0-100 scale
      riskLevel: "medium", // low, medium, high, critical
      riskFactors: [
        {
          factor: "transaction_amount",
          score: 15,
          description: "Large transaction amount"
        },
        {
          factor: "customer_history",
          score: -5,
          description: "Established customer with good history"
        },
        {
          factor: "destination_account",
          score: 10,
          description: "New destination account"
        }
      ],

      complianceChecks: {
        amlScreening: {
          status: "completed",
          result: "clear",
          timestamp: new Date("2025-11-10T14:30:20Z"),
          provider: "compliance_engine"
        },
        sanctionsScreening: {
          status: "completed",
          result: "clear",
          timestamp: new Date("2025-11-10T14:30:22Z"),
          provider: "sanctions_database"
        },
        fraudDetection: {
          status: "completed",
          result: "low_risk", 
          score: 12,
          timestamp: new Date("2025-11-10T14:30:25Z"),
          provider: "fraud_detection_ai"
        }
      }
    },

    // Processing information
    processing: {
      scheduledProcessingTime: new Date("2025-11-10T15:00:00Z"),
      actualProcessingTime: null,
      processingServer: "txn-processor-03",
      processingRegion: "us-east-1",

      // Retry and error handling
      attemptCount: 1,
      maxAttempts: 3,
      lastAttemptTime: new Date("2025-11-10T14:30:00Z"),

      errors: [],

      // Performance tracking
      processingMetrics: {
        validationTimeMs: 150,
        riskAssessmentTimeMs: 250,
        complianceCheckTimeMs: 420,
        totalProcessingTimeMs: null
      }
    },

    // Audit and regulatory compliance
    audit: {
      createdAt: new Date("2025-11-10T14:30:00Z"),
      createdBy: new ObjectId("64b2c3d4e5f6789012347c1a"),
      updatedAt: new Date("2025-11-10T14:32:00Z"),
      updatedBy: new ObjectId("64b2c3d4e5f6789012347c1b"),
      version: 2,

      // Detailed change tracking
      changeLog: [
        {
          timestamp: new Date("2025-11-10T14:30:00Z"),
          userId: new ObjectId("64b2c3d4e5f6789012347c1a"),
          action: "transaction_initiated",
          changes: ["status.current", "transaction", "accounts"],
          ipAddress: "192.168.1.100",
          userAgent: "MobileBankingApp/2.1.3"
        },
        {
          timestamp: new Date("2025-11-10T14:32:00Z"),
          userId: new ObjectId("64b2c3d4e5f6789012347c1b"),
          action: "authorization_approved",
          changes: ["status.authorization.approvals"],
          ipAddress: "10.0.1.50",
          userAgent: "EnterprisePortal/1.8.2"
        }
      ]
    },

    // Geographic and regulatory context
    geography: {
      originatingCountry: "US",
      originatingState: "CA",
      destinationCountry: "US",
      destinationState: "NY",

      // Regulatory requirements by jurisdiction
      regulations: {
        uspCompliance: true,
        internationalTransfer: false,
        reportingThreshold: 10000.00,
        reportingRequired: true,
        reportingDeadline: new Date("2025-11-11T23:59:59Z")
      }
    },

    // System and operational metadata
    metadata: {
      environment: "production",
      dataCenter: "primary",
      applicationVersion: "banking-core-v3.2.1",
      correlationId: "corr-uuid-12345678-90ab-cdef",

      // High availability tracking
      replicaSet: {
        writeConcern: "majority",
        readPreference: "primary",
        maxStalenessSeconds: 60
      },

      // Performance optimization
      indexHints: {
        preferredIndex: "idx_transactions_account_status_date",
        queryOptimizer: "enabled"
      }
    }
  };

  // Insert sample transaction
  await transactions.insertOne(transactionDocument);

  // Create comprehensive indexes optimized for replica set performance
  await Promise.all([
    // Primary business query indexes
    transactions.createIndex(
      { 
        "accounts.sourceAccount.accountId": 1, 
        "audit.createdAt": -1,
        "status.current": 1 
      },
      { 
        name: "idx_transactions_source_account_date_status",
        background: true // Non-blocking index creation
      }
    ),

    transactions.createIndex(
      { 
        "transaction.type": 1,
        "status.current": 1,
        "audit.createdAt": -1 
      },
      { 
        name: "idx_transactions_type_status_date",
        background: true
      }
    ),

    // Risk and compliance queries
    transactions.createIndex(
      { 
        "riskAssessment.riskLevel": 1,
        "status.current": 1 
      },
      { 
        name: "idx_transactions_risk_status",
        background: true
      }
    ),

    // Authorization workflow queries
    transactions.createIndex(
      { 
        "status.authorization.pendingApprovals.approverId": 1,
        "status.current": 1 
      },
      { 
        name: "idx_transactions_pending_approvals",
        background: true
      }
    ),

    // Processing and scheduling queries
    transactions.createIndex(
      { 
        "processing.scheduledProcessingTime": 1,
        "status.current": 1 
      },
      { 
        name: "idx_transactions_scheduled_processing",
        background: true
      }
    ),

    // Geographic and regulatory reporting
    transactions.createIndex(
      { 
        "geography.originatingCountry": 1,
        "geography.regulations.reportingRequired": 1,
        "audit.createdAt": -1 
      },
      { 
        name: "idx_transactions_geography_reporting",
        background: true
      }
    ),

    // Full-text search for transaction descriptions and references
    transactions.createIndex(
      { 
        "transactionId": "text",
        "externalReference": "text",
        "transaction.category": "text",
        "accounts.sourceAccount.accountHolder": "text"
      },
      { 
        name: "idx_transactions_text_search",
        background: true
      }
    )
  ]);

  console.log('Enterprise collections and indexes created successfully');
  return { transactions };
};

// Replica Set Management and Monitoring
class ReplicaSetManager {
  constructor(client, db) {
    this.client = client;
    this.db = db;
    this.admin = client.db('admin');
    this.monitoringInterval = null;
    this.healthMetrics = {
      lastCheck: null,
      replicaSetStatus: null,
      memberHealth: [],
      replicationLag: {},
      alerts: []
    };
  }

  async initializeReplicaSetMonitoring() {
    console.log('Initializing replica set monitoring and management...');

    try {
      // Get initial replica set configuration
      const config = await this.getReplicaSetConfig();
      console.log('Current replica set configuration:', JSON.stringify(config, null, 2));

      // Start continuous health monitoring
      await this.startHealthMonitoring();

      // Setup automatic failover testing (in non-production environments)
      if (process.env.NODE_ENV !== 'production') {
        await this.setupFailoverTesting();
      }

      console.log('Replica set monitoring initialized successfully');

    } catch (error) {
      console.error('Failed to initialize replica set monitoring:', error);
      throw error;
    }
  }

  async getReplicaSetConfig() {
    try {
      // Get replica set configuration
      const config = await this.admin.command({ replSetGetConfig: 1 });

      return {
        setName: config.config._id,
        version: config.config.version,
        members: config.config.members.map(member => ({
          id: member._id,
          host: member.host,
          priority: member.priority,
          votes: member.votes,
          hidden: member.hidden || false,
          buildIndexes: member.buildIndexes !== false,
          tags: member.tags || {}
        })),
        settings: config.config.settings || {}
      };
    } catch (error) {
      console.error('Error getting replica set config:', error);
      throw error;
    }
  }

  async getReplicaSetStatus() {
    try {
      // Get current replica set status
      const status = await this.admin.command({ replSetGetStatus: 1 });

      const members = status.members.map(member => ({
        id: member._id,
        name: member.name,
        health: member.health,
        state: this.getStateDescription(member.state),
        stateStr: member.stateStr,
        uptime: member.uptime,
        optimeDate: member.optimeDate,
        lastHeartbeat: member.lastHeartbeat,
        lastHeartbeatRecv: member.lastHeartbeatRecv,
        pingMs: member.pingMs,
        syncSourceHost: member.syncSourceHost,
        syncSourceId: member.syncSourceId,

        // Replication lag calculation
        lag: status.members[0].optimeDate && member.optimeDate ? 
          Math.abs(status.members[0].optimeDate - member.optimeDate) / 1000 : 0
      }));

      return {
        setName: status.set,
        date: status.date,
        myState: this.getStateDescription(status.myState),
        primary: members.find(member => member.state === 'PRIMARY'),
        members: members,
        heartbeatIntervalMillis: status.heartbeatIntervalMillis
      };

    } catch (error) {
      console.error('Error getting replica set status:', error);
      throw error;
    }
  }

  getStateDescription(state) {
    const states = {
      0: 'STARTUP',
      1: 'PRIMARY',
      2: 'SECONDARY', 
      3: 'RECOVERING',
      5: 'STARTUP2',
      6: 'UNKNOWN',
      7: 'ARBITER',
      8: 'DOWN',
      9: 'ROLLBACK',
      10: 'REMOVED'
    };
    return states[state] || `UNKNOWN_STATE_${state}`;
  }

  async startHealthMonitoring() {
    console.log('Starting continuous replica set health monitoring...');

    // Monitor replica set health every 30 seconds
    this.monitoringInterval = setInterval(async () => {
      try {
        await this.performHealthCheck();
      } catch (error) {
        console.error('Health check failed:', error);
        this.healthMetrics.alerts.push({
          timestamp: new Date(),
          level: 'ERROR',
          message: 'Health check failed',
          error: error.message
        });
      }
    }, 30000);

    // Perform initial health check
    await this.performHealthCheck();
  }

  async performHealthCheck() {
    const checkStartTime = new Date();
    console.log('Performing replica set health check...');

    try {
      // Get current replica set status
      const status = await this.getReplicaSetStatus();
      this.healthMetrics.lastCheck = checkStartTime;
      this.healthMetrics.replicaSetStatus = status;

      // Check for alerts
      const alerts = [];

      // Check if primary is available
      if (!status.primary) {
        alerts.push({
          timestamp: checkStartTime,
          level: 'CRITICAL',
          message: 'No primary member found in replica set'
        });
      }

      // Check member health
      status.members.forEach(member => {
        if (member.health !== 1) {
          alerts.push({
            timestamp: checkStartTime,
            level: 'WARNING',
            message: `Member ${member.name} has health status ${member.health}`
          });
        }

        // Check replication lag
        if (member.lag > 30) { // 30 seconds
          alerts.push({
            timestamp: checkStartTime,
            level: 'WARNING',
            message: `Member ${member.name} has replication lag of ${member.lag} seconds`
          });
        }

        // Check for high ping times
        if (member.pingMs && member.pingMs > 100) { // 100ms
          alerts.push({
            timestamp: checkStartTime,
            level: 'INFO',
            message: `Member ${member.name} has high ping time of ${member.pingMs}ms`
          });
        }
      });

      // Update health metrics
      this.healthMetrics.memberHealth = status.members;
      this.healthMetrics.alerts = [...alerts, ...this.healthMetrics.alerts.slice(0, 50)]; // Keep last 50 alerts

      // Log health status
      if (alerts.length > 0) {
        console.warn(`Health check found ${alerts.length} issues:`, alerts);
      } else {
        console.log('Replica set health check passed - all members healthy');
      }

      // Store health metrics in database for historical analysis
      await this.storeHealthMetrics();

    } catch (error) {
      console.error('Health check error:', error);
      throw error;
    }
  }

  async storeHealthMetrics() {
    try {
      const healthCollection = this.db.collection('replica_set_health');

      const healthRecord = {
        timestamp: this.healthMetrics.lastCheck,
        replicaSetName: this.healthMetrics.replicaSetStatus?.setName,

        // Summary metrics
        summary: {
          totalMembers: this.healthMetrics.memberHealth?.length || 0,
          healthyMembers: this.healthMetrics.memberHealth?.filter(m => m.health === 1).length || 0,
          primaryAvailable: !!this.healthMetrics.replicaSetStatus?.primary,
          maxReplicationLag: Math.max(...this.healthMetrics.memberHealth?.map(m => m.lag) || [0]),
          alertCount: this.healthMetrics.alerts?.filter(a => a.timestamp > new Date(Date.now() - 300000)).length || 0 // Last 5 minutes
        },

        // Detailed member status
        members: this.healthMetrics.memberHealth?.map(member => ({
          name: member.name,
          state: member.state,
          health: member.health,
          uptime: member.uptime,
          replicationLag: member.lag,
          pingMs: member.pingMs,
          isPrimary: member.state === 'PRIMARY'
        })) || [],

        // Recent alerts
        recentAlerts: this.healthMetrics.alerts?.filter(a => 
          a.timestamp > new Date(Date.now() - 300000)
        ) || [],

        // Performance metrics
        performance: {
          healthCheckDuration: Date.now() - this.healthMetrics.lastCheck?.getTime(),
          heartbeatInterval: this.healthMetrics.replicaSetStatus?.heartbeatIntervalMillis
        }
      };

      // Insert with short TTL for cleanup
      await healthCollection.insertOne({
        ...healthRecord,
        expiresAt: new Date(Date.now() + 7 * 24 * 60 * 60 * 1000) // 7 days
      });

    } catch (error) {
      console.warn('Failed to store health metrics:', error);
    }
  }

  async performReadWriteOperations() {
    console.log('Testing read/write operations across replica set...');

    const testCollection = this.db.collection('replica_set_tests');
    const testStartTime = new Date();

    try {
      // Test write operation (will go to primary)
      const writeResult = await testCollection.insertOne({
        testType: 'replica_set_write_test',
        timestamp: testStartTime,
        testData: 'Testing write operation to primary',
        serverId: 'test-operation'
      }, {
        writeConcern: { w: 'majority', j: true, wtimeout: 5000 }
      });

      console.log('Write operation successful:', writeResult.insertedId);

      // Test read from secondary (if available)
      const readFromSecondary = await testCollection.findOne(
        { _id: writeResult.insertedId },
        { 
          readPreference: 'secondary',
          maxStalenessSeconds: 90 
        }
      );

      if (readFromSecondary) {
        const replicationDelay = new Date() - readFromSecondary.timestamp;
        console.log(`Read from secondary successful, replication delay: ${replicationDelay}ms`);
      } else {
        console.log('Read from secondary not available or data not yet replicated');
      }

      // Test read from primary
      const readFromPrimary = await testCollection.findOne(
        { _id: writeResult.insertedId },
        { readPreference: 'primary' }
      );

      console.log('Read from primary successful:', !!readFromPrimary);

      // Cleanup test document
      await testCollection.deleteOne({ _id: writeResult.insertedId });

      return {
        writeSuccessful: true,
        readFromSecondarySuccessful: !!readFromSecondary,
        readFromPrimarySuccessful: !!readFromPrimary,
        testDuration: Date.now() - testStartTime
      };

    } catch (error) {
      console.error('Read/write operation test failed:', error);
      return {
        writeSuccessful: false,
        error: error.message,
        testDuration: Date.now() - testStartTime
      };
    }
  }

  async demonstrateReadPreferences() {
    console.log('Demonstrating various read preferences...');

    const testCollection = this.db.collection('financial_transactions');

    try {
      // 1. Read from primary (default)
      console.log('\n1. Reading from PRIMARY:');
      const primaryStart = Date.now();
      const primaryResult = await testCollection.find({}, {
        readPreference: 'primary'
      }).limit(5).toArray();
      console.log(`   - Read ${primaryResult.length} documents from primary in ${Date.now() - primaryStart}ms`);

      // 2. Read from secondary (load balancing)
      console.log('\n2. Reading from SECONDARY:');
      const secondaryStart = Date.now();
      const secondaryResult = await testCollection.find({}, {
        readPreference: 'secondary',
        maxStalenessSeconds: 120 // Accept data up to 2 minutes old
      }).limit(5).toArray();
      console.log(`   - Read ${secondaryResult.length} documents from secondary in ${Date.now() - secondaryStart}ms`);

      // 3. Read from secondary preferred (fallback to primary)
      console.log('\n3. Reading with SECONDARY PREFERRED:');
      const secondaryPrefStart = Date.now();
      const secondaryPrefResult = await testCollection.find({}, {
        readPreference: 'secondaryPreferred',
        maxStalenessSeconds: 90
      }).limit(5).toArray();
      console.log(`   - Read ${secondaryPrefResult.length} documents with secondary preference in ${Date.now() - secondaryPrefStart}ms`);

      // 4. Read from primary preferred (use secondary if primary unavailable)
      console.log('\n4. Reading with PRIMARY PREFERRED:');
      const primaryPrefStart = Date.now();
      const primaryPrefResult = await testCollection.find({}, {
        readPreference: 'primaryPreferred'
      }).limit(5).toArray();
      console.log(`   - Read ${primaryPrefResult.length} documents with primary preference in ${Date.now() - primaryPrefStart}ms`);

      // 5. Read from nearest (lowest latency)
      console.log('\n5. Reading from NEAREST:');
      const nearestStart = Date.now();
      const nearestResult = await testCollection.find({}, {
        readPreference: 'nearest'
      }).limit(5).toArray();
      console.log(`   - Read ${nearestResult.length} documents from nearest member in ${Date.now() - nearestStart}ms`);

      // 6. Tagged read preference (specific member characteristics)
      console.log('\n6. Reading with TAGGED preferences:');
      try {
        const taggedStart = Date.now();
        const taggedResult = await testCollection.find({}, {
          readPreference: 'secondary',
          readPreferenceTags: [{ region: 'us-east' }, { datacenter: 'primary' }] // Fallback tags
        }).limit(5).toArray();
        console.log(`   - Read ${taggedResult.length} documents with tagged preference in ${Date.now() - taggedStart}ms`);
      } catch (error) {
        console.log('   - Tagged read preference not available (members may not have matching tags)');
      }

      return {
        primaryLatency: Date.now() - primaryStart,
        secondaryLatency: Date.now() - secondaryStart,
        readPreferencesSuccessful: true
      };

    } catch (error) {
      console.error('Read preference demonstration failed:', error);
      return {
        readPreferencesSuccessful: false,
        error: error.message
      };
    }
  }

  async demonstrateWriteConcerns() {
    console.log('Demonstrating various write concerns for data durability...');

    const testCollection = this.db.collection('write_concern_tests');

    try {
      // 1. Default write concern (w: 1)
      console.log('\n1. Testing default write concern (w: 1):');
      const defaultStart = Date.now();
      const defaultResult = await testCollection.insertOne({
        testType: 'default_write_concern',
        timestamp: new Date(),
        data: 'Testing default write concern'
      });
      console.log(`   - Default write completed in ${Date.now() - defaultStart}ms`);

      // 2. Majority write concern (w: 'majority')
      console.log('\n2. Testing majority write concern:');
      const majorityStart = Date.now();
      const majorityResult = await testCollection.insertOne({
        testType: 'majority_write_concern',
        timestamp: new Date(),
        data: 'Testing majority write concern for high durability'
      }, {
        writeConcern: { 
          w: 'majority', 
          j: true, // Wait for journal acknowledgment
          wtimeout: 5000 
        }
      });
      console.log(`   - Majority write completed in ${Date.now() - majorityStart}ms`);

      // 3. Specific member count write concern
      console.log('\n3. Testing specific member count write concern (w: 2):');
      const specificStart = Date.now();
      try {
        const specificResult = await testCollection.insertOne({
          testType: 'specific_count_write_concern',
          timestamp: new Date(),
          data: 'Testing specific member count write concern'
        }, {
          writeConcern: { 
            w: 2, 
            j: true,
            wtimeout: 5000 
          }
        });
        console.log(`   - Specific count write completed in ${Date.now() - specificStart}ms`);
      } catch (error) {
        console.log(`   - Specific count write failed (may not have enough members): ${error.message}`);
      }

      // 4. Tagged write concern
      console.log('\n4. Testing tagged write concern:');
      const taggedStart = Date.now();
      try {
        const taggedResult = await testCollection.insertOne({
          testType: 'tagged_write_concern',
          timestamp: new Date(),
          data: 'Testing tagged write concern'
        }, {
          writeConcern: { 
            w: { region: 'us-east' }, // Write must be acknowledged by members with this tag
            wtimeout: 5000 
          }
        });
        console.log(`   - Tagged write completed in ${Date.now() - taggedStart}ms`);
      } catch (error) {
        console.log(`   - Tagged write failed (members may not have matching tags): ${error.message}`);
      }

      // 5. Unacknowledged write concern (w: 0) - not recommended for production
      console.log('\n5. Testing unacknowledged write concern (fire-and-forget):');
      const unackedStart = Date.now();
      const unackedResult = await testCollection.insertOne({
        testType: 'unacknowledged_write_concern',
        timestamp: new Date(),
        data: 'Testing unacknowledged write concern (not recommended for production)'
      }, {
        writeConcern: { w: 0 }
      });
      console.log(`   - Unacknowledged write completed in ${Date.now() - unackedStart}ms`);

      // Cleanup test documents
      await testCollection.deleteMany({ testType: { $regex: /_write_concern$/ } });

      return {
        allWriteConcernsSuccessful: true,
        performanceMetrics: {
          defaultLatency: Date.now() - defaultStart,
          majorityLatency: Date.now() - majorityStart
        }
      };

    } catch (error) {
      console.error('Write concern demonstration failed:', error);
      return {
        allWriteConcernsSuccessful: false,
        error: error.message
      };
    }
  }

  async getHealthSummary() {
    return {
      lastHealthCheck: this.healthMetrics.lastCheck,
      replicaSetName: this.healthMetrics.replicaSetStatus?.setName,
      totalMembers: this.healthMetrics.memberHealth?.length || 0,
      healthyMembers: this.healthMetrics.memberHealth?.filter(m => m.health === 1).length || 0,
      primaryAvailable: !!this.healthMetrics.replicaSetStatus?.primary,
      maxReplicationLag: Math.max(...this.healthMetrics.memberHealth?.map(m => m.lag) || [0]),
      recentAlertCount: this.healthMetrics.alerts?.filter(a => 
        a.timestamp > new Date(Date.now() - 300000)
      ).length || 0,
      isHealthy: this.isReplicaSetHealthy()
    };
  }

  isReplicaSetHealthy() {
    if (!this.healthMetrics.replicaSetStatus) return false;

    const hasHealthyPrimary = !!this.healthMetrics.replicaSetStatus.primary;
    const allMembersHealthy = this.healthMetrics.memberHealth?.every(m => m.health === 1) || false;
    const lowReplicationLag = Math.max(...this.healthMetrics.memberHealth?.map(m => m.lag) || [0]) < 30;
    const noRecentCriticalAlerts = !this.healthMetrics.alerts?.some(a => 
      a.level === 'CRITICAL' && a.timestamp > new Date(Date.now() - 300000)
    );

    return hasHealthyPrimary && allMembersHealthy && lowReplicationLag && noRecentCriticalAlerts;
  }

  async shutdown() {
    console.log('Shutting down replica set monitoring...');

    if (this.monitoringInterval) {
      clearInterval(this.monitoringInterval);
      this.monitoringInterval = null;
    }

    console.log('Replica set monitoring shutdown complete');
  }
}

// Advanced High Availability Operations
class HighAvailabilityOperations {
  constructor(client, db) {
    this.client = client;
    this.db = db;
    this.admin = client.db('admin');
  }

  async demonstrateFailoverScenarios() {
    console.log('Demonstrating automatic failover capabilities...');

    try {
      // Get initial replica set status
      const initialStatus = await this.admin.command({ replSetGetStatus: 1 });
      const currentPrimary = initialStatus.members.find(member => member.state === 1);

      console.log(`Current primary: ${currentPrimary?.name || 'Unknown'}`);

      // Simulate read/write operations during potential failover
      const operationsPromises = [];

      // Start continuous read operations
      operationsPromises.push(this.performContinuousReads());

      // Start continuous write operations
      operationsPromises.push(this.performContinuousWrites());

      // Monitor replica set status changes
      operationsPromises.push(this.monitorFailoverEvents());

      // Run operations for 60 seconds to demonstrate resilience
      console.log('Running continuous operations to test high availability...');
      await new Promise(resolve => setTimeout(resolve, 60000));

      console.log('High availability demonstration completed');

    } catch (error) {
      console.error('Failover demonstration failed:', error);
    }
  }

  async performContinuousReads() {
    const testCollection = this.db.collection('financial_transactions');
    let readCount = 0;
    let errorCount = 0;

    const readInterval = setInterval(async () => {
      try {
        // Perform read with secondaryPreferred to demonstrate load balancing
        await testCollection.find({}, {
          readPreference: 'secondaryPreferred',
          maxStalenessSeconds: 90
        }).limit(10).toArray();

        readCount++;

        if (readCount % 10 === 0) {
          console.log(`Continuous reads: ${readCount} successful, ${errorCount} errors`);
        }

      } catch (error) {
        errorCount++;
        console.warn(`Read operation failed: ${error.message}`);
      }
    }, 2000); // Read every 2 seconds

    // Stop after 60 seconds
    setTimeout(() => {
      clearInterval(readInterval);
      console.log(`Final read stats: ${readCount} successful, ${errorCount} errors`);
    }, 60000);
  }

  async performContinuousWrites() {
    const testCollection = this.db.collection('ha_test_operations');
    let writeCount = 0;
    let errorCount = 0;

    const writeInterval = setInterval(async () => {
      try {
        // Perform write with majority write concern for durability
        await testCollection.insertOne({
          operationType: 'ha_test_write',
          timestamp: new Date(),
          counter: writeCount + 1,
          testData: `Continuous write operation ${writeCount + 1}`
        }, {
          writeConcern: { w: 'majority', j: true, wtimeout: 5000 }
        });

        writeCount++;

        if (writeCount % 5 === 0) {
          console.log(`Continuous writes: ${writeCount} successful, ${errorCount} errors`);
        }

      } catch (error) {
        errorCount++;
        console.warn(`Write operation failed: ${error.message}`);

        // Implement exponential backoff on write failures
        await new Promise(resolve => setTimeout(resolve, Math.min(1000 * Math.pow(2, errorCount), 10000)));
      }
    }, 5000); // Write every 5 seconds

    // Stop after 60 seconds
    setTimeout(() => {
      clearInterval(writeInterval);
      console.log(`Final write stats: ${writeCount} successful, ${errorCount} errors`);

      // Cleanup test documents
      testCollection.deleteMany({ operationType: 'ha_test_write' }).catch(console.warn);
    }, 60000);
  }

  async monitorFailoverEvents() {
    let lastPrimaryName = null;

    const monitorInterval = setInterval(async () => {
      try {
        const status = await this.admin.command({ replSetGetStatus: 1 });
        const currentPrimary = status.members.find(member => member.state === 1);
        const currentPrimaryName = currentPrimary?.name;

        if (lastPrimaryName && lastPrimaryName !== currentPrimaryName) {
          console.log(`🔄 FAILOVER DETECTED: Primary changed from ${lastPrimaryName} to ${currentPrimaryName || 'NONE'}`);

          // Log failover event
          await this.logFailoverEvent(lastPrimaryName, currentPrimaryName);
        }

        lastPrimaryName = currentPrimaryName;

      } catch (error) {
        console.warn('Failed to monitor replica set status:', error.message);
      }
    }, 5000); // Check every 5 seconds

    // Stop monitoring after 60 seconds
    setTimeout(() => {
      clearInterval(monitorInterval);
    }, 60000);
  }

  async logFailoverEvent(oldPrimary, newPrimary) {
    try {
      const eventsCollection = this.db.collection('failover_events');

      await eventsCollection.insertOne({
        eventType: 'primary_failover',
        timestamp: new Date(),
        oldPrimary: oldPrimary,
        newPrimary: newPrimary,
        detectedBy: 'ha_operations_monitor',
        environment: process.env.NODE_ENV || 'development'
      });

      console.log('Failover event logged to database');

    } catch (error) {
      console.warn('Failed to log failover event:', error);
    }
  }

  async performDataConsistencyCheck() {
    console.log('Performing data consistency check across replica set members...');

    try {
      const testCollection = this.db.collection('consistency_test');

      // Insert test document with majority write concern
      const testDoc = {
        consistencyTestId: new ObjectId(),
        timestamp: new Date(),
        testData: 'Data consistency verification document',
        checksum: 'test-checksum-12345'
      };

      const insertResult = await testCollection.insertOne(testDoc, {
        writeConcern: { w: 'majority', j: true, wtimeout: 10000 }
      });

      console.log(`Test document inserted with ID: ${insertResult.insertedId}`);

      // Wait a moment for replication
      await new Promise(resolve => setTimeout(resolve, 2000));

      // Read from primary
      const primaryRead = await testCollection.findOne(
        { _id: insertResult.insertedId },
        { readPreference: 'primary' }
      );

      // Read from secondary (with retry logic)
      let secondaryRead = null;
      let retryCount = 0;
      const maxRetries = 5;

      while (!secondaryRead && retryCount < maxRetries) {
        try {
          secondaryRead = await testCollection.findOne(
            { _id: insertResult.insertedId },
            { 
              readPreference: 'secondary',
              maxStalenessSeconds: 120 
            }
          );

          if (!secondaryRead) {
            retryCount++;
            console.log(`Retry ${retryCount}: Secondary read returned null, waiting for replication...`);
            await new Promise(resolve => setTimeout(resolve, 1000));
          }
        } catch (error) {
          retryCount++;
          console.warn(`Secondary read attempt ${retryCount} failed: ${error.message}`);
          if (retryCount < maxRetries) {
            await new Promise(resolve => setTimeout(resolve, 1000));
          }
        }
      }

      // Compare results
      const consistent = primaryRead && secondaryRead && 
        primaryRead.checksum === secondaryRead.checksum &&
        primaryRead.testData === secondaryRead.testData;

      console.log(`Data consistency check: ${consistent ? 'PASSED' : 'FAILED'}`);
      console.log(`Primary read successful: ${!!primaryRead}`);
      console.log(`Secondary read successful: ${!!secondaryRead}`);

      if (consistent) {
        console.log('✅ Data is consistent across replica set members');
      } else {
        console.warn('⚠️  Data inconsistency detected');
        console.log('Primary data:', primaryRead);
        console.log('Secondary data:', secondaryRead);
      }

      // Cleanup
      await testCollection.deleteOne({ _id: insertResult.insertedId });

      return {
        consistent,
        primaryReadSuccessful: !!primaryRead,
        secondaryReadSuccessful: !!secondaryRead,
        retryCount
      };

    } catch (error) {
      console.error('Data consistency check failed:', error);
      return {
        consistent: false,
        error: error.message
      };
    }
  }
}

// Example usage and demonstration
const demonstrateReplicaSetCapabilities = async () => {
  console.log('Starting MongoDB Replica Set demonstration...\n');

  try {
    // Setup enterprise collections
    const collections = await setupEnterpriseCollections();
    console.log('✅ Enterprise collections created\n');

    // Initialize replica set management
    const rsManager = new ReplicaSetManager(client, db);
    await rsManager.initializeReplicaSetMonitoring();
    console.log('✅ Replica set monitoring initialized\n');

    // Get replica set configuration and status
    const config = await rsManager.getReplicaSetConfig();
    const status = await rsManager.getReplicaSetStatus();

    console.log('📊 Replica Set Status:');
    console.log(`   Set Name: ${status.setName}`);
    console.log(`   Primary: ${status.primary?.name || 'None'}`);
    console.log(`   Total Members: ${status.members.length}`);
    console.log(`   Healthy Members: ${status.members.filter(m => m.health === 1).length}\n`);

    // Demonstrate read preferences
    await rsManager.demonstrateReadPreferences();
    console.log('✅ Read preferences demonstrated\n');

    // Demonstrate write concerns
    await rsManager.demonstrateWriteConcerns();
    console.log('✅ Write concerns demonstrated\n');

    // Test read/write operations
    const rwTest = await rsManager.performReadWriteOperations();
    console.log('✅ Read/write operations tested:', rwTest);
    console.log();

    // High availability operations
    const haOps = new HighAvailabilityOperations(client, db);

    // Perform data consistency check
    const consistencyResult = await haOps.performDataConsistencyCheck();
    console.log('✅ Data consistency checked:', consistencyResult);
    console.log();

    // Get final health summary
    const healthSummary = await rsManager.getHealthSummary();
    console.log('📋 Final Health Summary:', healthSummary);

    // Cleanup
    await rsManager.shutdown();
    console.log('\n🏁 Replica Set demonstration completed successfully');

  } catch (error) {
    console.error('❌ Demonstration failed:', error);
  }
};

// Export for use in applications
module.exports = {
  setupEnterpriseCollections,
  ReplicaSetManager,
  HighAvailabilityOperations,
  demonstrateReplicaSetCapabilities
};

// Benefits of MongoDB Replica Sets:
// - Automatic failover with no application intervention required
// - Built-in data redundancy across multiple servers and data centers
// - Configurable read preferences for performance optimization
// - Strong consistency guarantees with majority write concerns
// - Rolling upgrades and maintenance without downtime
// - Geographic distribution for disaster recovery
// - Automatic recovery from network partitions and server failures
// - Real-time replication with minimal lag
// - Integration with MongoDB Atlas for managed high availability
// - Comprehensive monitoring and alerting capabilities

Understanding MongoDB Replica Set Architecture

Replica Set Configuration Patterns

MongoDB Replica Sets provide several deployment patterns for different availability and performance requirements:

// Advanced replica set configuration patterns for enterprise deployments
class EnterpriseReplicaSetArchitecture {
  constructor(client) {
    this.client = client;
    this.admin = client.db('admin');
    this.architecturePatterns = new Map();
  }

  async setupProductionArchitecture() {
    console.log('Setting up enterprise production replica set architecture...');

    // Pattern 1: Standard 3-Member Production Setup
    const standardProductionConfig = {
      _id: "production-rs",
      version: 1,
      members: [
        {
          _id: 0,
          host: "db-primary-01.company.com:27017",
          priority: 10, // Higher priority = preferred primary
          votes: 1,
          buildIndexes: true,
          tags: { 
            region: "us-east-1", 
            datacenter: "primary",
            nodeType: "standard",
            environment: "production"
          }
        },
        {
          _id: 1,
          host: "db-secondary-01.company.com:27017", 
          priority: 5,
          votes: 1,
          buildIndexes: true,
          tags: { 
            region: "us-east-1", 
            datacenter: "primary",
            nodeType: "standard",
            environment: "production"
          }
        },
        {
          _id: 2,
          host: "db-secondary-02.company.com:27017",
          priority: 5,
          votes: 1, 
          buildIndexes: true,
          tags: { 
            region: "us-west-2", 
            datacenter: "secondary",
            nodeType: "standard", 
            environment: "production"
          }
        }
      ],
      settings: {
        heartbeatIntervalMillis: 2000, // 2 second heartbeat
        heartbeatTimeoutSecs: 10,      // 10 second timeout
        electionTimeoutMillis: 10000,  // 10 second election timeout
        chainingAllowed: true,         // Allow secondary-to-secondary replication
        getLastErrorModes: {
          "datacenterMajority": { "datacenter": 2 }, // Require writes to both datacenters
          "regionMajority": { "region": 2 }          // Require writes to both regions
        }
      }
    };

    // Pattern 2: 5-Member High Availability with Arbiter
    const highAvailabilityConfig = {
      _id: "ha-production-rs",
      version: 1,
      members: [
        // Primary datacenter members
        {
          _id: 0,
          host: "db-primary-01.company.com:27017",
          priority: 10,
          votes: 1,
          tags: { region: "us-east-1", datacenter: "primary" }
        },
        {
          _id: 1, 
          host: "db-secondary-01.company.com:27017",
          priority: 8,
          votes: 1,
          tags: { region: "us-east-1", datacenter: "primary" }
        },

        // Disaster recovery datacenter members
        {
          _id: 2,
          host: "db-dr-01.company.com:27017",
          priority: 2, // Lower priority for DR
          votes: 1,
          tags: { region: "us-west-2", datacenter: "dr" }
        },
        {
          _id: 3,
          host: "db-dr-02.company.com:27017", 
          priority: 1,
          votes: 1,
          tags: { region: "us-west-2", datacenter: "dr" }
        },

        // Arbiter for odd number of votes (lightweight)
        {
          _id: 4,
          host: "db-arbiter-01.company.com:27017",
          arbiterOnly: true,
          votes: 1,
          tags: { region: "us-central-1", datacenter: "arbiter" }
        }
      ],
      settings: {
        getLastErrorModes: {
          "crossDatacenter": { "datacenter": 2 },
          "disasterRecovery": { "datacenter": 1, "region": 2 }
        }
      }
    };

    // Pattern 3: Analytics-Optimized with Hidden Members
    const analyticsOptimizedConfig = {
      _id: "analytics-rs", 
      version: 1,
      members: [
        // Production data serving members
        {
          _id: 0,
          host: "db-primary-01.company.com:27017",
          priority: 10,
          votes: 1,
          tags: { workload: "transactional", region: "us-east-1" }
        },
        {
          _id: 1,
          host: "db-secondary-01.company.com:27017",
          priority: 5,
          votes: 1,
          tags: { workload: "transactional", region: "us-east-1" }
        },

        // Hidden analytics members (never become primary)
        {
          _id: 2,
          host: "db-analytics-01.company.com:27017",
          priority: 0,    // Cannot become primary
          votes: 0,       // Does not participate in elections
          hidden: true,   // Hidden from application discovery
          buildIndexes: true,
          tags: { 
            workload: "analytics", 
            region: "us-east-1",
            purpose: "reporting" 
          }
        },
        {
          _id: 3,
          host: "db-analytics-02.company.com:27017",
          priority: 0,
          votes: 0, 
          hidden: true,
          buildIndexes: true,
          tags: { 
            workload: "analytics",
            region: "us-east-1", 
            purpose: "etl"
          }
        },

        // Delayed member for disaster recovery
        {
          _id: 4,
          host: "db-delayed-01.company.com:27017",
          priority: 0,
          votes: 0,
          hidden: true,
          buildIndexes: true,
          secondaryDelaySecs: 3600, // 1 hour delay
          tags: { 
            workload: "recovery",
            region: "us-west-2",
            purpose: "delayed_backup"
          }
        }
      ]
    };

    // Store configurations for reference
    this.architecturePatterns.set('standard-production', standardProductionConfig);
    this.architecturePatterns.set('high-availability', highAvailabilityConfig);
    this.architecturePatterns.set('analytics-optimized', analyticsOptimizedConfig);

    console.log('Enterprise replica set architectures configured');
    return this.architecturePatterns;
  }

  async implementReadPreferenceStrategies() {
    console.log('Implementing enterprise read preference strategies...');

    // Strategy 1: Load balancing with geographic preference
    const geographicLoadBalancing = {
      // Primary application reads from nearest secondary
      applicationReads: {
        readPreference: 'secondaryPreferred',
        readPreferenceTags: [
          { region: 'us-east-1', datacenter: 'primary' }, // Prefer local datacenter
          { region: 'us-east-1' },                       // Fallback to region
          {}                                             // Final fallback to any
        ],
        maxStalenessSeconds: 60
      },

      // Analytics reads from dedicated hidden members
      analyticsReads: {
        readPreference: 'secondary',
        readPreferenceTags: [
          { workload: 'analytics', purpose: 'reporting' },
          { workload: 'analytics' }
        ],
        maxStalenessSeconds: 300 // 5 minutes acceptable for analytics
      },

      // Real-time dashboard reads (require fresh data)
      dashboardReads: {
        readPreference: 'primaryPreferred',
        maxStalenessSeconds: 30
      },

      // ETL and batch processing reads
      etlReads: {
        readPreference: 'secondary',
        readPreferenceTags: [
          { purpose: 'etl' },
          { workload: 'analytics' }
        ],
        maxStalenessSeconds: 600 // 10 minutes acceptable for ETL
      }
    };

    // Strategy 2: Write concern patterns for different operations
    const writeConcernStrategies = {
      // Critical financial transactions
      criticalWrites: {
        writeConcern: { 
          w: 'datacenterMajority',  // Custom write concern
          j: true,
          wtimeout: 10000
        },
        description: 'Ensures writes to multiple datacenters'
      },

      // Standard application writes
      standardWrites: {
        writeConcern: {
          w: 'majority',
          j: true, 
          wtimeout: 5000
        },
        description: 'Balances durability and performance'
      },

      // High-volume logging writes
      loggingWrites: {
        writeConcern: {
          w: 1,
          j: false,
          wtimeout: 1000
        },
        description: 'Optimized for throughput'
      },

      // Audit trail writes (maximum durability)
      auditWrites: {
        writeConcern: {
          w: 'regionMajority', // Custom write concern
          j: true,
          wtimeout: 15000
        },
        description: 'Ensures geographic distribution'
      }
    };

    return {
      readPreferences: geographicLoadBalancing,
      writeConcerns: writeConcernStrategies
    };
  }

  async setupMonitoringAndAlerting() {
    console.log('Setting up comprehensive replica set monitoring...');

    const monitoringMetrics = {
      // Replication lag monitoring
      replicationLag: {
        warning: 10,   // seconds
        critical: 30,  // seconds
        query: 'db.runCommand({replSetGetStatus: 1})'
      },

      // Member health monitoring
      memberHealth: {
        checkInterval: 30, // seconds
        alertThreshold: 2, // consecutive failures
        metrics: ['health', 'state', 'uptime', 'lastHeartbeat']
      },

      // Oplog monitoring
      oplogUtilization: {
        warning: 75,   // percent
        critical: 90,  // percent
        retentionTarget: 24 // hours
      },

      // Connection monitoring
      connectionMetrics: {
        maxConnections: 1000,
        warningThreshold: 800,
        monitorActiveConnections: true,
        trackConnectionSources: true
      },

      // Performance monitoring
      performanceMetrics: {
        slowQueryThreshold: 1000, // ms
        indexUsageTracking: true,
        collectionStatsMonitoring: true,
        operationProfiling: {
          enabled: true,
          slowMs: 100,
          sampleRate: 0.1 // 10% sampling
        }
      }
    };

    // Automated alert conditions
    const alertConditions = {
      criticalAlerts: [
        'No primary member available',
        'Majority of members down',
        'Replication lag > 30 seconds',
        'Oplog utilization > 90%'
      ],

      warningAlerts: [
        'Member health issues',
        'Replication lag > 10 seconds', 
        'High connection usage',
        'Slow query patterns detected'
      ],

      infoAlerts: [
        'Primary election occurred',
        'Member added/removed',
        'Configuration change',
        'Index build completed'
      ]
    };

    return {
      metrics: monitoringMetrics,
      alerts: alertConditions
    };
  }

  async performMaintenanceOperations() {
    console.log('Demonstrating maintenance operations...');

    try {
      // 1. Check replica set status before maintenance
      const preMaintenanceStatus = await this.admin.command({ replSetGetStatus: 1 });
      console.log('Pre-maintenance replica set status obtained');

      // 2. Demonstrate rolling maintenance (simulation)
      console.log('Simulating rolling maintenance procedures...');

      const maintenanceProcedures = {
        // Step-by-step rolling upgrade
        rollingUpgrade: [
          '1. Start with secondary members (lowest priority first)',
          '2. Stop MongoDB service on secondary',
          '3. Upgrade MongoDB binaries', 
          '4. Restart with new version',
          '5. Verify member rejoins and catches up',
          '6. Repeat for remaining secondaries',
          '7. Step down primary to make it secondary',
          '8. Upgrade former primary',
          '9. Allow automatic primary election'
        ],

        // Rolling configuration changes
        configurationUpdate: [
          '1. Update secondary member configurations',
          '2. Verify changes take effect',
          '3. Step down primary',
          '4. Update former primary configuration', 
          '5. Verify replica set health'
        ],

        // Index building strategy  
        indexMaintenance: [
          '1. Build indexes on secondaries first',
          '2. Use background: true for minimal impact',
          '3. Monitor replication lag during build',
          '4. Step down primary after secondary indexes complete',
          '5. Build index on former primary'
        ]
      };

      console.log('Rolling maintenance procedures defined:', Object.keys(maintenanceProcedures));

      // 3. Demonstrate graceful primary stepdown
      console.log('Demonstrating graceful primary stepdown...');

      try {
        // Check if we have a primary
        const currentPrimary = preMaintenanceStatus.members.find(m => m.state === 1);

        if (currentPrimary) {
          console.log(`Current primary: ${currentPrimary.name}`);

          // In a real scenario, you would step down the primary:
          // await this.admin.command({ replSetStepDown: 60 }); // Step down for 60 seconds

          console.log('Primary stepdown would be executed here (skipped in demo)');
        } else {
          console.log('No primary found - replica set may be in election');
        }

      } catch (error) {
        console.log('Primary stepdown simulation completed (expected in demo environment)');
      }

      // 4. Maintenance completion verification
      console.log('Verifying replica set health after maintenance...');

      const postMaintenanceChecks = {
        replicationLag: 'Check all members have acceptable lag',
        memberHealth: 'Verify all members are healthy',
        primaryElection: 'Confirm primary is elected and stable', 
        dataConsistency: 'Validate data consistency across members',
        applicationConnectivity: 'Test application reconnection',
        performanceBaseline: 'Confirm performance metrics are normal'
      };

      console.log('Post-maintenance verification checklist:', Object.keys(postMaintenanceChecks));

      return {
        maintenanceProcedures,
        preMaintenanceStatus: preMaintenanceStatus.members.length,
        postMaintenanceChecks: Object.keys(postMaintenanceChecks).length
      };

    } catch (error) {
      console.error('Maintenance operations demonstration failed:', error);
      throw error;
    }
  }

  async demonstrateDisasterRecovery() {
    console.log('Demonstrating disaster recovery capabilities...');

    const disasterRecoveryPlan = {
      // Scenario 1: Primary datacenter failure
      primaryDatacenterFailure: {
        detectionMethods: [
          'Automated health checks detect connectivity loss',
          'Application connection failures increase',
          'Monitoring systems report member unavailability'
        ],

        automaticResponse: [
          'Remaining members detect primary datacenter loss',
          'Automatic election occurs among surviving members',
          'New primary elected in secondary datacenter',
          'Applications automatically reconnect to new primary'
        ],

        recoverySteps: [
          'Verify new primary is stable and accepting writes',
          'Update DNS/load balancer if necessary', 
          'Monitor replication lag on remaining secondaries',
          'Plan primary datacenter restoration'
        ]
      },

      // Scenario 2: Network partition (split-brain prevention)
      networkPartition: {
        scenario: 'Network split isolates primary from majority of members',

        mongodbResponse: [
          'Primary detects loss of majority and steps down',
          'Primary becomes secondary (read-only)',
          'Majority partition elects new primary',
          'Split-brain scenario prevented by majority rule'
        ],

        resolution: [
          'Network partition heals automatically or manually',
          'Isolated member(s) rejoin replica set',
          'Data consistency maintained through oplog replay',
          'Normal operations resume'
        ]
      },

      // Scenario 3: Data corruption recovery
      dataCorruption: {
        detectionMethods: [
          'Checksum validation failures',
          'Application data integrity checks', 
          'MongoDB internal consistency checks'
        ],

        recoveryOptions: [
          'Restore from delayed secondary (if available)',
          'Point-in-time recovery from backup',
          'Partial data recovery and manual intervention',
          'Full replica set restoration from backup'
        ]
      }
    };

    // Demonstrate backup and recovery verification
    const backupVerification = await this.verifyBackupProcedures();

    return {
      disasterScenarios: Object.keys(disasterRecoveryPlan),
      backupVerification
    };
  }

  async verifyBackupProcedures() {
    console.log('Verifying backup and recovery procedures...');

    try {
      // Create a test collection for backup verification
      const backupTestCollection = this.client.db('backup_test').collection('test_data');

      // Insert test data
      await backupTestCollection.insertMany([
        { testId: 1, data: 'Backup verification data 1', timestamp: new Date() },
        { testId: 2, data: 'Backup verification data 2', timestamp: new Date() },
        { testId: 3, data: 'Backup verification data 3', timestamp: new Date() }
      ]);

      // Simulate backup verification steps
      const backupProcedures = {
        backupVerification: [
          'Verify mongodump/mongorestore functionality',
          'Test point-in-time recovery capabilities',
          'Validate backup file integrity',
          'Confirm backup storage accessibility'
        ],

        recoveryTesting: [
          'Restore backup to test environment',
          'Verify data completeness and integrity',
          'Test application connectivity to restored data',
          'Measure recovery time objectives (RTO)'
        ],

        continuousBackup: [
          'Oplog-based continuous backup',
          'Incremental backup strategies',
          'Cross-region backup replication', 
          'Automated backup validation'
        ]
      };

      // Read back test data to verify
      const verificationCount = await backupTestCollection.countDocuments();
      console.log(`Backup verification: ${verificationCount} test documents created`);

      // Cleanup
      await backupTestCollection.drop();

      return {
        backupProceduresVerified: Object.keys(backupProcedures).length,
        testDataVerified: verificationCount === 3
      };

    } catch (error) {
      console.warn('Backup verification failed:', error.message);
      return {
        backupProceduresVerified: 0,
        testDataVerified: false,
        error: error.message
      };
    }
  }

  getArchitectureRecommendations() {
    return {
      production: {
        minimumMembers: 3,
        recommendedMembers: 5,
        arbiterUsage: 'Only when even number of data-bearing members',
        geographicDistribution: 'Multiple datacenters recommended',
        hiddenMembers: 'Use for analytics and backup workloads'
      },

      performance: {
        readPreferences: 'Configure based on workload patterns',
        writeConcerns: 'Balance durability with performance requirements', 
        indexStrategy: 'Build on secondaries first during maintenance',
        connectionPooling: 'Configure appropriate pool sizes'
      },

      monitoring: {
        replicationLag: 'Monitor continuously with alerts',
        memberHealth: 'Automated health checking essential',
        oplogSize: 'Size for expected downtime windows',
        backupTesting: 'Regular backup and recovery testing'
      }
    };
  }
}

// Export the enterprise architecture class
module.exports = { EnterpriseReplicaSetArchitecture };

SQL-Style Replica Set Operations with QueryLeaf

QueryLeaf provides SQL-familiar syntax for MongoDB Replica Set configuration and monitoring:

-- QueryLeaf replica set operations with SQL-familiar syntax

-- Create and configure replica sets using SQL-style syntax
CREATE REPLICA SET production_rs WITH (
  members = [
    { 
      host = 'db-primary-01.company.com:27017',
      priority = 10,
      votes = 1,
      tags = { region = 'us-east-1', datacenter = 'primary' }
    },
    {
      host = 'db-secondary-01.company.com:27017', 
      priority = 5,
      votes = 1,
      tags = { region = 'us-east-1', datacenter = 'primary' }
    },
    {
      host = 'db-secondary-02.company.com:27017',
      priority = 5,
      votes = 1, 
      tags = { region = 'us-west-2', datacenter = 'secondary' }
    }
  ],
  settings = {
    heartbeat_interval = '2 seconds',
    election_timeout = '10 seconds',
    write_concern_modes = {
      datacenter_majority = { datacenter = 2 },
      cross_region = { region = 2 }
    }
  }
);

-- Monitor replica set health with SQL queries
SELECT 
  member_name,
  member_state,
  health_status,
  uptime_seconds,
  replication_lag_seconds,
  last_heartbeat,

  -- Health assessment
  CASE 
    WHEN health_status = 1 AND member_state = 'PRIMARY' THEN 'Healthy Primary'
    WHEN health_status = 1 AND member_state = 'SECONDARY' THEN 'Healthy Secondary'
    WHEN health_status = 0 THEN 'Unhealthy Member'
    ELSE 'Unknown Status'
  END as status_description,

  -- Performance indicators
  CASE
    WHEN replication_lag_seconds > 30 THEN 'High Lag'
    WHEN replication_lag_seconds > 10 THEN 'Moderate Lag'  
    ELSE 'Low Lag'
  END as lag_status,

  -- Connection quality
  CASE
    WHEN ping_ms < 10 THEN 'Excellent'
    WHEN ping_ms < 50 THEN 'Good'
    WHEN ping_ms < 100 THEN 'Fair'
    ELSE 'Poor'
  END as connection_quality

FROM replica_set_status('production_rs')
ORDER BY 
  CASE member_state
    WHEN 'PRIMARY' THEN 1
    WHEN 'SECONDARY' THEN 2
    ELSE 3
  END,
  member_name;

-- Advanced read preference configuration with SQL
SELECT 
  account_id,
  transaction_date,
  amount,
  transaction_type,
  status

FROM financial_transactions 
WHERE transaction_date >= CURRENT_DATE - INTERVAL '30 days'
  AND account_id = '12345'

-- Use read preference for load balancing
WITH READ_PREFERENCE = 'secondary_preferred'
WITH READ_PREFERENCE_TAGS = [
  { region = 'us-east-1', datacenter = 'primary' },
  { region = 'us-east-1' },
  { }  -- fallback to any available member
]
WITH MAX_STALENESS = '60 seconds'

ORDER BY transaction_date DESC
LIMIT 100;

-- Write operations with custom write concerns
INSERT INTO critical_financial_data (
  transaction_id,
  account_from,
  account_to, 
  amount,
  transaction_type,
  created_at
) VALUES (
  'TXN-2025-001234',
  'ACC-123456789',
  'ACC-987654321', 
  1500.00,
  'wire_transfer',
  CURRENT_TIMESTAMP
)
-- Ensure write to multiple datacenters for critical data
WITH WRITE_CONCERN = {
  w = 'datacenter_majority',
  journal = true,
  timeout = '10 seconds'
};

-- Comprehensive replica set analytics
WITH replica_set_metrics AS (
  SELECT 
    rs.replica_set_name,
    rs.member_name,
    rs.member_state,
    rs.health_status,
    rs.uptime_seconds,
    rs.replication_lag_seconds,
    rs.ping_ms,

    -- Time-based analysis
    DATE_TRUNC('hour', rs.check_timestamp) as hour_bucket,
    DATE_TRUNC('day', rs.check_timestamp) as day_bucket,

    -- Performance categorization
    CASE 
      WHEN rs.replication_lag_seconds <= 5 THEN 'excellent'
      WHEN rs.replication_lag_seconds <= 15 THEN 'good'
      WHEN rs.replication_lag_seconds <= 30 THEN 'fair'
      ELSE 'poor'
    END as replication_performance,

    CASE
      WHEN rs.ping_ms <= 10 THEN 'excellent'
      WHEN rs.ping_ms <= 25 THEN 'good'
      WHEN rs.ping_ms <= 50 THEN 'fair'
      ELSE 'poor'
    END as connection_performance

  FROM replica_set_health_history rs
  WHERE rs.check_timestamp >= CURRENT_TIMESTAMP - INTERVAL '24 hours'
),
performance_summary AS (
  SELECT 
    replica_set_name,
    hour_bucket,

    -- Member availability
    COUNT(*) as total_checks,
    COUNT(*) FILTER (WHERE health_status = 1) as healthy_checks,
    ROUND((COUNT(*) FILTER (WHERE health_status = 1)::numeric / COUNT(*)) * 100, 2) as availability_percent,

    -- Primary stability
    COUNT(DISTINCT member_name) FILTER (WHERE member_state = 'PRIMARY') as primary_changes,

    -- Replication performance
    AVG(replication_lag_seconds) as avg_replication_lag,
    MAX(replication_lag_seconds) as max_replication_lag,
    PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY replication_lag_seconds) as p95_replication_lag,

    -- Connection performance  
    AVG(ping_ms) as avg_ping_ms,
    MAX(ping_ms) as max_ping_ms,
    PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY ping_ms) as p95_ping_ms,

    -- Performance distribution
    COUNT(*) FILTER (WHERE replication_performance = 'excellent') as excellent_replication_count,
    COUNT(*) FILTER (WHERE replication_performance = 'good') as good_replication_count,
    COUNT(*) FILTER (WHERE replication_performance = 'fair') as fair_replication_count,
    COUNT(*) FILTER (WHERE replication_performance = 'poor') as poor_replication_count,

    COUNT(*) FILTER (WHERE connection_performance = 'excellent') as excellent_connection_count,
    COUNT(*) FILTER (WHERE connection_performance = 'good') as good_connection_count,
    COUNT(*) FILTER (WHERE connection_performance = 'fair') as fair_connection_count,
    COUNT(*) FILTER (WHERE connection_performance = 'poor') as poor_connection_count

  FROM replica_set_metrics
  GROUP BY replica_set_name, hour_bucket
),
alerting_analysis AS (
  SELECT 
    ps.*,

    -- SLA compliance (99.9% availability target)
    CASE WHEN ps.availability_percent >= 99.9 THEN 'SLA_COMPLIANT' ELSE 'SLA_BREACH' END as sla_status,

    -- Performance alerts
    CASE 
      WHEN ps.avg_replication_lag > 30 THEN 'CRITICAL_LAG'
      WHEN ps.avg_replication_lag > 15 THEN 'WARNING_LAG' 
      ELSE 'NORMAL_LAG'
    END as lag_alert_level,

    CASE
      WHEN ps.primary_changes > 1 THEN 'UNSTABLE_PRIMARY'
      WHEN ps.primary_changes = 1 THEN 'PRIMARY_CHANGE'
      ELSE 'STABLE_PRIMARY'
    END as primary_stability,

    -- Recommendations
    CASE
      WHEN ps.availability_percent < 99.0 THEN 'Investigate member failures and network issues'
      WHEN ps.avg_replication_lag > 30 THEN 'Check network bandwidth and server performance'
      WHEN ps.primary_changes > 1 THEN 'Analyze primary election patterns and member priorities'
      WHEN ps.avg_ping_ms > 50 THEN 'Investigate network latency between members'
      ELSE 'Performance within acceptable parameters'
    END as recommendation

  FROM performance_summary ps
)
SELECT 
  aa.replica_set_name,
  TO_CHAR(aa.hour_bucket, 'YYYY-MM-DD HH24:00') as monitoring_hour,

  -- Availability metrics
  aa.total_checks,
  aa.availability_percent,
  aa.sla_status,

  -- Replication metrics
  ROUND(aa.avg_replication_lag::numeric, 2) as avg_lag_seconds,
  ROUND(aa.max_replication_lag::numeric, 2) as max_lag_seconds,
  ROUND(aa.p95_replication_lag::numeric, 2) as p95_lag_seconds,
  aa.lag_alert_level,

  -- Connection metrics
  ROUND(aa.avg_ping_ms::numeric, 1) as avg_ping_ms,
  ROUND(aa.max_ping_ms::numeric, 1) as max_ping_ms,

  -- Stability metrics
  aa.primary_changes,
  aa.primary_stability,

  -- Performance distribution
  CONCAT(
    'Excellent: ', aa.excellent_replication_count, 
    ', Good: ', aa.good_replication_count,
    ', Fair: ', aa.fair_replication_count,
    ', Poor: ', aa.poor_replication_count
  ) as replication_distribution,

  -- Operational insights
  aa.recommendation,

  -- Trend indicators
  LAG(aa.availability_percent) OVER (
    PARTITION BY aa.replica_set_name 
    ORDER BY aa.hour_bucket
  ) as prev_hour_availability,

  aa.availability_percent - LAG(aa.availability_percent) OVER (
    PARTITION BY aa.replica_set_name 
    ORDER BY aa.hour_bucket  
  ) as availability_trend

FROM alerting_analysis aa
ORDER BY aa.replica_set_name, aa.hour_bucket DESC;

-- Failover simulation and testing
CREATE PROCEDURE test_failover_scenario(
  replica_set_name VARCHAR(100),
  test_type VARCHAR(50) -- 'primary_stepdown', 'network_partition', 'member_failure'
) AS
BEGIN
  -- Record test start
  INSERT INTO failover_tests (
    replica_set_name,
    test_type,
    test_start_time,
    status
  ) VALUES (
    replica_set_name,
    test_type,
    CURRENT_TIMESTAMP,
    'running'
  );

  -- Execute test based on type
  CASE test_type
    WHEN 'primary_stepdown' THEN
      -- Gracefully step down current primary
      CALL replica_set_step_down(replica_set_name, 60); -- 60 second stepdown

    WHEN 'network_partition' THEN
      -- Simulate network partition (requires external orchestration)
      CALL simulate_network_partition(replica_set_name, '30 seconds');

    WHEN 'member_failure' THEN
      -- Simulate member failure (test environment only)
      CALL simulate_member_failure(replica_set_name, 'secondary', '60 seconds');
  END CASE;

  -- Monitor failover process
  CALL monitor_failover_recovery(replica_set_name);

  -- Record test completion
  UPDATE failover_tests 
  SET 
    test_end_time = CURRENT_TIMESTAMP,
    status = 'completed',
    recovery_time_seconds = EXTRACT(EPOCH FROM (CURRENT_TIMESTAMP - test_start_time))
  WHERE replica_set_name = replica_set_name 
    AND test_start_time = (
      SELECT MAX(test_start_time) 
      FROM failover_tests 
      WHERE replica_set_name = replica_set_name
    );
END;

-- Backup and recovery verification
WITH backup_verification AS (
  SELECT 
    backup_name,
    backup_timestamp,
    backup_size_gb,
    backup_type, -- 'full', 'incremental', 'oplog'

    -- Backup validation
    backup_integrity_check,
    restoration_test_status,

    -- Recovery metrics
    estimated_recovery_time_minutes,
    recovery_point_objective_minutes,
    recovery_time_objective_minutes,

    -- Geographic distribution
    backup_locations,
    cross_region_replication_status

  FROM backup_history
  WHERE backup_timestamp >= CURRENT_TIMESTAMP - INTERVAL '7 days'
    AND backup_type IN ('full', 'incremental')
),
recovery_readiness AS (
  SELECT 
    COUNT(*) as total_backups,
    COUNT(*) FILTER (WHERE backup_integrity_check = 'passed') as verified_backups,
    COUNT(*) FILTER (WHERE restoration_test_status = 'success') as tested_backups,

    AVG(estimated_recovery_time_minutes) as avg_recovery_time,
    MAX(estimated_recovery_time_minutes) as max_recovery_time,

    -- Compliance assessment
    CASE 
      WHEN COUNT(*) FILTER (WHERE restoration_test_status = 'success') >= 3 THEN 'compliant'
      WHEN COUNT(*) FILTER (WHERE restoration_test_status = 'success') >= 1 THEN 'warning'
      ELSE 'non_compliant'
    END as backup_testing_compliance,

    -- Geographic redundancy
    COUNT(DISTINCT backup_locations) as backup_site_count,

    -- Recommendations
    CASE
      WHEN COUNT(*) FILTER (WHERE restoration_test_status = 'success') = 0 THEN 
        'Schedule immediate backup restoration testing'
      WHEN AVG(estimated_recovery_time_minutes) > recovery_time_objective_minutes THEN
        'Optimize backup strategy to meet RTO requirements'
      WHEN COUNT(DISTINCT backup_locations) < 2 THEN
        'Implement geographic backup distribution'
      ELSE 'Backup and recovery strategy meets requirements'
    END as recommendation

  FROM backup_verification
)
SELECT 
  rr.total_backups,
  rr.verified_backups,
  rr.tested_backups,
  ROUND((rr.tested_backups::numeric / rr.total_backups) * 100, 1) as testing_coverage_percent,

  ROUND(rr.avg_recovery_time, 1) as avg_recovery_time_minutes,
  ROUND(rr.max_recovery_time, 1) as max_recovery_time_minutes,

  rr.backup_testing_compliance,
  rr.backup_site_count,
  rr.recommendation

FROM recovery_readiness rr;

-- QueryLeaf provides comprehensive replica set capabilities:
-- 1. SQL-familiar replica set configuration and management syntax
-- 2. Advanced monitoring and alerting with SQL aggregation functions
-- 3. Read preference and write concern configuration using SQL expressions
-- 4. Comprehensive health analytics with time-series analysis
-- 5. Automated failover testing and recovery verification procedures
-- 6. Backup and recovery management with compliance tracking
-- 7. Performance optimization recommendations based on SQL analytics
-- 8. Integration with existing SQL-based monitoring and reporting systems
-- 9. Geographic distribution and disaster recovery planning with SQL queries
-- 10. Enterprise-grade high availability management using familiar SQL patterns

Best Practices for Replica Set Implementation

Production Deployment Strategies

Essential practices for enterprise MongoDB Replica Set deployments:

Member Configuration: Deploy odd numbers of voting members to prevent election ties
Geographic Distribution: Distribute members across multiple data centers for disaster recovery
Priority Settings: Configure member priorities to control primary election preferences
Hidden Members: Use hidden members for analytics workloads without affecting elections
Arbiter Usage: Deploy arbiters only when necessary to maintain odd voting member counts
Write Concerns: Configure appropriate write concerns for data durability requirements

Performance and Monitoring

Optimize Replica Sets for high-performance, production environments:

Read Preferences: Configure read preferences to distribute load and optimize performance
Replication Lag: Monitor replication lag continuously with automated alerting
Oplog Sizing: Size oplog appropriately for expected maintenance windows and downtime
Connection Pooling: Configure connection pools for optimal resource utilization
Index Building: Build indexes on secondaries first during maintenance windows
Health Monitoring: Implement comprehensive health checking and automated recovery

Conclusion

MongoDB Replica Sets provide comprehensive high availability and data resilience that eliminates the complexity and operational overhead of traditional database replication solutions while ensuring automatic failover, data consistency, and geographic distribution. The native integration with MongoDB's distributed architecture, combined with configurable read preferences and write concerns, makes building highly available applications both powerful and operationally simple.

Key Replica Set benefits include:

Automatic Failover: Seamless primary election and failover without manual intervention
Data Redundancy: Built-in data replication across multiple servers and geographic regions
Read Scaling: Configurable read preferences for optimal performance and load distribution
Strong Consistency: Majority write concerns ensure data durability and consistency
Zero-Downtime Maintenance: Rolling upgrades and maintenance without service interruption
Geographic Distribution: Cross-region deployment for disaster recovery and compliance

Whether you're building financial systems, e-commerce platforms, healthcare applications, or any mission-critical system requiring high availability, MongoDB Replica Sets with QueryLeaf's familiar SQL interface provides the foundation for enterprise-grade data resilience. This combination enables you to implement sophisticated high availability architectures while preserving familiar database interaction patterns.

QueryLeaf Integration: QueryLeaf automatically manages MongoDB Replica Set operations while providing SQL-familiar configuration syntax, monitoring queries, and health analytics functions. Advanced replica set management, read preference configuration, and failover testing are seamlessly handled through familiar SQL patterns, making high availability database management both powerful and accessible.

The integration of native high availability capabilities with SQL-style operations makes MongoDB an ideal platform for applications requiring both enterprise-grade resilience and familiar database interaction patterns, ensuring your high availability solutions remain both effective and maintainable as they scale and evolve.

November 9, 2025
28 min read

MongoDB Aggregation Framework: Advanced Analytics and Real-Time Data Transformations for Enterprise Applications

Modern enterprise applications require sophisticated data processing capabilities that can handle complex transformations, real-time analytics, and multi-stage data aggregations with high performance and scalability. Traditional database approaches often struggle with complex analytical queries, requiring expensive joins, subqueries, and multiple round trips that create performance bottlenecks and operational complexity in production environments.

MongoDB's Aggregation Framework provides comprehensive data processing pipelines that enable sophisticated analytics, transformations, and real-time computations within the database itself. Unlike traditional SQL approaches that require complex joins and expensive operations, MongoDB's aggregation pipelines deliver optimized, single-pass data processing with automatic query optimization, distributed processing capabilities, and native support for complex document transformations.

The Traditional Analytics Challenge

Conventional relational database approaches to complex analytics face significant performance and scalability limitations:

-- Traditional PostgreSQL analytics - complex joins and expensive operations

-- Multi-table sales analytics with complex aggregations
WITH customer_segments AS (
    SELECT 
        c.customer_id,
        c.customer_name,
        c.email,
        c.registration_date,
        c.country,
        c.state,

        -- Customer segmentation logic
        CASE 
            WHEN c.registration_date >= CURRENT_DATE - INTERVAL '90 days' THEN 'new_customer'
            WHEN c.last_order_date >= CURRENT_DATE - INTERVAL '30 days' THEN 'active_customer'
            WHEN c.last_order_date >= CURRENT_DATE - INTERVAL '180 days' THEN 'returning_customer'
            ELSE 'dormant_customer'
        END as customer_segment,

        -- Calculate customer lifetime metrics
        c.total_orders,
        c.total_spent,
        c.average_order_value,
        c.last_order_date,

        -- Geographic classification
        CASE 
            WHEN c.country = 'US' THEN 'domestic'
            WHEN c.country IN ('CA', 'MX') THEN 'north_america'
            WHEN c.country IN ('GB', 'DE', 'FR', 'IT', 'ES') THEN 'europe'
            ELSE 'international'
        END as geographic_segment

    FROM customers c
    WHERE c.is_active = true
),

order_analytics AS (
    SELECT 
        o.order_id,
        o.customer_id,
        o.order_date,
        o.order_status,
        o.total_amount,
        o.discount_amount,
        o.tax_amount,
        o.shipping_amount,

        -- Time-based analytics
        DATE_TRUNC('month', o.order_date) as order_month,
        DATE_TRUNC('quarter', o.order_date) as order_quarter,
        DATE_TRUNC('year', o.order_date) as order_year,
        EXTRACT(dow FROM o.order_date) as day_of_week,
        EXTRACT(hour FROM o.order_date) as hour_of_day,

        -- Order categorization
        CASE 
            WHEN o.total_amount >= 1000 THEN 'high_value'
            WHEN o.total_amount >= 500 THEN 'medium_value'
            WHEN o.total_amount >= 100 THEN 'low_value'
            ELSE 'micro_transaction'
        END as order_value_segment,

        -- Seasonal analysis
        CASE 
            WHEN EXTRACT(month FROM o.order_date) IN (12, 1, 2) THEN 'winter'
            WHEN EXTRACT(month FROM o.order_date) IN (3, 4, 5) THEN 'spring'
            WHEN EXTRACT(month FROM o.order_date) IN (6, 7, 8) THEN 'summer'
            ELSE 'fall'
        END as season,

        -- Payment method analysis
        o.payment_method,
        o.payment_processor,

        -- Fulfillment metrics
        o.shipping_method,
        o.warehouse_id,
        EXTRACT(EPOCH FROM (o.shipped_date - o.order_date)) / 86400 as fulfillment_days

    FROM orders o
    WHERE o.order_date >= CURRENT_DATE - INTERVAL '2 years'
      AND o.order_status IN ('completed', 'shipped', 'delivered')
),

product_analytics AS (
    SELECT 
        oi.order_id,
        oi.product_id,
        p.product_name,
        p.category,
        p.subcategory,
        p.brand,
        p.supplier_id,
        oi.quantity,
        oi.unit_price,
        oi.total_price,
        oi.discount_amount as item_discount,

        -- Product performance metrics
        p.cost_per_unit,
        (oi.unit_price - p.cost_per_unit) as unit_margin,
        (oi.unit_price - p.cost_per_unit) * oi.quantity as total_margin,

        -- Product categorization
        CASE 
            WHEN p.category = 'Electronics' THEN 'tech'
            WHEN p.category IN ('Clothing', 'Shoes', 'Accessories') THEN 'fashion'
            WHEN p.category IN ('Home', 'Garden', 'Furniture') THEN 'home'
            ELSE 'other'
        END as product_group,

        -- Inventory and supply chain
        p.current_stock,
        p.reorder_level,
        CASE 
            WHEN p.current_stock <= p.reorder_level THEN 'low_stock'
            WHEN p.current_stock <= p.reorder_level * 2 THEN 'medium_stock'
            ELSE 'high_stock'
        END as stock_status,

        -- Supplier performance
        s.supplier_name,
        s.supplier_rating,
        s.average_lead_time

    FROM order_items oi
    JOIN products p ON oi.product_id = p.product_id
    JOIN suppliers s ON p.supplier_id = s.supplier_id
    WHERE p.is_active = true
),

comprehensive_sales_analytics AS (
    SELECT 
        cs.customer_id,
        cs.customer_name,
        cs.customer_segment,
        cs.geographic_segment,

        oa.order_id,
        oa.order_date,
        oa.order_month,
        oa.order_quarter,
        oa.order_value_segment,
        oa.season,
        oa.payment_method,
        oa.shipping_method,
        oa.fulfillment_days,

        pa.product_id,
        pa.product_name,
        pa.category,
        pa.brand,
        pa.product_group,
        pa.quantity,
        pa.unit_price,
        pa.total_price,
        pa.total_margin,
        pa.stock_status,
        pa.supplier_name,

        -- Advanced calculations requiring window functions
        SUM(pa.total_price) OVER (
            PARTITION BY cs.customer_id, oa.order_month
        ) as customer_monthly_spend,

        AVG(pa.unit_price) OVER (
            PARTITION BY pa.category, oa.order_quarter
        ) as category_avg_price_quarterly,

        ROW_NUMBER() OVER (
            PARTITION BY cs.customer_id 
            ORDER BY oa.order_date DESC
        ) as customer_order_recency,

        RANK() OVER (
            PARTITION BY oa.order_month 
            ORDER BY pa.total_margin DESC
        ) as product_margin_rank_monthly,

        -- Complex aggregations with multiple groupings
        COUNT(*) OVER (
            PARTITION BY cs.geographic_segment, oa.season
        ) as segment_seasonal_order_count,

        SUM(pa.total_price) OVER (
            PARTITION BY pa.brand, oa.order_quarter
        ) as brand_quarterly_revenue

    FROM customer_segments cs
    JOIN order_analytics oa ON cs.customer_id = oa.customer_id
    JOIN product_analytics pa ON oa.order_id = pa.order_id
),

performance_metrics AS (
    SELECT 
        csa.*,

        -- Customer behavior analysis
        CASE 
            WHEN customer_order_recency <= 3 THEN 'frequent_buyer'
            WHEN customer_order_recency <= 10 THEN 'regular_buyer'
            ELSE 'occasional_buyer'
        END as buying_frequency,

        -- Product performance analysis
        CASE 
            WHEN product_margin_rank_monthly <= 10 THEN 'top_margin_product'
            WHEN product_margin_rank_monthly <= 50 THEN 'good_margin_product'
            ELSE 'low_margin_product'
        END as margin_performance,

        -- Market analysis
        ROUND(
            (customer_monthly_spend / NULLIF(segment_seasonal_order_count::DECIMAL, 0)) * 100, 
            2
        ) as customer_segment_contribution_pct,

        ROUND(
            (brand_quarterly_revenue / SUM(brand_quarterly_revenue) OVER ()) * 100,
            2
        ) as brand_market_share_pct

    FROM comprehensive_sales_analytics csa
)

SELECT 
    -- Dimensional attributes
    customer_segment,
    geographic_segment,
    order_quarter,
    season,
    product_group,
    category,
    brand,
    payment_method,
    shipping_method,

    -- Aggregated metrics
    COUNT(DISTINCT customer_id) as unique_customers,
    COUNT(DISTINCT order_id) as total_orders,
    COUNT(DISTINCT product_id) as unique_products,

    -- Revenue metrics
    SUM(total_price) as total_revenue,
    AVG(total_price) as avg_order_value,
    SUM(total_margin) as total_margin,
    ROUND(AVG(total_margin), 2) as avg_margin_per_item,
    ROUND((SUM(total_margin) / SUM(total_price)) * 100, 1) as margin_percentage,

    -- Customer metrics
    AVG(customer_monthly_spend) as avg_customer_monthly_spend,
    COUNT(DISTINCT CASE WHEN buying_frequency = 'frequent_buyer' THEN customer_id END) as frequent_buyers,
    COUNT(DISTINCT CASE WHEN buying_frequency = 'regular_buyer' THEN customer_id END) as regular_buyers,

    -- Product performance
    COUNT(CASE WHEN margin_performance = 'top_margin_product' THEN 1 END) as top_margin_products,
    AVG(category_avg_price_quarterly) as avg_category_price,

    -- Operational metrics
    AVG(fulfillment_days) as avg_fulfillment_days,
    COUNT(CASE WHEN stock_status = 'low_stock' THEN 1 END) as low_stock_items,
    COUNT(DISTINCT supplier_name) as unique_suppliers,

    -- Time-based trends
    AVG(brand_market_share_pct) as avg_brand_market_share,
    ROUND(AVG(customer_segment_contribution_pct), 1) as avg_segment_contribution,

    -- Growth indicators (comparing to previous period)
    LAG(SUM(total_price)) OVER (
        PARTITION BY customer_segment, geographic_segment, product_group
        ORDER BY order_quarter
    ) as prev_quarter_revenue,

    ROUND(
        ((SUM(total_price) - LAG(SUM(total_price)) OVER (
            PARTITION BY customer_segment, geographic_segment, product_group
            ORDER BY order_quarter
        )) / NULLIF(LAG(SUM(total_price)) OVER (
            PARTITION BY customer_segment, geographic_segment, product_group
            ORDER BY order_quarter
        ), 0)) * 100,
        1
    ) as revenue_growth_pct

FROM performance_metrics
GROUP BY 
    customer_segment, geographic_segment, order_quarter, season,
    product_group, category, brand, payment_method, shipping_method
HAVING 
    COUNT(DISTINCT customer_id) >= 10  -- Filter for statistical significance
    AND SUM(total_price) >= 1000       -- Minimum revenue threshold
ORDER BY 
    order_quarter DESC,
    total_revenue DESC,
    unique_customers DESC
LIMIT 1000;

-- Problems with traditional SQL analytics approach:
-- 1. Extremely complex query structure with multiple CTEs and window functions
-- 2. Expensive JOIN operations across multiple large tables
-- 3. Poor performance due to multiple aggregation passes
-- 4. Limited support for nested data structures and arrays
-- 5. Difficult to maintain and modify complex analytical logic
-- 6. Memory-intensive operations with large intermediate result sets
-- 7. No native support for document-based data transformations
-- 8. Complex indexing requirements for optimal performance
-- 9. Difficult real-time processing due to query complexity
-- 10. Limited horizontal scaling for large analytical workloads

-- MySQL analytical limitations (even more restrictive)
SELECT 
    c.customer_segment,
    DATE_FORMAT(o.order_date, '%Y-%m') as order_month,
    COUNT(DISTINCT o.customer_id) as customers,
    SUM(o.total_amount) as revenue,
    AVG(o.total_amount) as avg_order_value
FROM (
    SELECT 
        customer_id,
        CASE 
            WHEN registration_date >= DATE_SUB(NOW(), INTERVAL 90 DAY) THEN 'new'
            WHEN last_order_date >= DATE_SUB(NOW(), INTERVAL 30 DAY) THEN 'active'  
            ELSE 'dormant'
        END as customer_segment
    FROM customers 
) c
JOIN orders o ON c.customer_id = o.customer_id
WHERE o.order_date >= DATE_SUB(NOW(), INTERVAL 1 YEAR)
GROUP BY c.customer_segment, DATE_FORMAT(o.order_date, '%Y-%m')
ORDER BY order_month DESC, revenue DESC;

-- MySQL limitations for analytics:
-- - No window functions in older versions (pre-8.0)
-- - Limited CTE support 
-- - Poor JSON handling for complex nested data
-- - Basic aggregation functions only
-- - No advanced analytical functions
-- - Limited support for complex data transformations
-- - Poor performance with large analytical queries
-- - No native support for real-time streaming analytics

MongoDB Aggregation Framework provides powerful, optimized data processing pipelines:

// MongoDB Aggregation Framework - Comprehensive analytics and data transformation
const { MongoClient, ObjectId } = require('mongodb');

const client = new MongoClient('mongodb://localhost:27017');
const db = client.db('enterprise_analytics');

// Comprehensive Enterprise Analytics with MongoDB Aggregation Framework
class AdvancedAnalyticsProcessor {
  constructor(db) {
    this.db = db;
    this.collections = {
      customers: db.collection('customers'),
      orders: db.collection('orders'),
      products: db.collection('products'),
      analytics: db.collection('analytics_results'),
      realTimeMetrics: db.collection('real_time_metrics')
    };

    // Performance optimization settings
    this.aggregationOptions = {
      allowDiskUse: true,
      maxTimeMS: 300000, // 5 minutes timeout
      hint: null, // Will be set dynamically based on query
      explain: false,
      comment: 'enterprise_analytics_query'
    };

    this.setupAnalyticsIndexes();
  }

  async setupAnalyticsIndexes() {
    console.log('Setting up optimized indexes for analytics...');

    try {
      // Customer collection indexes
      await this.collections.customers.createIndexes([
        { key: { customerId: 1 }, background: true, name: 'customer_id_idx' },
        { key: { registrationDate: -1, customerSegment: 1 }, background: true, name: 'registration_segment_idx' },
        { key: { 'address.country': 1, 'address.state': 1 }, background: true, name: 'geographic_idx' },
        { key: { loyaltyTier: 1, totalSpent: -1 }, background: true, name: 'loyalty_spending_idx' },
        { key: { lastOrderDate: -1, isActive: 1 }, background: true, name: 'activity_idx' }
      ]);

      // Orders collection indexes
      await this.collections.orders.createIndexes([
        { key: { customerId: 1, orderDate: -1 }, background: true, name: 'customer_date_idx' },
        { key: { orderDate: -1, status: 1 }, background: true, name: 'date_status_idx' },
        { key: { 'financial.total': -1, orderDate: -1 }, background: true, name: 'value_date_idx' },
        { key: { 'items.productId': 1, orderDate: -1 }, background: true, name: 'product_date_idx' },
        { key: { 'shipping.region': 1, orderDate: -1 }, background: true, name: 'region_date_idx' }
      ]);

      // Products collection indexes  
      await this.collections.products.createIndexes([
        { key: { productId: 1 }, background: true, name: 'product_id_idx' },
        { key: { category: 1, subcategory: 1 }, background: true, name: 'category_idx' },
        { key: { brand: 1, 'pricing.currentPrice': -1 }, background: true, name: 'brand_price_idx' },
        { key: { 'inventory.currentStock': 1, 'inventory.reorderLevel': 1 }, background: true, name: 'inventory_idx' },
        { key: { supplierId: 1, isActive: 1 }, background: true, name: 'supplier_active_idx' }
      ]);

      console.log('Analytics indexes created successfully');

    } catch (error) {
      console.error('Error creating analytics indexes:', error);
    }
  }

  async performComprehensiveCustomerAnalytics(timeRange = 'last_12_months', customerSegments = null) {
    console.log(`Performing comprehensive customer analytics for ${timeRange}...`);

    const startTime = Date.now();

    // Calculate date range
    const dateRanges = {
      'last_30_days': new Date(Date.now() - 30 * 24 * 60 * 60 * 1000),
      'last_90_days': new Date(Date.now() - 90 * 24 * 60 * 60 * 1000),
      'last_6_months': new Date(Date.now() - 6 * 30 * 24 * 60 * 60 * 1000),
      'last_12_months': new Date(Date.now() - 12 * 30 * 24 * 60 * 60 * 1000),
      'last_2_years': new Date(Date.now() - 2 * 365 * 24 * 60 * 60 * 1000)
    };

    const startDate = dateRanges[timeRange] || dateRanges['last_12_months'];

    const pipeline = [
      // Stage 1: Match orders within time range
      {
        $match: {
          orderDate: { $gte: startDate },
          status: { $in: ['completed', 'shipped', 'delivered'] }
        }
      },

      // Stage 2: Lookup customer information
      {
        $lookup: {
          from: 'customers',
          localField: 'customerId',
          foreignField: 'customerId',
          as: 'customer'
        }
      },

      // Stage 3: Unwind customer array (should be single document)
      {
        $unwind: '$customer'
      },

      // Stage 4: Filter by customer segments if specified
      ...(customerSegments ? [{
        $match: {
          'customer.segment': { $in: customerSegments }
        }
      }] : []),

      // Stage 5: Lookup product information for each order item
      {
        $lookup: {
          from: 'products',
          localField: 'items.productId',
          foreignField: 'productId',
          as: 'productDetails'
        }
      },

      // Stage 6: Add comprehensive calculated fields
      {
        $addFields: {
          // Time-based dimensions
          orderMonth: {
            $dateTrunc: {
              date: '$orderDate',
              unit: 'month'
            }
          },
          orderQuarter: {
            $concat: [
              { $toString: { $year: '$orderDate' } },
              '-Q',
              { $toString: {
                $ceil: { $divide: [{ $month: '$orderDate' }, 3] }
              }}
            ]
          },
          orderYear: { $year: '$orderDate' },
          dayOfWeek: { $dayOfWeek: '$orderDate' },
          hourOfDay: { $hour: '$orderDate' },

          // Seasonal classification
          season: {
            $switch: {
              branches: [
                {
                  case: { $in: [{ $month: '$orderDate' }, [12, 1, 2]] },
                  then: 'winter'
                },
                {
                  case: { $in: [{ $month: '$orderDate' }, [3, 4, 5]] },
                  then: 'spring'
                },
                {
                  case: { $in: [{ $month: '$orderDate' }, [6, 7, 8]] },
                  then: 'summer'
                }
              ],
              default: 'fall'
            }
          },

          // Customer segmentation
          customerSegment: {
            $switch: {
              branches: [
                {
                  case: {
                    $gte: [
                      '$customer.registrationDate',
                      new Date(Date.now() - 90 * 24 * 60 * 60 * 1000)
                    ]
                  },
                  then: 'new_customer'
                },
                {
                  case: {
                    $gte: [
                      '$customer.lastOrderDate',
                      new Date(Date.now() - 30 * 24 * 60 * 60 * 1000)
                    ]
                  },
                  then: 'active_customer'
                },
                {
                  case: {
                    $gte: [
                      '$customer.lastOrderDate',
                      new Date(Date.now() - 180 * 24 * 60 * 60 * 1000)
                    ]
                  },
                  then: 'returning_customer'
                }
              ],
              default: 'dormant_customer'
            }
          },

          // Geographic classification
          geographicSegment: {
            $switch: {
              branches: [
                {
                  case: { $eq: ['$customer.address.country', 'US'] },
                  then: 'domestic'
                },
                {
                  case: { $in: ['$customer.address.country', ['CA', 'MX']] },
                  then: 'north_america'
                },
                {
                  case: { $in: ['$customer.address.country', ['GB', 'DE', 'FR', 'IT', 'ES']] },
                  then: 'europe'
                }
              ],
              default: 'international'
            }
          },

          // Order value classification
          orderValueSegment: {
            $switch: {
              branches: [
                {
                  case: { $gte: ['$financial.total', 1000] },
                  then: 'high_value'
                },
                {
                  case: { $gte: ['$financial.total', 500] },
                  then: 'medium_value'
                },
                {
                  case: { $gte: ['$financial.total', 100] },
                  then: 'low_value'
                }
              ],
              default: 'micro_transaction'
            }
          },

          // Enhanced item analysis with product details
          enrichedItems: {
            $map: {
              input: '$items',
              as: 'item',
              in: {
                $mergeObjects: [
                  '$$item',
                  {
                    productDetails: {
                      $arrayElemAt: [
                        {
                          $filter: {
                            input: '$productDetails',
                            cond: { $eq: ['$$this.productId', '$$item.productId'] }
                          }
                        },
                        0
                      ]
                    }
                  },
                  {
                    // Calculate margins and performance metrics
                    unitMargin: {
                      $subtract: [
                        '$$item.unitPrice',
                        {
                          $arrayElemAt: [
                            {
                              $map: {
                                input: {
                                  $filter: {
                                    input: '$productDetails',
                                    cond: { $eq: ['$$this.productId', '$$item.productId'] }
                                  }
                                },
                                in: '$$this.costPerUnit'
                              }
                            },
                            0
                          ]
                        }
                      ]
                    },

                    categoryGroup: {
                      $let: {
                        vars: {
                          category: {
                            $arrayElemAt: [
                              {
                                $map: {
                                  input: {
                                    $filter: {
                                      input: '$productDetails',
                                      cond: { $eq: ['$$this.productId', '$$item.productId'] }
                                    }
                                  },
                                  in: '$$this.category'
                                }
                              },
                              0
                            ]
                          }
                        },
                        in: {
                          $switch: {
                            branches: [
                              { case: { $eq: ['$$category', 'Electronics'] }, then: 'tech' },
                              { case: { $in: ['$$category', ['Clothing', 'Shoes', 'Accessories']] }, then: 'fashion' },
                              { case: { $in: ['$$category', ['Home', 'Garden', 'Furniture']] }, then: 'home' }
                            ],
                            default: 'other'
                          }
                        }
                      }
                    }
                  }
                ]
              }
            }
          },

          // Customer lifetime metrics (approximation)
          estimatedCustomerValue: {
            $multiply: [
              '$financial.total',
              { $add: ['$customer.averageOrdersPerYear', 1] }
            ]
          },

          // Fulfillment performance
          fulfillmentDays: {
            $cond: {
              if: { $and: ['$fulfillment.shippedAt', '$orderDate'] },
              then: {
                $divide: [
                  { $subtract: ['$fulfillment.shippedAt', '$orderDate'] },
                  86400000 // Convert milliseconds to days
                ]
              },
              else: null
            }
          }
        }
      },

      // Stage 7: Group by multiple dimensions for comprehensive analytics
      {
        $group: {
          _id: {
            customerSegment: '$customerSegment',
            geographicSegment: '$geographicSegment',
            orderMonth: '$orderMonth',
            orderQuarter: '$orderQuarter',
            season: '$season',
            orderValueSegment: '$orderValueSegment'
          },

          // Customer metrics
          uniqueCustomers: { $addToSet: '$customerId' },
          totalOrders: { $sum: 1 },

          // Financial metrics
          totalRevenue: { $sum: '$financial.total' },
          totalDiscount: { $sum: '$financial.discount' },
          totalTax: { $sum: '$financial.tax' },
          totalShipping: { $sum: '$financial.shipping' },

          // Order value statistics
          avgOrderValue: { $avg: '$financial.total' },
          maxOrderValue: { $max: '$financial.total' },
          minOrderValue: { $min: '$financial.total' },

          // Product and item metrics
          totalItems: { $sum: { $size: '$items' } },
          avgItemsPerOrder: { $avg: { $size: '$items' } },
          uniqueProducts: { 
            $addToSet: {
              $reduce: {
                input: '$items',
                initialValue: [],
                in: { $concatArrays: ['$$value', ['$$this.productId']] }
              }
            }
          },

          // Category distribution
          categoryBreakdown: {
            $push: {
              $map: {
                input: '$enrichedItems',
                in: '$$this.categoryGroup'
              }
            }
          },

          // Customer behavior metrics
          avgCustomerValue: { $avg: '$estimatedCustomerValue' },
          loyaltyTierDistribution: { $push: '$customer.loyaltyTier' },

          // Operational metrics
          avgFulfillmentDays: { $avg: '$fulfillmentDays' },
          paymentMethodDistribution: { $push: '$payment.method' },
          shippingMethodDistribution: { $push: '$shipping.method' },

          // Geographic insights
          stateDistribution: { $push: '$customer.address.state' },
          countryDistribution: { $push: '$customer.address.country' },

          // Time-based patterns
          dayOfWeekDistribution: { $push: '$dayOfWeek' },
          hourOfDayDistribution: { $push: '$hourOfDay' },

          // Customer acquisition and retention
          newCustomersCount: {
            $sum: {
              $cond: [{ $eq: ['$customerSegment', 'new_customer'] }, 1, 0]
            }
          },
          returningCustomersCount: {
            $sum: {
              $cond: [{ $eq: ['$customerSegment', 'returning_customer'] }, 1, 0]
            }
          },

          // First and last order dates for trend analysis
          firstOrderDate: { $min: '$orderDate' },
          lastOrderDate: { $max: '$orderDate' }
        }
      },

      // Stage 8: Calculate derived metrics and insights
      {
        $addFields: {
          // Calculate actual unique counts
          uniqueCustomerCount: { $size: '$uniqueCustomers' },
          uniqueProductCount: {
            $size: {
              $reduce: {
                input: '$uniqueProducts',
                initialValue: [],
                in: { $setUnion: ['$$value', '$$this'] }
              }
            }
          },

          // Revenue per customer
          revenuePerCustomer: {
            $cond: {
              if: { $gt: [{ $size: '$uniqueCustomers' }, 0] },
              then: { $divide: ['$totalRevenue', { $size: '$uniqueCustomers' }] },
              else: 0
            }
          },

          // Margin analysis
          grossMargin: { $subtract: ['$totalRevenue', '$totalDiscount'] },
          marginPercentage: {
            $multiply: [
              { $divide: [{ $subtract: ['$totalRevenue', '$totalDiscount'] }, '$totalRevenue'] },
              100
            ]
          },

          // Category insights
          topCategories: {
            $slice: [
              {
                $map: {
                  input: {
                    $sortArray: {
                      input: {
                        $objectToArray: {
                          $reduce: {
                            input: {
                              $reduce: {
                                input: '$categoryBreakdown',
                                initialValue: [],
                                in: { $concatArrays: ['$$value', '$$this'] }
                              }
                            },
                            initialValue: {},
                            in: {
                              $mergeObjects: [
                                '$$value',
                                { ['$$this']: { $add: [{ $ifNull: [{ $getField: ['$$this', '$$value'] }, 0] }, 1] } }
                              ]
                            }
                          }
                        }
                      },
                      sortBy: { v: -1 }
                    }
                  },
                  in: { category: '$$this.k', count: '$$this.v' }
                }
              },
              5 // Top 5 categories
            ]
          },

          // Customer distribution insights
          customerSegmentMetrics: {
            newCustomerPercentage: {
              $multiply: [
                { $divide: ['$newCustomersCount', '$totalOrders'] },
                100
              ]
            },
            returningCustomerPercentage: {
              $multiply: [
                { $divide: ['$returningCustomersCount', '$totalOrders'] },
                100
              ]
            }
          },

          // Time range analysis
          analysisPeriodDays: {
            $divide: [
              { $subtract: ['$lastOrderDate', '$firstOrderDate'] },
              86400000 // Convert to days
            ]
          },

          // Performance indicators
          performanceMetrics: {
            ordersPerDay: {
              $divide: [
                '$totalOrders',
                { $divide: [{ $subtract: ['$lastOrderDate', '$firstOrderDate'] }, 86400000] }
              ]
            },
            avgRevenuePerDay: {
              $divide: [
                '$totalRevenue',
                { $divide: [{ $subtract: ['$lastOrderDate', '$firstOrderDate'] }, 86400000] }
              ]
            }
          }
        }
      },

      // Stage 9: Project final results with clean structure
      {
        $project: {
          _id: 0,

          // Dimensions
          dimensions: '$_id',

          // Core metrics
          metrics: {
            customers: {
              total: '$uniqueCustomerCount',
              new: '$newCustomersCount',
              returning: '$returningCustomersCount',
              newPercentage: '$customerSegmentMetrics.newCustomerPercentage',
              returningPercentage: '$customerSegmentMetrics.returningCustomerPercentage'
            },

            orders: {
              total: '$totalOrders',
              averageValue: '$avgOrderValue',
              maxValue: '$maxOrderValue',
              minValue: '$minOrderValue',
              itemsPerOrder: '$avgItemsPerOrder'
            },

            revenue: {
              total: '$totalRevenue',
              gross: '$grossMargin',
              marginPercentage: '$marginPercentage',
              revenuePerCustomer: '$revenuePerCustomer',
              totalDiscount: '$totalDiscount',
              totalTax: '$totalTax',
              totalShipping: '$totalShipping'
            },

            products: {
              uniqueCount: '$uniqueProductCount',
              totalItems: '$totalItems',
              topCategories: '$topCategories'
            },

            operations: {
              avgFulfillmentDays: '$avgFulfillmentDays',
              analysisPeriodDays: '$analysisPeriodDays'
            },

            performance: '$performanceMetrics'
          },

          // Insights and distributions
          insights: {
            loyaltyTiers: '$loyaltyTierDistribution',
            paymentMethods: '$paymentMethodDistribution',
            shippingMethods: '$shippingMethodDistribution',
            geographic: {
              states: '$stateDistribution',
              countries: '$countryDistribution'
            },
            temporal: {
              daysOfWeek: '$dayOfWeekDistribution',
              hoursOfDay: '$hourOfDayDistribution'
            }
          },

          // Time range
          timeRange: {
            startDate: '$firstOrderDate',
            endDate: '$lastOrderDate'
          }
        }
      },

      // Stage 10: Sort results by significance
      {
        $sort: {
          'metrics.revenue.total': -1,
          'metrics.customers.total': -1
        }
      }
    ];

    try {
      const results = await this.collections.orders.aggregate(pipeline, this.aggregationOptions).toArray();

      const processingTime = Date.now() - startTime;
      console.log(`Customer analytics completed in ${processingTime}ms, found ${results.length} result groups`);

      return {
        success: true,
        processingTimeMs: processingTime,
        timeRange: timeRange,
        resultCount: results.length,
        analytics: results,
        metadata: {
          queryComplexity: 'high',
          stagesCount: pipeline.length,
          indexesUsed: 'multiple_compound_indexes',
          aggregationFeatures: [
            'lookup_joins',
            'complex_expressions',
            'grouping_aggregations', 
            'conditional_logic',
            'array_operations',
            'date_functions',
            'mathematical_operations'
          ]
        }
      };

    } catch (error) {
      console.error('Error performing customer analytics:', error);
      return {
        success: false,
        error: error.message,
        processingTimeMs: Date.now() - startTime
      };
    }
  }

  async performRealTimeProductAnalytics(refreshInterval = 60000) {
    console.log('Starting real-time product performance analytics...');

    const pipeline = [
      // Stage 1: Match recent orders (last 24 hours)
      {
        $match: {
          orderDate: { 
            $gte: new Date(Date.now() - 24 * 60 * 60 * 1000) 
          },
          status: { $in: ['completed', 'processing', 'shipped'] }
        }
      },

      // Stage 2: Unwind order items for item-level analysis
      {
        $unwind: '$items'
      },

      // Stage 3: Lookup product details
      {
        $lookup: {
          from: 'products',
          localField: 'items.productId',
          foreignField: 'productId',
          as: 'product'
        }
      },

      // Stage 4: Unwind product (should be single document)
      {
        $unwind: '$product'
      },

      // Stage 5: Lookup current inventory levels
      {
        $lookup: {
          from: 'inventory',
          localField: 'items.productId',
          foreignField: 'productId',
          as: 'inventory'
        }
      },

      // Stage 6: Add calculated fields for real-time metrics
      {
        $addFields: {
          // Time buckets for real-time analysis
          hourBucket: {
            $dateTrunc: {
              date: '$orderDate',
              unit: 'hour'
            }
          },

          // Product performance metrics
          itemRevenue: { $multiply: ['$items.quantity', '$items.unitPrice'] },
          itemMargin: {
            $multiply: [
              '$items.quantity',
              { $subtract: ['$items.unitPrice', '$product.costPerUnit'] }
            ]
          },

          // Inventory status
          currentStock: { $arrayElemAt: ['$inventory.currentStock', 0] },
          reorderLevel: { $arrayElemAt: ['$inventory.reorderLevel', 0] },

          // Product categorization
          categoryGroup: {
            $switch: {
              branches: [
                { case: { $eq: ['$product.category', 'Electronics'] }, then: 'tech' },
                { case: { $in: ['$product.category', ['Clothing', 'Shoes']] }, then: 'fashion' },
                { case: { $in: ['$product.category', ['Home', 'Garden']] }, then: 'home' }
              ],
              default: 'other'
            }
          },

          // Price performance
          pricePoint: {
            $switch: {
              branches: [
                { case: { $gte: ['$items.unitPrice', 500] }, then: 'premium' },
                { case: { $gte: ['$items.unitPrice', 100] }, then: 'mid_range' },
                { case: { $gte: ['$items.unitPrice', 25] }, then: 'budget' }
              ],
              default: 'economy'
            }
          },

          // Velocity indicators
          orderRecency: {
            $divide: [
              { $subtract: [new Date(), '$orderDate'] },
              3600000 // Convert to hours
            ]
          }
        }
      },

      // Stage 7: Group by product and time buckets for real-time aggregation
      {
        $group: {
          _id: {
            productId: '$items.productId',
            productName: '$product.name',
            category: '$product.category',
            categoryGroup: '$categoryGroup',
            brand: '$product.brand',
            pricePoint: '$pricePoint',
            hourBucket: '$hourBucket'
          },

          // Sales metrics
          totalQuantitySold: { $sum: '$items.quantity' },
          totalRevenue: { $sum: '$itemRevenue' },
          totalMargin: { $sum: '$itemMargin' },
          uniqueOrders: { $addToSet: '$_id' },
          avgOrderQuantity: { $avg: '$items.quantity' },

          // Pricing metrics
          avgSellingPrice: { $avg: '$items.unitPrice' },
          maxSellingPrice: { $max: '$items.unitPrice' },
          minSellingPrice: { $min: '$items.unitPrice' },

          // Inventory insights
          currentStockLevel: { $first: '$currentStock' },
          reorderThreshold: { $first: '$reorderLevel' },

          // Time-based insights
          avgOrderRecency: { $avg: '$orderRecency' },
          latestOrderTime: { $max: '$orderDate' },
          earliestOrderTime: { $min: '$orderDate' },

          // Customer insights
          uniqueCustomers: { $addToSet: '$customerId' },

          // Geographic distribution
          regions: { $addToSet: '$customer.address.state' },

          // Order characteristics
          avgOrderValue: { $avg: '$financial.total' },
          shippingMethodsUsed: { $addToSet: '$shipping.method' }
        }
      },

      // Stage 8: Calculate performance indicators and rankings
      {
        $addFields: {
          // Performance calculations
          marginPercentage: {
            $cond: {
              if: { $gt: ['$totalRevenue', 0] },
              then: { $multiply: [{ $divide: ['$totalMargin', '$totalRevenue'] }, 100] },
              else: 0
            }
          },

          uniqueOrderCount: { $size: '$uniqueOrders' },
          uniqueCustomerCount: { $size: '$uniqueCustomers' },

          // Inventory health
          stockStatus: {
            $switch: {
              branches: [
                {
                  case: { $lte: ['$currentStockLevel', 0] },
                  then: 'out_of_stock'
                },
                {
                  case: { $lte: ['$currentStockLevel', '$reorderThreshold'] },
                  then: 'low_stock'
                },
                {
                  case: { $lte: ['$currentStockLevel', { $multiply: ['$reorderThreshold', 2] }] },
                  then: 'medium_stock'
                }
              ],
              default: 'high_stock'
            }
          },

          // Velocity metrics
          salesVelocity: {
            $divide: ['$totalQuantitySold', { $max: ['$avgOrderRecency', 1] }]
          },

          // Customer engagement
          customerRetention: {
            $divide: ['$uniqueCustomerCount', '$uniqueOrderCount']
          },

          // Regional penetration
          regionalReach: { $size: '$regions' }
        }
      },

      // Stage 9: Add ranking and performance classification
      {
        $setWindowFields: {
          sortBy: { totalRevenue: -1 },
          output: {
            revenueRank: { $rank: {} },
            revenuePercentile: { $percentRank: {} }
          }
        }
      },

      {
        $setWindowFields: {
          partitionBy: '$_id.categoryGroup',
          sortBy: { totalQuantitySold: -1 },
          output: {
            categoryRank: { $rank: {} }
          }
        }
      },

      // Stage 10: Add performance classification
      {
        $addFields: {
          performanceClassification: {
            $switch: {
              branches: [
                {
                  case: {
                    $and: [
                      { $lte: ['$revenueRank', 10] },
                      { $gt: ['$marginPercentage', 20] },
                      { $gt: ['$uniqueCustomerCount', 5] }
                    ]
                  },
                  then: 'star_performer'
                },
                {
                  case: {
                    $and: [
                      { $lte: ['$revenueRank', 50] },
                      { $gt: ['$marginPercentage', 15] }
                    ]
                  },
                  then: 'strong_performer'
                },
                {
                  case: {
                    $and: [
                      { $gt: ['$totalRevenue', 100] },
                      { $gt: ['$marginPercentage', 10] }
                    ]
                  },
                  then: 'solid_performer'
                },
                {
                  case: { $lte: ['$totalRevenue', 50] },
                  then: 'low_performer'
                }
              ],
              default: 'average_performer'
            }
          },

          // Action recommendations
          recommendations: {
            $switch: {
              branches: [
                {
                  case: { $eq: ['$stockStatus', 'out_of_stock'] },
                  then: ['urgent_restock', 'review_demand_forecast']
                },
                {
                  case: { $eq: ['$stockStatus', 'low_stock'] },
                  then: ['schedule_restock', 'monitor_sales_velocity']
                },
                {
                  case: {
                    $and: [
                      { $eq: ['$performanceClassification', 'star_performer'] },
                      { $gt: ['$currentStockLevel', '$reorderThreshold'] }
                    ]
                  },
                  then: ['increase_marketing', 'optimize_pricing', 'expand_availability']
                },
                {
                  case: { $eq: ['$performanceClassification', 'low_performer'] },
                  then: ['review_pricing', 'improve_marketing', 'consider_discontinuation']
                }
              ],
              default: ['monitor_performance', 'optimize_inventory_levels']
            }
          },

          // Real-time alerts
          alerts: {
            $filter: {
              input: [
                {
                  $cond: {
                    if: { $eq: ['$stockStatus', 'out_of_stock'] },
                    then: {
                      type: 'critical',
                      message: 'Product is out of stock with active sales',
                      priority: 'high'
                    },
                    else: null
                  }
                },
                {
                  $cond: {
                    if: {
                      $and: [
                        { $eq: ['$performanceClassification', 'star_performer'] },
                        { $eq: ['$stockStatus', 'low_stock'] }
                      ]
                    },
                    then: {
                      type: 'opportunity',
                      message: 'High-performing product running low on stock',
                      priority: 'medium'
                    },
                    else: null
                  }
                },
                {
                  $cond: {
                    if: { $lt: ['$marginPercentage', 5] },
                    then: {
                      type: 'margin_concern',
                      message: 'Product margin below threshold',
                      priority: 'low'
                    },
                    else: null
                  }
                }
              ],
              cond: { $ne: ['$$this', null] }
            }
          }
        }
      },

      // Stage 11: Final projection with structured output
      {
        $project: {
          _id: 0,

          // Product identification
          product: {
            id: '$_id.productId',
            name: '$_id.productName',
            category: '$_id.category',
            categoryGroup: '$_id.categoryGroup',
            brand: '$_id.brand',
            pricePoint: '$_id.pricePoint'
          },

          // Time context
          timeContext: {
            hourBucket: '$_id.hourBucket',
            latestOrder: '$latestOrderTime',
            earliestOrder: '$earliestOrderTime',
            avgOrderRecencyHours: '$avgOrderRecency'
          },

          // Performance metrics
          performance: {
            totalQuantitySold: '$totalQuantitySold',
            totalRevenue: { $round: ['$totalRevenue', 2] },
            totalMargin: { $round: ['$totalMargin', 2] },
            marginPercentage: { $round: ['$marginPercentage', 1] },
            uniqueOrders: '$uniqueOrderCount',
            uniqueCustomers: '$uniqueCustomerCount',
            avgOrderQuantity: { $round: ['$avgOrderQuantity', 2] },
            salesVelocity: { $round: ['$salesVelocity', 3] },
            customerRetention: { $round: ['$customerRetention', 3] }
          },

          // Pricing insights
          pricing: {
            avgSellingPrice: { $round: ['$avgSellingPrice', 2] },
            maxSellingPrice: '$maxSellingPrice',
            minSellingPrice: '$minSellingPrice',
            priceVariation: { $subtract: ['$maxSellingPrice', '$minSellingPrice'] }
          },

          // Inventory status
          inventory: {
            currentStock: '$currentStockLevel',
            reorderLevel: '$reorderThreshold',
            stockStatus: '$stockStatus',
            stockTurnover: {
              $cond: {
                if: { $gt: ['$currentStockLevel', 0] },
                then: { $divide: ['$totalQuantitySold', '$currentStockLevel'] },
                else: null
              }
            }
          },

          // Market position
          marketPosition: {
            revenueRank: '$revenueRank',
            revenuePercentile: { $round: ['$revenuePercentile', 3] },
            categoryRank: '$categoryRank',
            performanceClass: '$performanceClassification'
          },

          // Geographic and market reach
          marketReach: {
            regionalReach: '$regionalReach',
            regions: '$regions',
            avgOrderValue: { $round: ['$avgOrderValue', 2] },
            shippingMethods: '$shippingMethodsUsed'
          },

          // Actionable insights
          insights: {
            recommendations: '$recommendations',
            alerts: '$alerts',

            // Key insights derived from data
            keyInsights: {
              $filter: {
                input: [
                  {
                    $cond: {
                      if: { $gt: ['$uniqueCustomerCount', 10] },
                      then: 'High customer engagement - good repeat purchase potential',
                      else: null
                    }
                  },
                  {
                    $cond: {
                      if: { $gt: ['$regionalReach', 5] },
                      then: 'Strong geographic distribution - consider expanding marketing',
                      else: null
                    }
                  },
                  {
                    $cond: {
                      if: { $gt: ['$salesVelocity', 1] },
                      then: 'Fast-moving product - ensure adequate inventory levels',
                      else: null
                    }
                  }
                ],
                cond: { $ne: ['$$this', null] }
              }
            }
          }
        }
      },

      // Stage 12: Sort by performance and significance
      {
        $sort: {
          'performance.totalRevenue': -1,
          'performance.uniqueCustomers': -1,
          'inventory.stockTurnover': -1
        }
      },

      // Stage 13: Limit to top performers for real-time display
      {
        $limit: 100
      }
    ];

    try {
      const results = await this.collections.orders.aggregate(pipeline, {
        ...this.aggregationOptions,
        maxTimeMS: 30000 // Shorter timeout for real-time queries
      }).toArray();

      // Store results for real-time dashboard
      await this.collections.realTimeMetrics.replaceOne(
        { type: 'product_performance' },
        {
          type: 'product_performance',
          timestamp: new Date(),
          refreshInterval: refreshInterval,
          dataCount: results.length,
          data: results
        },
        { upsert: true }
      );

      console.log(`Real-time product analytics completed: ${results.length} products analyzed`);

      return {
        success: true,
        timestamp: new Date(),
        productCount: results.length,
        analytics: results,
        summary: {
          totalRevenue: results.reduce((sum, product) => sum + product.performance.totalRevenue, 0),
          totalQuantitySold: results.reduce((sum, product) => sum + product.performance.totalQuantitySold, 0),
          avgMarginPercentage: results.reduce((sum, product) => sum + product.performance.marginPercentage, 0) / results.length,
          outOfStockProducts: results.filter(product => product.inventory.stockStatus === 'out_of_stock').length,
          starPerformers: results.filter(product => product.marketPosition.performanceClass === 'star_performer').length,
          criticalAlerts: results.reduce((sum, product) => 
            sum + product.insights.alerts.filter(alert => alert.priority === 'high').length, 0
          )
        }
      };

    } catch (error) {
      console.error('Error performing real-time product analytics:', error);
      return {
        success: false,
        error: error.message,
        timestamp: new Date()
      };
    }
  }

  async performAdvancedCohortAnalysis(cohortType = 'monthly', lookbackPeriods = 12) {
    console.log(`Performing ${cohortType} cohort analysis for ${lookbackPeriods} periods...`);

    const startTime = Date.now();

    // Calculate cohort periods based on type
    const cohortConfig = {
      'weekly': { unit: 'week', periodMs: 7 * 24 * 60 * 60 * 1000 },
      'monthly': { unit: 'month', periodMs: 30 * 24 * 60 * 60 * 1000 },
      'quarterly': { unit: 'quarter', periodMs: 90 * 24 * 60 * 60 * 1000 }
    };

    const config = cohortConfig[cohortType] || cohortConfig['monthly'];
    const startDate = new Date(Date.now() - lookbackPeriods * config.periodMs);

    const pipeline = [
      // Stage 1: Get all customer first orders to establish cohorts
      {
        $match: {
          orderDate: { $gte: startDate },
          status: { $in: ['completed', 'shipped', 'delivered'] }
        }
      },

      // Stage 2: Get customer first order dates
      {
        $group: {
          _id: '$customerId',
          firstOrderDate: { $min: '$orderDate' },
          allOrders: {
            $push: {
              orderId: '$_id',
              orderDate: '$orderDate',
              total: '$financial.total',
              items: '$items'
            }
          },
          totalOrders: { $sum: 1 },
          totalSpent: { $sum: '$financial.total' },
          lastOrderDate: { $max: '$orderDate' }
        }
      },

      // Stage 3: Calculate cohort membership and period analysis
      {
        $addFields: {
          // Determine which cohort this customer belongs to
          cohortPeriod: {
            $dateTrunc: {
              date: '$firstOrderDate',
              unit: config.unit
            }
          },

          // Calculate customer lifetime span
          lifetimeSpanDays: {
            $divide: [
              { $subtract: ['$lastOrderDate', '$firstOrderDate'] },
              86400000
            ]
          },

          // Analyze orders by period
          ordersByPeriod: {
            $map: {
              input: '$allOrders',
              as: 'order',
              in: {
                $mergeObjects: [
                  '$$order',
                  {
                    orderPeriod: {
                      $dateTrunc: {
                        date: '$$order.orderDate',
                        unit: config.unit
                      }
                    },
                    periodsFromFirstOrder: {
                      $divide: [
                        {
                          $subtract: [
                            {
                              $dateTrunc: {
                                date: '$$order.orderDate',
                                unit: config.unit
                              }
                            },
                            {
                              $dateTrunc: {
                                date: '$firstOrderDate',
                                unit: config.unit
                              }
                            }
                          ]
                        },
                        config.periodMs
                      ]
                    }
                  }
                ]
              }
            }
          }
        }
      },

      // Stage 4: Unwind orders to analyze period-by-period behavior
      {
        $unwind: '$ordersByPeriod'
      },

      // Stage 5: Group by cohort and period for retention analysis
      {
        $group: {
          _id: {
            cohortPeriod: '$cohortPeriod',
            orderPeriod: '$ordersByPeriod.orderPeriod',
            periodsFromFirst: { 
              $floor: '$ordersByPeriod.periodsFromFirstOrder' 
            }
          },

          // Customer retention metrics
          activeCustomers: { $addToSet: '$_id' },
          totalOrders: { $sum: 1 },
          totalRevenue: { $sum: '$ordersByPeriod.total' },
          avgOrderValue: { $avg: '$ordersByPeriod.total' },

          // Customer behavior metrics
          avgLifetimeSpan: { $avg: '$lifetimeSpanDays' },
          totalCustomerLifetimeValue: { $avg: '$totalSpent' },
          avgOrdersPerCustomer: { $avg: '$totalOrders' },

          // Period-specific insights
          newCustomersInPeriod: {
            $sum: {
              $cond: [
                { $eq: ['$ordersByPeriod.periodsFromFirstOrder', 0] },
                1,
                0
              ]
            }
          },

          // Revenue distribution
          revenueDistribution: {
            $push: '$ordersByPeriod.total'
          },

          // Order frequency analysis
          orderFrequencyDistribution: {
            $push: '$totalOrders'
          }
        }
      },

      // Stage 6: Calculate cohort size (initial customers in each cohort)
      {
        $lookup: {
          from: 'orders',
          pipeline: [
            {
              $match: {
                orderDate: { $gte: startDate },
                status: { $in: ['completed', 'shipped', 'delivered'] }
              }
            },
            {
              $group: {
                _id: '$customerId',
                firstOrderDate: { $min: '$orderDate' }
              }
            },
            {
              $addFields: {
                cohortPeriod: {
                  $dateTrunc: {
                    date: '$firstOrderDate',
                    unit: config.unit
                  }
                }
              }
            },
            {
              $group: {
                _id: '$cohortPeriod',
                cohortSize: { $sum: 1 }
              }
            }
          ],
          as: 'cohortSizes'
        }
      },

      // Stage 7: Add cohort size information
      {
        $addFields: {
          cohortSize: {
            $let: {
              vars: {
                matchingCohort: {
                  $arrayElemAt: [
                    {
                      $filter: {
                        input: '$cohortSizes',
                        cond: { $eq: ['$$this._id', '$_id.cohortPeriod'] }
                      }
                    },
                    0
                  ]
                }
              },
              in: '$$matchingCohort.cohortSize'
            }
          },

          activeCustomerCount: { $size: '$activeCustomers' },

          // Calculate retention rate
          retentionRate: {
            $let: {
              vars: {
                cohortSize: {
                  $arrayElemAt: [
                    {
                      $map: {
                        input: {
                          $filter: {
                            input: '$cohortSizes',
                            cond: { $eq: ['$$this._id', '$_id.cohortPeriod'] }
                          }
                        },
                        in: '$$this.cohortSize'
                      }
                    },
                    0
                  ]
                }
              },
              in: {
                $multiply: [
                  { $divide: [{ $size: '$activeCustomers' }, '$$cohortSize'] },
                  100
                ]
              }
            }
          }
        }
      },

      // Stage 8: Calculate advanced cohort metrics
      {
        $addFields: {
          // Revenue per customer in this period
          revenuePerCustomer: {
            $divide: ['$totalRevenue', '$activeCustomerCount']
          },

          // Customer engagement score
          engagementScore: {
            $multiply: [
              { $divide: ['$totalOrders', '$activeCustomerCount'] },
              { $divide: ['$retentionRate', 100] }
            ]
          },

          // Revenue distribution analysis
          revenueMetrics: {
            median: {
              $arrayElemAt: [
                {
                  $sortArray: {
                    input: '$revenueDistribution',
                    sortBy: 1
                  }
                },
                { $floor: { $divide: [{ $size: '$revenueDistribution' }, 2] } }
              ]
            },
            total: '$totalRevenue',
            average: '$avgOrderValue',
            max: { $max: '$revenueDistribution' },
            min: { $min: '$revenueDistribution' }
          },

          // Period classification
          periodClassification: {
            $switch: {
              branches: [
                { case: { $eq: ['$_id.periodsFromFirst', 0] }, then: 'acquisition' },
                { case: { $lte: ['$_id.periodsFromFirst', 3] }, then: 'early_engagement' },
                { case: { $lte: ['$_id.periodsFromFirst', 12] }, then: 'mature_relationship' }
              ],
              default: 'long_term_loyalty'
            }
          }
        }
      },

      // Stage 9: Group by cohort for final analysis
      {
        $group: {
          _id: '$_id.cohortPeriod',
          cohortSize: { $first: '$cohortSize' },

          // Retention analysis by period
          retentionByPeriod: {
            $push: {
              period: '$_id.periodsFromFirst',
              orderPeriod: '$_id.orderPeriod',
              activeCustomers: '$activeCustomerCount',
              retentionRate: '$retentionRate',
              totalRevenue: '$totalRevenue',
              revenuePerCustomer: '$revenuePerCustomer',
              avgOrderValue: '$avgOrderValue',
              totalOrders: '$totalOrders',
              engagementScore: '$engagementScore',
              periodClassification: '$periodClassification',
              revenueMetrics: '$revenueMetrics'
            }
          },

          // Aggregate cohort metrics
          totalLifetimeRevenue: { $sum: '$totalRevenue' },
          avgLifetimeValue: { $avg: '$totalCustomerLifetimeValue' },
          peakRetentionRate: { $max: '$retentionRate' },
          finalRetentionRate: { $last: '$retentionRate' },
          avgEngagementScore: { $avg: '$engagementScore' },

          // Cohort performance classification
          cohortHealth: {
            $avg: {
              $cond: [
                { $gte: ['$retentionRate', 30] }, // 30% retention considered healthy
                1,
                0
              ]
            }
          }
        }
      },

      // Stage 10: Calculate cohort performance indicators
      {
        $addFields: {
          // Lifetime value per customer in cohort
          lifetimeValuePerCustomer: {
            $divide: ['$totalLifetimeRevenue', '$cohortSize']
          },

          // Retention curve analysis
          retentionTrend: {
            $let: {
              vars: {
                firstPeriodRetention: {
                  $arrayElemAt: [
                    {
                      $map: {
                        input: {
                          $filter: {
                            input: '$retentionByPeriod',
                            cond: { $eq: ['$$this.period', 0] }
                          }
                        },
                        in: '$$this.retentionRate'
                      }
                    },
                    0
                  ]
                },
                lastPeriodRetention: '$finalRetentionRate'
              },
              in: {
                $subtract: ['$$lastPeriodRetention', '$$firstPeriodRetention']
              }
            }
          },

          // Cohort quality classification
          cohortQuality: {
            $switch: {
              branches: [
                {
                  case: {
                    $and: [
                      { $gte: ['$peakRetentionRate', 50] },
                      { $gte: ['$avgEngagementScore', 1.5] },
                      { $gte: ['$lifetimeValuePerCustomer', 500] }
                    ]
                  },
                  then: 'excellent'
                },
                {
                  case: {
                    $and: [
                      { $gte: ['$peakRetentionRate', 35] },
                      { $gte: ['$avgEngagementScore', 1.0] },
                      { $gte: ['$lifetimeValuePerCustomer', 250] }
                    ]
                  },
                  then: 'good'
                },
                {
                  case: {
                    $and: [
                      { $gte: ['$peakRetentionRate', 20] },
                      { $gte: ['$avgEngagementScore', 0.5] }
                    ]
                  },
                  then: 'fair'
                }
              ],
              default: 'poor'
            }
          },

          // Strategic recommendations
          recommendations: {
            $switch: {
              branches: [
                {
                  case: { $lt: ['$peakRetentionRate', 20] },
                  then: ['improve_onboarding', 'enhance_early_engagement', 'review_product_fit']
                },
                {
                  case: { $lt: ['$finalRetentionRate', 10] },
                  then: ['develop_loyalty_program', 'improve_long_term_value', 'increase_engagement']
                },
                {
                  case: { $lt: ['$avgEngagementScore', 0.5] },
                  then: ['enhance_customer_experience', 'increase_purchase_frequency', 'improve_product_recommendations']
                }
              ],
              default: ['maintain_excellence', 'scale_successful_strategies', 'explore_expansion_opportunities']
            }
          }
        }
      },

      // Stage 11: Final projection and formatting
      {
        $project: {
          _id: 0,

          // Cohort identification
          cohortPeriod: '$_id',
          cohortSize: 1,
          cohortQuality: 1,

          // Key performance metrics
          performance: {
            lifetimeValuePerCustomer: { $round: ['$lifetimeValuePerCustomer', 2] },
            avgLifetimeValue: { $round: ['$avgLifetimeValue', 2] },
            totalLifetimeRevenue: { $round: ['$totalLifetimeRevenue', 2] },
            peakRetentionRate: { $round: ['$peakRetentionRate', 1] },
            finalRetentionRate: { $round: ['$finalRetentionRate', 1] },
            retentionTrend: { $round: ['$retentionTrend', 1] },
            avgEngagementScore: { $round: ['$avgEngagementScore', 2] },
            cohortHealth: { $round: ['$cohortHealth', 2] }
          },

          // Detailed retention analysis
          retentionAnalysis: {
            $map: {
              input: { $sortArray: { input: '$retentionByPeriod', sortBy: { period: 1 } } },
              in: {
                period: '$$this.period',
                orderPeriod: '$$this.orderPeriod',
                activeCustomers: '$$this.activeCustomers',
                retentionRate: { $round: ['$$this.retentionRate', 1] },
                revenuePerCustomer: { $round: ['$$this.revenuePerCustomer', 2] },
                avgOrderValue: { $round: ['$$this.avgOrderValue', 2] },
                totalRevenue: { $round: ['$$this.totalRevenue', 2] },
                totalOrders: '$$this.totalOrders',
                engagementScore: { $round: ['$$this.engagementScore', 2] },
                periodClassification: '$$this.periodClassification',
                revenueMetrics: {
                  median: { $round: ['$$this.revenueMetrics.median', 2] },
                  average: { $round: ['$$this.revenueMetrics.average', 2] },
                  max: '$$this.revenueMetrics.max',
                  min: '$$this.revenueMetrics.min'
                }
              }
            }
          },

          // Strategic insights
          insights: {
            recommendations: '$recommendations',
            keyInsights: {
              $filter: {
                input: [
                  {
                    $cond: {
                      if: { $gt: ['$peakRetentionRate', 40] },
                      then: 'High-quality cohort with strong initial engagement',
                      else: null
                    }
                  },
                  {
                    $cond: {
                      if: { $gt: ['$retentionTrend', 0] },
                      then: 'Retention improving over time - successful loyalty building',
                      else: null
                    }
                  },
                  {
                    $cond: {
                      if: { $gt: ['$avgEngagementScore', 2] },
                      then: 'Highly engaged cohort with frequent repeat purchases',
                      else: null
                    }
                  },
                  {
                    $cond: {
                      if: { $gt: ['$lifetimeValuePerCustomer', 1000] },
                      then: 'High-value cohort - focus on retention and expansion',
                      else: null
                    }
                  }
                ],
                cond: { $ne: ['$$this', null] }
              }
            }
          }
        }
      },

      // Stage 12: Sort by cohort period (most recent first)
      {
        $sort: { cohortPeriod: -1 }
      }
    ];

    try {
      const results = await this.collections.orders.aggregate(pipeline, this.aggregationOptions).toArray();

      const processingTime = Date.now() - startTime;
      console.log(`Cohort analysis completed in ${processingTime}ms, analyzed ${results.length} cohorts`);

      // Calculate cross-cohort insights
      const crossCohortInsights = {
        totalCohorts: results.length,
        avgCohortSize: results.reduce((sum, cohort) => sum + cohort.cohortSize, 0) / results.length,
        avgLifetimeValue: results.reduce((sum, cohort) => sum + cohort.performance.lifetimeValuePerCustomer, 0) / results.length,
        bestPerformingCohort: results.reduce((best, current) => 
          current.performance.lifetimeValuePerCustomer > best.performance.lifetimeValuePerCustomer ? current : best, results[0]
        ),
        retentionTrendAvg: results.reduce((sum, cohort) => sum + cohort.performance.retentionTrend, 0) / results.length,
        excellentCohorts: results.filter(cohort => cohort.cohortQuality === 'excellent').length,
        improvingCohorts: results.filter(cohort => cohort.performance.retentionTrend > 0).length
      };

      return {
        success: true,
        processingTimeMs: processingTime,
        cohortType: cohortType,
        lookbackPeriods: lookbackPeriods,
        analysisDate: new Date(),
        cohortCount: results.length,
        cohorts: results,
        crossCohortInsights: crossCohortInsights,
        metadata: {
          aggregationComplexity: 'very_high',
          stagesCount: pipeline.length,
          analyticsFeatures: [
            'customer_lifetime_value',
            'retention_analysis',
            'cohort_segmentation',
            'behavioral_analysis',
            'revenue_attribution',
            'trend_analysis',
            'performance_classification',
            'strategic_recommendations'
          ]
        }
      };

    } catch (error) {
      console.error('Error performing cohort analysis:', error);
      return {
        success: false,
        error: error.message,
        processingTimeMs: Date.now() - startTime
      };
    }
  }
}

// Benefits of MongoDB Aggregation Framework:
// - Single-pass processing for complex multi-stage analytics
// - Native document transformation without expensive JOINs
// - Automatic query optimization and index utilization
// - Horizontal scaling across sharded clusters
// - Real-time processing capabilities with streaming aggregation
// - Rich expression language for complex calculations
// - Built-in statistical and analytical functions
// - Memory-efficient processing with spill-to-disk support
// - Integration with MongoDB's native features (GeoSpatial, Text Search, etc.)
// - SQL-compatible operations through QueryLeaf integration

module.exports = {
  AdvancedAnalyticsProcessor
};

Understanding MongoDB Aggregation Framework Architecture

Advanced Pipeline Optimization and Performance Patterns

Implement sophisticated aggregation strategies for enterprise MongoDB deployments:

// Production-optimized MongoDB Aggregation with advanced performance tuning
class EnterpriseAggregationOptimizer {
  constructor(db, optimizationConfig) {
    this.db = db;
    this.config = {
      ...optimizationConfig,
      enableQueryPlanCache: true,
      enableParallelProcessing: true,
      enableIncrementalProcessing: true,
      maxMemoryUsage: '2GB',
      enableIndexHints: true,
      enableResultCaching: true
    };

    this.queryPlanCache = new Map();
    this.resultCache = new Map();
    this.performanceMetrics = new Map();
  }

  async optimizeAggregationPipeline(pipeline, collectionName, options = {}) {
    console.log(`Optimizing aggregation pipeline for ${collectionName}...`);

    const optimizationStrategies = [
      this.moveMatchToBeginning,
      this.optimizeIndexUsage,
      this.enableEarlyFiltering,
      this.minimizeDataMovement,
      this.optimizeGroupingOperations,
      this.enableParallelExecution
    ];

    let optimizedPipeline = [...pipeline];

    for (const strategy of optimizationStrategies) {
      optimizedPipeline = await strategy.call(this, optimizedPipeline, collectionName, options);
    }

    return {
      originalStages: pipeline.length,
      optimizedStages: optimizedPipeline.length,
      optimizedPipeline: optimizedPipeline,
      estimatedPerformanceGain: this.calculatePerformanceGain(pipeline, optimizedPipeline)
    };
  }

  async enableRealTimeAggregation(pipeline, collectionName, refreshInterval = 5000) {
    console.log(`Setting up real-time aggregation for ${collectionName}...`);

    // Implementation of real-time aggregation with Change Streams
    const changeStream = this.db.collection(collectionName).watch([
      {
        $match: {
          operationType: { $in: ['insert', 'update', 'delete'] }
        }
      }
    ]);

    const realTimeProcessor = {
      pipeline: pipeline,
      lastResults: null,
      isProcessing: false,

      async processChanges() {
        if (this.isProcessing) return;

        this.isProcessing = true;
        try {
          const results = await this.db.collection(collectionName)
            .aggregate(pipeline, { allowDiskUse: true })
            .toArray();

          this.lastResults = results;

          // Emit real-time results to subscribers
          this.emitResults(results);

        } catch (error) {
          console.error('Real-time aggregation error:', error);
        } finally {
          this.isProcessing = false;
        }
      }
    };

    // Process changes as they occur
    changeStream.on('change', () => {
      realTimeProcessor.processChanges();
    });

    return realTimeProcessor;
  }

  async implementIncrementalAggregation(pipeline, collectionName, incrementField = 'updatedAt') {
    console.log(`Setting up incremental aggregation for ${collectionName}...`);

    // Track last processed timestamp
    let lastProcessedTime = await this.getLastProcessedTime(collectionName);

    const incrementalPipeline = [
      // Only process new/updated documents
      {
        $match: {
          [incrementField]: { $gt: lastProcessedTime }
        }
      },
      ...pipeline
    ];

    const results = await this.db.collection(collectionName)
      .aggregate(incrementalPipeline, { allowDiskUse: true })
      .toArray();

    // Update last processed time
    await this.updateLastProcessedTime(collectionName, new Date());

    return {
      incrementalResults: results,
      lastProcessedTime: lastProcessedTime,
      newProcessedTime: new Date(),
      documentsProcessed: results.length
    };
  }
}

SQL-Style Aggregation Operations with QueryLeaf

QueryLeaf provides familiar SQL syntax for MongoDB Aggregation Framework operations:

-- QueryLeaf aggregation operations with SQL-familiar syntax

-- Complex customer analytics with CTEs and window functions
WITH customer_segments AS (
  SELECT 
    customer_id,
    customer_name,
    registration_date,

    -- Customer segmentation using CASE expressions
    CASE 
      WHEN registration_date >= CURRENT_TIMESTAMP - INTERVAL '90 days' THEN 'new_customer'
      WHEN last_order_date >= CURRENT_TIMESTAMP - INTERVAL '30 days' THEN 'active_customer'
      WHEN last_order_date >= CURRENT_TIMESTAMP - INTERVAL '180 days' THEN 'returning_customer'
      ELSE 'dormant_customer'
    END as customer_segment,

    -- Geographic classification using nested CASE
    CASE 
      WHEN JSON_EXTRACT(address, '$.country') = 'US' THEN 'domestic'
      WHEN JSON_EXTRACT(address, '$.country') IN ('CA', 'MX') THEN 'north_america'
      WHEN JSON_EXTRACT(address, '$.country') IN ('GB', 'DE', 'FR', 'IT', 'ES') THEN 'europe'
      ELSE 'international'
    END as geographic_segment,

    -- Customer value classification
    total_spent,
    total_orders,
    average_order_value,
    loyalty_tier

  FROM customers
  WHERE is_active = true
),

order_analytics AS (
  SELECT 
    o._id as order_id,
    o.customer_id,
    o.order_date,

    -- Time-based dimensions using date functions
    DATE_TRUNC('month', o.order_date) as order_month,
    DATE_TRUNC('quarter', o.order_date) as order_quarter,
    EXTRACT(year FROM o.order_date) as order_year,
    EXTRACT(dow FROM o.order_date) as day_of_week,
    EXTRACT(hour FROM o.order_date) as hour_of_day,

    -- Seasonal analysis
    CASE 
      WHEN EXTRACT(month FROM o.order_date) IN (12, 1, 2) THEN 'winter'
      WHEN EXTRACT(month FROM o.order_date) IN (3, 4, 5) THEN 'spring'
      WHEN EXTRACT(month FROM o.order_date) IN (6, 7, 8) THEN 'summer'
      ELSE 'fall'
    END as season,

    -- Financial metrics
    JSON_EXTRACT(financial, '$.total') as order_total,
    JSON_EXTRACT(financial, '$.discount') as discount_amount,
    JSON_EXTRACT(financial, '$.tax') as tax_amount,
    JSON_EXTRACT(financial, '$.shipping') as shipping_amount,

    -- Order classification
    CASE 
      WHEN JSON_EXTRACT(financial, '$.total') >= 1000 THEN 'high_value'
      WHEN JSON_EXTRACT(financial, '$.total') >= 500 THEN 'medium_value'
      WHEN JSON_EXTRACT(financial, '$.total') >= 100 THEN 'low_value'
      ELSE 'micro_transaction'
    END as order_value_segment,

    -- Item analysis using JSON functions
    JSON_ARRAY_LENGTH(items) as item_count,

    -- Payment and shipping insights
    JSON_EXTRACT(payment, '$.method') as payment_method,
    JSON_EXTRACT(shipping, '$.method') as shipping_method,
    JSON_EXTRACT(shipping, '$.region') as shipping_region

  FROM orders o
  WHERE o.order_date >= CURRENT_TIMESTAMP - INTERVAL '12 months'
    AND o.status IN ('completed', 'shipped', 'delivered')
),

product_performance AS (
  SELECT 
    oi.order_id,

    -- Unnest items array for item-level analysis
    JSON_EXTRACT(item, '$.product_id') as product_id,
    JSON_EXTRACT(item, '$.quantity') as quantity,
    JSON_EXTRACT(item, '$.unit_price') as unit_price,
    JSON_EXTRACT(item, '$.total_price') as item_total,

    -- Product details from JOIN
    p.product_name,
    p.category,
    p.brand,
    p.cost_per_unit,

    -- Calculate margins
    (JSON_EXTRACT(item, '$.unit_price') - p.cost_per_unit) as unit_margin,
    (JSON_EXTRACT(item, '$.unit_price') - p.cost_per_unit) * JSON_EXTRACT(item, '$.quantity') as total_margin,

    -- Product categorization
    CASE 
      WHEN p.category = 'Electronics' THEN 'tech'
      WHEN p.category IN ('Clothing', 'Shoes', 'Accessories') THEN 'fashion'
      WHEN p.category IN ('Home', 'Garden', 'Furniture') THEN 'home'
      ELSE 'other'
    END as product_group

  FROM order_analytics oa
  CROSS JOIN JSON_TABLE(
    oa.items, '$[*]' COLUMNS (
      item JSON PATH '$'
    )
  ) AS items_table
  JOIN products p ON JSON_EXTRACT(items_table.item, '$.product_id') = p.product_id
  WHERE p.is_active = true
),

comprehensive_analytics AS (
  SELECT 
    -- Dimensional attributes
    cs.customer_segment,
    cs.geographic_segment,
    oa.order_month,
    oa.order_quarter,
    oa.season,
    oa.order_value_segment,
    pp.product_group,
    pp.category,
    pp.brand,

    -- Aggregated metrics using window functions
    COUNT(DISTINCT cs.customer_id) as unique_customers,
    COUNT(DISTINCT oa.order_id) as total_orders,
    COUNT(DISTINCT pp.product_id) as unique_products,

    -- Revenue metrics
    SUM(oa.order_total) as total_revenue,
    AVG(oa.order_total) as avg_order_value,
    SUM(pp.total_margin) as total_margin,

    -- Customer metrics with window functions
    AVG(SUM(oa.order_total)) OVER (
      PARTITION BY cs.customer_id
    ) as avg_customer_monthly_spend,

    -- Product performance with rankings
    RANK() OVER (
      PARTITION BY oa.order_month
      ORDER BY SUM(pp.total_margin) DESC
    ) as product_margin_rank,

    -- Time-based analysis
    COUNT(*) OVER (
      PARTITION BY cs.geographic_segment, oa.season
    ) as segment_seasonal_orders,

    -- Advanced statistical functions
    PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY oa.order_total) as median_order_value,
    STDDEV_POP(oa.order_total) as order_value_stddev,

    -- Cohort analysis elements
    MIN(oa.order_date) OVER (PARTITION BY cs.customer_id) as customer_first_order,
    MAX(oa.order_date) OVER (PARTITION BY cs.customer_id) as customer_last_order,

    -- Calculate customer lifetime metrics
    COUNT(*) OVER (PARTITION BY cs.customer_id) as customer_total_orders,
    SUM(oa.order_total) OVER (PARTITION BY cs.customer_id) as customer_lifetime_value

  FROM customer_segments cs
  JOIN order_analytics oa ON cs.customer_id = oa.customer_id
  JOIN product_performance pp ON oa.order_id = pp.order_id
),

final_analytics AS (
  SELECT 
    customer_segment,
    geographic_segment,
    order_quarter,
    season,
    product_group,
    category,
    brand,

    -- Core metrics
    unique_customers,
    total_orders,
    unique_products,
    ROUND(total_revenue, 2) as total_revenue,
    ROUND(avg_order_value, 2) as avg_order_value,
    ROUND(total_margin, 2) as total_margin,
    ROUND((total_margin / total_revenue) * 100, 1) as margin_percentage,

    -- Customer insights
    ROUND(avg_customer_monthly_spend, 2) as avg_customer_monthly_spend,
    ROUND(median_order_value, 2) as median_order_value,
    ROUND(order_value_stddev, 2) as order_value_stddev,

    -- Performance indicators
    CASE 
      WHEN product_margin_rank <= 10 THEN 'top_performer'
      WHEN product_margin_rank <= 50 THEN 'good_performer'
      ELSE 'average_performer'
    END as performance_tier,

    -- Customer behavior analysis
    AVG(EXTRACT(days FROM (customer_last_order - customer_first_order))) as avg_customer_lifespan_days,
    AVG(customer_total_orders) as avg_orders_per_customer,
    ROUND(AVG(customer_lifetime_value), 2) as avg_customer_lifetime_value,

    -- Growth analysis using LAG function
    LAG(total_revenue) OVER (
      PARTITION BY customer_segment, geographic_segment, product_group
      ORDER BY order_quarter
    ) as prev_quarter_revenue,

    -- Calculate growth rate
    ROUND(
      ((total_revenue - LAG(total_revenue) OVER (
        PARTITION BY customer_segment, geographic_segment, product_group
        ORDER BY order_quarter
      )) / NULLIF(LAG(total_revenue) OVER (
        PARTITION BY customer_segment, geographic_segment, product_group
        ORDER BY order_quarter
      ), 0)) * 100,
      1
    ) as revenue_growth_pct,

    -- Market share analysis
    ROUND(
      (total_revenue / SUM(total_revenue) OVER (PARTITION BY order_quarter)) * 100,
      2
    ) as market_share_pct,

    -- Seasonal performance indexing
    ROUND(
      total_revenue / AVG(total_revenue) OVER (
        PARTITION BY customer_segment, geographic_segment, product_group
      ) * 100,
      1
    ) as seasonal_index

  FROM comprehensive_analytics
)

SELECT 
  -- Dimensional attributes
  customer_segment,
  geographic_segment,
  order_quarter,
  season,
  product_group,
  category,
  brand,

  -- Core metrics
  unique_customers,
  total_orders,
  unique_products,
  total_revenue,
  avg_order_value,
  total_margin,
  margin_percentage,

  -- Customer insights
  avg_customer_monthly_spend,
  median_order_value,
  order_value_stddev,
  avg_customer_lifespan_days,
  avg_orders_per_customer,
  avg_customer_lifetime_value,

  -- Performance classification
  performance_tier,

  -- Growth metrics
  prev_quarter_revenue,
  revenue_growth_pct,
  market_share_pct,
  seasonal_index,

  -- Business insights and recommendations
  CASE 
    WHEN revenue_growth_pct > 25 THEN 'high_growth_opportunity'
    WHEN revenue_growth_pct > 10 THEN 'steady_growth'
    WHEN revenue_growth_pct > 0 THEN 'slow_growth'
    WHEN revenue_growth_pct IS NULL THEN 'new_segment'
    ELSE 'declining_segment'
  END as growth_classification,

  CASE 
    WHEN margin_percentage > 30 AND revenue_growth_pct > 15 THEN 'invest_and_expand'
    WHEN margin_percentage > 30 AND revenue_growth_pct < 0 THEN 'optimize_and_retain'  
    WHEN margin_percentage < 15 AND revenue_growth_pct > 15 THEN 'improve_margins'
    WHEN margin_percentage < 15 AND revenue_growth_pct < 0 THEN 'consider_exit'
    ELSE 'monitor_and_optimize'
  END as strategic_recommendation,

  -- Key performance indicators
  CASE 
    WHEN avg_customer_lifetime_value > 1000 AND avg_orders_per_customer > 5 THEN 'high_value_loyal'
    WHEN avg_customer_lifetime_value > 500 THEN 'high_value'
    WHEN avg_orders_per_customer > 3 THEN 'loyal_customers'
    ELSE 'acquisition_focus'
  END as customer_strategy

FROM final_analytics
WHERE total_revenue > 1000  -- Filter for statistical significance
ORDER BY 
  total_revenue DESC,
  revenue_growth_pct DESC NULLS LAST,
  margin_percentage DESC
LIMIT 500;

-- Real-time product performance dashboard
CREATE VIEW real_time_product_performance AS
WITH hourly_product_metrics AS (
  SELECT 
    JSON_EXTRACT(item, '$.product_id') as product_id,
    DATE_TRUNC('hour', order_date) as hour_bucket,

    -- Sales metrics
    SUM(JSON_EXTRACT(item, '$.quantity')) as total_quantity_sold,
    SUM(JSON_EXTRACT(item, '$.total_price')) as total_revenue,
    COUNT(DISTINCT order_id) as unique_orders,
    COUNT(DISTINCT customer_id) as unique_customers,

    -- Pricing analysis
    AVG(JSON_EXTRACT(item, '$.unit_price')) as avg_selling_price,
    MAX(JSON_EXTRACT(item, '$.unit_price')) as max_selling_price,
    MIN(JSON_EXTRACT(item, '$.unit_price')) as min_selling_price,

    -- Performance indicators
    AVG(JSON_EXTRACT(financial, '$.total')) as avg_order_value,
    SUM(JSON_EXTRACT(item, '$.quantity')) / COUNT(DISTINCT order_id) as avg_quantity_per_order

  FROM orders o
  CROSS JOIN JSON_TABLE(
    o.items, '$[*]' COLUMNS (
      item JSON PATH '$'
    )
  ) AS items_unnested
  WHERE o.order_date >= CURRENT_TIMESTAMP - INTERVAL '24 hours'
    AND o.status IN ('completed', 'processing', 'shipped')
  GROUP BY 
    JSON_EXTRACT(item, '$.product_id'),
    DATE_TRUNC('hour', order_date)
),

product_rankings AS (
  SELECT 
    hpm.*,
    p.product_name,
    p.category,
    p.brand,
    p.cost_per_unit,

    -- Calculate margins
    (hpm.avg_selling_price - p.cost_per_unit) as unit_margin,
    ((hpm.avg_selling_price - p.cost_per_unit) * hpm.total_quantity_sold) as total_margin,

    -- Performance rankings using window functions
    RANK() OVER (ORDER BY total_revenue DESC) as revenue_rank,
    RANK() OVER (ORDER BY total_quantity_sold DESC) as quantity_rank,
    RANK() OVER (PARTITION BY p.category ORDER BY total_revenue DESC) as category_rank,

    -- Percentile rankings
    PERCENT_RANK() OVER (ORDER BY total_revenue) as revenue_percentile,
    PERCENT_RANK() OVER (ORDER BY total_quantity_sold) as quantity_percentile,

    -- Growth analysis (comparing to previous hour)
    LAG(total_revenue) OVER (
      PARTITION BY product_id 
      ORDER BY hour_bucket
    ) as prev_hour_revenue,

    LAG(total_quantity_sold) OVER (
      PARTITION BY product_id 
      ORDER BY hour_bucket
    ) as prev_hour_quantity

  FROM hourly_product_metrics hpm
  JOIN products p ON hpm.product_id = p.product_id
  WHERE p.is_active = true
)

SELECT 
  product_id,
  product_name,
  category,
  brand,
  hour_bucket,

  -- Sales performance
  total_quantity_sold,
  ROUND(total_revenue, 2) as total_revenue,
  unique_orders,
  unique_customers,

  -- Pricing metrics
  ROUND(avg_selling_price, 2) as avg_selling_price,
  max_selling_price,
  min_selling_price,
  ROUND(unit_margin, 2) as unit_margin,
  ROUND(total_margin, 2) as total_margin,
  ROUND((total_margin / total_revenue) * 100, 1) as margin_percentage,

  -- Performance rankings
  revenue_rank,
  quantity_rank,
  category_rank,
  ROUND(revenue_percentile * 100, 1) as revenue_percentile_score,

  -- Growth metrics
  ROUND(
    CASE 
      WHEN prev_hour_revenue > 0 THEN
        ((total_revenue - prev_hour_revenue) / prev_hour_revenue) * 100
      ELSE NULL
    END,
    1
  ) as hourly_revenue_growth_pct,

  ROUND(
    CASE 
      WHEN prev_hour_quantity > 0 THEN
        ((total_quantity_sold - prev_hour_quantity) / prev_hour_quantity::DECIMAL) * 100
      ELSE NULL
    END,
    1
  ) as hourly_quantity_growth_pct,

  -- Customer metrics
  ROUND(avg_order_value, 2) as avg_order_value,
  ROUND(avg_quantity_per_order, 2) as avg_quantity_per_order,
  ROUND(total_revenue / unique_customers, 2) as revenue_per_customer,

  -- Performance classification
  CASE 
    WHEN revenue_rank <= 10 AND margin_percentage > 20 THEN 'star_performer'
    WHEN revenue_rank <= 50 AND margin_percentage > 15 THEN 'strong_performer'
    WHEN revenue_rank <= 100 THEN 'solid_performer'
    ELSE 'monitor_performance'
  END as performance_classification,

  -- Alert indicators
  CASE 
    WHEN hourly_revenue_growth_pct > 50 THEN 'trending_up'
    WHEN hourly_revenue_growth_pct < -30 THEN 'trending_down'
    WHEN revenue_rank <= 20 AND margin_percentage < 10 THEN 'margin_concern'
    ELSE 'normal'
  END as alert_status,

  -- Recommendations
  CASE 
    WHEN performance_classification = 'star_performer' THEN 'increase_inventory_and_marketing'
    WHEN alert_status = 'trending_down' THEN 'investigate_declining_performance'
    WHEN margin_percentage < 10 THEN 'review_pricing_strategy'
    WHEN revenue_rank > 100 THEN 'consider_promotion_or_discontinuation'
    ELSE 'maintain_current_strategy'
  END as recommendation

FROM product_rankings
WHERE hour_bucket >= CURRENT_TIMESTAMP - INTERVAL '24 hours'
ORDER BY 
  hour_bucket DESC,
  revenue_rank ASC,
  margin_percentage DESC;

-- Advanced cohort analysis with SQL window functions
WITH customer_cohorts AS (
  SELECT 
    customer_id,
    DATE_TRUNC('month', MIN(order_date)) as cohort_month,
    MIN(order_date) as first_order_date,
    COUNT(*) as total_orders,
    SUM(JSON_EXTRACT(financial, '$.total')) as total_spent

  FROM orders
  WHERE status IN ('completed', 'delivered')
    AND order_date >= CURRENT_TIMESTAMP - INTERVAL '24 months'
  GROUP BY customer_id
),

cohort_periods AS (
  SELECT 
    cc.customer_id,
    cc.cohort_month,
    cc.first_order_date,
    cc.total_orders,
    cc.total_spent,

    o.order_date,
    o._id as order_id,
    JSON_EXTRACT(o.financial, '$.total') as order_value,

    -- Calculate periods since first order
    FLOOR(
      MONTHS_BETWEEN(DATE_TRUNC('month', o.order_date), cc.cohort_month)
    ) as periods_since_first_order,

    DATE_TRUNC('month', o.order_date) as order_month

  FROM customer_cohorts cc
  JOIN orders o ON cc.customer_id = o.customer_id
  WHERE o.status IN ('completed', 'delivered')
    AND o.order_date >= cc.first_order_date
),

cohort_analysis AS (
  SELECT 
    cohort_month,
    periods_since_first_order,
    order_month,

    -- Cohort metrics
    COUNT(DISTINCT customer_id) as active_customers,
    COUNT(DISTINCT order_id) as total_orders,
    SUM(order_value) as total_revenue,
    AVG(order_value) as avg_order_value,

    -- Customer behavior
    AVG(total_orders) as avg_lifetime_orders,
    AVG(total_spent) as avg_lifetime_value,

    -- Period-specific insights
    COUNT(DISTINCT customer_id) / 
    FIRST_VALUE(COUNT(DISTINCT customer_id)) OVER (
      PARTITION BY cohort_month 
      ORDER BY periods_since_first_order
      ROWS UNBOUNDED PRECEDING
    ) as retention_rate

  FROM cohort_periods
  GROUP BY cohort_month, periods_since_first_order, order_month
),

cohort_summary AS (
  SELECT 
    cohort_month,

    -- Cohort size (customers who made first purchase in this month)
    MAX(CASE WHEN periods_since_first_order = 0 THEN active_customers END) as cohort_size,

    -- Retention rates by period
    MAX(CASE WHEN periods_since_first_order = 1 THEN retention_rate END) as month_1_retention,
    MAX(CASE WHEN periods_since_first_order = 3 THEN retention_rate END) as month_3_retention,
    MAX(CASE WHEN periods_since_first_order = 6 THEN retention_rate END) as month_6_retention,
    MAX(CASE WHEN periods_since_first_order = 12 THEN retention_rate END) as month_12_retention,

    -- Revenue metrics
    SUM(total_revenue) as cohort_total_revenue,
    AVG(avg_lifetime_value) as avg_customer_ltv,

    -- Performance indicators
    MAX(periods_since_first_order) as max_observed_periods,
    AVG(avg_order_value) as cohort_avg_order_value

  FROM cohort_analysis
  GROUP BY cohort_month
)

SELECT 
  cohort_month,
  cohort_size,

  -- Retention analysis
  ROUND(month_1_retention * 100, 1) as month_1_retention_pct,
  ROUND(month_3_retention * 100, 1) as month_3_retention_pct,
  ROUND(month_6_retention * 100, 1) as month_6_retention_pct,
  ROUND(month_12_retention * 100, 1) as month_12_retention_pct,

  -- Financial metrics
  ROUND(cohort_total_revenue, 2) as cohort_total_revenue,
  ROUND(avg_customer_ltv, 2) as avg_customer_ltv,
  ROUND(cohort_avg_order_value, 2) as avg_order_value,
  ROUND(cohort_total_revenue / cohort_size, 2) as revenue_per_customer,

  -- Cohort performance classification
  CASE 
    WHEN month_3_retention >= 0.4 AND avg_customer_ltv >= 500 THEN 'excellent'
    WHEN month_3_retention >= 0.3 AND avg_customer_ltv >= 300 THEN 'good'
    WHEN month_3_retention >= 0.2 OR avg_customer_ltv >= 200 THEN 'fair'
    ELSE 'poor'
  END as cohort_quality,

  -- Growth trend analysis
  ROUND(
    (month_6_retention - month_1_retention) * 100,
    1
  ) as retention_trend,

  -- Business insights
  CASE 
    WHEN month_1_retention < 0.2 THEN 'improve_onboarding'
    WHEN month_12_retention < 0.1 THEN 'enhance_loyalty_program'
    WHEN avg_customer_ltv < 100 THEN 'increase_customer_value'
    ELSE 'maintain_performance'
  END as recommendation,

  max_observed_periods

FROM cohort_summary
WHERE cohort_size >= 10  -- Filter for statistical significance
ORDER BY cohort_month DESC;

-- QueryLeaf provides comprehensive aggregation capabilities:
-- 1. SQL-familiar syntax for complex MongoDB aggregation pipelines
-- 2. Advanced analytics with CTEs, window functions, and statistical operations
-- 3. Real-time processing with familiar SQL patterns and aggregation functions  
-- 4. Complex customer segmentation and behavioral analysis using SQL constructs
-- 5. Product performance analytics with rankings and growth calculations
-- 6. Cohort analysis with retention rates and lifetime value calculations
-- 7. Integration with MongoDB's native aggregation optimizations
-- 8. Familiar SQL data types, functions, and expression syntax
-- 9. Advanced time-series analysis and trend detection capabilities
-- 10. Enterprise-ready analytics with performance optimization and scalability

Best Practices for Aggregation Framework Implementation

Performance Optimization and Pipeline Design

Essential strategies for effective MongoDB Aggregation Framework usage:

Early Stage Filtering: Place $match stages as early as possible to reduce data processing volume
Index Utilization: Design compound indexes that support aggregation pipeline operations
Memory Management: Use allowDiskUse: true for large aggregations and monitor memory usage
Pipeline Ordering: Arrange stages to minimize data movement and intermediate result sizes
Expression Optimization: Use efficient expressions and avoid complex nested operations when possible
Result Set Limiting: Apply $limit stages strategically to control output size

Enterprise Analytics Architecture

Design scalable aggregation systems for production deployments:

Distributed Processing: Leverage MongoDB's sharding to distribute aggregation workloads
Caching Strategies: Implement result caching for frequently accessed aggregations
Real-time Processing: Combine aggregation pipelines with Change Streams for live analytics
Incremental Updates: Design incremental aggregation patterns for large, frequently updated datasets
Performance Monitoring: Track aggregation performance and optimize based on usage patterns
Resource Planning: Size clusters appropriately for expected aggregation workloads and data volumes

Conclusion

MongoDB's Aggregation Framework provides comprehensive data processing capabilities that eliminate the complexity and performance limitations of traditional SQL analytics approaches through optimized single-pass processing, native document transformations, and distributed execution capabilities. The rich expression language and extensive operator library enable sophisticated analytics while maintaining high performance and operational simplicity.

Key MongoDB Aggregation Framework benefits include:

Unified Processing: Single-pass analytics without expensive JOINs or multiple query rounds
Rich Expressions: Comprehensive mathematical, statistical, and analytical operations
Document-Native: Native handling of nested documents, arrays, and complex data structures
Performance Optimization: Automatic query optimization with index utilization and parallel processing
Horizontal Scaling: Distributed aggregation processing across sharded MongoDB clusters
Real-time Capabilities: Integration with Change Streams for live analytical processing

Whether you're building business intelligence platforms, real-time analytics systems, customer segmentation tools, or complex reporting solutions, MongoDB's Aggregation Framework with QueryLeaf's familiar SQL interface provides the foundation for powerful, scalable, and maintainable analytical processing.

QueryLeaf Integration: QueryLeaf seamlessly translates SQL analytics queries into optimized MongoDB aggregation pipelines while providing familiar SQL syntax for complex analytics, statistical functions, and reporting operations. Advanced aggregation patterns including cohort analysis, customer segmentation, and real-time analytics are elegantly handled through familiar SQL constructs, making sophisticated data processing both powerful and accessible to SQL-oriented analytics teams.

The combination of MongoDB's robust aggregation capabilities with SQL-style analytical operations makes it an ideal platform for applications requiring both advanced analytics functionality and familiar database interaction patterns, ensuring your analytical infrastructure can deliver insights efficiently while maintaining developer productivity and operational excellence.

November 8, 2025
30 min read

MongoDB Time Series Collections for IoT Data Management: Real-Time Analytics and High-Performance Data Processing

Modern IoT applications generate massive volumes of time-stamped sensor data that require specialized storage and processing capabilities to handle millions of data points per second while enabling real-time analytics and efficient historical data queries. Traditional database approaches struggle with the scale, write-heavy workloads, and time-based query patterns characteristic of IoT systems, often requiring complex partitioning schemes, multiple storage tiers, and custom optimization strategies that increase operational complexity and development overhead.

MongoDB Time Series Collections provide purpose-built storage optimization for time-stamped data with automatic bucketing, compression, and query optimization specifically designed for IoT workloads. Unlike traditional approaches that require manual time-based partitioning and complex indexing strategies, Time Series Collections automatically organize data by time ranges, apply intelligent compression, and optimize queries for time-based access patterns while maintaining MongoDB's flexible document model and powerful aggregation capabilities.

The Traditional IoT Data Storage Challenge

Conventional approaches to storing and processing IoT time series data face significant scalability and performance limitations:

-- Traditional PostgreSQL time series approach - complex partitioning and limited scalability

-- IoT sensor data with traditional table design
CREATE TABLE sensor_readings (
    reading_id BIGSERIAL PRIMARY KEY,
    device_id VARCHAR(100) NOT NULL,
    sensor_type VARCHAR(50) NOT NULL,
    location VARCHAR(200),

    -- Time series data
    timestamp TIMESTAMPTZ NOT NULL,
    value DECIMAL(15,6) NOT NULL,
    unit VARCHAR(20),
    quality_score DECIMAL(3,2) DEFAULT 1.0,

    -- Device and context metadata
    device_metadata JSONB,
    environmental_conditions JSONB,

    -- Data processing flags
    processed BOOLEAN DEFAULT FALSE,
    anomaly_detected BOOLEAN DEFAULT FALSE,
    data_source VARCHAR(100),

    -- Partitioning helper columns
    year_month INTEGER GENERATED ALWAYS AS (EXTRACT(YEAR FROM timestamp) * 100 + EXTRACT(MONTH FROM timestamp)) STORED,
    date_partition DATE GENERATED ALWAYS AS (DATE(timestamp)) STORED
);

-- Complex time-based partitioning (manual maintenance required)
CREATE TABLE sensor_readings_2024_01 PARTITION OF sensor_readings
FOR VALUES FROM ('2024-01-01') TO ('2024-02-01');

CREATE TABLE sensor_readings_2024_02 PARTITION OF sensor_readings
FOR VALUES FROM ('2024-02-01') TO ('2024-03-01');

CREATE TABLE sensor_readings_2024_03 PARTITION OF sensor_readings
FOR VALUES FROM ('2024-03-01') TO ('2024-04-01');

-- Additional partitions must be created manually each month
-- Automation required to prevent partition overflow

-- Indexing strategy for time series queries (expensive maintenance)
CREATE INDEX idx_sensor_readings_device_time ON sensor_readings (device_id, timestamp DESC);
CREATE INDEX idx_sensor_readings_sensor_type_time ON sensor_readings (sensor_type, timestamp DESC);
CREATE INDEX idx_sensor_readings_location_time ON sensor_readings (location, timestamp DESC);
CREATE INDEX idx_sensor_readings_timestamp_only ON sensor_readings (timestamp DESC);
CREATE INDEX idx_sensor_readings_processed_flag ON sensor_readings (processed, timestamp DESC);

-- Additional indexes for different query patterns
CREATE INDEX idx_sensor_readings_anomaly_time ON sensor_readings (anomaly_detected, timestamp DESC) WHERE anomaly_detected = TRUE;
CREATE INDEX idx_sensor_readings_device_type_time ON sensor_readings (device_id, sensor_type, timestamp DESC);

-- Materialized view for real-time aggregations (complex maintenance)
CREATE MATERIALIZED VIEW sensor_readings_hourly_summary AS
WITH hourly_aggregations AS (
    SELECT 
        device_id,
        sensor_type,
        location,
        DATE_TRUNC('hour', timestamp) as hour_bucket,

        -- Statistical aggregations
        COUNT(*) as reading_count,
        AVG(value) as avg_value,
        MIN(value) as min_value,
        MAX(value) as max_value,
        STDDEV(value) as stddev_value,
        PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY value) as median_value,
        PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY value) as p95_value,

        -- Data quality metrics
        AVG(quality_score) as avg_quality,
        COUNT(*) FILTER (WHERE quality_score < 0.8) as low_quality_readings,
        COUNT(*) FILTER (WHERE anomaly_detected = true) as anomaly_count,

        -- Value change analysis
        (MAX(value) - MIN(value)) as value_range,
        CASE 
            WHEN COUNT(*) > 1 THEN
                (LAST_VALUE(value) OVER (ORDER BY timestamp ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) - 
                 FIRST_VALUE(value) OVER (ORDER BY timestamp ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING))
            ELSE 0
        END as value_change_in_hour,

        -- Processing statistics
        COUNT(*) FILTER (WHERE processed = true) as processed_readings,
        (COUNT(*) FILTER (WHERE processed = true)::DECIMAL / COUNT(*)) * 100 as processing_rate_percent,

        -- Time coverage analysis
        (EXTRACT(EPOCH FROM MAX(timestamp) - MIN(timestamp)) / 3600) as time_coverage_hours,
        COUNT(*)::DECIMAL / (EXTRACT(EPOCH FROM MAX(timestamp) - MIN(timestamp)) / 60) as readings_per_minute

    FROM sensor_readings
    WHERE timestamp >= CURRENT_TIMESTAMP - INTERVAL '24 hours'
    GROUP BY device_id, sensor_type, location, DATE_TRUNC('hour', timestamp)
)
SELECT 
    ha.*,

    -- Additional calculated metrics
    CASE 
        WHEN ha.reading_count < 50 THEN 'sparse'
        WHEN ha.reading_count < 200 THEN 'normal'
        WHEN ha.reading_count < 500 THEN 'dense'
        ELSE 'very_dense'
    END as data_density_category,

    CASE 
        WHEN ha.avg_quality >= 0.95 THEN 'excellent'
        WHEN ha.avg_quality >= 0.8 THEN 'good'
        WHEN ha.avg_quality >= 0.6 THEN 'fair'
        ELSE 'poor'
    END as quality_category,

    -- Anomaly rate analysis
    CASE 
        WHEN ha.anomaly_count = 0 THEN 'normal'
        WHEN (ha.anomaly_count::DECIMAL / ha.reading_count) < 0.01 THEN 'low_anomalies'
        WHEN (ha.anomaly_count::DECIMAL / ha.reading_count) < 0.05 THEN 'moderate_anomalies'
        ELSE 'high_anomalies'
    END as anomaly_level,

    -- Performance indicators
    CASE 
        WHEN ha.readings_per_minute >= 10 THEN 'high_frequency'
        WHEN ha.readings_per_minute >= 1 THEN 'medium_frequency'
        WHEN ha.readings_per_minute >= 0.1 THEN 'low_frequency'
        ELSE 'very_low_frequency'
    END as sampling_frequency_category

FROM hourly_aggregations ha;

-- Must be refreshed periodically (expensive operation)
CREATE UNIQUE INDEX idx_sensor_hourly_unique ON sensor_readings_hourly_summary (device_id, sensor_type, location, hour_bucket);

-- Complex query for real-time analytics (resource-intensive)
WITH device_performance AS (
    SELECT 
        sr.device_id,
        sr.sensor_type,
        sr.location,
        DATE_TRUNC('minute', sr.timestamp) as minute_bucket,

        -- Real-time aggregations (expensive on large datasets)
        COUNT(*) as readings_per_minute,
        AVG(sr.value) as avg_value,
        STDDEV(sr.value) as value_stability,

        -- Change detection (requires window functions)
        LAG(AVG(sr.value)) OVER (
            PARTITION BY sr.device_id, sr.sensor_type 
            ORDER BY DATE_TRUNC('minute', sr.timestamp)
        ) as prev_minute_avg,

        -- Quality assessment
        AVG(sr.quality_score) as avg_quality,
        COUNT(*) FILTER (WHERE sr.anomaly_detected) as anomaly_count,

        -- Processing lag calculation
        AVG(EXTRACT(EPOCH FROM CURRENT_TIMESTAMP - sr.timestamp)) as avg_processing_lag_seconds

    FROM sensor_readings sr
    WHERE 
        sr.timestamp >= CURRENT_TIMESTAMP - INTERVAL '1 hour'
        AND sr.processed = true
    GROUP BY sr.device_id, sr.sensor_type, sr.location, DATE_TRUNC('minute', sr.timestamp)
),

real_time_alerts AS (
    SELECT 
        dp.*,

        -- Alert conditions
        CASE 
            WHEN ABS(dp.avg_value - dp.prev_minute_avg) > (dp.value_stability * 3) THEN 'value_spike'
            WHEN dp.avg_quality < 0.7 THEN 'quality_degradation'
            WHEN dp.anomaly_count > 0 THEN 'anomalies_detected'
            WHEN dp.avg_processing_lag_seconds > 300 THEN 'processing_delay'
            WHEN dp.readings_per_minute < 0.5 THEN 'data_gap'
            ELSE 'normal'
        END as alert_type,

        -- Severity calculation
        CASE 
            WHEN dp.anomaly_count > 10 OR dp.avg_quality < 0.5 THEN 'critical'
            WHEN dp.anomaly_count > 5 OR dp.avg_quality < 0.7 OR dp.avg_processing_lag_seconds > 600 THEN 'high'
            WHEN dp.anomaly_count > 2 OR dp.avg_quality < 0.8 OR dp.avg_processing_lag_seconds > 300 THEN 'medium'
            ELSE 'low'
        END as alert_severity,

        -- Performance assessment
        CASE 
            WHEN dp.readings_per_minute >= 30 AND dp.avg_quality >= 0.9 THEN 'optimal'
            WHEN dp.readings_per_minute >= 10 AND dp.avg_quality >= 0.8 THEN 'good'
            WHEN dp.readings_per_minute >= 1 AND dp.avg_quality >= 0.6 THEN 'acceptable'
            ELSE 'poor'
        END as performance_status

    FROM device_performance dp
    WHERE dp.minute_bucket >= CURRENT_TIMESTAMP - INTERVAL '15 minutes'
),

device_health_summary AS (
    SELECT 
        rta.device_id,
        COUNT(*) as total_minutes_analyzed,

        -- Health metrics
        COUNT(*) FILTER (WHERE rta.alert_type != 'normal') as minutes_with_alerts,
        COUNT(*) FILTER (WHERE rta.alert_severity IN ('critical', 'high')) as critical_minutes,
        COUNT(*) FILTER (WHERE rta.performance_status IN ('optimal', 'good')) as good_performance_minutes,

        -- Overall device status
        AVG(rta.avg_quality) as overall_quality,
        AVG(rta.readings_per_minute) as avg_data_rate,
        SUM(rta.anomaly_count) as total_anomalies,

        -- Most recent status
        LAST_VALUE(rta.performance_status) OVER (
            PARTITION BY rta.device_id 
            ORDER BY rta.minute_bucket 
            ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
        ) as current_status,

        LAST_VALUE(rta.alert_type) OVER (
            PARTITION BY rta.device_id 
            ORDER BY rta.minute_bucket 
            ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
        ) as current_alert_type

    FROM real_time_alerts rta
    GROUP BY rta.device_id
)

-- Final real-time dashboard query
SELECT 
    dhs.device_id,
    dhs.current_status,
    dhs.current_alert_type,

    -- Health indicators
    ROUND(dhs.overall_quality, 3) as quality_score,
    ROUND(dhs.avg_data_rate, 1) as data_rate_per_minute,
    dhs.total_anomalies,

    -- Alert summary
    dhs.minutes_with_alerts,
    dhs.critical_minutes,
    dhs.good_performance_minutes,

    -- Performance assessment
    ROUND((dhs.good_performance_minutes::DECIMAL / dhs.total_minutes_analyzed) * 100, 1) as uptime_percentage,
    ROUND((dhs.minutes_with_alerts::DECIMAL / dhs.total_minutes_analyzed) * 100, 1) as alert_percentage,

    -- Device health classification
    CASE 
        WHEN dhs.critical_minutes > 2 OR dhs.overall_quality < 0.6 THEN 'unhealthy'
        WHEN dhs.minutes_with_alerts > 5 OR dhs.overall_quality < 0.8 THEN 'degraded'
        WHEN dhs.good_performance_minutes >= (dhs.total_minutes_analyzed * 0.8) THEN 'healthy'
        ELSE 'monitoring'
    END as device_health_status,

    -- Recommendations
    CASE 
        WHEN dhs.total_anomalies > 20 THEN 'investigate_sensor_calibration'
        WHEN dhs.avg_data_rate < 1 THEN 'check_connectivity'
        WHEN dhs.overall_quality < 0.7 THEN 'review_sensor_maintenance'
        WHEN dhs.critical_minutes > 0 THEN 'immediate_attention_required'
        ELSE 'operating_normally'
    END as recommended_action

FROM device_health_summary dhs
ORDER BY 
    CASE dhs.current_status
        WHEN 'poor' THEN 1
        WHEN 'acceptable' THEN 2
        WHEN 'good' THEN 3
        WHEN 'optimal' THEN 4
    END,
    dhs.critical_minutes DESC,
    dhs.total_anomalies DESC;

-- Traditional time series problems:
-- 1. Complex manual partitioning and maintenance overhead
-- 2. Expensive materialized view refreshes for real-time analytics
-- 3. Limited compression and storage optimization for time series data
-- 4. Complex indexing strategies with high maintenance costs
-- 5. Poor write performance under high-volume IoT workloads
-- 6. Difficult horizontal scaling for time series data
-- 7. Limited time-based query optimization
-- 8. Complex time window and rollup aggregations
-- 9. Expensive historical data archiving and cleanup operations
-- 10. No built-in time series specific features and optimizations

MongoDB Time Series Collections provide comprehensive IoT data management with automatic optimization and intelligent compression:

// MongoDB Time Series Collections - Optimized IoT data storage and analytics
const { MongoClient, ObjectId } = require('mongodb');

// Comprehensive IoT Time Series Data Manager
class IoTTimeSeriesManager {
  constructor(connectionString, iotConfig = {}) {
    this.connectionString = connectionString;
    this.client = null;
    this.db = null;

    this.config = {
      // Time series configuration
      defaultGranularity: iotConfig.defaultGranularity || 'seconds',
      enableAutomaticIndexing: iotConfig.enableAutomaticIndexing !== false,
      enableCompression: iotConfig.enableCompression !== false,

      // IoT-specific features
      enableRealTimeAlerts: iotConfig.enableRealTimeAlerts !== false,
      enableAnomalyDetection: iotConfig.enableAnomalyDetection !== false,
      enablePredictiveAnalytics: iotConfig.enablePredictiveAnalytics !== false,
      enableDataQualityMonitoring: iotConfig.enableDataQualityMonitoring !== false,

      // Performance optimization
      batchWriteSize: iotConfig.batchWriteSize || 1000,
      writeConcern: iotConfig.writeConcern || { w: 1, j: true },
      readPreference: iotConfig.readPreference || 'primaryPreferred',
      maxConnectionPoolSize: iotConfig.maxConnectionPoolSize || 100,

      // Data retention and archival
      enableDataLifecycleManagement: iotConfig.enableDataLifecycleManagement !== false,
      defaultRetentionDays: iotConfig.defaultRetentionDays || 365,
      enableAutomaticArchiving: iotConfig.enableAutomaticArchiving !== false,

      // Analytics and processing
      enableStreamProcessing: iotConfig.enableStreamProcessing !== false,
      enableRealTimeAggregation: iotConfig.enableRealTimeAggregation !== false,
      aggregationWindowSize: iotConfig.aggregationWindowSize || '1 minute',

      ...iotConfig
    };

    // Time series collections for different data types
    this.timeSeriesCollections = new Map();
    this.aggregationCollections = new Map();
    this.alertCollections = new Map();

    // Real-time processing components
    this.changeStreams = new Map();
    this.processingPipelines = new Map();
    this.alertRules = new Map();

    // Performance metrics
    this.performanceMetrics = {
      totalDataPoints: 0,
      writeOperationsPerSecond: 0,
      queryOperationsPerSecond: 0,
      averageLatency: 0,
      compressionRatio: 0,
      alertsTriggered: 0
    };
  }

  async initializeIoTTimeSeriesSystem() {
    console.log('Initializing MongoDB IoT Time Series system...');

    try {
      // Connect to MongoDB
      this.client = new MongoClient(this.connectionString, {
        maxPoolSize: this.config.maxConnectionPoolSize,
        writeConcern: this.config.writeConcern,
        readPreference: this.config.readPreference
      });

      await this.client.connect();
      this.db = this.client.db();

      // Create time series collections for different sensor types
      await this.createTimeSeriesCollections();

      // Setup real-time processing pipelines
      if (this.config.enableStreamProcessing) {
        await this.setupStreamProcessing();
      }

      // Initialize real-time aggregations
      if (this.config.enableRealTimeAggregation) {
        await this.setupRealTimeAggregations();
      }

      // Setup anomaly detection
      if (this.config.enableAnomalyDetection) {
        await this.setupAnomalyDetection();
      }

      // Initialize data lifecycle management
      if (this.config.enableDataLifecycleManagement) {
        await this.setupDataLifecycleManagement();
      }

      console.log('IoT Time Series system initialized successfully');

    } catch (error) {
      console.error('Error initializing IoT Time Series system:', error);
      throw error;
    }
  }

  async createTimeSeriesCollections() {
    console.log('Creating optimized time series collections...');

    // Sensor readings time series collection with automatic bucketing
    await this.createOptimizedTimeSeriesCollection('sensor_readings', {
      timeField: 'timestamp',
      metaField: 'device',
      granularity: this.config.defaultGranularity,
      bucketMaxSpanSeconds: 3600, // 1 hour buckets
      bucketRoundingSeconds: 60,  // Round to nearest minute

      // Optimize for IoT data patterns
      expireAfterSeconds: this.config.defaultRetentionDays * 24 * 60 * 60,

      // Index optimization for common IoT queries
      additionalIndexes: [
        { 'device.id': 1, 'timestamp': -1 },
        { 'device.type': 1, 'timestamp': -1 },
        { 'device.location': 1, 'timestamp': -1 },
        { 'sensor.type': 1, 'timestamp': -1 }
      ]
    });

    // Environmental monitoring time series
    await this.createOptimizedTimeSeriesCollection('environmental_data', {
      timeField: 'timestamp',
      metaField: 'location',
      granularity: 'minutes',
      bucketMaxSpanSeconds: 7200, // 2 hour buckets for slower changing data

      additionalIndexes: [
        { 'location.facility': 1, 'location.zone': 1, 'timestamp': -1 },
        { 'sensor_type': 1, 'timestamp': -1 }
      ]
    });

    // Equipment performance monitoring
    await this.createOptimizedTimeSeriesCollection('equipment_metrics', {
      timeField: 'timestamp',
      metaField: 'equipment',
      granularity: 'seconds',
      bucketMaxSpanSeconds: 1800, // 30 minute buckets for high-frequency data

      additionalIndexes: [
        { 'equipment.id': 1, 'equipment.type': 1, 'timestamp': -1 },
        { 'metric_type': 1, 'timestamp': -1 }
      ]
    });

    // Energy consumption tracking
    await this.createOptimizedTimeSeriesCollection('energy_consumption', {
      timeField: 'timestamp',
      metaField: 'meter',
      granularity: 'minutes',
      bucketMaxSpanSeconds: 3600, // 1 hour buckets

      additionalIndexes: [
        { 'meter.id': 1, 'timestamp': -1 },
        { 'meter.building': 1, 'meter.floor': 1, 'timestamp': -1 }
      ]
    });

    // Vehicle telemetry data
    await this.createOptimizedTimeSeriesCollection('vehicle_telemetry', {
      timeField: 'timestamp',
      metaField: 'vehicle',
      granularity: 'seconds',
      bucketMaxSpanSeconds: 900, // 15 minute buckets for mobile data

      additionalIndexes: [
        { 'vehicle.id': 1, 'timestamp': -1 },
        { 'vehicle.route': 1, 'timestamp': -1 },
        { 'telemetry_type': 1, 'timestamp': -1 }
      ]
    });

    console.log('Time series collections created successfully');
  }

  async createOptimizedTimeSeriesCollection(collectionName, config) {
    console.log(`Creating time series collection: ${collectionName}`);

    try {
      // Create time series collection with MongoDB's native optimization
      const collection = await this.db.createCollection(collectionName, {
        timeseries: {
          timeField: config.timeField,
          metaField: config.metaField,
          granularity: config.granularity,
          bucketMaxSpanSeconds: config.bucketMaxSpanSeconds,
          bucketRoundingSeconds: config.bucketRoundingSeconds || 60
        },

        // Set TTL for automatic data expiration
        ...(config.expireAfterSeconds && {
          expireAfterSeconds: config.expireAfterSeconds
        }),

        // Enable compression for storage optimization
        storageEngine: {
          wiredTiger: {
            configString: 'block_compressor=zstd'
          }
        }
      });

      // Create additional indexes for query optimization
      if (config.additionalIndexes) {
        await collection.createIndexes(
          config.additionalIndexes.map(indexSpec => ({
            key: indexSpec,
            background: true,
            name: `idx_${Object.keys(indexSpec).join('_')}`
          }))
        );
      }

      // Store collection reference and configuration
      this.timeSeriesCollections.set(collectionName, {
        collection: collection,
        config: config,
        stats: {
          documentsInserted: 0,
          bytesStored: 0,
          compressionRatio: 0,
          lastInsertTime: null
        }
      });

      console.log(`Time series collection ${collectionName} created successfully`);

    } catch (error) {
      console.error(`Error creating time series collection ${collectionName}:`, error);
      throw error;
    }
  }

  async insertSensorData(collectionName, sensorDataPoints) {
    const startTime = Date.now();

    try {
      const collectionInfo = this.timeSeriesCollections.get(collectionName);
      if (!collectionInfo) {
        throw new Error(`Time series collection ${collectionName} not found`);
      }

      const collection = collectionInfo.collection;

      // Prepare data points with enhanced metadata
      const enhancedDataPoints = sensorDataPoints.map(dataPoint => ({
        // Time series fields
        timestamp: dataPoint.timestamp || new Date(),

        // Device/sensor metadata (automatically indexed as metaField)
        device: {
          id: dataPoint.deviceId,
          type: dataPoint.deviceType || 'generic_sensor',
          location: {
            facility: dataPoint.facility || 'unknown',
            zone: dataPoint.zone || 'default',
            coordinates: dataPoint.coordinates || null,
            floor: dataPoint.floor || null,
            room: dataPoint.room || null
          },
          firmware: dataPoint.firmwareVersion || null,
          manufacturer: dataPoint.manufacturer || null,
          model: dataPoint.model || null,
          installDate: dataPoint.installDate || null
        },

        // Sensor information
        sensor: {
          type: dataPoint.sensorType,
          unit: dataPoint.unit || null,
          precision: dataPoint.precision || null,
          calibrationDate: dataPoint.calibrationDate || null,
          maintenanceSchedule: dataPoint.maintenanceSchedule || null
        },

        // Measurement data
        value: dataPoint.value,
        rawValue: dataPoint.rawValue || dataPoint.value,

        // Data quality indicators
        quality: {
          score: dataPoint.qualityScore || 1.0,
          flags: dataPoint.qualityFlags || [],
          confidence: dataPoint.confidence || 1.0,
          calibrationStatus: dataPoint.calibrationStatus || 'valid',
          sensorHealth: dataPoint.sensorHealth || 'healthy'
        },

        // Environmental context
        environmentalConditions: {
          temperature: dataPoint.ambientTemperature || null,
          humidity: dataPoint.ambientHumidity || null,
          pressure: dataPoint.atmosphericPressure || null,
          vibration: dataPoint.vibrationLevel || null,
          electricalNoise: dataPoint.electricalNoise || null
        },

        // Processing metadata
        processing: {
          receivedAt: new Date(),
          source: dataPoint.dataSource || 'direct',
          protocol: dataPoint.protocol || 'unknown',
          gateway: dataPoint.gatewayId || null,
          processingLatency: dataPoint.processingLatency || null,
          networkLatency: dataPoint.networkLatency || null
        },

        // Alert and anomaly flags
        alerts: {
          anomalyDetected: dataPoint.anomalyDetected || false,
          thresholdViolation: dataPoint.thresholdViolation || null,
          alertLevel: dataPoint.alertLevel || 'normal',
          alertReason: dataPoint.alertReason || null
        },

        // Business context
        businessContext: {
          assetId: dataPoint.assetId || null,
          processId: dataPoint.processId || null,
          operationalMode: dataPoint.operationalMode || 'normal',
          shiftId: dataPoint.shiftId || null,
          operatorId: dataPoint.operatorId || null
        },

        // Additional custom metadata
        customMetadata: dataPoint.customMetadata || {}
      }));

      // Perform batch insert with write concern
      const result = await collection.insertMany(enhancedDataPoints, {
        writeConcern: this.config.writeConcern,
        ordered: false // Allow partial success for better performance
      });

      const insertTime = Date.now() - startTime;

      // Update collection statistics
      collectionInfo.stats.documentsInserted += result.insertedCount;
      collectionInfo.stats.lastInsertTime = new Date();

      // Update performance metrics
      this.updatePerformanceMetrics('insert', result.insertedCount, insertTime);

      // Trigger real-time processing if enabled
      if (this.config.enableStreamProcessing) {
        await this.processRealTimeData(collectionName, enhancedDataPoints);
      }

      // Check for alerts if enabled
      if (this.config.enableRealTimeAlerts) {
        await this.checkAlertConditions(collectionName, enhancedDataPoints);
      }

      console.log(`Inserted ${result.insertedCount} sensor data points into ${collectionName} in ${insertTime}ms`);

      return {
        success: true,
        collection: collectionName,
        insertedCount: result.insertedCount,
        insertTime: insertTime,
        insertedIds: result.insertedIds
      };

    } catch (error) {
      console.error(`Error inserting sensor data into ${collectionName}:`, error);
      return {
        success: false,
        collection: collectionName,
        error: error.message,
        insertTime: Date.now() - startTime
      };
    }
  }

  async queryTimeSeriesData(collectionName, query) {
    const startTime = Date.now();

    try {
      const collectionInfo = this.timeSeriesCollections.get(collectionName);
      if (!collectionInfo) {
        throw new Error(`Time series collection ${collectionName} not found`);
      }

      const collection = collectionInfo.collection;

      // Build comprehensive aggregation pipeline for time series analysis
      const pipeline = [
        // Time range filtering (optimized for time series collections)
        {
          $match: {
            timestamp: {
              $gte: query.startTime,
              $lte: query.endTime || new Date()
            },
            ...(query.deviceIds && { 'device.id': { $in: query.deviceIds } }),
            ...(query.deviceTypes && { 'device.type': { $in: query.deviceTypes } }),
            ...(query.sensorTypes && { 'sensor.type': { $in: query.sensorTypes } }),
            ...(query.locations && { 'device.location.facility': { $in: query.locations } }),
            ...(query.minQualityScore && { 'quality.score': { $gte: query.minQualityScore } }),
            ...(query.alertLevel && { 'alerts.alertLevel': query.alertLevel })
          }
        },

        // Time-based grouping and aggregation
        {
          $group: {
            _id: {
              deviceId: '$device.id',
              deviceType: '$device.type',
              sensorType: '$sensor.type',
              location: '$device.location',

              // Time bucketing based on query granularity
              timeBucket: query.granularity === 'hour' 
                ? { $dateTrunc: { date: '$timestamp', unit: 'hour' } }
                : query.granularity === 'minute'
                ? { $dateTrunc: { date: '$timestamp', unit: 'minute' } }
                : query.granularity === 'day'
                ? { $dateTrunc: { date: '$timestamp', unit: 'day' } }
                : '$timestamp' // Raw timestamp for second-level granularity
            },

            // Statistical aggregations
            count: { $sum: 1 },
            avgValue: { $avg: '$value' },
            minValue: { $min: '$value' },
            maxValue: { $max: '$value' },
            sumValue: { $sum: '$value' },

            // Advanced statistical measures
            stdDevValue: { $stdDevPop: '$value' },
            varianceValue: { $pow: [{ $stdDevPop: '$value' }, 2] },

            // Percentile calculations using $percentile (MongoDB 7.0+)
            percentiles: {
              $percentile: {
                input: '$value',
                p: [0.25, 0.5, 0.75, 0.9, 0.95, 0.99],
                method: 'approximate'
              }
            },

            // Data quality metrics
            avgQualityScore: { $avg: '$quality.score' },
            minQualityScore: { $min: '$quality.score' },
            lowQualityCount: {
              $sum: { $cond: [{ $lt: ['$quality.score', 0.8] }, 1, 0] }
            },

            // Alert and anomaly statistics
            anomalyCount: {
              $sum: { $cond: ['$alerts.anomalyDetected', 1, 0] }
            },
            alertCounts: {
              $push: {
                $cond: [
                  { $ne: ['$alerts.alertLevel', 'normal'] },
                  '$alerts.alertLevel',
                  '$$REMOVE'
                ]
              }
            },

            // Time-based metrics
            firstReading: { $min: '$timestamp' },
            lastReading: { $max: '$timestamp' },

            // Value change analysis
            valueRange: { $subtract: [{ $max: '$value' }, { $min: '$value' }] },

            // Environmental conditions (if available)
            avgAmbientTemp: { $avg: '$environmentalConditions.temperature' },
            avgAmbientHumidity: { $avg: '$environmentalConditions.humidity' },

            // Processing performance
            avgProcessingLatency: { $avg: '$processing.processingLatency' },
            maxProcessingLatency: { $max: '$processing.processingLatency' },

            // Raw data points (if requested for detailed analysis)
            ...(query.includeRawData && {
              rawDataPoints: {
                $push: {
                  timestamp: '$timestamp',
                  value: '$value',
                  quality: '$quality.score',
                  anomaly: '$alerts.anomalyDetected'
                }
              }
            })
          }
        },

        // Calculate additional derived metrics
        {
          $addFields: {
            // Time coverage and sampling rate analysis
            timeCoverageSeconds: {
              $divide: [
                { $subtract: ['$lastReading', '$firstReading'] },
                1000
              ]
            },

            // Data completeness analysis
            expectedReadings: {
              $cond: [
                { $eq: [query.granularity, 'minute'] },
                { $divide: [{ $subtract: ['$lastReading', '$firstReading'] }, 60000] },
                { $cond: [
                  { $eq: [query.granularity, 'hour'] },
                  { $divide: [{ $subtract: ['$lastReading', '$firstReading'] }, 3600000] },
                  '$count'
                ]}
              ]
            },

            // Statistical analysis
            coefficientOfVariation: {
              $cond: [
                { $ne: ['$avgValue', 0] },
                { $divide: ['$stdDevValue', '$avgValue'] },
                0
              ]
            },

            // Data quality percentage
            qualityPercentage: {
              $multiply: [
                { $divide: [
                  { $subtract: ['$count', '$lowQualityCount'] },
                  '$count'
                ]},
                100
              ]
            },

            // Anomaly rate
            anomalyRate: {
              $multiply: [
                { $divide: ['$anomalyCount', '$count'] },
                100
              ]
            },

            // Alert distribution
            alertDistribution: {
              $reduce: {
                input: '$alertCounts',
                initialValue: {},
                in: {
                  $mergeObjects: [
                    '$$value',
                    { ['$$this']: { $add: [{ $ifNull: [{ $getField: { field: '$$this', input: '$$value' } }, 0] }, 1] } }
                  ]
                }
              }
            },

            // Performance classification
            performanceCategory: {
              $switch: {
                branches: [
                  { 
                    case: { 
                      $and: [
                        { $gte: ['$qualityPercentage', 95] },
                        { $lt: ['$anomalyRate', 1] },
                        { $lte: ['$avgProcessingLatency', 100] }
                      ]
                    }, 
                    then: 'excellent' 
                  },
                  { 
                    case: { 
                      $and: [
                        { $gte: ['$qualityPercentage', 85] },
                        { $lt: ['$anomalyRate', 5] },
                        { $lte: ['$avgProcessingLatency', 300] }
                      ]
                    }, 
                    then: 'good' 
                  },
                  { 
                    case: { 
                      $and: [
                        { $gte: ['$qualityPercentage', 70] },
                        { $lt: ['$anomalyRate', 10] }
                      ]
                    }, 
                    then: 'fair' 
                  }
                ],
                default: 'poor'
              }
            },

            // Trending analysis (basic)
            valueTrend: {
              $cond: [
                { $and: [
                  { $ne: ['$minValue', '$maxValue'] },
                  { $gt: ['$count', 1] }
                ]},
                {
                  $switch: {
                    branches: [
                      { case: { $gt: ['$valueRange', { $multiply: ['$avgValue', 0.2] }] }, then: 'volatile' },
                      { case: { $gt: ['$coefficientOfVariation', 0.3] }, then: 'variable' },
                      { case: { $lt: ['$coefficientOfVariation', 0.1] }, then: 'stable' }
                    ],
                    default: 'moderate'
                  }
                },
                'insufficient_data'
              ]
            }
          }
        },

        // Data completeness analysis
        {
          $addFields: {
            dataCompleteness: {
              $multiply: [
                { $divide: ['$count', { $max: ['$expectedReadings', 1] }] },
                100
              ]
            },

            // Sampling rate (readings per minute)
            samplingRate: {
              $cond: [
                { $gt: ['$timeCoverageSeconds', 0] },
                { $divide: ['$count', { $divide: ['$timeCoverageSeconds', 60] }] },
                0
              ]
            }
          }
        },

        // Final projection and organization
        {
          $project: {
            // Identity fields
            deviceId: '$_id.deviceId',
            deviceType: '$_id.deviceType',
            sensorType: '$_id.sensorType',
            location: '$_id.location',
            timeBucket: '$_id.timeBucket',

            // Basic statistics
            dataPoints: '$count',
            statistics: {
              avg: { $round: ['$avgValue', 4] },
              min: '$minValue',
              max: '$maxValue',
              sum: { $round: ['$sumValue', 2] },
              stdDev: { $round: ['$stdDevValue', 4] },
              variance: { $round: ['$varianceValue', 4] },
              coefficientOfVariation: { $round: ['$coefficientOfVariation', 4] },
              valueRange: { $round: ['$valueRange', 4] },
              percentiles: '$percentiles'
            },

            // Data quality metrics
            dataQuality: {
              avgScore: { $round: ['$avgQualityScore', 3] },
              minScore: { $round: ['$minQualityScore', 3] },
              qualityPercentage: { $round: ['$qualityPercentage', 1] },
              lowQualityCount: '$lowQualityCount'
            },

            // Alert and anomaly information
            alerts: {
              anomalyCount: '$anomalyCount',
              anomalyRate: { $round: ['$anomalyRate', 2] },
              alertDistribution: '$alertDistribution'
            },

            // Time-based analysis
            temporal: {
              firstReading: '$firstReading',
              lastReading: '$lastReading',
              timeCoverageSeconds: { $round: ['$timeCoverageSeconds', 0] },
              dataCompleteness: { $round: ['$dataCompleteness', 1] },
              samplingRate: { $round: ['$samplingRate', 2] }
            },

            // Environmental context
            environment: {
              avgTemperature: { $round: ['$avgAmbientTemp', 1] },
              avgHumidity: { $round: ['$avgAmbientHumidity', 1] }
            },

            // Performance metrics
            performance: {
              avgProcessingLatency: { $round: ['$avgProcessingLatency', 0] },
              maxProcessingLatency: { $round: ['$maxProcessingLatency', 0] },
              performanceCategory: '$performanceCategory'
            },

            // Analysis results
            analysis: {
              valueTrend: '$valueTrend',
              overallAssessment: {
                $switch: {
                  branches: [
                    { 
                      case: { 
                        $and: [
                          { $eq: ['$performanceCategory', 'excellent'] },
                          { $gte: ['$dataCompleteness', 95] }
                        ]
                      }, 
                      then: 'optimal_performance' 
                    },
                    { 
                      case: { 
                        $and: [
                          { $in: ['$performanceCategory', ['good', 'excellent']] },
                          { $gte: ['$dataCompleteness', 80] }
                        ]
                      }, 
                      then: 'good_performance' 
                    },
                    { 
                      case: { $lt: ['$dataCompleteness', 50] }, 
                      then: 'data_gaps_detected' 
                    },
                    { 
                      case: { $gt: ['$anomalyRate', 15] }, 
                      then: 'high_anomaly_rate' 
                    },
                    { 
                      case: { $lt: ['$qualityPercentage', 70] }, 
                      then: 'quality_issues' 
                    }
                  ],
                  default: 'acceptable_performance'
                }
              }
            },

            // Include raw data if requested
            ...(query.includeRawData && { rawDataPoints: 1 })
          }
        },

        // Sort results
        { $sort: { deviceId: 1, timeBucket: 1 } },

        // Apply result limits
        ...(query.limit && [{ $limit: query.limit }])
      ];

      // Execute aggregation pipeline
      const results = await collection.aggregate(pipeline, {
        allowDiskUse: true,
        maxTimeMS: 30000
      }).toArray();

      const queryTime = Date.now() - startTime;

      // Update performance metrics
      this.updatePerformanceMetrics('query', results.length, queryTime);

      console.log(`Time series query completed: ${results.length} results in ${queryTime}ms`);

      return {
        success: true,
        collection: collectionName,
        results: results,
        resultCount: results.length,
        queryTime: queryTime,
        queryMetadata: {
          timeRange: {
            start: query.startTime,
            end: query.endTime || new Date()
          },
          granularity: query.granularity || 'raw',
          filters: {
            deviceIds: query.deviceIds?.length || 0,
            deviceTypes: query.deviceTypes?.length || 0,
            sensorTypes: query.sensorTypes?.length || 0,
            locations: query.locations?.length || 0
          },
          optimizationsApplied: ['time_series_bucketing', 'statistical_aggregation', 'index_optimization']
        }
      };

    } catch (error) {
      console.error(`Error querying time series data from ${collectionName}:`, error);
      return {
        success: false,
        collection: collectionName,
        error: error.message,
        queryTime: Date.now() - startTime
      };
    }
  }

  async setupRealTimeAggregations() {
    console.log('Setting up real-time aggregation pipelines...');

    // Create aggregation collections for different time windows
    const aggregationConfigs = [
      {
        name: 'sensor_readings_1min',
        sourceCollection: 'sensor_readings',
        windowSize: '1 minute',
        retentionDays: 7
      },
      {
        name: 'sensor_readings_5min',
        sourceCollection: 'sensor_readings',
        windowSize: '5 minutes',
        retentionDays: 30
      },
      {
        name: 'sensor_readings_1hour',
        sourceCollection: 'sensor_readings', 
        windowSize: '1 hour',
        retentionDays: 365
      },
      {
        name: 'sensor_readings_1day',
        sourceCollection: 'sensor_readings',
        windowSize: '1 day',
        retentionDays: 1825 // 5 years
      }
    ];

    for (const config of aggregationConfigs) {
      await this.createAggregationPipeline(config);
    }

    console.log('Real-time aggregation pipelines setup completed');
  }

  async createAggregationPipeline(config) {
    console.log(`Creating aggregation pipeline: ${config.name}`);

    // Create collection for storing aggregated data
    const aggregationCollection = await this.db.createCollection(config.name, {
      timeseries: {
        timeField: 'timestamp',
        metaField: 'device',
        granularity: config.windowSize.includes('minute') ? 'minutes' : 
                    config.windowSize.includes('hour') ? 'hours' : 'days'
      },
      expireAfterSeconds: config.retentionDays * 24 * 60 * 60
    });

    this.aggregationCollections.set(config.name, {
      collection: aggregationCollection,
      config: config
    });
  }

  async processRealTimeData(collectionName, dataPoints) {
    console.log(`Processing real-time data for ${collectionName}: ${dataPoints.length} points`);

    // Update real-time aggregations
    for (const [aggName, aggInfo] of this.aggregationCollections.entries()) {
      if (aggInfo.config.sourceCollection === collectionName) {
        await this.updateRealTimeAggregation(aggName, dataPoints);
      }
    }

    // Process data through ML pipelines if enabled
    if (this.config.enablePredictiveAnalytics) {
      await this.processPredictiveAnalytics(dataPoints);
    }
  }

  async checkAlertConditions(collectionName, dataPoints) {
    console.log(`Checking alert conditions for ${dataPoints.length} data points`);

    const alertsTriggered = [];

    for (const dataPoint of dataPoints) {
      // Check various alert conditions
      const alerts = [];

      // Value threshold alerts
      if (dataPoint.sensor.type === 'temperature' && dataPoint.value > 80) {
        alerts.push({
          type: 'threshold_violation',
          severity: 'high',
          message: `Temperature ${dataPoint.value}°C exceeds threshold`,
          deviceId: dataPoint.device.id
        });
      }

      // Quality score alerts
      if (dataPoint.quality.score < 0.7) {
        alerts.push({
          type: 'quality_degradation',
          severity: 'medium',
          message: `Quality score ${dataPoint.quality.score} below acceptable level`,
          deviceId: dataPoint.device.id
        });
      }

      // Anomaly alerts
      if (dataPoint.alerts.anomalyDetected) {
        alerts.push({
          type: 'anomaly_detected',
          severity: 'high',
          message: `Anomaly detected in sensor reading`,
          deviceId: dataPoint.device.id
        });
      }

      // Processing latency alerts
      if (dataPoint.processing.processingLatency > 5000) { // 5 seconds
        alerts.push({
          type: 'processing_delay',
          severity: 'medium',
          message: `Processing latency ${dataPoint.processing.processingLatency}ms exceeds threshold`,
          deviceId: dataPoint.device.id
        });
      }

      if (alerts.length > 0) {
        alertsTriggered.push(...alerts);
        this.performanceMetrics.alertsTriggered += alerts.length;
      }
    }

    // Store alerts if any were triggered
    if (alertsTriggered.length > 0) {
      await this.storeAlerts(alertsTriggered);
    }

    return alertsTriggered;
  }

  async storeAlerts(alerts) {
    try {
      // Create alerts collection if it doesn't exist
      if (!this.alertCollections.has('iot_alerts')) {
        const alertsCollection = await this.db.createCollection('iot_alerts');
        await alertsCollection.createIndexes([
          { key: { deviceId: 1, timestamp: -1 }, background: true },
          { key: { severity: 1, timestamp: -1 }, background: true },
          { key: { type: 1, timestamp: -1 }, background: true }
        ]);

        this.alertCollections.set('iot_alerts', alertsCollection);
      }

      const alertsCollection = this.alertCollections.get('iot_alerts');

      const alertDocuments = alerts.map(alert => ({
        ...alert,
        timestamp: new Date(),
        acknowledged: false,
        resolvedAt: null
      }));

      await alertsCollection.insertMany(alertDocuments);

      console.log(`Stored ${alertDocuments.length} alerts`);

    } catch (error) {
      console.error('Error storing alerts:', error);
    }
  }

  updatePerformanceMetrics(operation, count, duration) {
    if (operation === 'insert') {
      this.performanceMetrics.totalDataPoints += count;
      this.performanceMetrics.writeOperationsPerSecond = 
        (count / duration) * 1000;
    } else if (operation === 'query') {
      this.performanceMetrics.queryOperationsPerSecond = 
        (count / duration) * 1000;
    }

    // Update average latency
    this.performanceMetrics.averageLatency = 
      (this.performanceMetrics.averageLatency + duration) / 2;
  }

  async getSystemStatistics() {
    console.log('Gathering IoT Time Series system statistics...');

    const stats = {
      collections: {},
      performance: this.performanceMetrics,
      aggregations: {},
      systemHealth: 'healthy'
    };

    // Gather statistics from each time series collection
    for (const [collectionName, collectionInfo] of this.timeSeriesCollections.entries()) {
      try {
        const collection = collectionInfo.collection;

        const [collectionStats, sampleData] = await Promise.all([
          collection.stats(),
          collection.find().sort({ timestamp: -1 }).limit(1).toArray()
        ]);

        stats.collections[collectionName] = {
          documentCount: collectionStats.count || 0,
          storageSize: collectionStats.size || 0,
          indexSize: collectionStats.totalIndexSize || 0,
          avgDocumentSize: collectionStats.avgObjSize || 0,
          compressionRatio: collectionStats.size > 0 ? 
            (collectionStats.storageSize / collectionStats.size) : 1,
          lastDataPoint: sampleData[0]?.timestamp || null,
          configuration: collectionInfo.config,
          performance: collectionInfo.stats
        };

      } catch (error) {
        stats.collections[collectionName] = {
          error: error.message,
          available: false
        };
      }
    }

    return stats;
  }

  async shutdown() {
    console.log('Shutting down IoT Time Series Manager...');

    // Close change streams
    for (const [streamName, changeStream] of this.changeStreams.entries()) {
      try {
        await changeStream.close();
        console.log(`Closed change stream: ${streamName}`);
      } catch (error) {
        console.error(`Error closing change stream ${streamName}:`, error);
      }
    }

    // Close MongoDB connection
    if (this.client) {
      await this.client.close();
    }

    console.log('IoT Time Series Manager shutdown complete');
  }
}

// Benefits of MongoDB Time Series Collections:
// - Native time series optimization with automatic bucketing and compression
// - Specialized indexing and query optimization for time-based data patterns
// - Efficient storage with automatic data lifecycle management
// - Real-time aggregation pipelines for IoT analytics
// - Built-in support for high-volume write workloads
// - Intelligent compression reducing storage costs by up to 90%
// - Seamless integration with MongoDB's distributed architecture
// - SQL-compatible time series operations through QueryLeaf integration
// - Native support for IoT-specific query patterns and analytics
// - Automatic data archiving and retention management

module.exports = {
  IoTTimeSeriesManager
};

Understanding MongoDB Time Series Collections Architecture

IoT Data Patterns and Optimization Strategies

MongoDB Time Series Collections are specifically designed for the unique characteristics of IoT and time-stamped data:

// Advanced IoT Time Series Processing with Enterprise Features
class EnterpriseIoTProcessor extends IoTTimeSeriesManager {
  constructor(connectionString, enterpriseConfig) {
    super(connectionString, enterpriseConfig);

    this.enterpriseConfig = {
      ...enterpriseConfig,
      enableEdgeComputing: true,
      enablePredictiveAnalytics: true,
      enableDigitalTwins: true,
      enableMLPipelines: true,
      enableAdvancedVisualization: true,
      enableMultiTenancy: true
    };

    this.setupEnterpriseFeatures();
    this.initializeMLPipelines();
    this.setupDigitalTwins();
  }

  async implementAdvancedIoTStrategies() {
    console.log('Implementing enterprise IoT data strategies...');

    const strategies = {
      // Edge computing integration
      edgeComputing: {
        edgeDataAggregation: true,
        intelligentFiltering: true,
        localAnomalyDetection: true,
        bandwidthOptimization: true
      },

      // Predictive analytics
      predictiveAnalytics: {
        equipmentFailurePrediction: true,
        energyOptimization: true,
        maintenanceScheduling: true,
        capacityPlanning: true
      },

      // Digital twin implementation
      digitalTwins: {
        realTimeSimulation: true,
        processOptimization: true,
        scenarioModeling: true,
        performanceAnalytics: true
      }
    };

    return await this.deployEnterpriseIoTStrategies(strategies);
  }

  async setupAdvancedAnalytics() {
    console.log('Setting up advanced IoT analytics capabilities...');

    const analyticsConfig = {
      // Real-time processing
      realTimeProcessing: {
        streamProcessing: true,
        complexEventProcessing: true,
        patternRecognition: true,
        correlationAnalysis: true
      },

      // Machine learning integration
      machineLearning: {
        anomalyDetection: true,
        predictiveModeling: true,
        classificationModels: true,
        reinforcementLearning: true
      },

      // Advanced visualization
      visualization: {
        realTimeDashboards: true,
        historicalAnalytics: true,
        geospatialVisualization: true,
        threeDimensionalModeling: true
      }
    };

    return await this.deployAdvancedAnalytics(analyticsConfig);
  }
}

SQL-Style Time Series Operations with QueryLeaf

QueryLeaf provides familiar SQL syntax for MongoDB Time Series operations and IoT data management:

-- QueryLeaf Time Series operations with SQL-familiar syntax for IoT data

-- Create optimized time series collections for different IoT data types
CREATE TIME_SERIES_COLLECTION sensor_readings (
  timestamp TIMESTAMPTZ,
  device_id STRING,
  sensor_type STRING,
  value DECIMAL,
  quality_score DECIMAL,
  location OBJECT,
  metadata OBJECT
)
WITH (
  time_field = 'timestamp',
  meta_field = 'device_metadata',
  granularity = 'seconds',
  bucket_max_span_seconds = 3600,
  bucket_rounding_seconds = 60,
  expire_after_seconds = 31536000, -- 1 year retention
  enable_compression = true,
  compression_algorithm = 'zstd'
);

-- Create specialized collections for different IoT use cases
CREATE TIME_SERIES_COLLECTION equipment_telemetry (
  timestamp TIMESTAMPTZ,
  equipment_id STRING,
  metric_type STRING,
  value DECIMAL,
  operational_status STRING,
  maintenance_flags ARRAY,
  performance_indicators OBJECT
)
WITH (
  granularity = 'seconds',
  bucket_max_span_seconds = 1800, -- 30 minute buckets for high-frequency data
  enable_automatic_indexing = true
);

CREATE TIME_SERIES_COLLECTION environmental_monitoring (
  timestamp TIMESTAMPTZ,
  location_id STRING,
  sensor_network STRING,
  measurements OBJECT,
  weather_conditions OBJECT,
  air_quality_index DECIMAL
)
WITH (
  granularity = 'minutes',
  bucket_max_span_seconds = 7200, -- 2 hour buckets for environmental data
  enable_predictive_analytics = true
);

-- Advanced IoT data insertion with comprehensive metadata
INSERT INTO sensor_readings (
  timestamp, device_metadata, sensor_info, measurements, quality_metrics, context
)
WITH iot_data_enrichment AS (
  SELECT 
    reading_timestamp as timestamp,

    -- Device and location metadata (optimized as metaField)
    JSON_OBJECT(
      'device_id', device_identifier,
      'device_type', equipment_type,
      'firmware_version', firmware_ver,
      'location', JSON_OBJECT(
        'facility', facility_name,
        'zone', zone_identifier,
        'coordinates', JSON_OBJECT('lat', latitude, 'lng', longitude),
        'floor', floor_number,
        'room', room_identifier
      ),
      'network_info', JSON_OBJECT(
        'gateway_id', gateway_identifier,
        'signal_strength', rssi_value,
        'protocol', communication_protocol,
        'network_latency', network_delay_ms
      )
    ) as device_metadata,

    -- Sensor information and calibration data
    JSON_OBJECT(
      'sensor_type', sensor_category,
      'model_number', sensor_model,
      'serial_number', sensor_serial,
      'calibration_date', last_calibration,
      'maintenance_schedule', maintenance_interval,
      'measurement_unit', measurement_units,
      'precision', sensor_precision,
      'accuracy', sensor_accuracy_percent,
      'operating_range', JSON_OBJECT(
        'min_value', minimum_measurable,
        'max_value', maximum_measurable,
        'optimal_range', optimal_operating_range
      )
    ) as sensor_info,

    -- Measurement data with statistical context
    JSON_OBJECT(
      'primary_value', sensor_reading,
      'raw_value', unprocessed_reading,
      'calibrated_value', calibration_adjusted_value,
      'statistical_context', JSON_OBJECT(
        'recent_average', rolling_average_10min,
        'recent_min', rolling_min_10min,
        'recent_max', rolling_max_10min,
        'trend_indicator', trend_direction,
        'volatility_index', measurement_volatility
      ),
      'related_measurements', JSON_OBJECT(
        'secondary_sensors', related_sensor_readings,
        'environmental_factors', ambient_conditions,
        'operational_context', equipment_operating_mode
      )
    ) as measurements,

    -- Comprehensive quality assessment
    JSON_OBJECT(
      'overall_score', data_quality_score,
      'confidence_level', measurement_confidence,
      'quality_factors', JSON_OBJECT(
        'sensor_health', sensor_status_indicator,
        'calibration_validity', calibration_status,
        'environmental_conditions', environmental_suitability,
        'signal_integrity', signal_quality_assessment,
        'power_status', power_supply_stability
      ),
      'quality_flags', quality_warning_flags,
      'anomaly_indicators', JSON_OBJECT(
        'statistical_anomaly', statistical_outlier_flag,
        'temporal_anomaly', temporal_pattern_anomaly,
        'contextual_anomaly', contextual_deviation_flag,
        'severity_level', anomaly_severity_rating
      )
    ) as quality_metrics,

    -- Business and operational context
    JSON_OBJECT(
      'business_context', JSON_OBJECT(
        'asset_id', primary_asset_identifier,
        'process_id', manufacturing_process_id,
        'production_batch', current_batch_identifier,
        'shift_information', JSON_OBJECT(
          'shift_id', current_shift,
          'operator_id', responsible_operator,
          'supervisor_id', shift_supervisor
        )
      ),
      'operational_context', JSON_OBJECT(
        'equipment_mode', current_operational_mode,
        'production_rate', current_production_speed,
        'efficiency_metrics', operational_efficiency_data,
        'maintenance_status', equipment_maintenance_state,
        'compliance_flags', regulatory_compliance_status
      ),
      'alert_configuration', JSON_OBJECT(
        'threshold_settings', alert_threshold_values,
        'notification_rules', alert_notification_config,
        'escalation_procedures', alert_escalation_rules,
        'suppression_conditions', alert_suppression_rules
      )
    ) as context

  FROM raw_iot_data_stream
  WHERE 
    data_timestamp >= CURRENT_TIMESTAMP - INTERVAL '5 minutes'
    AND data_quality_preliminary >= 0.5
    AND device_status != 'maintenance_mode'
)
SELECT 
  timestamp,
  device_metadata,
  sensor_info,
  measurements,
  quality_metrics,
  context,

  -- Processing metadata
  JSON_OBJECT(
    'ingestion_timestamp', CURRENT_TIMESTAMP,
    'processing_latency_ms', 
      EXTRACT(EPOCH FROM (CURRENT_TIMESTAMP - timestamp)) * 1000,
    'data_pipeline_version', '2.1.0',
    'enrichment_applied', JSON_ARRAY(
      'metadata_enhancement',
      'quality_assessment',
      'anomaly_detection',
      'contextual_enrichment'
    )
  ) as processing_metadata

FROM iot_data_enrichment
WHERE 
  -- Final data quality validation
  JSON_EXTRACT(quality_metrics, '$.overall_score') >= 0.6
  AND JSON_EXTRACT(measurements, '$.primary_value') IS NOT NULL

ORDER BY timestamp;

-- Real-time IoT analytics with time-based aggregations
WITH real_time_sensor_analytics AS (
  SELECT 
    DATE_TRUNC('minute', timestamp) as time_bucket,
    JSON_EXTRACT(device_metadata, '$.device_id') as device_id,
    JSON_EXTRACT(device_metadata, '$.device_type') as device_type,
    JSON_EXTRACT(device_metadata, '$.location.facility') as facility,
    JSON_EXTRACT(device_metadata, '$.location.zone') as zone,
    JSON_EXTRACT(sensor_info, '$.sensor_type') as sensor_type,

    -- Statistical aggregations optimized for time series
    COUNT(*) as reading_count,
    AVG(JSON_EXTRACT(measurements, '$.primary_value')::DECIMAL) as avg_value,
    MIN(JSON_EXTRACT(measurements, '$.primary_value')::DECIMAL) as min_value,
    MAX(JSON_EXTRACT(measurements, '$.primary_value')::DECIMAL) as max_value,
    STDDEV(JSON_EXTRACT(measurements, '$.primary_value')::DECIMAL) as stddev_value,

    -- Percentile calculations for distribution analysis
    PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY JSON_EXTRACT(measurements, '$.primary_value')::DECIMAL) as p25_value,
    PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY JSON_EXTRACT(measurements, '$.primary_value')::DECIMAL) as median_value,
    PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY JSON_EXTRACT(measurements, '$.primary_value')::DECIMAL) as p75_value,
    PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY JSON_EXTRACT(measurements, '$.primary_value')::DECIMAL) as p95_value,

    -- Data quality aggregations
    AVG(JSON_EXTRACT(quality_metrics, '$.overall_score')::DECIMAL) as avg_quality_score,
    MIN(JSON_EXTRACT(quality_metrics, '$.overall_score')::DECIMAL) as min_quality_score,
    COUNT(*) FILTER (WHERE JSON_EXTRACT(quality_metrics, '$.overall_score')::DECIMAL < 0.8) as low_quality_readings,

    -- Anomaly detection aggregations
    COUNT(*) FILTER (WHERE JSON_EXTRACT(quality_metrics, '$.anomaly_indicators.statistical_anomaly')::BOOLEAN = true) as statistical_anomalies,
    COUNT(*) FILTER (WHERE JSON_EXTRACT(quality_metrics, '$.anomaly_indicators.temporal_anomaly')::BOOLEAN = true) as temporal_anomalies,
    COUNT(*) FILTER (WHERE JSON_EXTRACT(quality_metrics, '$.anomaly_indicators.contextual_anomaly')::BOOLEAN = true) as contextual_anomalies,

    -- Value change and trend analysis
    (MAX(JSON_EXTRACT(measurements, '$.primary_value')::DECIMAL) - 
     MIN(JSON_EXTRACT(measurements, '$.primary_value')::DECIMAL)) as value_range,

    -- Time coverage and sampling analysis
    (EXTRACT(EPOCH FROM MAX(timestamp) - MIN(timestamp))) as time_span_seconds,
    COUNT(*)::DECIMAL / GREATEST(1, EXTRACT(EPOCH FROM MAX(timestamp) - MIN(timestamp)) / 60) as readings_per_minute,

    -- Processing performance metrics
    AVG(JSON_EXTRACT(processing_metadata, '$.processing_latency_ms')::DECIMAL) as avg_processing_latency,
    MAX(JSON_EXTRACT(processing_metadata, '$.processing_latency_ms')::DECIMAL) as max_processing_latency,

    -- Network performance indicators
    AVG(JSON_EXTRACT(device_metadata, '$.network_info.network_latency')::DECIMAL) as avg_network_latency,
    AVG(JSON_EXTRACT(device_metadata, '$.network_info.signal_strength')::DECIMAL) as avg_signal_strength,

    -- Environmental context aggregations
    AVG(JSON_EXTRACT(measurements, '$.related_measurements.environmental_factors.temperature')::DECIMAL) as avg_ambient_temp,
    AVG(JSON_EXTRACT(measurements, '$.related_measurements.environmental_factors.humidity')::DECIMAL) as avg_ambient_humidity,

    -- Operational context
    MODE() WITHIN GROUP (ORDER BY JSON_EXTRACT(context, '$.operational_context.equipment_mode')::STRING) as primary_equipment_mode,
    AVG(JSON_EXTRACT(context, '$.operational_context.production_rate')::DECIMAL) as avg_production_rate

  FROM sensor_readings
  WHERE 
    timestamp >= CURRENT_TIMESTAMP - INTERVAL '1 hour'
    AND JSON_EXTRACT(quality_metrics, '$.overall_score')::DECIMAL >= 0.5
  GROUP BY 
    time_bucket, device_id, device_type, facility, zone, sensor_type
),

performance_analysis AS (
  SELECT 
    rtsa.*,

    -- Data quality assessment
    ROUND((rtsa.reading_count - rtsa.low_quality_readings)::DECIMAL / rtsa.reading_count * 100, 2) as quality_percentage,

    -- Anomaly rate calculations
    ROUND((rtsa.statistical_anomalies + rtsa.temporal_anomalies + rtsa.contextual_anomalies)::DECIMAL / rtsa.reading_count * 100, 2) as total_anomaly_rate,

    -- Statistical analysis
    CASE 
      WHEN rtsa.avg_value != 0 THEN ROUND(rtsa.stddev_value / ABS(rtsa.avg_value), 4)
      ELSE 0
    END as coefficient_of_variation,

    -- Data completeness analysis (expected vs actual readings)
    ROUND(rtsa.reading_count / GREATEST(1, rtsa.time_span_seconds / 60) * 100, 1) as data_completeness_percent,

    -- Performance classification
    CASE 
      WHEN rtsa.avg_quality_score >= 0.95 AND rtsa.total_anomaly_rate <= 1 THEN 'excellent'
      WHEN rtsa.avg_quality_score >= 0.85 AND rtsa.total_anomaly_rate <= 5 THEN 'good'
      WHEN rtsa.avg_quality_score >= 0.70 AND rtsa.total_anomaly_rate <= 10 THEN 'acceptable'
      ELSE 'poor'
    END as performance_category,

    -- Trend analysis
    CASE 
      WHEN rtsa.coefficient_of_variation > 0.5 THEN 'highly_variable'
      WHEN rtsa.coefficient_of_variation > 0.3 THEN 'variable'
      WHEN rtsa.coefficient_of_variation > 0.1 THEN 'moderate'
      ELSE 'stable'
    END as stability_classification,

    -- Alert conditions
    CASE 
      WHEN rtsa.avg_quality_score < 0.7 THEN 'quality_alert'
      WHEN rtsa.total_anomaly_rate > 15 THEN 'anomaly_alert'
      WHEN rtsa.avg_processing_latency > 5000 THEN 'latency_alert'
      WHEN rtsa.data_completeness_percent < 80 THEN 'data_gap_alert'
      WHEN ABS(rtsa.avg_signal_strength) < -80 THEN 'connectivity_alert'
      ELSE 'normal'
    END as alert_status,

    -- Operational efficiency indicators
    CASE 
      WHEN rtsa.primary_equipment_mode = 'production' AND rtsa.avg_production_rate >= 95 THEN 'optimal_efficiency'
      WHEN rtsa.primary_equipment_mode = 'production' AND rtsa.avg_production_rate >= 80 THEN 'good_efficiency'
      WHEN rtsa.primary_equipment_mode = 'production' AND rtsa.avg_production_rate >= 60 THEN 'reduced_efficiency'
      WHEN rtsa.primary_equipment_mode = 'maintenance' THEN 'maintenance_mode'
      ELSE 'unknown_efficiency'
    END as operational_efficiency,

    -- Time-based patterns
    EXTRACT(HOUR FROM rtsa.time_bucket) as hour_of_day,
    EXTRACT(DOW FROM rtsa.time_bucket) as day_of_week

  FROM real_time_sensor_analytics rtsa
),

device_health_assessment AS (
  SELECT 
    pa.device_id,
    pa.device_type,
    pa.facility,
    pa.zone,
    pa.sensor_type,

    -- Current status indicators
    LAST_VALUE(pa.performance_category) OVER (
      PARTITION BY pa.device_id 
      ORDER BY pa.time_bucket 
      ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
    ) as current_performance_status,

    LAST_VALUE(pa.alert_status) OVER (
      PARTITION BY pa.device_id 
      ORDER BY pa.time_bucket 
      ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
    ) as current_alert_status,

    -- Performance trends over the analysis window
    COUNT(*) as analysis_periods,
    COUNT(*) FILTER (WHERE pa.performance_category IN ('excellent', 'good')) as good_periods,
    COUNT(*) FILTER (WHERE pa.alert_status != 'normal') as alert_periods,

    -- Average performance metrics
    ROUND(AVG(pa.avg_quality_score), 3) as overall_avg_quality,
    ROUND(AVG(pa.total_anomaly_rate), 2) as overall_anomaly_rate,
    ROUND(AVG(pa.readings_per_minute), 2) as overall_data_rate,
    ROUND(AVG(pa.avg_processing_latency), 0) as overall_processing_latency,

    -- Stability and consistency
    ROUND(AVG(pa.coefficient_of_variation), 4) as average_stability_index,
    ROUND(AVG(pa.data_completeness_percent), 1) as average_data_completeness,

    -- Network and connectivity
    ROUND(AVG(pa.avg_network_latency), 0) as average_network_latency,
    ROUND(AVG(pa.avg_signal_strength), 1) as average_signal_strength,

    -- Environmental context
    ROUND(AVG(pa.avg_ambient_temp), 1) as average_ambient_temperature,
    ROUND(AVG(pa.avg_ambient_humidity), 1) as average_ambient_humidity,

    -- Operational efficiency
    MODE() WITHIN GROUP (ORDER BY pa.operational_efficiency) as predominant_efficiency_level,

    -- Value statistics across all time periods
    ROUND(AVG(pa.avg_value), 4) as overall_average_value,
    ROUND(AVG(pa.stddev_value), 4) as overall_value_variability,
    MIN(pa.min_value) as absolute_minimum_value,
    MAX(pa.max_value) as absolute_maximum_value

  FROM performance_analysis pa
  GROUP BY 
    pa.device_id, pa.device_type, pa.facility, pa.zone, pa.sensor_type
)

-- Comprehensive IoT device health and performance report
SELECT 
  dha.device_id,
  dha.device_type,
  dha.sensor_type,
  dha.facility,
  dha.zone,

  -- Current status
  dha.current_performance_status,
  dha.current_alert_status,

  -- Performance summary
  dha.overall_avg_quality as quality_score,
  dha.overall_anomaly_rate as anomaly_rate_percent,
  dha.overall_data_rate as readings_per_minute,
  dha.overall_processing_latency as avg_latency_ms,

  -- Reliability indicators
  ROUND((dha.good_periods::DECIMAL / dha.analysis_periods) * 100, 1) as uptime_percentage,
  ROUND((dha.alert_periods::DECIMAL / dha.analysis_periods) * 100, 1) as alert_percentage,
  dha.average_data_completeness as data_completeness_percent,

  -- Performance classification
  CASE 
    WHEN dha.overall_avg_quality >= 0.9 AND dha.overall_anomaly_rate <= 2 AND dha.uptime_percentage >= 95 THEN 'optimal'
    WHEN dha.overall_avg_quality >= 0.8 AND dha.overall_anomaly_rate <= 5 AND dha.uptime_percentage >= 90 THEN 'good'
    WHEN dha.overall_avg_quality >= 0.6 AND dha.overall_anomaly_rate <= 10 AND dha.uptime_percentage >= 80 THEN 'acceptable'
    WHEN dha.overall_avg_quality < 0.5 OR dha.overall_anomaly_rate > 20 THEN 'critical'
    ELSE 'needs_attention'
  END as device_health_classification,

  -- Operational context
  dha.predominant_efficiency_level,
  dha.overall_average_value as typical_reading_value,
  dha.overall_value_variability as measurement_stability,

  -- Environmental factors
  dha.average_ambient_temperature,
  dha.average_ambient_humidity,

  -- Connectivity and infrastructure
  dha.average_network_latency as network_latency_ms,
  dha.average_signal_strength as signal_strength_dbm,

  -- Recommendations and next actions
  CASE 
    WHEN dha.current_alert_status = 'quality_alert' THEN 'calibrate_sensor_immediate'
    WHEN dha.current_alert_status = 'anomaly_alert' THEN 'investigate_anomaly_patterns'
    WHEN dha.current_alert_status = 'latency_alert' THEN 'optimize_data_pipeline'
    WHEN dha.current_alert_status = 'connectivity_alert' THEN 'check_network_infrastructure'
    WHEN dha.current_alert_status = 'data_gap_alert' THEN 'verify_sensor_connectivity'
    WHEN dha.overall_avg_quality < 0.8 THEN 'schedule_maintenance'
    WHEN dha.overall_anomaly_rate > 10 THEN 'review_operating_conditions'
    WHEN dha.uptime_percentage < 90 THEN 'improve_system_reliability'
    ELSE 'continue_monitoring'
  END as recommended_action,

  -- Priority level for action
  CASE 
    WHEN dha.device_health_classification = 'critical' THEN 'immediate'
    WHEN dha.device_health_classification = 'needs_attention' THEN 'high'
    WHEN dha.current_alert_status != 'normal' THEN 'medium'
    WHEN dha.device_health_classification = 'acceptable' THEN 'low'
    ELSE 'routine'
  END as action_priority

FROM device_health_assessment dha
ORDER BY 
  CASE action_priority
    WHEN 'immediate' THEN 1
    WHEN 'high' THEN 2  
    WHEN 'medium' THEN 3
    WHEN 'low' THEN 4
    ELSE 5
  END,
  dha.overall_anomaly_rate DESC,
  dha.overall_avg_quality ASC;

-- Time series forecasting and predictive analytics
WITH historical_patterns AS (
  SELECT 
    JSON_EXTRACT(device_metadata, '$.device_id') as device_id,
    JSON_EXTRACT(sensor_info, '$.sensor_type') as sensor_type,
    DATE_TRUNC('hour', timestamp) as hour_bucket,

    AVG(JSON_EXTRACT(measurements, '$.primary_value')::DECIMAL) as hourly_avg_value,
    COUNT(*) as readings_count,
    MIN(JSON_EXTRACT(measurements, '$.primary_value')::DECIMAL) as hourly_min,
    MAX(JSON_EXTRACT(measurements, '$.primary_value')::DECIMAL) as hourly_max,

    -- Time-based features for forecasting
    EXTRACT(HOUR FROM timestamp) as hour_of_day,
    EXTRACT(DOW FROM timestamp) as day_of_week,
    EXTRACT(DAY FROM timestamp) as day_of_month,

    -- Seasonal indicators
    CASE 
      WHEN EXTRACT(MONTH FROM timestamp) IN (12, 1, 2) THEN 'winter'
      WHEN EXTRACT(MONTH FROM timestamp) IN (3, 4, 5) THEN 'spring'
      WHEN EXTRACT(MONTH FROM timestamp) IN (6, 7, 8) THEN 'summer'
      ELSE 'autumn'
    END as season,

    -- Operational context features
    MODE() WITHIN GROUP (ORDER BY JSON_EXTRACT(context, '$.operational_context.equipment_mode')) as predominant_mode

  FROM sensor_readings
  WHERE 
    timestamp >= CURRENT_TIMESTAMP - INTERVAL '30 days'
    AND JSON_EXTRACT(quality_metrics, '$.overall_score')::DECIMAL >= 0.8
  GROUP BY 
    device_id, sensor_type, hour_bucket, hour_of_day, day_of_week, day_of_month, season
),

trend_analysis AS (
  SELECT 
    hp.*,

    -- Moving averages for trend analysis
    AVG(hp.hourly_avg_value) OVER (
      PARTITION BY hp.device_id, hp.sensor_type 
      ORDER BY hp.hour_bucket 
      ROWS BETWEEN 23 PRECEDING AND CURRENT ROW
    ) as moving_avg_24h,

    AVG(hp.hourly_avg_value) OVER (
      PARTITION BY hp.device_id, hp.sensor_type 
      ORDER BY hp.hour_bucket 
      ROWS BETWEEN 167 PRECEDING AND CURRENT ROW  -- 7 days * 24 hours
    ) as moving_avg_7d,

    -- Lag values for change detection
    LAG(hp.hourly_avg_value, 1) OVER (
      PARTITION BY hp.device_id, hp.sensor_type 
      ORDER BY hp.hour_bucket
    ) as prev_hour_value,

    LAG(hp.hourly_avg_value, 24) OVER (
      PARTITION BY hp.device_id, hp.sensor_type 
      ORDER BY hp.hour_bucket
    ) as prev_day_same_hour_value,

    LAG(hp.hourly_avg_value, 168) OVER (
      PARTITION BY hp.device_id, hp.sensor_type 
      ORDER BY hp.hour_bucket
    ) as prev_week_same_hour_value,

    -- Seasonal comparison
    AVG(hp.hourly_avg_value) OVER (
      PARTITION BY hp.device_id, hp.sensor_type, hp.hour_of_day, hp.day_of_week
      ORDER BY hp.hour_bucket
      ROWS BETWEEN 672 PRECEDING AND 672 PRECEDING  -- 4 weeks ago, same hour/day
    ) as seasonal_baseline

  FROM historical_patterns hp
),

predictive_indicators AS (
  SELECT 
    ta.*,

    -- Change calculations
    COALESCE(ta.hourly_avg_value - ta.prev_hour_value, 0) as hourly_change,
    COALESCE(ta.hourly_avg_value - ta.prev_day_same_hour_value, 0) as daily_change,
    COALESCE(ta.hourly_avg_value - ta.prev_week_same_hour_value, 0) as weekly_change,
    COALESCE(ta.hourly_avg_value - ta.seasonal_baseline, 0) as seasonal_deviation,

    -- Trend direction indicators
    CASE 
      WHEN ta.hourly_avg_value > ta.moving_avg_24h * 1.05 THEN 'upward'
      WHEN ta.hourly_avg_value < ta.moving_avg_24h * 0.95 THEN 'downward'  
      ELSE 'stable'
    END as short_term_trend,

    CASE 
      WHEN ta.moving_avg_24h > ta.moving_avg_7d * 1.02 THEN 'increasing'
      WHEN ta.moving_avg_24h < ta.moving_avg_7d * 0.98 THEN 'decreasing'
      ELSE 'steady'
    END as long_term_trend,

    -- Volatility measures
    ABS(ta.hourly_avg_value - ta.moving_avg_24h) / NULLIF(ta.moving_avg_24h, 0) as relative_volatility,

    -- Anomaly scoring
    CASE 
      WHEN ABS(ta.hourly_avg_value - ta.seasonal_baseline) > (ta.moving_avg_7d * 0.3) THEN 'high_anomaly'
      WHEN ABS(ta.hourly_avg_value - ta.seasonal_baseline) > (ta.moving_avg_7d * 0.15) THEN 'moderate_anomaly'
      WHEN ABS(ta.hourly_avg_value - ta.seasonal_baseline) > (ta.moving_avg_7d * 0.05) THEN 'low_anomaly'
      ELSE 'normal'
    END as anomaly_level,

    -- Predictive risk assessment
    CASE 
      WHEN ta.short_term_trend = 'upward' AND ta.long_term_trend = 'increasing' AND ta.relative_volatility > 0.2 THEN 'high_risk'
      WHEN ta.short_term_trend = 'downward' AND ta.long_term_trend = 'decreasing' AND ta.relative_volatility > 0.15 THEN 'high_risk'
      WHEN ta.relative_volatility > 0.25 THEN 'moderate_risk'
      WHEN ta.anomaly_level IN ('high_anomaly', 'moderate_anomaly') THEN 'moderate_risk'
      ELSE 'low_risk'
    END as predictive_risk_level

  FROM trend_analysis ta
  WHERE ta.hour_bucket >= CURRENT_TIMESTAMP - INTERVAL '7 days'
)

-- Predictive analytics and forecasting results
SELECT 
  pi.device_id,
  pi.sensor_type,
  pi.hour_bucket,

  -- Current values and trends
  ROUND(pi.hourly_avg_value, 4) as current_value,
  ROUND(pi.moving_avg_24h, 4) as trend_24h,
  ROUND(pi.moving_avg_7d, 4) as trend_7d,

  -- Change analysis
  ROUND(pi.hourly_change, 4) as hour_to_hour_change,
  ROUND(pi.daily_change, 4) as day_to_day_change,
  ROUND(pi.weekly_change, 4) as week_to_week_change,
  ROUND(pi.seasonal_deviation, 4) as seasonal_variance,

  -- Trend classification
  pi.short_term_trend,
  pi.long_term_trend,
  pi.anomaly_level,
  pi.predictive_risk_level,

  -- Risk indicators
  ROUND(pi.relative_volatility * 100, 2) as volatility_percent,

  -- Simple linear forecast (next hour prediction)
  ROUND(
    pi.hourly_avg_value + 
    (COALESCE(pi.hourly_change, 0) * 0.7) + 
    (COALESCE(pi.daily_change, 0) * 0.2) + 
    (COALESCE(pi.weekly_change, 0) * 0.1), 
    4
  ) as predicted_next_hour_value,

  -- Confidence level for prediction
  CASE 
    WHEN pi.relative_volatility < 0.05 AND pi.anomaly_level = 'normal' THEN 'high'
    WHEN pi.relative_volatility < 0.15 AND pi.anomaly_level IN ('normal', 'low_anomaly') THEN 'medium'
    WHEN pi.relative_volatility < 0.30 THEN 'low'
    ELSE 'very_low'
  END as prediction_confidence,

  -- Maintenance and operational recommendations
  CASE 
    WHEN pi.predictive_risk_level = 'high_risk' THEN 'schedule_immediate_inspection'
    WHEN pi.anomaly_level = 'high_anomaly' THEN 'investigate_root_cause'
    WHEN pi.long_term_trend = 'decreasing' AND pi.sensor_type = 'efficiency' THEN 'schedule_maintenance'
    WHEN pi.relative_volatility > 0.2 THEN 'check_sensor_calibration'
    WHEN pi.short_term_trend != pi.long_term_trend THEN 'monitor_closely'
    ELSE 'continue_routine_monitoring'
  END as maintenance_recommendation

FROM predictive_indicators pi
WHERE pi.hour_bucket >= CURRENT_TIMESTAMP - INTERVAL '24 hours'
ORDER BY 
  CASE pi.predictive_risk_level
    WHEN 'high_risk' THEN 1
    WHEN 'moderate_risk' THEN 2
    ELSE 3
  END,
  pi.relative_volatility DESC,
  pi.device_id,
  pi.hour_bucket DESC;

-- Real-time alerting and notification system
WITH real_time_monitoring AS (
  SELECT 
    JSON_EXTRACT(device_metadata, '$.device_id') as device_id,
    JSON_EXTRACT(device_metadata, '$.device_type') as device_type,
    JSON_EXTRACT(device_metadata, '$.location.facility') as facility,
    JSON_EXTRACT(sensor_info, '$.sensor_type') as sensor_type,
    timestamp,
    JSON_EXTRACT(measurements, '$.primary_value')::DECIMAL as current_value,
    JSON_EXTRACT(quality_metrics, '$.overall_score')::DECIMAL as quality_score,

    -- Alert thresholds from configuration
    JSON_EXTRACT(context, '$.alert_configuration.threshold_settings.critical_high')::DECIMAL as critical_high_threshold,
    JSON_EXTRACT(context, '$.alert_configuration.threshold_settings.critical_low')::DECIMAL as critical_low_threshold,
    JSON_EXTRACT(context, '$.alert_configuration.threshold_settings.warning_high')::DECIMAL as warning_high_threshold,
    JSON_EXTRACT(context, '$.alert_configuration.threshold_settings.warning_low')::DECIMAL as warning_low_threshold,

    -- Quality thresholds
    JSON_EXTRACT(context, '$.alert_configuration.threshold_settings.min_quality_score')::DECIMAL as min_quality_threshold,

    -- Anomaly flags
    JSON_EXTRACT(quality_metrics, '$.anomaly_indicators.statistical_anomaly')::BOOLEAN as statistical_anomaly,
    JSON_EXTRACT(quality_metrics, '$.anomaly_indicators.temporal_anomaly')::BOOLEAN as temporal_anomaly,
    JSON_EXTRACT(quality_metrics, '$.anomaly_indicators.contextual_anomaly')::BOOLEAN as contextual_anomaly,

    -- Processing performance
    JSON_EXTRACT(processing_metadata, '$.processing_latency_ms')::DECIMAL as processing_latency,
    JSON_EXTRACT(device_metadata, '$.network_info.network_latency')::DECIMAL as network_latency

  FROM sensor_readings
  WHERE timestamp >= CURRENT_TIMESTAMP - INTERVAL '5 minutes'
),

alert_evaluation AS (
  SELECT 
    rtm.*,

    -- Value-based alerts
    CASE 
      WHEN rtm.current_value >= rtm.critical_high_threshold THEN 'critical_high_value'
      WHEN rtm.current_value <= rtm.critical_low_threshold THEN 'critical_low_value'
      WHEN rtm.current_value >= rtm.warning_high_threshold THEN 'warning_high_value'
      WHEN rtm.current_value <= rtm.warning_low_threshold THEN 'warning_low_value'
      ELSE null
    END as value_alert_type,

    -- Quality-based alerts
    CASE 
      WHEN rtm.quality_score < rtm.min_quality_threshold THEN 'quality_degradation'
      ELSE null
    END as quality_alert_type,

    -- Anomaly-based alerts
    CASE 
      WHEN rtm.statistical_anomaly = true THEN 'statistical_anomaly_detected'
      WHEN rtm.temporal_anomaly = true THEN 'temporal_pattern_anomaly'
      WHEN rtm.contextual_anomaly = true THEN 'contextual_anomaly_detected'
      ELSE null
    END as anomaly_alert_type,

    -- Performance-based alerts
    CASE 
      WHEN rtm.processing_latency > 5000 THEN 'high_processing_latency'
      WHEN rtm.network_latency > 2000 THEN 'high_network_latency'
      ELSE null
    END as performance_alert_type,

    -- Severity calculation
    CASE 
      WHEN rtm.current_value >= rtm.critical_high_threshold OR rtm.current_value <= rtm.critical_low_threshold THEN 'critical'
      WHEN rtm.quality_score < (rtm.min_quality_threshold * 0.7) THEN 'critical'
      WHEN rtm.statistical_anomaly = true OR rtm.temporal_anomaly = true THEN 'high'
      WHEN rtm.current_value >= rtm.warning_high_threshold OR rtm.current_value <= rtm.warning_low_threshold THEN 'medium'
      WHEN rtm.quality_score < rtm.min_quality_threshold THEN 'medium'
      WHEN rtm.contextual_anomaly = true OR rtm.processing_latency > 5000 THEN 'low'
      ELSE null
    END as alert_severity

  FROM real_time_monitoring rtm
),

active_alerts AS (
  SELECT 
    ae.device_id,
    ae.device_type,
    ae.facility,
    ae.sensor_type,
    ae.timestamp as alert_timestamp,
    ae.current_value,
    ae.quality_score,

    -- Consolidate all alert types
    COALESCE(ae.value_alert_type, ae.quality_alert_type, ae.anomaly_alert_type, ae.performance_alert_type) as alert_type,
    ae.alert_severity,

    -- Alert context
    JSON_OBJECT(
      'current_reading', ae.current_value,
      'quality_score', ae.quality_score,
      'thresholds', JSON_OBJECT(
        'critical_high', ae.critical_high_threshold,
        'critical_low', ae.critical_low_threshold,
        'warning_high', ae.warning_high_threshold,
        'warning_low', ae.warning_low_threshold
      ),
      'anomaly_indicators', JSON_OBJECT(
        'statistical_anomaly', ae.statistical_anomaly,
        'temporal_anomaly', ae.temporal_anomaly,
        'contextual_anomaly', ae.contextual_anomaly
      ),
      'performance_metrics', JSON_OBJECT(
        'processing_latency_ms', ae.processing_latency,
        'network_latency_ms', ae.network_latency
      )
    ) as alert_context,

    -- Notification urgency
    CASE 
      WHEN ae.alert_severity = 'critical' THEN 'immediate'
      WHEN ae.alert_severity = 'high' THEN 'within_15_minutes'
      WHEN ae.alert_severity = 'medium' THEN 'within_1_hour'
      ELSE 'next_business_day'
    END as notification_urgency,

    -- Recommended actions
    CASE 
      WHEN ae.value_alert_type IN ('critical_high_value', 'critical_low_value') THEN 'emergency_shutdown_consider'
      WHEN ae.quality_alert_type = 'quality_degradation' THEN 'sensor_maintenance_required'
      WHEN ae.anomaly_alert_type IN ('statistical_anomaly_detected', 'temporal_pattern_anomaly') THEN 'investigate_anomaly_cause'
      WHEN ae.performance_alert_type = 'high_processing_latency' THEN 'check_system_resources'
      WHEN ae.performance_alert_type = 'high_network_latency' THEN 'check_network_connectivity'
      ELSE 'standard_investigation'
    END as recommended_action

  FROM alert_evaluation ae
  WHERE ae.alert_severity IS NOT NULL
)

-- Active alerts requiring immediate attention
SELECT 
  aa.alert_timestamp,
  aa.device_id,
  aa.device_type,
  aa.sensor_type,
  aa.facility,

  -- Alert details
  aa.alert_type,
  aa.alert_severity,
  aa.notification_urgency,
  aa.recommended_action,

  -- Current status
  aa.current_value as current_reading,
  aa.quality_score as current_quality,

  -- Alert context for operators
  aa.alert_context,

  -- Time since alert
  EXTRACT(EPOCH FROM (CURRENT_TIMESTAMP - aa.alert_timestamp)) / 60 as minutes_since_alert,

  -- Business impact assessment
  CASE 
    WHEN aa.alert_severity = 'critical' AND aa.device_type = 'safety_system' THEN 'safety_risk'
    WHEN aa.alert_severity = 'critical' AND aa.device_type = 'production_equipment' THEN 'production_impact'
    WHEN aa.alert_severity IN ('critical', 'high') AND aa.sensor_type = 'environmental' THEN 'compliance_risk'
    WHEN aa.alert_severity IN ('critical', 'high') THEN 'operational_impact'
    ELSE 'monitoring_required'
  END as business_impact_level,

  -- Next steps for operators
  JSON_OBJECT(
    'immediate_action', aa.recommended_action,
    'escalation_required', 
      CASE aa.alert_severity 
        WHEN 'critical' THEN true 
        ELSE false 
      END,
    'estimated_resolution_time', 
      CASE aa.alert_type
        WHEN 'quality_degradation' THEN '30-60 minutes'
        WHEN 'statistical_anomaly_detected' THEN '1-4 hours'
        WHEN 'critical_high_value' THEN '15-30 minutes'
        WHEN 'critical_low_value' THEN '15-30 minutes'
        ELSE '1-2 hours'
      END,
    'required_expertise', 
      CASE aa.alert_type
        WHEN 'quality_degradation' THEN 'maintenance_technician'
        WHEN 'statistical_anomaly_detected' THEN 'process_engineer'
        WHEN 'high_processing_latency' THEN 'it_support'
        WHEN 'high_network_latency' THEN 'network_administrator'
        ELSE 'operations_supervisor'
      END
  ) as operational_guidance

FROM active_alerts aa
ORDER BY 
  CASE aa.alert_severity
    WHEN 'critical' THEN 1
    WHEN 'high' THEN 2
    WHEN 'medium' THEN 3
    ELSE 4
  END,
  aa.alert_timestamp DESC;

-- QueryLeaf provides comprehensive IoT time series capabilities:
-- 1. SQL-familiar time series collection creation and optimization
-- 2. High-performance IoT data ingestion with automatic bucketing
-- 3. Real-time analytics and aggregation for sensor data
-- 4. Predictive analytics and trend analysis
-- 5. Comprehensive anomaly detection and alerting
-- 6. Performance monitoring and health assessment
-- 7. Integration with MongoDB's native time series optimizations
-- 8. Enterprise-ready IoT data management with familiar SQL syntax
-- 9. Automatic data lifecycle management and archiving
-- 10. Production-ready scalability for high-volume IoT workloads

Best Practices for Time Series Implementation

IoT Data Architecture and Performance Optimization

Essential principles for effective MongoDB Time Series deployment in IoT environments:

Collection Design: Create purpose-built time series collections with optimal bucketing strategies for different sensor types and data frequencies
Metadata Strategy: Design comprehensive metadata schemas that enable efficient filtering and provide rich context for analytics
Ingestion Optimization: Implement batch ingestion patterns and write concern configurations optimized for IoT write workloads
Query Patterns: Design aggregation pipelines that leverage time series optimizations for common IoT analytics patterns
Real-Time Processing: Implement change streams and real-time processing pipelines for immediate anomaly detection and alerting
Data Lifecycle: Establish automated data retention and archiving strategies to manage long-term storage costs

Production IoT Systems and Operational Excellence

Design time series systems for enterprise IoT requirements:

Scalable Architecture: Implement horizontally scalable time series infrastructure with proper sharding and distribution strategies
Performance Monitoring: Establish comprehensive monitoring for write performance, query latency, and storage utilization
Alert Management: Create intelligent alerting systems that reduce noise while ensuring critical issues are detected promptly
Edge Integration: Design systems that work efficiently with edge computing environments and intermittent connectivity
Security Implementation: Implement device authentication, data encryption, and access controls appropriate for IoT environments
Compliance Features: Build in data governance, audit trails, and regulatory compliance capabilities for industrial applications

Conclusion

MongoDB Time Series Collections provide comprehensive IoT data management capabilities that eliminate the complexity of traditional time-based partitioning and manual optimization through automatic bucketing, intelligent compression, and purpose-built query optimization. The native support for high-volume writes, real-time aggregations, and time-based analytics makes Time Series Collections ideal for modern IoT applications requiring both scale and performance.

Key Time Series Collections benefits include:

Automatic Optimization: Native bucketing and compression eliminate manual partitioning and maintenance overhead
High-Performance Writes: Optimized storage engine designed for high-volume, time-stamped data ingestion
Intelligent Compression: Automatic compression reduces storage costs by up to 90% compared to traditional approaches
Real-Time Analytics: Built-in aggregation optimization for time-based queries and real-time processing
Flexible Data Models: Rich document structure accommodates complex IoT metadata alongside time series measurements
SQL Accessibility: Familiar SQL-style time series operations through QueryLeaf for accessible IoT data management

Whether you're building industrial monitoring systems, smart city infrastructure, environmental sensors, or enterprise IoT platforms, MongoDB Time Series Collections with QueryLeaf's familiar SQL interface provides the foundation for scalable, efficient IoT data management.

QueryLeaf Integration: QueryLeaf seamlessly manages MongoDB Time Series Collections while providing SQL-familiar syntax for time series data operations, real-time analytics, and IoT-specific query patterns. Advanced time series capabilities including automatic bucketing, predictive analytics, and enterprise alerting are elegantly handled through familiar SQL constructs, making sophisticated IoT data management both powerful and accessible to SQL-oriented development teams.

The combination of MongoDB's robust time series capabilities with SQL-style data operations makes it an ideal platform for applications requiring both high-performance IoT data storage and familiar database interaction patterns, ensuring your time series infrastructure can scale efficiently while maintaining operational simplicity and developer productivity.