Skip to content

Blog

MongoDB Backup and Recovery Strategies: Advanced Disaster Recovery and Data Protection for Mission-Critical Applications

Production database environments require robust backup and recovery strategies that can protect against data loss, system failures, and disaster scenarios while enabling rapid recovery with minimal business disruption. Traditional backup approaches often struggle with large database sizes, complex recovery procedures, and inconsistent backup scheduling, leading to extended recovery times, potential data loss, and operational complexity that can compromise business continuity during critical incidents.

MongoDB provides comprehensive backup and recovery capabilities through native backup tools, automated backup scheduling, incremental backup strategies, and point-in-time recovery features that ensure robust data protection with minimal performance impact. Unlike traditional databases that require complex backup scripting and manual recovery procedures, MongoDB integrates backup and recovery operations directly into the database with optimized backup compression, automatic consistency verification, and streamlined recovery workflows.

The Traditional Backup and Recovery Challenge

Conventional database backup approaches face significant limitations in enterprise environments:

-- Traditional PostgreSQL backup management - manual processes with limited automation capabilities

-- Basic backup tracking table with minimal functionality
CREATE TABLE backup_jobs (
    backup_id SERIAL PRIMARY KEY,
    backup_name VARCHAR(255) NOT NULL,
    backup_type VARCHAR(100) NOT NULL, -- full, incremental, differential
    database_name VARCHAR(100) NOT NULL,

    -- Backup execution tracking
    backup_start_time TIMESTAMP NOT NULL,
    backup_end_time TIMESTAMP,
    backup_status VARCHAR(50) DEFAULT 'running',

    -- Basic size and performance metrics (limited visibility)
    backup_size_bytes BIGINT,
    backup_duration_seconds INTEGER,
    backup_compression_ratio DECIMAL(5,2),

    -- File location tracking (manual)
    backup_file_path TEXT,
    backup_storage_location VARCHAR(200),
    backup_retention_days INTEGER DEFAULT 30,

    -- Basic validation (very limited)
    backup_checksum VARCHAR(64),
    backup_verification_status VARCHAR(50),
    backup_verification_time TIMESTAMP,

    -- Error tracking
    backup_error_message TEXT,
    backup_warning_count INTEGER DEFAULT 0,

    -- Metadata
    created_by VARCHAR(100) DEFAULT current_user,
    backup_method VARCHAR(100) DEFAULT 'pg_dump'
);

-- Simple backup scheduling table (no real automation)
CREATE TABLE backup_schedules (
    schedule_id SERIAL PRIMARY KEY,
    schedule_name VARCHAR(255) NOT NULL,
    database_name VARCHAR(100) NOT NULL,
    backup_type VARCHAR(100) NOT NULL,

    -- Basic scheduling (cron-like but manual)
    schedule_frequency VARCHAR(50), -- daily, weekly, monthly
    schedule_time TIME,
    schedule_days VARCHAR(20), -- comma-separated day numbers

    -- Basic configuration
    retention_days INTEGER DEFAULT 30,
    backup_location VARCHAR(200),
    compression_enabled BOOLEAN DEFAULT true,

    -- Status tracking
    schedule_enabled BOOLEAN DEFAULT true,
    last_backup_time TIMESTAMP,
    last_backup_status VARCHAR(50),
    next_backup_time TIMESTAMP,

    -- Error tracking
    consecutive_failures INTEGER DEFAULT 0,
    last_error_message TEXT,

    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

-- Manual backup execution function (very basic functionality)
CREATE OR REPLACE FUNCTION execute_backup(
    database_name_param VARCHAR(100),
    backup_type_param VARCHAR(100) DEFAULT 'full'
) RETURNS TABLE (
    backup_id INTEGER,
    backup_status VARCHAR(50),
    backup_duration_seconds INTEGER,
    backup_size_mb INTEGER,
    backup_file_path TEXT,
    error_message TEXT
) AS $$
DECLARE
    new_backup_id INTEGER;
    backup_start TIMESTAMP;
    backup_end TIMESTAMP;
    backup_command TEXT;
    backup_filename TEXT;
    backup_directory TEXT := '/backup/postgresql/';
    command_result INTEGER;
    backup_size BIGINT;
    final_status VARCHAR(50) := 'completed';
    error_msg TEXT := '';
BEGIN
    backup_start := clock_timestamp();

    -- Generate backup filename
    backup_filename := database_name_param || '_' || 
                      backup_type_param || '_' || 
                      TO_CHAR(backup_start, 'YYYY-MM-DD_HH24-MI-SS') || '.sql';

    -- Create backup job record
    INSERT INTO backup_jobs (
        backup_name, backup_type, database_name, 
        backup_start_time, backup_file_path, backup_method
    )
    VALUES (
        backup_filename, backup_type_param, database_name_param,
        backup_start, backup_directory || backup_filename, 'pg_dump'
    )
    RETURNING backup_jobs.backup_id INTO new_backup_id;

    BEGIN
        -- Execute backup command (this is a simulation - real implementation would call external command)
        -- In reality: pg_dump -h localhost -U postgres -d database_name -f backup_file

        -- Simulate backup process with basic validation
        IF database_name_param NOT IN (SELECT datname FROM pg_database) THEN
            RAISE EXCEPTION 'Database % does not exist', database_name_param;
        END IF;

        -- Simulate backup time based on type
        CASE backup_type_param
            WHEN 'full' THEN PERFORM pg_sleep(2.0);  -- Simulate 2 seconds for full backup
            WHEN 'incremental' THEN PERFORM pg_sleep(0.5);  -- Simulate 0.5 seconds for incremental
            ELSE PERFORM pg_sleep(1.0);
        END CASE;

        -- Simulate backup size calculation (very basic)
        SELECT pg_database_size(database_name_param) INTO backup_size;

        -- Basic compression simulation
        backup_size := backup_size * 0.3;  -- Assume 70% compression

    EXCEPTION WHEN OTHERS THEN
        final_status := 'failed';
        error_msg := SQLERRM;
        backup_size := 0;
    END;

    backup_end := clock_timestamp();

    -- Update backup job record
    UPDATE backup_jobs 
    SET 
        backup_end_time = backup_end,
        backup_status = final_status,
        backup_size_bytes = backup_size,
        backup_duration_seconds = EXTRACT(SECONDS FROM backup_end - backup_start)::INTEGER,
        backup_compression_ratio = CASE WHEN backup_size > 0 THEN 70.0 ELSE 0 END,
        backup_error_message = CASE WHEN final_status = 'failed' THEN error_msg ELSE NULL END
    WHERE backup_jobs.backup_id = new_backup_id;

    -- Return results
    RETURN QUERY SELECT 
        new_backup_id,
        final_status,
        EXTRACT(SECONDS FROM backup_end - backup_start)::INTEGER,
        (backup_size / 1024 / 1024)::INTEGER,
        backup_directory || backup_filename,
        CASE WHEN final_status = 'failed' THEN error_msg ELSE NULL END;

END;
$$ LANGUAGE plpgsql;

-- Execute a backup (basic functionality)
SELECT * FROM execute_backup('production_db', 'full');

-- Basic backup verification function (very limited)
CREATE OR REPLACE FUNCTION verify_backup(backup_id_param INTEGER)
RETURNS TABLE (
    backup_id INTEGER,
    verification_status VARCHAR(50),
    verification_duration_seconds INTEGER,
    file_exists BOOLEAN,
    file_size_mb INTEGER,
    checksum_valid BOOLEAN,
    error_message TEXT
) AS $$
DECLARE
    backup_record RECORD;
    verification_start TIMESTAMP;
    verification_end TIMESTAMP;
    file_size BIGINT;
    verification_error TEXT := '';
    verification_result VARCHAR(50) := 'valid';
BEGIN
    verification_start := clock_timestamp();

    -- Get backup record
    SELECT * INTO backup_record
    FROM backup_jobs
    WHERE backup_jobs.backup_id = backup_id_param;

    IF NOT FOUND THEN
        RETURN QUERY SELECT 
            backup_id_param,
            'not_found'::VARCHAR(50),
            0,
            false,
            0,
            false,
            'Backup record not found'::TEXT;
        RETURN;
    END IF;

    BEGIN
        -- Simulate file verification (in reality would check actual file)
        -- Check if backup was successful
        IF backup_record.backup_status != 'completed' THEN
            verification_result := 'invalid';
            verification_error := 'Original backup failed';
        END IF;

        -- Simulate file size check
        file_size := backup_record.backup_size_bytes;

        -- Basic integrity simulation
        IF file_size = 0 OR backup_record.backup_duration_seconds = 0 THEN
            verification_result := 'invalid';
            verification_error := 'Backup file appears to be empty or corrupted';
        END IF;

        -- Simulate verification time
        PERFORM pg_sleep(0.1);

    EXCEPTION WHEN OTHERS THEN
        verification_result := 'error';
        verification_error := SQLERRM;
    END;

    verification_end := clock_timestamp();

    -- Update backup record with verification results
    UPDATE backup_jobs
    SET 
        backup_verification_status = verification_result,
        backup_verification_time = verification_end
    WHERE backup_jobs.backup_id = backup_id_param;

    -- Return verification results
    RETURN QUERY SELECT 
        backup_id_param,
        verification_result,
        EXTRACT(SECONDS FROM verification_end - verification_start)::INTEGER,
        CASE WHEN file_size > 0 THEN true ELSE false END,
        (file_size / 1024 / 1024)::INTEGER,
        CASE WHEN verification_result = 'valid' THEN true ELSE false END,
        CASE WHEN verification_result != 'valid' THEN verification_error ELSE NULL END;

END;
$$ LANGUAGE plpgsql;

-- Recovery function (very basic and manual)
CREATE OR REPLACE FUNCTION restore_backup(
    backup_id_param INTEGER,
    target_database_name VARCHAR(100)
) RETURNS TABLE (
    restore_success BOOLEAN,
    restore_duration_seconds INTEGER,
    restored_size_mb INTEGER,
    error_message TEXT
) AS $$
DECLARE
    backup_record RECORD;
    restore_start TIMESTAMP;
    restore_end TIMESTAMP;
    restore_error TEXT := '';
    restore_result BOOLEAN := true;
BEGIN
    restore_start := clock_timestamp();

    -- Get backup information
    SELECT * INTO backup_record
    FROM backup_jobs
    WHERE backup_id = backup_id_param
    AND backup_status = 'completed';

    IF NOT FOUND THEN
        RETURN QUERY SELECT 
            false,
            0,
            0,
            'Valid backup not found for restore operation'::TEXT;
        RETURN;
    END IF;

    BEGIN
        -- Simulate restore process (in reality would execute psql command)
        -- psql -h localhost -U postgres -d target_database -f backup_file

        -- Basic validation
        IF target_database_name IS NULL OR LENGTH(target_database_name) = 0 THEN
            RAISE EXCEPTION 'Target database name is required';
        END IF;

        -- Simulate restore time proportional to backup size
        PERFORM pg_sleep(LEAST(backup_record.backup_duration_seconds * 1.5, 10.0));

    EXCEPTION WHEN OTHERS THEN
        restore_result := false;
        restore_error := SQLERRM;
    END;

    restore_end := clock_timestamp();

    -- Return restore results
    RETURN QUERY SELECT 
        restore_result,
        EXTRACT(SECONDS FROM restore_end - restore_start)::INTEGER,
        (backup_record.backup_size_bytes / 1024 / 1024)::INTEGER,
        CASE WHEN NOT restore_result THEN restore_error ELSE NULL END;

END;
$$ LANGUAGE plpgsql;

-- Basic backup monitoring and cleanup
WITH backup_status_summary AS (
    SELECT 
        DATE_TRUNC('day', backup_start_time) as backup_date,
        database_name,
        backup_type,
        COUNT(*) as total_backups,
        COUNT(*) FILTER (WHERE backup_status = 'completed') as successful_backups,
        COUNT(*) FILTER (WHERE backup_status = 'failed') as failed_backups,
        SUM(backup_size_bytes) as total_backup_size_bytes,
        AVG(backup_duration_seconds) as avg_backup_duration,
        MIN(backup_start_time) as first_backup,
        MAX(backup_start_time) as last_backup

    FROM backup_jobs
    WHERE backup_start_time >= CURRENT_DATE - INTERVAL '7 days'
    GROUP BY DATE_TRUNC('day', backup_start_time), database_name, backup_type
)
SELECT 
    backup_date,
    database_name,
    backup_type,
    total_backups,
    successful_backups,
    failed_backups,

    -- Success rate
    CASE 
        WHEN total_backups > 0 THEN
            ROUND((successful_backups::DECIMAL / total_backups) * 100, 1)
        ELSE 0
    END as success_rate_percent,

    -- Size and performance metrics
    ROUND((total_backup_size_bytes / 1024.0 / 1024.0), 1) as total_size_mb,
    ROUND(avg_backup_duration::NUMERIC, 1) as avg_duration_seconds,

    -- Backup frequency analysis
    EXTRACT(HOURS FROM (last_backup - first_backup))::INTEGER as backup_window_hours,

    -- Health assessment
    CASE 
        WHEN failed_backups > 0 THEN 'issues'
        WHEN successful_backups = 0 THEN 'no_backups'
        ELSE 'healthy'
    END as backup_health,

    -- Recommendations
    CASE 
        WHEN failed_backups > total_backups * 0.2 THEN 'investigate_failures'
        WHEN avg_backup_duration > 3600 THEN 'optimize_performance'
        WHEN total_backup_size_bytes > 100 * 1024 * 1024 * 1024 THEN 'consider_compression'
        ELSE 'monitor'
    END as recommendation

FROM backup_status_summary
ORDER BY backup_date DESC, database_name, backup_type;

-- Cleanup old backups (manual process)
WITH old_backups AS (
    SELECT backup_id, backup_file_path, backup_size_bytes
    FROM backup_jobs
    WHERE backup_start_time < CURRENT_DATE - INTERVAL '90 days'
    AND backup_status = 'completed'
),
cleanup_summary AS (
    DELETE FROM backup_jobs
    WHERE backup_id IN (SELECT backup_id FROM old_backups)
    RETURNING backup_id, backup_size_bytes
)
SELECT 
    COUNT(*) as backups_cleaned,
    SUM(backup_size_bytes) as total_space_freed_bytes,
    ROUND(SUM(backup_size_bytes) / 1024.0 / 1024.0 / 1024.0, 2) as space_freed_gb
FROM cleanup_summary;

-- Problems with traditional backup approaches:
-- 1. Manual backup execution with no automation or scheduling
-- 2. Limited backup verification and integrity checking
-- 3. No point-in-time recovery capabilities
-- 4. Basic error handling with no automatic retry mechanisms
-- 5. No incremental backup support or optimization
-- 6. Manual cleanup and retention management
-- 7. Limited monitoring and alerting capabilities
-- 8. No support for distributed backup strategies
-- 9. Complex recovery procedures requiring manual intervention
-- 10. No integration with cloud storage or disaster recovery systems

MongoDB provides comprehensive backup and recovery capabilities with automated scheduling and management:

// MongoDB Advanced Backup and Recovery - comprehensive data protection with automated disaster recovery
const { MongoClient, GridFSBucket } = require('mongodb');
const { spawn } = require('child_process');
const fs = require('fs').promises;
const path = require('path');
const { createHash } = require('crypto');
const { EventEmitter } = require('events');

// Comprehensive MongoDB Backup and Recovery Manager
class AdvancedBackupRecoveryManager extends EventEmitter {
  constructor(connectionString, backupConfig = {}) {
    super();
    this.connectionString = connectionString;
    this.client = null;
    this.db = null;

    // Advanced backup and recovery configuration
    this.config = {
      // Backup strategy configuration
      enableAutomatedBackups: backupConfig.enableAutomatedBackups !== false,
      enableIncrementalBackups: backupConfig.enableIncrementalBackups || false,
      enablePointInTimeRecovery: backupConfig.enablePointInTimeRecovery || false,
      enableCompression: backupConfig.enableCompression !== false,

      // Backup scheduling
      fullBackupSchedule: backupConfig.fullBackupSchedule || '0 2 * * *', // Daily at 2 AM
      incrementalBackupSchedule: backupConfig.incrementalBackupSchedule || '0 */6 * * *', // Every 6 hours

      // Storage configuration
      backupStoragePath: backupConfig.backupStoragePath || './backups',
      maxBackupSize: backupConfig.maxBackupSize || 10 * 1024 * 1024 * 1024, // 10GB
      compressionLevel: backupConfig.compressionLevel || 6,

      // Retention policies
      dailyBackupRetention: backupConfig.dailyBackupRetention || 30, // 30 days
      weeklyBackupRetention: backupConfig.weeklyBackupRetention || 12, // 12 weeks
      monthlyBackupRetention: backupConfig.monthlyBackupRetention || 12, // 12 months

      // Backup validation
      enableBackupVerification: backupConfig.enableBackupVerification !== false,
      verificationSampleSize: backupConfig.verificationSampleSize || 1000,
      enableChecksumValidation: backupConfig.enableChecksumValidation !== false,

      // Recovery configuration
      enableParallelRecovery: backupConfig.enableParallelRecovery || false,
      maxRecoveryThreads: backupConfig.maxRecoveryThreads || 4,
      recoveryBatchSize: backupConfig.recoveryBatchSize || 1000,

      // Monitoring and alerting
      enableBackupMonitoring: backupConfig.enableBackupMonitoring !== false,
      enableRecoveryTesting: backupConfig.enableRecoveryTesting || false,
      alertThresholds: {
        backupFailureCount: backupConfig.backupFailureThreshold || 3,
        backupDurationMinutes: backupConfig.backupDurationThreshold || 120,
        backupSizeVariation: backupConfig.backupSizeVariationThreshold || 50
      },

      // Disaster recovery
      enableReplication: backupConfig.enableReplication || false,
      replicationTargets: backupConfig.replicationTargets || [],
      enableCloudSync: backupConfig.enableCloudSync || false,
      cloudSyncConfig: backupConfig.cloudSyncConfig || {}
    };

    // Backup and recovery state management
    this.backupJobs = new Map();
    this.scheduledBackups = new Map();
    this.recoveryOperations = new Map();
    this.backupMetrics = {
      totalBackups: 0,
      successfulBackups: 0,
      failedBackups: 0,
      totalDataBackedUp: 0,
      averageBackupDuration: 0
    };

    // Backup history and metadata
    this.backupHistory = [];
    this.recoveryHistory = [];

    this.initializeBackupSystem();
  }

  async initializeBackupSystem() {
    console.log('Initializing advanced backup and recovery system...');

    try {
      // Connect to MongoDB
      this.client = new MongoClient(this.connectionString);
      await this.client.connect();
      this.db = this.client.db();

      // Setup backup infrastructure
      await this.setupBackupInfrastructure();

      // Initialize automated backup scheduling
      if (this.config.enableAutomatedBackups) {
        await this.setupAutomatedBackups();
      }

      // Setup backup monitoring
      if (this.config.enableBackupMonitoring) {
        await this.setupBackupMonitoring();
      }

      // Initialize point-in-time recovery if enabled
      if (this.config.enablePointInTimeRecovery) {
        await this.setupPointInTimeRecovery();
      }

      console.log('Advanced backup and recovery system initialized successfully');

    } catch (error) {
      console.error('Error initializing backup system:', error);
      throw error;
    }
  }

  async setupBackupInfrastructure() {
    console.log('Setting up backup infrastructure...');

    try {
      // Create backup storage directory
      await fs.mkdir(this.config.backupStoragePath, { recursive: true });

      // Create subdirectories for different backup types
      const backupDirs = ['full', 'incremental', 'logs', 'metadata', 'recovery-points'];
      for (const dir of backupDirs) {
        await fs.mkdir(path.join(this.config.backupStoragePath, dir), { recursive: true });
      }

      // Setup backup metadata collections
      const collections = {
        backupJobs: this.db.collection('backup_jobs'),
        backupMetadata: this.db.collection('backup_metadata'),
        recoveryOperations: this.db.collection('recovery_operations'),
        backupSchedules: this.db.collection('backup_schedules')
      };

      // Create indexes for backup operations
      await collections.backupJobs.createIndex(
        { startTime: -1, status: 1 },
        { background: true }
      );

      await collections.backupMetadata.createIndex(
        { backupId: 1, backupType: 1, timestamp: -1 },
        { background: true }
      );

      await collections.recoveryOperations.createIndex(
        { recoveryId: 1, startTime: -1 },
        { background: true }
      );

      this.collections = collections;

    } catch (error) {
      console.error('Error setting up backup infrastructure:', error);
      throw error;
    }
  }

  async createFullBackup(backupOptions = {}) {
    console.log('Starting full database backup...');

    const backupId = this.generateBackupId('full');
    const startTime = new Date();

    try {
      // Create backup job record
      const backupJob = {
        backupId: backupId,
        backupType: 'full',
        startTime: startTime,
        status: 'running',

        // Backup configuration
        options: {
          compression: this.config.enableCompression,
          compressionLevel: this.config.compressionLevel,
          includeIndexes: backupOptions.includeIndexes !== false,
          includeSystemCollections: backupOptions.includeSystemCollections || false,
          oplogCapture: this.config.enablePointInTimeRecovery
        },

        // Progress tracking
        progress: {
          collectionsProcessed: 0,
          totalCollections: 0,
          documentsProcessed: 0,
          totalDocuments: 0,
          bytesProcessed: 0,
          estimatedTotalBytes: 0
        },

        // Performance metrics
        performance: {
          throughputMBps: 0,
          compressionRatio: 0,
          parallelStreams: 1
        }
      };

      await this.collections.backupJobs.insertOne(backupJob);
      this.backupJobs.set(backupId, backupJob);

      // Get database statistics for progress tracking
      const dbStats = await this.db.stats();
      backupJob.progress.estimatedTotalBytes = dbStats.dataSize;

      // Get collection list and metadata
      const collections = await this.db.listCollections().toArray();
      backupJob.progress.totalCollections = collections.length;

      // Calculate total document count across collections
      let totalDocuments = 0;
      for (const collectionInfo of collections) {
        if (collectionInfo.type === 'collection') {
          const collection = this.db.collection(collectionInfo.name);
          const count = await collection.estimatedDocumentCount();
          totalDocuments += count;
        }
      }
      backupJob.progress.totalDocuments = totalDocuments;

      // Create backup using mongodump
      const backupResult = await this.executeMongoDump(backupId, backupJob);

      // Verify backup integrity
      if (this.config.enableBackupVerification) {
        await this.verifyBackupIntegrity(backupId, backupResult);
      }

      // Calculate backup metrics
      const endTime = new Date();
      const duration = endTime.getTime() - startTime.getTime();
      const backupSizeBytes = backupResult.backupSize;
      const compressionRatio = backupResult.originalSize > 0 ? 
        (backupResult.originalSize - backupSizeBytes) / backupResult.originalSize : 0;

      // Update backup job with results
      const completedJob = {
        ...backupJob,
        endTime: endTime,
        status: 'completed',
        duration: duration,
        backupSize: backupSizeBytes,
        originalSize: backupResult.originalSize,
        compressionRatio: compressionRatio,
        backupPath: backupResult.backupPath,
        checksum: backupResult.checksum,

        // Final performance metrics
        performance: {
          throughputMBps: (backupSizeBytes / 1024 / 1024) / (duration / 1000),
          compressionRatio: compressionRatio,
          parallelStreams: backupResult.parallelStreams || 1
        }
      };

      await this.collections.backupJobs.replaceOne(
        { backupId: backupId },
        completedJob
      );

      // Update backup metrics
      this.updateBackupMetrics(completedJob);

      // Store backup metadata for recovery operations
      await this.storeBackupMetadata(completedJob);

      this.emit('backupCompleted', {
        backupId: backupId,
        backupType: 'full',
        duration: duration,
        backupSize: backupSizeBytes,
        compressionRatio: compressionRatio
      });

      console.log(`Full backup completed: ${backupId} (${Math.round(backupSizeBytes / 1024 / 1024)} MB, ${Math.round(duration / 1000)}s)`);

      return {
        success: true,
        backupId: backupId,
        backupSize: backupSizeBytes,
        duration: duration,
        compressionRatio: compressionRatio,
        backupPath: backupResult.backupPath
      };

    } catch (error) {
      console.error(`Full backup failed for ${backupId}:`, error);

      // Update backup job with error
      await this.collections.backupJobs.updateOne(
        { backupId: backupId },
        {
          $set: {
            status: 'failed',
            endTime: new Date(),
            error: {
              message: error.message,
              stack: error.stack,
              timestamp: new Date()
            }
          }
        }
      );

      this.backupMetrics.failedBackups++;

      this.emit('backupFailed', {
        backupId: backupId,
        backupType: 'full',
        error: error.message
      });

      return {
        success: false,
        backupId: backupId,
        error: error.message
      };
    }
  }

  async executeMongoDump(backupId, backupJob) {
    console.log(`Executing mongodump for backup: ${backupId}`);

    return new Promise((resolve, reject) => {
      const backupPath = path.join(
        this.config.backupStoragePath,
        'full',
        `${backupId}.archive`
      );

      // Build mongodump command arguments
      const mongodumpArgs = [
        '--uri', this.connectionString,
        '--archive=' + backupPath,
        '--gzip'
      ];

      // Add additional options based on configuration
      if (backupJob.options.oplogCapture) {
        mongodumpArgs.push('--oplog');
      }

      if (!backupJob.options.includeSystemCollections) {
        mongodumpArgs.push('--excludeCollection=system.*');
      }

      // Execute mongodump
      const mongodumpProcess = spawn('mongodump', mongodumpArgs);

      let stdoutData = '';
      let stderrData = '';

      mongodumpProcess.stdout.on('data', (data) => {
        stdoutData += data.toString();
        this.parseBackupProgress(backupId, data.toString());
      });

      mongodumpProcess.stderr.on('data', (data) => {
        stderrData += data.toString();
        console.warn('mongodump stderr:', data.toString());
      });

      mongodumpProcess.on('close', async (code) => {
        try {
          if (code === 0) {
            // Get backup file statistics
            const stats = await fs.stat(backupPath);
            const backupSize = stats.size;

            // Calculate checksum for integrity verification
            const checksum = await this.calculateFileChecksum(backupPath);

            resolve({
              backupPath: backupPath,
              backupSize: backupSize,
              originalSize: backupJob.progress.estimatedTotalBytes,
              checksum: checksum,
              stdout: stdoutData,
              parallelStreams: 1
            });
          } else {
            reject(new Error(`mongodump failed with exit code ${code}: ${stderrData}`));
          }
        } catch (error) {
          reject(error);
        }
      });

      mongodumpProcess.on('error', (error) => {
        reject(new Error(`Failed to start mongodump: ${error.message}`));
      });
    });
  }

  parseBackupProgress(backupId, output) {
    // Parse mongodump output to extract progress information
    const backupJob = this.backupJobs.get(backupId);
    if (!backupJob) return;

    // Look for progress indicators in mongodump output
    const progressMatches = output.match(/(\d+)\s+documents?\s+to\s+(\w+)\.(\w+)/g);
    if (progressMatches) {
      for (const match of progressMatches) {
        const [, docCount, dbName, collectionName] = match.match(/(\d+)\s+documents?\s+to\s+(\w+)\.(\w+)/);

        backupJob.progress.documentsProcessed += parseInt(docCount);
        backupJob.progress.collectionsProcessed++;

        // Emit progress update
        this.emit('backupProgress', {
          backupId: backupId,
          progress: {
            collectionsProcessed: backupJob.progress.collectionsProcessed,
            totalCollections: backupJob.progress.totalCollections,
            documentsProcessed: backupJob.progress.documentsProcessed,
            totalDocuments: backupJob.progress.totalDocuments,
            percentComplete: (backupJob.progress.documentsProcessed / backupJob.progress.totalDocuments) * 100
          }
        });
      }
    }
  }

  async calculateFileChecksum(filePath) {
    console.log(`Calculating checksum for: ${filePath}`);

    try {
      const fileBuffer = await fs.readFile(filePath);
      const hash = createHash('sha256');
      hash.update(fileBuffer);
      return hash.digest('hex');

    } catch (error) {
      console.error('Error calculating file checksum:', error);
      throw error;
    }
  }

  async verifyBackupIntegrity(backupId, backupResult) {
    console.log(`Verifying backup integrity: ${backupId}`);

    try {
      const verification = {
        backupId: backupId,
        verificationTime: new Date(),
        checksumVerified: false,
        sampleVerified: false,
        errors: []
      };

      // Verify file checksum
      const currentChecksum = await this.calculateFileChecksum(backupResult.backupPath);
      verification.checksumVerified = currentChecksum === backupResult.checksum;

      if (!verification.checksumVerified) {
        verification.errors.push('Checksum verification failed - file may be corrupted');
      }

      // Perform sample restore verification
      if (this.config.verificationSampleSize > 0) {
        const sampleResult = await this.performSampleRestoreTest(backupId, backupResult);
        verification.sampleVerified = sampleResult.success;

        if (!sampleResult.success) {
          verification.errors.push(`Sample restore failed: ${sampleResult.error}`);
        }
      }

      // Store verification results
      await this.collections.backupMetadata.updateOne(
        { backupId: backupId },
        {
          $set: {
            verification: verification,
            lastVerificationTime: verification.verificationTime
          }
        },
        { upsert: true }
      );

      this.emit('backupVerified', {
        backupId: backupId,
        verification: verification
      });

      return verification;

    } catch (error) {
      console.error(`Backup verification failed for ${backupId}:`, error);
      throw error;
    }
  }

  async performSampleRestoreTest(backupId, backupResult) {
    console.log(`Performing sample restore test for backup: ${backupId}`);

    try {
      // Create temporary database for restore test
      const testDbName = `backup_test_${backupId}_${Date.now()}`;

      // Execute mongorestore on sample data
      const restoreResult = await this.executeSampleRestore(
        backupResult.backupPath,
        testDbName
      );

      // Verify restored data integrity
      const verificationResult = await this.verifySampleData(testDbName);

      // Cleanup test database
      await this.cleanupTestDatabase(testDbName);

      return {
        success: restoreResult.success && verificationResult.success,
        error: restoreResult.error || verificationResult.error,
        restoredDocuments: restoreResult.documentCount,
        verificationDetails: verificationResult
      };

    } catch (error) {
      console.error(`Sample restore test failed for ${backupId}:`, error);
      return {
        success: false,
        error: error.message
      };
    }
  }

  async createIncrementalBackup(baseBackupId, backupOptions = {}) {
    console.log(`Starting incremental backup based on: ${baseBackupId}`);

    const backupId = this.generateBackupId('incremental');
    const startTime = new Date();

    try {
      // Get base backup metadata
      const baseBackup = await this.collections.backupJobs.findOne({ backupId: baseBackupId });
      if (!baseBackup) {
        throw new Error(`Base backup not found: ${baseBackupId}`);
      }

      // Create incremental backup job record
      const backupJob = {
        backupId: backupId,
        backupType: 'incremental',
        baseBackupId: baseBackupId,
        startTime: startTime,
        status: 'running',

        // Incremental backup specific configuration
        options: {
          ...backupOptions,
          fromTimestamp: baseBackup.endTime,
          toTimestamp: startTime,
          oplogOnly: true,
          compression: this.config.enableCompression
        },

        progress: {
          oplogEntriesProcessed: 0,
          totalOplogEntries: 0,
          bytesProcessed: 0
        }
      };

      await this.collections.backupJobs.insertOne(backupJob);
      this.backupJobs.set(backupId, backupJob);

      // Execute incremental backup using oplog
      const backupResult = await this.executeOplogBackup(backupId, backupJob);

      // Update backup job with results
      const endTime = new Date();
      const duration = endTime.getTime() - startTime.getTime();

      const completedJob = {
        ...backupJob,
        endTime: endTime,
        status: 'completed',
        duration: duration,
        backupSize: backupResult.backupSize,
        oplogEntries: backupResult.oplogEntries,
        backupPath: backupResult.backupPath,
        checksum: backupResult.checksum
      };

      await this.collections.backupJobs.replaceOne(
        { backupId: backupId },
        completedJob
      );

      this.updateBackupMetrics(completedJob);
      await this.storeBackupMetadata(completedJob);

      this.emit('backupCompleted', {
        backupId: backupId,
        backupType: 'incremental',
        baseBackupId: baseBackupId,
        duration: duration,
        backupSize: backupResult.backupSize,
        oplogEntries: backupResult.oplogEntries
      });

      console.log(`Incremental backup completed: ${backupId}`);

      return {
        success: true,
        backupId: backupId,
        baseBackupId: baseBackupId,
        backupSize: backupResult.backupSize,
        duration: duration,
        oplogEntries: backupResult.oplogEntries
      };

    } catch (error) {
      console.error(`Incremental backup failed for ${backupId}:`, error);

      await this.collections.backupJobs.updateOne(
        { backupId: backupId },
        {
          $set: {
            status: 'failed',
            endTime: new Date(),
            error: {
              message: error.message,
              stack: error.stack,
              timestamp: new Date()
            }
          }
        }
      );

      return {
        success: false,
        backupId: backupId,
        error: error.message
      };
    }
  }

  async restoreFromBackup(backupId, restoreOptions = {}) {
    console.log(`Starting database restore from backup: ${backupId}`);

    const recoveryId = this.generateRecoveryId();
    const startTime = new Date();

    try {
      // Get backup metadata
      const backupJob = await this.collections.backupJobs.findOne({ backupId: backupId });
      if (!backupJob || backupJob.status !== 'completed') {
        throw new Error(`Valid backup not found: ${backupId}`);
      }

      // Create recovery operation record
      const recoveryOperation = {
        recoveryId: recoveryId,
        backupId: backupId,
        backupType: backupJob.backupType,
        startTime: startTime,
        status: 'running',

        // Recovery configuration
        options: {
          targetDatabase: restoreOptions.targetDatabase || this.db.databaseName,
          dropBeforeRestore: restoreOptions.dropBeforeRestore || false,
          restoreIndexes: restoreOptions.restoreIndexes !== false,
          parallelRecovery: this.config.enableParallelRecovery,
          batchSize: this.config.recoveryBatchSize
        },

        progress: {
          collectionsRestored: 0,
          totalCollections: 0,
          documentsRestored: 0,
          totalDocuments: 0,
          bytesRestored: 0
        }
      };

      await this.collections.recoveryOperations.insertOne(recoveryOperation);
      this.recoveryOperations.set(recoveryId, recoveryOperation);

      // Execute restore process
      const restoreResult = await this.executeRestore(recoveryId, backupJob, recoveryOperation);

      // Verify restore integrity
      if (this.config.enableBackupVerification) {
        await this.verifyRestoreIntegrity(recoveryId, restoreResult);
      }

      // Update recovery operation with results
      const endTime = new Date();
      const duration = endTime.getTime() - startTime.getTime();

      const completedRecovery = {
        ...recoveryOperation,
        endTime: endTime,
        status: 'completed',
        duration: duration,
        restoredSize: restoreResult.restoredSize,
        documentsRestored: restoreResult.documentsRestored,
        collectionsRestored: restoreResult.collectionsRestored
      };

      await this.collections.recoveryOperations.replaceOne(
        { recoveryId: recoveryId },
        completedRecovery
      );

      this.recoveryHistory.push(completedRecovery);

      this.emit('restoreCompleted', {
        recoveryId: recoveryId,
        backupId: backupId,
        duration: duration,
        restoredSize: restoreResult.restoredSize,
        documentsRestored: restoreResult.documentsRestored
      });

      console.log(`Database restore completed: ${recoveryId}`);

      return {
        success: true,
        recoveryId: recoveryId,
        backupId: backupId,
        duration: duration,
        restoredSize: restoreResult.restoredSize,
        documentsRestored: restoreResult.documentsRestored,
        collectionsRestored: restoreResult.collectionsRestored
      };

    } catch (error) {
      console.error(`Database restore failed for ${recoveryId}:`, error);

      await this.collections.recoveryOperations.updateOne(
        { recoveryId: recoveryId },
        {
          $set: {
            status: 'failed',
            endTime: new Date(),
            error: {
              message: error.message,
              stack: error.stack,
              timestamp: new Date()
            }
          }
        }
      );

      return {
        success: false,
        recoveryId: recoveryId,
        backupId: backupId,
        error: error.message
      };
    }
  }

  async getBackupStatus(backupId = null) {
    console.log(`Getting backup status${backupId ? ' for: ' + backupId : ' (all backups)'}`);

    try {
      let query = {};
      if (backupId) {
        query.backupId = backupId;
      }

      const backups = await this.collections.backupJobs
        .find(query)
        .sort({ startTime: -1 })
        .limit(backupId ? 1 : 50)
        .toArray();

      const backupStatuses = backups.map(backup => ({
        backupId: backup.backupId,
        backupType: backup.backupType,
        status: backup.status,
        startTime: backup.startTime,
        endTime: backup.endTime,
        duration: backup.duration,
        backupSize: backup.backupSize,
        compressionRatio: backup.compressionRatio,
        documentsProcessed: backup.progress?.documentsProcessed || 0,
        collectionsProcessed: backup.progress?.collectionsProcessed || 0,
        error: backup.error?.message || null,

        // Additional metadata
        baseBackupId: backup.baseBackupId || null,
        checksum: backup.checksum || null,
        backupPath: backup.backupPath || null,

        // Performance metrics
        throughputMBps: backup.performance?.throughputMBps || 0,

        // Health indicators
        healthStatus: this.assessBackupHealth(backup),
        lastVerificationTime: backup.verification?.verificationTime || null,
        verificationStatus: backup.verification?.checksumVerified ? 'verified' : 'pending'
      }));

      return {
        success: true,
        backups: backupStatuses,
        totalBackups: backups.length,

        // System-wide metrics
        systemMetrics: {
          totalBackups: this.backupMetrics.totalBackups,
          successfulBackups: this.backupMetrics.successfulBackups,
          failedBackups: this.backupMetrics.failedBackups,
          averageBackupDuration: this.backupMetrics.averageBackupDuration,
          totalDataBackedUp: this.backupMetrics.totalDataBackedUp
        }
      };

    } catch (error) {
      console.error('Error getting backup status:', error);
      return {
        success: false,
        error: error.message
      };
    }
  }

  assessBackupHealth(backup) {
    if (backup.status === 'failed') return 'unhealthy';
    if (backup.status === 'running') return 'in_progress';
    if (backup.status !== 'completed') return 'unknown';

    // Check verification status
    if (backup.verification && !backup.verification.checksumVerified) {
      return 'verification_failed';
    }

    // Check backup age
    const ageHours = (Date.now() - backup.startTime.getTime()) / (1000 * 60 * 60);
    if (ageHours > 24 * 7) return 'stale'; // Older than 1 week

    return 'healthy';
  }

  updateBackupMetrics(backupJob) {
    this.backupMetrics.totalBackups++;

    if (backupJob.status === 'completed') {
      this.backupMetrics.successfulBackups++;
      this.backupMetrics.totalDataBackedUp += backupJob.backupSize || 0;

      // Update average duration
      const currentAvg = this.backupMetrics.averageBackupDuration;
      const totalSuccessful = this.backupMetrics.successfulBackups;
      this.backupMetrics.averageBackupDuration = 
        ((currentAvg * (totalSuccessful - 1)) + (backupJob.duration || 0)) / totalSuccessful;
    } else if (backupJob.status === 'failed') {
      this.backupMetrics.failedBackups++;
    }
  }

  async storeBackupMetadata(backupJob) {
    const metadata = {
      backupId: backupJob.backupId,
      backupType: backupJob.backupType,
      timestamp: backupJob.startTime,
      backupSize: backupJob.backupSize,
      backupPath: backupJob.backupPath,
      checksum: backupJob.checksum,
      compressionRatio: backupJob.compressionRatio,
      baseBackupId: backupJob.baseBackupId || null,

      // Retention information
      retentionPolicy: this.determineRetentionPolicy(backupJob),
      expirationDate: this.calculateExpirationDate(backupJob),

      // Recovery information
      recoveryMetadata: {
        documentsCount: backupJob.progress?.documentsProcessed || 0,
        collectionsCount: backupJob.progress?.collectionsProcessed || 0,
        indexesIncluded: backupJob.options?.includeIndexes !== false,
        oplogIncluded: backupJob.options?.oplogCapture === true
      }
    };

    await this.collections.backupMetadata.replaceOne(
      { backupId: backupJob.backupId },
      metadata,
      { upsert: true }
    );
  }

  determineRetentionPolicy(backupJob) {
    const dayOfWeek = backupJob.startTime.getDay();
    const dayOfMonth = backupJob.startTime.getDate();

    if (dayOfMonth === 1) return 'monthly';
    if (dayOfWeek === 0) return 'weekly'; // Sunday
    return 'daily';
  }

  calculateExpirationDate(backupJob) {
    const retentionPolicy = this.determineRetentionPolicy(backupJob);
    const startTime = backupJob.startTime;

    switch (retentionPolicy) {
      case 'monthly':
        return new Date(startTime.getTime() + (this.config.monthlyBackupRetention * 30 * 24 * 60 * 60 * 1000));
      case 'weekly':
        return new Date(startTime.getTime() + (this.config.weeklyBackupRetention * 7 * 24 * 60 * 60 * 1000));
      default:
        return new Date(startTime.getTime() + (this.config.dailyBackupRetention * 24 * 60 * 60 * 1000));
    }
  }

  generateBackupId(type) {
    const timestamp = new Date().toISOString().replace(/[:.]/g, '-');
    return `backup_${type}_${timestamp}_${Math.random().toString(36).substr(2, 9)}`;
  }

  generateRecoveryId() {
    const timestamp = new Date().toISOString().replace(/[:.]/g, '-');
    return `recovery_${timestamp}_${Math.random().toString(36).substr(2, 9)}`;
  }

  async shutdown() {
    console.log('Shutting down backup and recovery manager...');

    try {
      // Stop all scheduled backups
      for (const [scheduleId, schedule] of this.scheduledBackups.entries()) {
        clearInterval(schedule.interval);
      }

      // Wait for active backup jobs to complete
      for (const [backupId, backupJob] of this.backupJobs.entries()) {
        if (backupJob.status === 'running') {
          console.log(`Waiting for backup to complete: ${backupId}`);
          // In a real implementation, we would wait for or gracefully cancel the backup
        }
      }

      // Close MongoDB connection
      if (this.client) {
        await this.client.close();
      }

      console.log('Backup and recovery manager shutdown complete');

    } catch (error) {
      console.error('Error during shutdown:', error);
    }
  }

  // Additional methods would include implementations for:
  // - setupAutomatedBackups()
  // - setupBackupMonitoring() 
  // - setupPointInTimeRecovery()
  // - executeOplogBackup()
  // - executeRestore()
  // - executeSampleRestore()
  // - verifySampleData()
  // - cleanupTestDatabase()
  // - verifyRestoreIntegrity()
}

// Benefits of MongoDB Advanced Backup and Recovery:
// - Automated backup scheduling with flexible retention policies
// - Comprehensive backup verification and integrity checking
// - Point-in-time recovery capabilities with oplog integration
// - Incremental backup support for efficient storage utilization
// - Advanced compression and optimization for large databases
// - Parallel backup and recovery operations for improved performance
// - Comprehensive monitoring and alerting for backup operations
// - Disaster recovery capabilities with replication and cloud sync
// - SQL-compatible backup management through QueryLeaf integration
// - Production-ready backup automation with minimal configuration

module.exports = {
  AdvancedBackupRecoveryManager
};

Understanding MongoDB Backup and Recovery Architecture

Advanced Backup Strategy Design and Implementation Patterns

Implement comprehensive backup and recovery workflows for enterprise MongoDB deployments:

// Enterprise-grade MongoDB backup and recovery with advanced disaster recovery capabilities
class EnterpriseBackupStrategy extends AdvancedBackupRecoveryManager {
  constructor(connectionString, enterpriseConfig) {
    super(connectionString, enterpriseConfig);

    this.enterpriseConfig = {
      ...enterpriseConfig,
      enableGeographicReplication: true,
      enableComplianceAuditing: true,
      enableAutomatedTesting: true,
      enableDisasterRecoveryProcedures: true,
      enableCapacityPlanning: true
    };

    this.setupEnterpriseBackupStrategy();
    this.initializeDisasterRecoveryProcedures();
    this.setupComplianceAuditing();
  }

  async implementAdvancedBackupStrategy() {
    console.log('Implementing enterprise backup strategy...');

    const backupStrategy = {
      // Multi-tier backup strategy
      backupTiers: {
        primaryBackups: {
          frequency: 'daily',
          retentionDays: 30,
          compressionLevel: 9,
          verificationLevel: 'full'
        },
        secondaryBackups: {
          frequency: 'hourly',
          retentionDays: 7,
          compressionLevel: 6,
          verificationLevel: 'checksum'
        },
        archivalBackups: {
          frequency: 'monthly',
          retentionMonths: 84, // 7 years for compliance
          compressionLevel: 9,
          verificationLevel: 'full'
        }
      },

      // Disaster recovery configuration
      disasterRecovery: {
        geographicReplication: true,
        crossRegionBackups: true,
        automatedFailoverTesting: true,
        recoveryTimeObjective: 4 * 60 * 60 * 1000, // 4 hours
        recoveryPointObjective: 15 * 60 * 1000 // 15 minutes
      },

      // Performance optimization
      performanceOptimization: {
        parallelBackupStreams: 8,
        networkOptimization: true,
        storageOptimization: true,
        resourceThrottling: true
      }
    };

    return await this.deployEnterpriseStrategy(backupStrategy);
  }

  async setupComplianceAuditing() {
    console.log('Setting up compliance auditing for backup operations...');

    const auditingConfig = {
      // Regulatory compliance
      regulations: ['SOX', 'GDPR', 'HIPAA', 'PCI-DSS'],
      auditTrailRetention: 7 * 365, // 7 years
      encryptionStandards: ['AES-256', 'RSA-2048'],
      accessControlAuditing: true,

      // Data governance
      dataClassification: {
        sensitiveDataHandling: true,
        dataRetentionPolicies: true,
        dataLineageTracking: true,
        privacyCompliance: true
      }
    };

    return await this.deployComplianceFramework(auditingConfig);
  }
}

SQL-Style Backup and Recovery with QueryLeaf

QueryLeaf provides familiar SQL syntax for MongoDB backup and recovery operations:

-- QueryLeaf advanced backup and recovery with SQL-familiar syntax for MongoDB

-- Configure comprehensive backup strategy
CONFIGURE BACKUP_STRATEGY 
SET strategy_name = 'enterprise_backup',
    backup_types = ['full', 'incremental', 'differential'],

    -- Full backup configuration
    full_backup_schedule = '0 2 * * 0',  -- Weekly on Sunday at 2 AM
    full_backup_retention_days = 90,
    full_backup_compression_level = 9,

    -- Incremental backup configuration  
    incremental_backup_schedule = '0 */6 * * *',  -- Every 6 hours
    incremental_backup_retention_days = 14,
    incremental_backup_compression_level = 6,

    -- Point-in-time recovery
    enable_point_in_time_recovery = true,
    oplog_retention_hours = 168,  -- 7 days
    recovery_point_objective_minutes = 15,
    recovery_time_objective_hours = 4,

    -- Storage and performance
    backup_storage_path = '/backup/mongodb',
    enable_compression = true,
    enable_encryption = true,
    parallel_backup_streams = 8,
    max_backup_bandwidth_mbps = 1000,

    -- Verification and validation
    enable_backup_verification = true,
    verification_sample_size = 10000,
    enable_checksum_validation = true,
    enable_restore_testing = true,

    -- Disaster recovery
    enable_geographic_replication = true,
    cross_region_backup_locations = ['us-east-1', 'eu-west-1'],
    enable_automated_failover_testing = true,

    -- Monitoring and alerting
    enable_backup_monitoring = true,
    alert_on_backup_failure = true,
    alert_on_backup_delay_minutes = 60,
    alert_on_verification_failure = true;

-- Execute comprehensive backup with monitoring
WITH backup_execution AS (
  SELECT 
    backup_id,
    backup_type,
    backup_start_time,
    backup_end_time,
    backup_status,
    backup_size_bytes,
    compression_ratio,

    -- Performance metrics
    EXTRACT(SECONDS FROM (backup_end_time - backup_start_time)) as backup_duration_seconds,
    CASE 
      WHEN EXTRACT(SECONDS FROM (backup_end_time - backup_start_time)) > 0 THEN
        (backup_size_bytes / 1024.0 / 1024.0) / EXTRACT(SECONDS FROM (backup_end_time - backup_start_time))
      ELSE 0
    END as throughput_mbps,

    -- Progress tracking
    collections_processed,
    total_collections,
    documents_processed,
    total_documents,
    CASE 
      WHEN total_documents > 0 THEN 
        (documents_processed * 100.0) / total_documents
      ELSE 0
    END as completion_percentage,

    -- Quality metrics
    backup_checksum,
    verification_status,
    verification_timestamp,

    -- Storage efficiency
    original_size_bytes,
    CASE 
      WHEN original_size_bytes > 0 THEN
        ((original_size_bytes - backup_size_bytes) * 100.0) / original_size_bytes
      ELSE 0
    END as compression_percentage,

    -- Error tracking
    error_message,
    warning_count,
    retry_count

  FROM BACKUP_JOBS('full', 'production_db')
  WHERE backup_start_time >= CURRENT_TIMESTAMP - INTERVAL '24 hours'
),

performance_analysis AS (
  SELECT 
    backup_type,
    COUNT(*) as total_backups,
    COUNT(*) FILTER (WHERE backup_status = 'completed') as successful_backups,
    COUNT(*) FILTER (WHERE backup_status = 'failed') as failed_backups,
    COUNT(*) FILTER (WHERE backup_status = 'running') as in_progress_backups,

    -- Performance statistics
    AVG(backup_duration_seconds) as avg_duration_seconds,
    PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY backup_duration_seconds) as p95_duration_seconds,
    AVG(throughput_mbps) as avg_throughput_mbps,
    MAX(throughput_mbps) as max_throughput_mbps,

    -- Size and compression analysis
    SUM(backup_size_bytes) as total_backup_size_bytes,
    AVG(compression_percentage) as avg_compression_percentage,

    -- Quality metrics
    COUNT(*) FILTER (WHERE verification_status = 'verified') as verified_backups,
    COUNT(*) FILTER (WHERE error_message IS NOT NULL) as backups_with_errors,
    AVG(warning_count) as avg_warnings_per_backup,

    -- Success rate calculations
    CASE 
      WHEN COUNT(*) > 0 THEN
        (COUNT(*) FILTER (WHERE backup_status = 'completed') * 100.0) / COUNT(*)
      ELSE 0
    END as success_rate_percentage,

    -- Recent trends
    COUNT(*) FILTER (
      WHERE backup_start_time >= CURRENT_TIMESTAMP - INTERVAL '7 days'
      AND backup_status = 'completed'
    ) as successful_backups_last_week

  FROM backup_execution
  GROUP BY backup_type
),

storage_analysis AS (
  SELECT 
    DATE_TRUNC('day', backup_start_time) as backup_date,
    SUM(backup_size_bytes) as daily_backup_size_bytes,
    COUNT(*) as daily_backup_count,
    AVG(compression_ratio) as avg_daily_compression_ratio,

    -- Growth analysis
    LAG(SUM(backup_size_bytes)) OVER (
      ORDER BY DATE_TRUNC('day', backup_start_time)
    ) as prev_day_backup_size,

    -- Storage efficiency
    SUM(original_size_bytes - backup_size_bytes) as daily_space_saved_bytes,

    -- Quality indicators
    COUNT(*) FILTER (WHERE verification_status = 'verified') as verified_backups_per_day,
    COUNT(*) FILTER (WHERE backup_status = 'failed') as failed_backups_per_day

  FROM backup_execution
  WHERE backup_start_time >= CURRENT_DATE - INTERVAL '30 days'
  GROUP BY DATE_TRUNC('day', backup_start_time)
)

SELECT 
  pa.backup_type,
  pa.total_backups,
  pa.successful_backups,
  pa.failed_backups,
  pa.in_progress_backups,

  -- Performance summary
  ROUND(pa.avg_duration_seconds, 1) as avg_backup_time_seconds,
  ROUND(pa.p95_duration_seconds, 1) as p95_backup_time_seconds,
  ROUND(pa.avg_throughput_mbps, 2) as avg_throughput_mbps,
  ROUND(pa.max_throughput_mbps, 2) as max_throughput_mbps,

  -- Storage summary
  ROUND(pa.total_backup_size_bytes / 1024.0 / 1024.0 / 1024.0, 2) as total_backup_size_gb,
  ROUND(pa.avg_compression_percentage, 1) as avg_compression_percent,

  -- Quality assessment
  pa.verified_backups,
  ROUND((pa.verified_backups * 100.0) / NULLIF(pa.successful_backups, 0), 1) as verification_rate_percent,
  pa.success_rate_percentage,

  -- Health indicators
  CASE 
    WHEN pa.success_rate_percentage < 95 THEN 'critical'
    WHEN pa.success_rate_percentage < 98 THEN 'warning'
    WHEN pa.avg_duration_seconds > 7200 THEN 'warning'  -- 2 hours
    ELSE 'healthy'
  END as backup_health_status,

  -- Operational recommendations
  CASE 
    WHEN pa.failed_backups > pa.total_backups * 0.05 THEN 'investigate_failures'
    WHEN pa.avg_duration_seconds > 3600 THEN 'optimize_performance'
    WHEN pa.avg_compression_percentage < 50 THEN 'review_compression_settings'
    WHEN pa.verified_backups < pa.successful_backups * 0.9 THEN 'improve_verification_coverage'
    ELSE 'monitor_continued'
  END as recommendation,

  -- Recent activity
  pa.successful_backups_last_week,
  CASE 
    WHEN pa.successful_backups_last_week < 7 AND pa.backup_type = 'full' THEN 'backup_frequency_low'
    WHEN pa.successful_backups_last_week < 28 AND pa.backup_type = 'incremental' THEN 'backup_frequency_low'
    ELSE 'backup_frequency_adequate'
  END as frequency_assessment,

  -- Storage trends from storage_analysis
  (SELECT 
     ROUND(AVG(sa.daily_backup_size_bytes) / 1024.0 / 1024.0, 1) 
   FROM storage_analysis sa 
   WHERE sa.backup_date >= CURRENT_DATE - INTERVAL '7 days'
  ) as avg_daily_backup_size_mb,

  (SELECT 
     ROUND(SUM(sa.daily_space_saved_bytes) / 1024.0 / 1024.0 / 1024.0, 2) 
   FROM storage_analysis sa 
   WHERE sa.backup_date >= CURRENT_DATE - INTERVAL '30 days'
  ) as total_space_saved_last_month_gb

FROM performance_analysis pa
ORDER BY pa.backup_type;

-- Point-in-time recovery analysis and recommendations
WITH recovery_scenarios AS (
  SELECT 
    recovery_id,
    backup_id,
    recovery_type,
    target_timestamp,
    recovery_start_time,
    recovery_end_time,
    recovery_status,

    -- Recovery performance
    EXTRACT(SECONDS FROM (recovery_end_time - recovery_start_time)) as recovery_duration_seconds,
    documents_restored,
    collections_restored,
    restored_data_size_bytes,

    -- Recovery quality
    data_consistency_verified,
    index_rebuild_required,
    post_recovery_validation_status,

    -- Business impact
    downtime_seconds,
    affected_applications,
    recovery_point_achieved,
    recovery_time_objective_met,

    -- Error tracking
    recovery_errors,
    manual_intervention_required

  FROM RECOVERY_OPERATIONS
  WHERE recovery_start_time >= CURRENT_TIMESTAMP - INTERVAL '90 days'
),

recovery_performance AS (
  SELECT 
    recovery_type,
    COUNT(*) as total_recoveries,
    COUNT(*) FILTER (WHERE recovery_status = 'completed') as successful_recoveries,
    COUNT(*) FILTER (WHERE recovery_status = 'failed') as failed_recoveries,

    -- Performance metrics
    AVG(recovery_duration_seconds) as avg_recovery_time_seconds,
    PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY recovery_duration_seconds) as p95_recovery_time_seconds,
    AVG(downtime_seconds) as avg_downtime_seconds,

    -- Data recovery metrics
    SUM(documents_restored) as total_documents_recovered,
    AVG(restored_data_size_bytes) as avg_data_size_recovered,

    -- Quality metrics
    COUNT(*) FILTER (WHERE data_consistency_verified = true) as verified_recoveries,
    COUNT(*) FILTER (WHERE recovery_time_objective_met = true) as rto_met_count,
    COUNT(*) FILTER (WHERE manual_intervention_required = true) as manual_intervention_count,

    -- Success rate
    CASE 
      WHEN COUNT(*) > 0 THEN
        (COUNT(*) FILTER (WHERE recovery_status = 'completed') * 100.0) / COUNT(*)
      ELSE 0
    END as recovery_success_rate_percent

  FROM recovery_scenarios
  GROUP BY recovery_type
),

backup_recovery_readiness AS (
  SELECT 
    backup_id,
    backup_type,
    backup_timestamp,
    backup_size_bytes,
    backup_status,
    verification_status,

    -- Recovery readiness assessment
    CASE 
      WHEN backup_status = 'completed' AND verification_status = 'verified' THEN 'ready'
      WHEN backup_status = 'completed' AND verification_status = 'pending' THEN 'needs_verification'
      WHEN backup_status = 'completed' AND verification_status = 'failed' THEN 'not_reliable'
      WHEN backup_status = 'failed' THEN 'not_available'
      ELSE 'unknown'
    END as recovery_readiness,

    -- Age assessment for recovery planning
    EXTRACT(DAYS FROM (CURRENT_TIMESTAMP - backup_timestamp)) as backup_age_days,
    CASE 
      WHEN EXTRACT(DAYS FROM (CURRENT_TIMESTAMP - backup_timestamp)) <= 1 THEN 'very_recent'
      WHEN EXTRACT(DAYS FROM (CURRENT_TIMESTAMP - backup_timestamp)) <= 7 THEN 'recent'
      WHEN EXTRACT(DAYS FROM (CURRENT_TIMESTAMP - backup_timestamp)) <= 30 THEN 'moderate'
      ELSE 'old'
    END as backup_age_category,

    -- Estimated recovery time based on size
    CASE 
      WHEN backup_size_bytes < 1024 * 1024 * 1024 THEN 'fast'      -- < 1GB
      WHEN backup_size_bytes < 10 * 1024 * 1024 * 1024 THEN 'moderate' -- < 10GB  
      WHEN backup_size_bytes < 100 * 1024 * 1024 * 1024 THEN 'slow'     -- < 100GB
      ELSE 'very_slow'                                                   -- >= 100GB
    END as estimated_recovery_speed

  FROM backup_jobs
  WHERE backup_timestamp >= CURRENT_TIMESTAMP - INTERVAL '30 days'
  AND backup_type IN ('full', 'incremental')
)

SELECT 
  rp.recovery_type,
  rp.total_recoveries,
  rp.successful_recoveries,
  rp.failed_recoveries,
  ROUND(rp.recovery_success_rate_percent, 1) as success_rate_percent,

  -- Performance summary
  ROUND(rp.avg_recovery_time_seconds / 60.0, 1) as avg_recovery_time_minutes,
  ROUND(rp.p95_recovery_time_seconds / 60.0, 1) as p95_recovery_time_minutes,
  ROUND(rp.avg_downtime_seconds / 60.0, 1) as avg_downtime_minutes,

  -- Data recovery summary  
  rp.total_documents_recovered,
  ROUND(rp.avg_data_size_recovered / 1024.0 / 1024.0, 1) as avg_data_recovered_mb,

  -- Quality assessment
  rp.verified_recoveries,
  ROUND((rp.verified_recoveries * 100.0) / NULLIF(rp.successful_recoveries, 0), 1) as verification_rate_percent,
  rp.rto_met_count,
  ROUND((rp.rto_met_count * 100.0) / NULLIF(rp.total_recoveries, 0), 1) as rto_achievement_percent,

  -- Operational indicators
  rp.manual_intervention_count,
  CASE 
    WHEN rp.recovery_success_rate_percent < 95 THEN 'critical'
    WHEN rp.avg_recovery_time_seconds > 14400 THEN 'warning'  -- 4 hours
    WHEN rp.manual_intervention_count > rp.total_recoveries * 0.2 THEN 'warning'
    ELSE 'healthy'
  END as recovery_health_status,

  -- Backup readiness summary
  (SELECT COUNT(*) 
   FROM backup_recovery_readiness brr 
   WHERE brr.recovery_readiness = 'ready' 
   AND brr.backup_age_category IN ('very_recent', 'recent')
  ) as ready_recent_backups,

  (SELECT COUNT(*) 
   FROM backup_recovery_readiness brr 
   WHERE brr.recovery_readiness = 'needs_verification'
  ) as backups_needing_verification,

  -- Recovery capability assessment
  CASE 
    WHEN rp.avg_recovery_time_seconds <= 3600 THEN 'excellent'  -- ≤ 1 hour
    WHEN rp.avg_recovery_time_seconds <= 14400 THEN 'good'      -- ≤ 4 hours  
    WHEN rp.avg_recovery_time_seconds <= 28800 THEN 'acceptable' -- ≤ 8 hours
    ELSE 'needs_improvement'
  END as recovery_capability_rating,

  -- Recommendations
  ARRAY[
    CASE WHEN rp.recovery_success_rate_percent < 98 THEN 'Improve backup verification processes' END,
    CASE WHEN rp.avg_recovery_time_seconds > 7200 THEN 'Optimize recovery performance' END,
    CASE WHEN rp.manual_intervention_count > 0 THEN 'Automate recovery procedures' END,
    CASE WHEN rp.rto_achievement_percent < 90 THEN 'Review recovery time objectives' END
  ]::TEXT[] as improvement_recommendations

FROM recovery_performance rp
ORDER BY rp.recovery_type;

-- Disaster recovery readiness assessment
CREATE VIEW disaster_recovery_dashboard AS
WITH current_backup_status AS (
  SELECT 
    backup_type,
    COUNT(*) as total_backups,
    COUNT(*) FILTER (WHERE backup_status = 'completed') as completed_backups,
    COUNT(*) FILTER (WHERE verification_status = 'verified') as verified_backups,
    MAX(backup_timestamp) as latest_backup_time,

    -- Recovery point assessment
    MIN(EXTRACT(MINUTES FROM (CURRENT_TIMESTAMP - backup_timestamp))) as minutes_since_latest,

    -- Geographic distribution
    COUNT(DISTINCT backup_location) as backup_locations,
    COUNT(*) FILTER (WHERE backup_location LIKE '%cross-region%') as cross_region_backups

  FROM backup_jobs
  WHERE backup_timestamp >= CURRENT_TIMESTAMP - INTERVAL '7 days'
  GROUP BY backup_type
),

disaster_scenarios AS (
  SELECT 
    scenario_name,
    scenario_type,
    estimated_data_loss_minutes,
    estimated_recovery_time_hours,
    recovery_success_probability,
    last_tested_date,
    test_result_status

  FROM disaster_recovery_tests
  WHERE test_date >= CURRENT_TIMESTAMP - INTERVAL '90 days'
),

compliance_status AS (
  SELECT 
    regulation_name,
    compliance_status,
    last_audit_date,
    next_audit_due_date,
    backup_retention_requirement_days,
    encryption_requirement_met,
    access_control_requirement_met

  FROM compliance_audits
  WHERE audit_type = 'backup_recovery'
)

SELECT 
  CURRENT_TIMESTAMP as dashboard_timestamp,

  -- Overall backup health
  (SELECT 
     CASE 
       WHEN MIN(minutes_since_latest) <= 60 AND 
            AVG((completed_backups * 100.0) / total_backups) >= 95 THEN 'excellent'
       WHEN MIN(minutes_since_latest) <= 240 AND 
            AVG((completed_backups * 100.0) / total_backups) >= 90 THEN 'good'  
       WHEN MIN(minutes_since_latest) <= 1440 AND 
            AVG((completed_backups * 100.0) / total_backups) >= 85 THEN 'acceptable'
       ELSE 'critical'
     END 
   FROM current_backup_status) as overall_backup_health,

  -- Recovery readiness
  (SELECT 
     CASE
       WHEN COUNT(*) FILTER (WHERE recovery_success_probability >= 0.95) = COUNT(*) THEN 'fully_ready'
       WHEN COUNT(*) FILTER (WHERE recovery_success_probability >= 0.90) >= COUNT(*) * 0.8 THEN 'mostly_ready' 
       WHEN COUNT(*) FILTER (WHERE recovery_success_probability >= 0.75) >= COUNT(*) * 0.6 THEN 'partially_ready'
       ELSE 'not_ready'
     END
   FROM disaster_scenarios) as disaster_recovery_readiness,

  -- Compliance status
  (SELECT 
     CASE 
       WHEN COUNT(*) FILTER (WHERE compliance_status = 'compliant') = COUNT(*) THEN 'fully_compliant'
       WHEN COUNT(*) FILTER (WHERE compliance_status = 'compliant') >= COUNT(*) * 0.8 THEN 'mostly_compliant'
       ELSE 'non_compliant'
     END
   FROM compliance_status) as regulatory_compliance_status,

  -- Detailed metrics
  (SELECT JSON_AGG(
     JSON_BUILD_OBJECT(
       'backup_type', backup_type,
       'completion_rate', ROUND((completed_backups * 100.0) / total_backups, 1),
       'verification_rate', ROUND((verified_backups * 100.0) / completed_backups, 1),
       'minutes_since_latest', minutes_since_latest,
       'geographic_distribution', backup_locations,
       'cross_region_backups', cross_region_backups
     )
   ) FROM current_backup_status) as backup_status_details,

  -- Critical alerts
  ARRAY[
    CASE WHEN (SELECT MIN(minutes_since_latest) FROM current_backup_status) > 1440 
         THEN 'CRITICAL: No recent backups found (>24 hours)' END,
    CASE WHEN (SELECT COUNT(*) FROM disaster_scenarios WHERE last_tested_date < CURRENT_DATE - INTERVAL '90 days') > 0
         THEN 'WARNING: Disaster recovery procedures not recently tested' END,
    CASE WHEN (SELECT COUNT(*) FROM compliance_status WHERE compliance_status != 'compliant') > 0
         THEN 'WARNING: Compliance violations detected' END,
    CASE WHEN (SELECT AVG((verified_backups * 100.0) / completed_backups) FROM current_backup_status) < 90
         THEN 'WARNING: Low backup verification rate' END
  ]::TEXT[] as critical_alerts;

-- QueryLeaf provides comprehensive backup and recovery capabilities:
-- 1. SQL-familiar syntax for MongoDB backup configuration and management
-- 2. Advanced backup scheduling with flexible retention policies
-- 3. Comprehensive backup verification and integrity monitoring
-- 4. Point-in-time recovery capabilities with oplog integration
-- 5. Disaster recovery planning and readiness assessment
-- 6. Compliance auditing and regulatory requirement management
-- 7. Performance monitoring and optimization recommendations
-- 8. Automated backup testing and recovery validation
-- 9. Enterprise-grade backup management with minimal configuration
-- 10. Production-ready disaster recovery automation and procedures

Best Practices for Production Backup and Recovery

Backup Strategy Design Principles

Essential principles for effective MongoDB backup and recovery deployment:

  1. Multi-Tier Backup Strategy: Implement multiple backup frequencies and retention policies for different recovery scenarios
  2. Verification and Testing: Establish comprehensive backup verification and regular recovery testing procedures
  3. Point-in-Time Recovery: Configure oplog capture and incremental backups for granular recovery capabilities
  4. Geographic Distribution: Implement cross-region backup replication for disaster recovery protection
  5. Performance Optimization: Balance backup frequency with system performance impact through intelligent scheduling
  6. Compliance Integration: Ensure backup procedures meet regulatory requirements and audit standards

Enterprise Backup Architecture

Design backup systems for enterprise-scale requirements:

  1. Automated Scheduling: Implement intelligent backup scheduling based on business requirements and system load
  2. Storage Management: Optimize backup storage with compression, deduplication, and lifecycle management
  3. Monitoring Integration: Integrate backup monitoring with existing alerting and operational workflows
  4. Security Controls: Implement encryption, access controls, and audit trails for backup security
  5. Disaster Recovery: Design comprehensive disaster recovery procedures with automated failover capabilities
  6. Capacity Planning: Monitor backup growth patterns and plan storage capacity requirements

Conclusion

MongoDB backup and recovery provides comprehensive data protection capabilities that enable robust disaster recovery, regulatory compliance, and business continuity through automated backup scheduling, point-in-time recovery, and advanced verification features. The native backup tools and integrated recovery procedures ensure that critical data is protected with minimal operational overhead.

Key MongoDB Backup and Recovery benefits include:

  • Automated Protection: Intelligent backup scheduling with comprehensive retention policies and automated lifecycle management
  • Advanced Recovery Options: Point-in-time recovery capabilities with oplog integration and incremental backup support
  • Enterprise Reliability: Production-ready backup verification, disaster recovery procedures, and compliance auditing
  • Performance Optimization: Efficient backup compression, parallel processing, and minimal performance impact
  • Operational Excellence: Comprehensive monitoring, alerting, and automated testing for backup system reliability
  • SQL Accessibility: Familiar SQL-style backup management operations through QueryLeaf for accessible data protection

Whether you're protecting mission-critical applications, meeting regulatory compliance requirements, implementing disaster recovery procedures, or managing enterprise backup operations, MongoDB backup and recovery with QueryLeaf's familiar SQL interface provides the foundation for comprehensive, reliable data protection.

QueryLeaf Integration: QueryLeaf automatically optimizes MongoDB backup and recovery operations while providing SQL-familiar syntax for backup configuration, monitoring, and recovery procedures. Advanced backup strategies, disaster recovery planning, and compliance auditing are seamlessly handled through familiar SQL constructs, making sophisticated data protection accessible to SQL-oriented operations teams.

The combination of MongoDB's robust backup capabilities with SQL-style data protection operations makes it an ideal platform for applications requiring both comprehensive data protection and familiar database management patterns, ensuring your critical data remains secure and recoverable as your systems scale and evolve.

MongoDB Data Pipeline Management and Stream Processing: Advanced Real-Time Data Processing and ETL Pipelines for Modern Applications

Modern data-driven applications require sophisticated data processing pipelines that can handle real-time data ingestion, complex transformations, and reliable data delivery across multiple systems and formats. Traditional batch processing approaches struggle with latency requirements, data volume scalability, and the complexity of managing distributed processing workflows. Effective data pipeline management demands real-time stream processing, incremental data transformations, and intelligent error handling mechanisms.

MongoDB's comprehensive data pipeline capabilities provide advanced stream processing features through Change Streams, Aggregation Framework, and native pipeline orchestration that enable sophisticated real-time data processing workflows. Unlike traditional ETL systems that require separate infrastructure components and complex coordination mechanisms, MongoDB integrates stream processing directly into the database with optimized pipeline execution, automatic scaling, and built-in fault tolerance.

The Traditional Data Pipeline Challenge

Conventional approaches to data pipeline management in relational systems face significant limitations in real-time processing:

-- Traditional PostgreSQL data pipeline management - complex batch processing with limited real-time capabilities

-- Basic ETL tracking table with limited functionality
CREATE TABLE etl_job_runs (
    run_id SERIAL PRIMARY KEY,
    job_name VARCHAR(255) NOT NULL,
    job_type VARCHAR(100) NOT NULL,
    source_system VARCHAR(100),
    target_system VARCHAR(100),

    -- Job execution tracking
    start_time TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
    end_time TIMESTAMP,
    status VARCHAR(50) DEFAULT 'running',

    -- Basic metrics (very limited)
    records_processed INTEGER DEFAULT 0,
    records_inserted INTEGER DEFAULT 0,
    records_updated INTEGER DEFAULT 0,
    records_deleted INTEGER DEFAULT 0,
    records_failed INTEGER DEFAULT 0,

    -- Error tracking (basic)
    error_message TEXT,
    error_count INTEGER DEFAULT 0,

    -- Resource usage (manual tracking)
    cpu_usage_percent DECIMAL(5,2),
    memory_usage_mb INTEGER,
    disk_io_mb INTEGER,

    -- Basic configuration
    batch_size INTEGER DEFAULT 1000,
    parallel_workers INTEGER DEFAULT 1,
    retry_attempts INTEGER DEFAULT 3
);

-- Data transformation rules (static and inflexible)
CREATE TABLE transformation_rules (
    rule_id SERIAL PRIMARY KEY,
    rule_name VARCHAR(255) NOT NULL,
    source_table VARCHAR(255),
    target_table VARCHAR(255),
    transformation_type VARCHAR(100),

    -- Transformation logic (limited SQL expressions)
    source_columns TEXT[],
    target_columns TEXT[],
    transformation_sql TEXT,

    -- Basic validation rules
    validation_rules TEXT[],
    data_quality_checks TEXT[],

    -- Rule metadata
    active BOOLEAN DEFAULT true,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    created_by VARCHAR(100)
);

-- Simple batch processing function (no real-time capabilities)
CREATE OR REPLACE FUNCTION execute_batch_etl(
    job_name_param VARCHAR(255),
    batch_size_param INTEGER DEFAULT 1000
) RETURNS TABLE (
    run_id INTEGER,
    records_processed INTEGER,
    execution_time_seconds INTEGER,
    status VARCHAR(50),
    error_message TEXT
) AS $$
DECLARE
    current_run_id INTEGER;
    processing_start TIMESTAMP;
    processing_end TIMESTAMP;
    batch_count INTEGER := 0;
    total_records INTEGER := 0;
    error_msg TEXT := '';
    processing_status VARCHAR(50) := 'completed';
BEGIN
    -- Start new job run
    INSERT INTO etl_job_runs (job_name, job_type, status)
    VALUES (job_name_param, 'batch_etl', 'running')
    RETURNING etl_job_runs.run_id INTO current_run_id;

    processing_start := clock_timestamp();

    BEGIN
        -- Very basic batch processing loop
        LOOP
            -- Simulate batch processing (would be actual data transformation in reality)
            PERFORM pg_sleep(0.1); -- Simulate processing time

            batch_count := batch_count + 1;
            total_records := total_records + batch_size_param;

            -- Simple exit condition (no real data source integration)
            EXIT WHEN batch_count >= 10; -- Process 10 batches maximum

        END LOOP;

    EXCEPTION WHEN OTHERS THEN
        error_msg := SQLERRM;
        processing_status := 'failed';

    END;

    processing_end := clock_timestamp();

    -- Update job run status
    UPDATE etl_job_runs 
    SET 
        end_time = processing_end,
        status = processing_status,
        records_processed = total_records,
        records_inserted = total_records,
        error_message = error_msg,
        error_count = CASE WHEN processing_status = 'failed' THEN 1 ELSE 0 END
    WHERE etl_job_runs.run_id = current_run_id;

    -- Return execution results
    RETURN QUERY SELECT 
        current_run_id,
        total_records,
        EXTRACT(SECONDS FROM (processing_end - processing_start))::INTEGER,
        processing_status,
        error_msg;

END;
$$ LANGUAGE plpgsql;

-- Execute batch ETL job (very basic functionality)
SELECT * FROM execute_batch_etl('customer_data_sync', 500);

-- Data quality monitoring (limited real-time capabilities)
WITH data_quality_metrics AS (
    SELECT 
        ejr.job_name,
        ejr.run_id,
        ejr.start_time,
        ejr.end_time,
        ejr.records_processed,
        ejr.records_failed,

        -- Basic quality calculations
        CASE 
            WHEN ejr.records_processed > 0 THEN 
                ROUND((ejr.records_processed - ejr.records_failed)::DECIMAL / ejr.records_processed * 100, 2)
            ELSE 0
        END as success_rate_percent,

        -- Processing rate
        CASE 
            WHEN EXTRACT(SECONDS FROM (ejr.end_time - ejr.start_time)) > 0 THEN
                ROUND(ejr.records_processed::DECIMAL / EXTRACT(SECONDS FROM (ejr.end_time - ejr.start_time)), 2)
            ELSE 0
        END as records_per_second,

        -- Basic status assessment
        CASE ejr.status
            WHEN 'completed' THEN 'success'
            WHEN 'failed' THEN 'failure'
            ELSE 'unknown'
        END as quality_status

    FROM etl_job_runs ejr
    WHERE ejr.start_time >= CURRENT_DATE - INTERVAL '7 days'
),

quality_summary AS (
    SELECT 
        job_name,
        COUNT(*) as total_runs,
        COUNT(*) FILTER (WHERE quality_status = 'success') as successful_runs,
        COUNT(*) FILTER (WHERE quality_status = 'failure') as failed_runs,

        -- Quality metrics
        AVG(success_rate_percent) as avg_success_rate,
        AVG(records_per_second) as avg_processing_rate,
        SUM(records_processed) as total_records_processed,
        SUM(records_failed) as total_records_failed,

        -- Time-based analysis
        AVG(EXTRACT(SECONDS FROM (end_time - start_time))) as avg_execution_seconds,
        MAX(EXTRACT(SECONDS FROM (end_time - start_time))) as max_execution_seconds,
        MIN(start_time) as first_run,
        MAX(end_time) as last_run

    FROM data_quality_metrics
    GROUP BY job_name
)

SELECT 
    job_name,
    total_runs,
    successful_runs,
    failed_runs,

    -- Success rates
    CASE 
        WHEN total_runs > 0 THEN 
            ROUND((successful_runs::DECIMAL / total_runs) * 100, 1)
        ELSE 0
    END as job_success_rate_percent,

    -- Performance metrics
    ROUND(avg_success_rate, 1) as avg_record_success_rate_percent,
    ROUND(avg_processing_rate, 1) as avg_records_per_second,
    total_records_processed,
    total_records_failed,

    -- Timing analysis
    ROUND(avg_execution_seconds, 1) as avg_duration_seconds,
    ROUND(max_execution_seconds, 1) as max_duration_seconds,

    -- Data quality assessment
    CASE 
        WHEN failed_runs = 0 AND avg_success_rate > 98 THEN 'excellent'
        WHEN failed_runs <= total_runs * 0.05 AND avg_success_rate > 95 THEN 'good'
        WHEN failed_runs <= total_runs * 0.1 AND avg_success_rate > 90 THEN 'acceptable'
        ELSE 'poor'
    END as data_quality_rating,

    -- Recommendations
    CASE 
        WHEN failed_runs > total_runs * 0.1 THEN 'investigate_failures'
        WHEN avg_processing_rate < 100 THEN 'optimize_performance'
        WHEN max_execution_seconds > avg_execution_seconds * 3 THEN 'check_consistency'
        ELSE 'monitor_continued'
    END as recommendation

FROM quality_summary
ORDER BY total_records_processed DESC;

-- Real-time data change tracking (very limited functionality)
CREATE TABLE data_changes (
    change_id SERIAL PRIMARY KEY,
    table_name VARCHAR(255) NOT NULL,
    operation_type VARCHAR(10) NOT NULL, -- INSERT, UPDATE, DELETE
    record_id VARCHAR(100),

    -- Change tracking (basic)
    old_values JSONB,
    new_values JSONB,
    changed_columns TEXT[],

    -- Metadata
    change_timestamp TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    user_id VARCHAR(100),
    application_name VARCHAR(100),

    -- Processing status
    processed BOOLEAN DEFAULT false,
    processing_attempts INTEGER DEFAULT 0,
    last_processing_attempt TIMESTAMP,
    processing_error TEXT
);

-- Basic trigger function for change tracking
CREATE OR REPLACE FUNCTION track_data_changes()
RETURNS TRIGGER AS $$
BEGIN
    -- Insert change record (very basic functionality)
    INSERT INTO data_changes (
        table_name,
        operation_type,
        record_id,
        old_values,
        new_values,
        user_id
    )
    VALUES (
        TG_TABLE_NAME,
        TG_OP,
        CASE 
            WHEN TG_OP = 'DELETE' THEN OLD.id::TEXT
            ELSE NEW.id::TEXT
        END,
        CASE 
            WHEN TG_OP = 'DELETE' THEN to_jsonb(OLD)
            WHEN TG_OP = 'UPDATE' THEN to_jsonb(OLD)
            ELSE NULL
        END,
        CASE 
            WHEN TG_OP = 'DELETE' THEN NULL
            ELSE to_jsonb(NEW)
        END,
        current_user
    );

    -- Return appropriate record
    CASE TG_OP
        WHEN 'DELETE' THEN RETURN OLD;
        ELSE RETURN NEW;
    END CASE;

EXCEPTION WHEN OTHERS THEN
    -- Basic error handling (logs errors but doesn't stop operations)
    RAISE WARNING 'Change tracking failed: %', SQLERRM;
    CASE TG_OP
        WHEN 'DELETE' THEN RETURN OLD;
        ELSE RETURN NEW;
    END CASE;
END;
$$ LANGUAGE plpgsql;

-- Process pending changes (batch processing only)
WITH pending_changes AS (
    SELECT 
        change_id,
        table_name,
        operation_type,
        new_values,
        old_values,
        change_timestamp,

        -- Group changes by time windows for batch processing
        DATE_TRUNC('minute', change_timestamp) as processing_window

    FROM data_changes
    WHERE processed = false 
    AND processing_attempts < 3
    ORDER BY change_timestamp
    LIMIT 1000
),

change_summary AS (
    SELECT 
        processing_window,
        table_name,
        operation_type,
        COUNT(*) as change_count,
        MIN(change_timestamp) as first_change,
        MAX(change_timestamp) as last_change,

        -- Basic aggregations (very limited analysis)
        COUNT(*) FILTER (WHERE operation_type = 'INSERT') as inserts,
        COUNT(*) FILTER (WHERE operation_type = 'UPDATE') as updates,
        COUNT(*) FILTER (WHERE operation_type = 'DELETE') as deletes

    FROM pending_changes
    GROUP BY processing_window, table_name, operation_type
)

SELECT 
    processing_window,
    table_name,
    operation_type,
    change_count,
    first_change,
    last_change,

    -- Change rate analysis
    CASE 
        WHEN EXTRACT(SECONDS FROM (last_change - first_change)) > 0 THEN
            ROUND(change_count::DECIMAL / EXTRACT(SECONDS FROM (last_change - first_change)), 2)
        ELSE change_count
    END as changes_per_second,

    -- Processing recommendations (very basic)
    CASE 
        WHEN change_count > 1000 THEN 'high_volume_batch'
        WHEN change_count > 100 THEN 'medium_batch'
        ELSE 'small_batch'
    END as processing_strategy,

    -- Simple priority assessment
    CASE table_name
        WHEN 'users' THEN 'high'
        WHEN 'orders' THEN 'high'
        WHEN 'products' THEN 'medium'
        ELSE 'low'
    END as processing_priority

FROM change_summary
ORDER BY processing_window DESC, change_count DESC;

-- Problems with traditional data pipeline approaches:
-- 1. No real-time processing - only batch operations with delays
-- 2. Limited transformation capabilities - basic SQL only
-- 3. Poor scalability - single-threaded processing
-- 4. Manual error handling and recovery
-- 5. No automatic schema evolution or data type handling
-- 6. Limited monitoring and observability
-- 7. Complex integration with external systems
-- 8. No built-in data quality validation
-- 9. Difficult to maintain and debug complex pipelines
-- 10. No support for stream processing or event-driven architectures

MongoDB provides comprehensive data pipeline management with advanced stream processing capabilities:

// MongoDB Advanced Data Pipeline Management and Stream Processing
const { MongoClient, ChangeStream } = require('mongodb');
const { EventEmitter } = require('events');

// Comprehensive MongoDB Data Pipeline Manager
class AdvancedDataPipelineManager extends EventEmitter {
  constructor(mongoUri, pipelineConfig = {}) {
    super();
    this.mongoUri = mongoUri;
    this.client = null;
    this.db = null;

    // Advanced pipeline configuration
    this.config = {
      // Processing configuration
      enableRealTimeProcessing: pipelineConfig.enableRealTimeProcessing !== false,
      enableBatchProcessing: pipelineConfig.enableBatchProcessing !== false,
      enableStreamProcessing: pipelineConfig.enableStreamProcessing !== false,

      // Performance settings
      maxConcurrentPipelines: pipelineConfig.maxConcurrentPipelines || 10,
      batchSize: pipelineConfig.batchSize || 1000,
      maxRetries: pipelineConfig.maxRetries || 3,
      retryDelay: pipelineConfig.retryDelay || 1000,

      // Change stream configuration
      enableChangeStreams: pipelineConfig.enableChangeStreams !== false,
      changeStreamOptions: pipelineConfig.changeStreamOptions || {
        fullDocument: 'updateLookup',
        fullDocumentBeforeChange: 'whenAvailable'
      },

      // Data quality and validation
      enableDataValidation: pipelineConfig.enableDataValidation !== false,
      enableSchemaEvolution: pipelineConfig.enableSchemaEvolution || false,
      enableDataLineage: pipelineConfig.enableDataLineage || false,

      // Monitoring and observability
      enableMetrics: pipelineConfig.enableMetrics !== false,
      enablePipelineMonitoring: pipelineConfig.enablePipelineMonitoring !== false,
      enableErrorTracking: pipelineConfig.enableErrorTracking !== false,

      // Advanced features
      enableIncrementalProcessing: pipelineConfig.enableIncrementalProcessing || false,
      enableDataDeduplication: pipelineConfig.enableDataDeduplication || false,
      enableDataEnrichment: pipelineConfig.enableDataEnrichment || false
    };

    // Pipeline registry and state management
    this.pipelines = new Map();
    this.changeStreams = new Map();
    this.pipelineMetrics = new Map();
    this.activeProcessing = new Map();

    // Error tracking and recovery
    this.errorHistory = [];
    this.retryQueues = new Map();

    this.initializeDataPipelines();
  }

  async initializeDataPipelines() {
    console.log('Initializing advanced data pipeline management system...');

    try {
      // Connect to MongoDB
      this.client = new MongoClient(this.mongoUri);
      await this.client.connect();
      this.db = this.client.db();

      // Setup pipeline infrastructure
      await this.setupPipelineInfrastructure();

      // Initialize change streams if enabled
      if (this.config.enableChangeStreams) {
        await this.setupChangeStreams();
      }

      // Start pipeline monitoring
      if (this.config.enablePipelineMonitoring) {
        await this.startPipelineMonitoring();
      }

      console.log('Advanced data pipeline system initialized successfully');

    } catch (error) {
      console.error('Error initializing data pipeline system:', error);
      throw error;
    }
  }

  async setupPipelineInfrastructure() {
    console.log('Setting up pipeline infrastructure...');

    try {
      // Create collections for pipeline management
      const collections = {
        pipelineDefinitions: this.db.collection('pipeline_definitions'),
        pipelineRuns: this.db.collection('pipeline_runs'),
        pipelineMetrics: this.db.collection('pipeline_metrics'),
        dataLineage: this.db.collection('data_lineage'),
        pipelineErrors: this.db.collection('pipeline_errors'),
        transformationRules: this.db.collection('transformation_rules')
      };

      // Create indexes for optimal performance
      await collections.pipelineRuns.createIndex(
        { pipelineId: 1, startTime: -1 },
        { background: true }
      );

      await collections.pipelineMetrics.createIndex(
        { pipelineId: 1, timestamp: -1 },
        { background: true }
      );

      await collections.dataLineage.createIndex(
        { sourceCollection: 1, targetCollection: 1, timestamp: -1 },
        { background: true }
      );

      this.collections = collections;

    } catch (error) {
      console.error('Error setting up pipeline infrastructure:', error);
      throw error;
    }
  }

  async registerDataPipeline(pipelineDefinition) {
    console.log(`Registering data pipeline: ${pipelineDefinition.name}`);

    try {
      // Validate pipeline definition
      const validatedDefinition = await this.validatePipelineDefinition(pipelineDefinition);

      // Enhanced pipeline definition with metadata
      const enhancedDefinition = {
        ...validatedDefinition,
        pipelineId: this.generatePipelineId(validatedDefinition.name),

        // Pipeline metadata
        registeredAt: new Date(),
        version: pipelineDefinition.version || '1.0.0',
        status: 'registered',

        // Processing configuration
        processingMode: pipelineDefinition.processingMode || 'stream', // stream, batch, hybrid
        triggerType: pipelineDefinition.triggerType || 'change_stream', // change_stream, schedule, manual

        // Data transformation pipeline
        transformationStages: pipelineDefinition.transformationStages || [],

        // Data sources and targets
        dataSources: pipelineDefinition.dataSources || [],
        dataTargets: pipelineDefinition.dataTargets || [],

        // Quality and validation rules
        dataQualityRules: pipelineDefinition.dataQualityRules || [],
        schemaValidationRules: pipelineDefinition.schemaValidationRules || [],

        // Performance configuration
        performance: {
          batchSize: pipelineDefinition.batchSize || this.config.batchSize,
          maxConcurrency: pipelineDefinition.maxConcurrency || 5,
          timeoutMs: pipelineDefinition.timeoutMs || 300000,

          // Resource limits
          maxMemoryMB: pipelineDefinition.maxMemoryMB || 1024,
          maxCpuPercent: pipelineDefinition.maxCpuPercent || 80
        },

        // Error handling configuration
        errorHandling: {
          retryStrategy: pipelineDefinition.retryStrategy || 'exponential_backoff',
          maxRetries: pipelineDefinition.maxRetries || this.config.maxRetries,
          deadLetterQueue: pipelineDefinition.deadLetterQueue || true,
          errorNotifications: pipelineDefinition.errorNotifications || []
        },

        // Monitoring configuration
        monitoring: {
          enableMetrics: pipelineDefinition.enableMetrics !== false,
          metricsInterval: pipelineDefinition.metricsInterval || 60000,
          alertThresholds: pipelineDefinition.alertThresholds || {}
        }
      };

      // Store pipeline definition
      await this.collections.pipelineDefinitions.replaceOne(
        { pipelineId: enhancedDefinition.pipelineId },
        enhancedDefinition,
        { upsert: true }
      );

      // Register pipeline in memory
      this.pipelines.set(enhancedDefinition.pipelineId, {
        definition: enhancedDefinition,
        status: 'registered',
        lastRun: null,
        statistics: {
          totalRuns: 0,
          successfulRuns: 0,
          failedRuns: 0,
          totalRecordsProcessed: 0,
          averageProcessingTime: 0
        }
      });

      console.log(`Pipeline '${enhancedDefinition.name}' registered successfully with ID: ${enhancedDefinition.pipelineId}`);

      // Start pipeline if configured for automatic startup
      if (enhancedDefinition.autoStart) {
        await this.startPipeline(enhancedDefinition.pipelineId);
      }

      return {
        success: true,
        pipelineId: enhancedDefinition.pipelineId,
        definition: enhancedDefinition
      };

    } catch (error) {
      console.error(`Error registering pipeline '${pipelineDefinition.name}':`, error);
      return {
        success: false,
        error: error.message,
        pipelineDefinition: pipelineDefinition
      };
    }
  }

  async startPipeline(pipelineId) {
    console.log(`Starting data pipeline: ${pipelineId}`);

    try {
      const pipeline = this.pipelines.get(pipelineId);
      if (!pipeline) {
        throw new Error(`Pipeline not found: ${pipelineId}`);
      }

      if (pipeline.status === 'running') {
        console.log(`Pipeline ${pipelineId} is already running`);
        return { success: true, status: 'already_running' };
      }

      const definition = pipeline.definition;

      // Create pipeline run record
      const runRecord = {
        runId: this.generateRunId(),
        pipelineId: pipelineId,
        pipelineName: definition.name,
        startTime: new Date(),
        status: 'running',

        // Processing metrics
        recordsProcessed: 0,
        recordsSuccessful: 0,
        recordsFailed: 0,

        // Performance tracking
        processingTimeMs: 0,
        throughputRecordsPerSecond: 0,

        // Resource usage
        memoryUsageMB: 0,
        cpuUsagePercent: 0,

        // Error tracking
        errors: [],
        retryAttempts: 0
      };

      await this.collections.pipelineRuns.insertOne(runRecord);

      // Start processing based on trigger type
      switch (definition.triggerType) {
        case 'change_stream':
          await this.startChangeStreamPipeline(pipelineId, definition, runRecord);
          break;

        case 'schedule':
          await this.startScheduledPipeline(pipelineId, definition, runRecord);
          break;

        case 'batch':
          await this.startBatchPipeline(pipelineId, definition, runRecord);
          break;

        default:
          throw new Error(`Unsupported trigger type: ${definition.triggerType}`);
      }

      // Update pipeline status
      pipeline.status = 'running';
      pipeline.lastRun = runRecord;

      this.emit('pipelineStarted', {
        pipelineId: pipelineId,
        runId: runRecord.runId,
        startTime: runRecord.startTime
      });

      return {
        success: true,
        pipelineId: pipelineId,
        runId: runRecord.runId,
        status: 'running'
      };

    } catch (error) {
      console.error(`Error starting pipeline ${pipelineId}:`, error);

      // Update pipeline status to error
      const pipeline = this.pipelines.get(pipelineId);
      if (pipeline) {
        pipeline.status = 'error';
      }

      return {
        success: false,
        pipelineId: pipelineId,
        error: error.message
      };
    }
  }

  async startChangeStreamPipeline(pipelineId, definition, runRecord) {
    console.log(`Starting change stream pipeline: ${pipelineId}`);

    try {
      const dataSources = definition.dataSources;

      for (const dataSource of dataSources) {
        const collection = this.db.collection(dataSource.collection);

        // Configure change stream options
        const changeStreamOptions = {
          ...this.config.changeStreamOptions,
          ...dataSource.changeStreamOptions,

          // Add pipeline-specific filters
          ...(dataSource.filter && { matchStage: { $match: dataSource.filter } })
        };

        // Create change stream
        const changeStream = collection.watch([], changeStreamOptions);

        // Store change stream reference
        this.changeStreams.set(`${pipelineId}_${dataSource.collection}`, changeStream);

        // Setup change stream event handlers
        changeStream.on('change', async (changeEvent) => {
          await this.processChangeEvent(pipelineId, definition, runRecord, changeEvent);
        });

        changeStream.on('error', async (error) => {
          console.error(`Change stream error for pipeline ${pipelineId}:`, error);
          await this.handlePipelineError(pipelineId, runRecord, error);
        });

        changeStream.on('close', () => {
          console.log(`Change stream closed for pipeline ${pipelineId}`);
          this.emit('pipelineStreamClosed', { pipelineId, collection: dataSource.collection });
        });
      }

    } catch (error) {
      console.error(`Error starting change stream pipeline ${pipelineId}:`, error);
      throw error;
    }
  }

  async processChangeEvent(pipelineId, definition, runRecord, changeEvent) {
    try {
      // Track processing start
      const processingStart = Date.now();

      // Apply transformation stages
      let processedData = changeEvent;

      for (const transformationStage of definition.transformationStages) {
        processedData = await this.applyTransformation(
          processedData, 
          transformationStage, 
          definition
        );
      }

      // Apply data quality validation
      if (this.config.enableDataValidation) {
        const validationResult = await this.validateData(
          processedData, 
          definition.dataQualityRules
        );

        if (!validationResult.isValid) {
          await this.handleValidationError(pipelineId, runRecord, processedData, validationResult);
          return;
        }
      }

      // Write to target destinations
      const writeResults = await this.writeToTargets(
        processedData, 
        definition.dataTargets, 
        definition
      );

      // Update run metrics
      const processingTime = Date.now() - processingStart;

      await this.updateRunMetrics(runRecord, {
        recordsProcessed: 1,
        recordsSuccessful: writeResults.successCount,
        recordsFailed: writeResults.failureCount,
        processingTimeMs: processingTime
      });

      // Record data lineage if enabled
      if (this.config.enableDataLineage) {
        await this.recordDataLineage(pipelineId, changeEvent, processedData, definition);
      }

      this.emit('recordProcessed', {
        pipelineId: pipelineId,
        runId: runRecord.runId,
        changeEvent: changeEvent,
        processedData: processedData,
        processingTime: processingTime
      });

    } catch (error) {
      console.error(`Error processing change event for pipeline ${pipelineId}:`, error);
      await this.handleProcessingError(pipelineId, runRecord, changeEvent, error);
    }
  }

  async applyTransformation(data, transformationStage, pipelineDefinition) {
    console.log(`Applying transformation: ${transformationStage.type}`);

    try {
      switch (transformationStage.type) {
        case 'aggregation':
          return await this.applyAggregationTransformation(data, transformationStage);

        case 'field_mapping':
          return await this.applyFieldMapping(data, transformationStage);

        case 'data_enrichment':
          return await this.applyDataEnrichment(data, transformationStage, pipelineDefinition);

        case 'filtering':
          return await this.applyFiltering(data, transformationStage);

        case 'normalization':
          return await this.applyNormalization(data, transformationStage);

        case 'custom_function':
          return await this.applyCustomFunction(data, transformationStage);

        default:
          console.warn(`Unknown transformation type: ${transformationStage.type}`);
          return data;
      }

    } catch (error) {
      console.error(`Error applying transformation ${transformationStage.type}:`, error);
      throw error;
    }
  }

  async applyAggregationTransformation(data, transformationStage) {
    // Apply MongoDB aggregation pipeline to transform data
    const pipeline = transformationStage.aggregationPipeline;

    if (!Array.isArray(pipeline) || pipeline.length === 0) {
      return data;
    }

    try {
      // Execute aggregation on source data
      // This would work with the actual data structure in a real implementation
      let transformedData = data;

      // Simulate aggregation operations
      for (const stage of pipeline) {
        if (stage.$project) {
          transformedData = this.projectFields(transformedData, stage.$project);
        } else if (stage.$match) {
          transformedData = this.matchFilter(transformedData, stage.$match);
        } else if (stage.$addFields) {
          transformedData = this.addFields(transformedData, stage.$addFields);
        }
        // Add more aggregation operators as needed
      }

      return transformedData;

    } catch (error) {
      console.error('Error in aggregation transformation:', error);
      throw error;
    }
  }

  async applyFieldMapping(data, transformationStage) {
    // Apply field mapping transformation
    const mappings = transformationStage.fieldMappings;

    if (!mappings || Object.keys(mappings).length === 0) {
      return data;
    }

    try {
      let mappedData = { ...data };

      // Apply field mappings
      Object.entries(mappings).forEach(([targetField, sourceField]) => {
        const sourceValue = this.getNestedValue(data, sourceField);
        this.setNestedValue(mappedData, targetField, sourceValue);
      });

      return mappedData;

    } catch (error) {
      console.error('Error in field mapping transformation:', error);
      throw error;
    }
  }

  async applyDataEnrichment(data, transformationStage, pipelineDefinition) {
    // Apply data enrichment from external sources
    const enrichmentConfig = transformationStage.enrichmentConfig;

    try {
      let enrichedData = { ...data };

      for (const enrichment of enrichmentConfig.enrichments) {
        switch (enrichment.type) {
          case 'lookup':
            enrichedData = await this.applyLookupEnrichment(enrichedData, enrichment);
            break;

          case 'calculation':
            enrichedData = await this.applyCalculationEnrichment(enrichedData, enrichment);
            break;

          case 'external_api':
            enrichedData = await this.applyExternalApiEnrichment(enrichedData, enrichment);
            break;
        }
      }

      return enrichedData;

    } catch (error) {
      console.error('Error in data enrichment transformation:', error);
      throw error;
    }
  }

  async writeToTargets(processedData, dataTargets, pipelineDefinition) {
    console.log('Writing processed data to targets...');

    const writeResults = {
      successCount: 0,
      failureCount: 0,
      results: []
    };

    try {
      const writePromises = dataTargets.map(async (target) => {
        try {
          const result = await this.writeToTarget(processedData, target, pipelineDefinition);
          writeResults.successCount++;
          writeResults.results.push({ target: target.name, success: true, result });
          return result;

        } catch (error) {
          console.error(`Error writing to target ${target.name}:`, error);
          writeResults.failureCount++;
          writeResults.results.push({ 
            target: target.name, 
            success: false, 
            error: error.message 
          });
          throw error;
        }
      });

      await Promise.allSettled(writePromises);

      return writeResults;

    } catch (error) {
      console.error('Error writing to targets:', error);
      throw error;
    }
  }

  async writeToTarget(processedData, target, pipelineDefinition) {
    console.log(`Writing to target: ${target.name} (${target.type})`);

    try {
      switch (target.type) {
        case 'mongodb_collection':
          return await this.writeToMongoDBCollection(processedData, target);

        case 'file':
          return await this.writeToFile(processedData, target);

        case 'external_api':
          return await this.writeToExternalAPI(processedData, target);

        case 'message_queue':
          return await this.writeToMessageQueue(processedData, target);

        default:
          throw new Error(`Unsupported target type: ${target.type}`);
      }

    } catch (error) {
      console.error(`Error writing to target ${target.name}:`, error);
      throw error;
    }
  }

  async writeToMongoDBCollection(processedData, target) {
    const collection = this.db.collection(target.collection);

    try {
      switch (target.writeMode || 'insert') {
        case 'insert':
          const insertResult = await collection.insertOne(processedData);
          return { operation: 'insert', insertedId: insertResult.insertedId };

        case 'upsert':
          const upsertResult = await collection.replaceOne(
            target.upsertFilter || { _id: processedData._id },
            processedData,
            { upsert: true }
          );
          return { 
            operation: 'upsert', 
            modifiedCount: upsertResult.modifiedCount,
            upsertedId: upsertResult.upsertedId
          };

        case 'update':
          const updateResult = await collection.updateOne(
            target.updateFilter || { _id: processedData._id },
            { $set: processedData }
          );
          return {
            operation: 'update',
            matchedCount: updateResult.matchedCount,
            modifiedCount: updateResult.modifiedCount
          };

        default:
          throw new Error(`Unsupported write mode: ${target.writeMode}`);
      }

    } catch (error) {
      console.error('Error writing to MongoDB collection:', error);
      throw error;
    }
  }

  async getPipelineMetrics(pipelineId, timeRange = {}) {
    console.log(`Getting metrics for pipeline: ${pipelineId}`);

    try {
      const pipeline = this.pipelines.get(pipelineId);
      if (!pipeline) {
        throw new Error(`Pipeline not found: ${pipelineId}`);
      }

      // Build time range filter
      const timeFilter = {};
      if (timeRange.startTime) {
        timeFilter.$gte = new Date(timeRange.startTime);
      }
      if (timeRange.endTime) {
        timeFilter.$lte = new Date(timeRange.endTime);
      }

      const matchStage = { pipelineId: pipelineId };
      if (Object.keys(timeFilter).length > 0) {
        matchStage.startTime = timeFilter;
      }

      // Aggregate pipeline metrics
      const metricsAggregation = [
        { $match: matchStage },
        {
          $group: {
            _id: '$pipelineId',
            totalRuns: { $sum: 1 },
            successfulRuns: { 
              $sum: { $cond: [{ $eq: ['$status', 'completed'] }, 1, 0] } 
            },
            failedRuns: { 
              $sum: { $cond: [{ $eq: ['$status', 'failed'] }, 1, 0] } 
            },
            totalRecordsProcessed: { $sum: '$recordsProcessed' },
            totalRecordsSuccessful: { $sum: '$recordsSuccessful' },
            totalRecordsFailed: { $sum: '$recordsFailed' },

            // Performance metrics
            averageProcessingTime: { $avg: '$processingTimeMs' },
            maxProcessingTime: { $max: '$processingTimeMs' },
            minProcessingTime: { $min: '$processingTimeMs' },

            // Throughput metrics
            averageThroughput: { $avg: '$throughputRecordsPerSecond' },
            maxThroughput: { $max: '$throughputRecordsPerSecond' },

            // Resource usage
            averageMemoryUsage: { $avg: '$memoryUsageMB' },
            maxMemoryUsage: { $max: '$memoryUsageMB' },
            averageCpuUsage: { $avg: '$cpuUsagePercent' },
            maxCpuUsage: { $max: '$cpuUsagePercent' },

            // Time range
            firstRun: { $min: '$startTime' },
            lastRun: { $max: '$startTime' }
          }
        }
      ];

      const metricsResult = await this.collections.pipelineRuns
        .aggregate(metricsAggregation)
        .toArray();

      const metrics = metricsResult[0] || {
        _id: pipelineId,
        totalRuns: 0,
        successfulRuns: 0,
        failedRuns: 0,
        totalRecordsProcessed: 0,
        totalRecordsSuccessful: 0,
        totalRecordsFailed: 0,
        averageProcessingTime: 0,
        averageThroughput: 0,
        averageMemoryUsage: 0,
        averageCpuUsage: 0
      };

      // Calculate additional derived metrics
      const successRate = metrics.totalRuns > 0 ? 
        (metrics.successfulRuns / metrics.totalRuns) * 100 : 0;

      const dataQualityRate = metrics.totalRecordsProcessed > 0 ? 
        (metrics.totalRecordsSuccessful / metrics.totalRecordsProcessed) * 100 : 0;

      return {
        success: true,
        pipelineId: pipelineId,
        timeRange: timeRange,

        // Basic metrics
        totalRuns: metrics.totalRuns,
        successfulRuns: metrics.successfulRuns,
        failedRuns: metrics.failedRuns,
        successRate: Math.round(successRate * 100) / 100,

        // Data processing metrics
        totalRecordsProcessed: metrics.totalRecordsProcessed,
        totalRecordsSuccessful: metrics.totalRecordsSuccessful,
        totalRecordsFailed: metrics.totalRecordsFailed,
        dataQualityRate: Math.round(dataQualityRate * 100) / 100,

        // Performance metrics
        performance: {
          averageProcessingTimeMs: Math.round(metrics.averageProcessingTime || 0),
          maxProcessingTimeMs: metrics.maxProcessingTime || 0,
          minProcessingTimeMs: metrics.minProcessingTime || 0,
          averageThroughputRps: Math.round((metrics.averageThroughput || 0) * 100) / 100,
          maxThroughputRps: Math.round((metrics.maxThroughput || 0) * 100) / 100
        },

        // Resource usage
        resourceUsage: {
          averageMemoryMB: Math.round(metrics.averageMemoryUsage || 0),
          maxMemoryMB: metrics.maxMemoryUsage || 0,
          averageCpuPercent: Math.round((metrics.averageCpuUsage || 0) * 100) / 100,
          maxCpuPercent: Math.round((metrics.maxCpuUsage || 0) * 100) / 100
        },

        // Time range
        timeSpan: {
          firstRun: metrics.firstRun,
          lastRun: metrics.lastRun,
          duration: metrics.firstRun && metrics.lastRun ? 
            metrics.lastRun.getTime() - metrics.firstRun.getTime() : 0
        },

        // Pipeline status
        currentStatus: pipeline.status,
        lastRunStatus: pipeline.lastRun ? pipeline.lastRun.status : null
      };

    } catch (error) {
      console.error(`Error getting pipeline metrics for ${pipelineId}:`, error);
      return {
        success: false,
        pipelineId: pipelineId,
        error: error.message
      };
    }
  }

  async stopPipeline(pipelineId) {
    console.log(`Stopping pipeline: ${pipelineId}`);

    try {
      const pipeline = this.pipelines.get(pipelineId);
      if (!pipeline) {
        throw new Error(`Pipeline not found: ${pipelineId}`);
      }

      // Stop change streams
      for (const [streamKey, changeStream] of this.changeStreams.entries()) {
        if (streamKey.startsWith(pipelineId)) {
          await changeStream.close();
          this.changeStreams.delete(streamKey);
        }
      }

      // Update pipeline status
      pipeline.status = 'stopped';

      // Update current run if exists
      if (pipeline.lastRun && pipeline.lastRun.status === 'running') {
        await this.collections.pipelineRuns.updateOne(
          { runId: pipeline.lastRun.runId },
          {
            $set: {
              status: 'stopped',
              endTime: new Date(),
              processingTimeMs: Date.now() - pipeline.lastRun.startTime.getTime()
            }
          }
        );
      }

      this.emit('pipelineStopped', {
        pipelineId: pipelineId,
        stopTime: new Date()
      });

      return {
        success: true,
        pipelineId: pipelineId,
        status: 'stopped'
      };

    } catch (error) {
      console.error(`Error stopping pipeline ${pipelineId}:`, error);
      return {
        success: false,
        pipelineId: pipelineId,
        error: error.message
      };
    }
  }

  // Utility methods for data processing

  getNestedValue(obj, path) {
    return path.split('.').reduce((current, key) => current && current[key], obj);
  }

  setNestedValue(obj, path, value) {
    const keys = path.split('.');
    const lastKey = keys.pop();
    const target = keys.reduce((current, key) => {
      if (!current[key]) current[key] = {};
      return current[key];
    }, obj);
    target[lastKey] = value;
  }

  projectFields(data, projection) {
    const result = {};
    Object.entries(projection).forEach(([field, include]) => {
      if (include) {
        const value = this.getNestedValue(data, field);
        if (value !== undefined) {
          this.setNestedValue(result, field, value);
        }
      }
    });
    return result;
  }

  matchFilter(data, filter) {
    // Simplified match implementation
    // In production, would implement full MongoDB query matching
    for (const [field, condition] of Object.entries(filter)) {
      const value = this.getNestedValue(data, field);

      if (typeof condition === 'object' && condition !== null) {
        // Handle operators like $eq, $ne, $gt, etc.
        for (const [operator, operand] of Object.entries(condition)) {
          switch (operator) {
            case '$eq':
              if (value !== operand) return null;
              break;
            case '$ne':
              if (value === operand) return null;
              break;
            case '$gt':
              if (value <= operand) return null;
              break;
            case '$gte':
              if (value < operand) return null;
              break;
            case '$lt':
              if (value >= operand) return null;
              break;
            case '$lte':
              if (value > operand) return null;
              break;
            case '$in':
              if (!operand.includes(value)) return null;
              break;
            case '$nin':
              if (operand.includes(value)) return null;
              break;
          }
        }
      } else {
        // Direct value comparison
        if (value !== condition) return null;
      }
    }

    return data;
  }

  addFields(data, fieldsToAdd) {
    const result = { ...data };

    Object.entries(fieldsToAdd).forEach(([field, expression]) => {
      // Simplified field addition
      // In production, would implement full MongoDB expression evaluation
      if (typeof expression === 'string' && expression.startsWith('$')) {
        // Reference to another field
        const referencedValue = this.getNestedValue(data, expression.slice(1));
        this.setNestedValue(result, field, referencedValue);
      } else {
        // Literal value
        this.setNestedValue(result, field, expression);
      }
    });

    return result;
  }

  generatePipelineId(name) {
    return `pipeline_${name.toLowerCase().replace(/\s+/g, '_')}_${Date.now()}`;
  }

  generateRunId() {
    return `run_${Date.now()}_${Math.random().toString(36).substr(2, 9)}`;
  }

  async validatePipelineDefinition(definition) {
    // Validate required fields
    if (!definition.name) {
      throw new Error('Pipeline name is required');
    }

    if (!definition.dataSources || definition.dataSources.length === 0) {
      throw new Error('At least one data source is required');
    }

    if (!definition.dataTargets || definition.dataTargets.length === 0) {
      throw new Error('At least one data target is required');
    }

    // Add more validation as needed
    return definition;
  }

  async updateRunMetrics(runRecord, metrics) {
    try {
      const updateData = {};

      if (metrics.recordsProcessed) {
        updateData.$inc = { recordsProcessed: metrics.recordsProcessed };
      }

      if (metrics.recordsSuccessful) {
        updateData.$inc = { ...updateData.$inc, recordsSuccessful: metrics.recordsSuccessful };
      }

      if (metrics.recordsFailed) {
        updateData.$inc = { ...updateData.$inc, recordsFailed: metrics.recordsFailed };
      }

      if (metrics.processingTimeMs) {
        updateData.$set = { 
          lastProcessingTime: metrics.processingTimeMs,
          lastUpdateTime: new Date()
        };
      }

      if (Object.keys(updateData).length > 0) {
        await this.collections.pipelineRuns.updateOne(
          { runId: runRecord.runId },
          updateData
        );
      }

    } catch (error) {
      console.error('Error updating run metrics:', error);
    }
  }

  async handlePipelineError(pipelineId, runRecord, error) {
    console.error(`Pipeline error for ${pipelineId}:`, error);

    try {
      // Record error
      const errorRecord = {
        pipelineId: pipelineId,
        runId: runRecord.runId,
        errorTime: new Date(),
        errorType: error.constructor.name,
        errorMessage: error.message,
        errorStack: error.stack,

        // Context information
        processingContext: {
          recordsProcessedBeforeError: runRecord.recordsProcessed,
          runDuration: Date.now() - runRecord.startTime.getTime()
        }
      };

      await this.collections.pipelineErrors.insertOne(errorRecord);

      // Update run status
      await this.collections.pipelineRuns.updateOne(
        { runId: runRecord.runId },
        {
          $set: {
            status: 'failed',
            endTime: new Date(),
            errorMessage: error.message
          },
          $push: { errors: errorRecord }
        }
      );

      // Update pipeline status
      const pipeline = this.pipelines.get(pipelineId);
      if (pipeline) {
        pipeline.status = 'error';
        pipeline.statistics.failedRuns++;
      }

      this.emit('pipelineError', {
        pipelineId: pipelineId,
        runId: runRecord.runId,
        error: errorRecord
      });

    } catch (recordingError) {
      console.error('Error recording pipeline error:', recordingError);
    }
  }

  async validateData(data, qualityRules) {
    // Implement data quality validation logic
    const validationResult = {
      isValid: true,
      errors: [],
      warnings: []
    };

    // Apply quality rules
    for (const rule of qualityRules) {
      try {
        const ruleResult = await this.applyQualityRule(data, rule);
        if (!ruleResult.passed) {
          validationResult.isValid = false;
          validationResult.errors.push({
            rule: rule.name,
            message: ruleResult.message,
            field: rule.field,
            value: this.getNestedValue(data, rule.field)
          });
        }
      } catch (error) {
        validationResult.warnings.push({
          rule: rule.name,
          message: `Rule validation failed: ${error.message}`
        });
      }
    }

    return validationResult;
  }

  async applyQualityRule(data, rule) {
    // Implement specific quality rule logic
    switch (rule.type) {
      case 'required':
        const value = this.getNestedValue(data, rule.field);
        return {
          passed: value !== null && value !== undefined && value !== '',
          message: value ? 'Field is present' : `Required field '${rule.field}' is missing`
        };

      case 'type':
        const fieldValue = this.getNestedValue(data, rule.field);
        const actualType = typeof fieldValue;
        return {
          passed: actualType === rule.expectedType,
          message: actualType === rule.expectedType ? 
            'Type validation passed' : 
            `Expected type '${rule.expectedType}' but got '${actualType}'`
        };

      case 'range':
        const numericValue = this.getNestedValue(data, rule.field);
        const inRange = numericValue >= rule.min && numericValue <= rule.max;
        return {
          passed: inRange,
          message: inRange ? 
            'Value is within range' : 
            `Value ${numericValue} is outside range [${rule.min}, ${rule.max}]`
        };

      default:
        return { passed: true, message: 'Unknown rule type' };
    }
  }

  async recordDataLineage(pipelineId, originalData, processedData, definition) {
    try {
      const lineageRecord = {
        pipelineId: pipelineId,
        timestamp: new Date(),

        // Data sources
        dataSources: definition.dataSources.map(source => ({
          collection: source.collection,
          database: source.database || this.db.databaseName
        })),

        // Data targets
        dataTargets: definition.dataTargets.map(target => ({
          collection: target.collection,
          database: target.database || this.db.databaseName,
          type: target.type
        })),

        // Transformation metadata
        transformations: definition.transformationStages.map(stage => ({
          type: stage.type,
          applied: true
        })),

        // Data checksums for integrity verification
        originalDataChecksum: this.calculateChecksum(originalData),
        processedDataChecksum: this.calculateChecksum(processedData),

        // Record identifiers
        originalRecordId: originalData._id || originalData.id,
        processedRecordId: processedData._id || processedData.id
      };

      await this.collections.dataLineage.insertOne(lineageRecord);

    } catch (error) {
      console.error('Error recording data lineage:', error);
      // Don't throw - lineage recording shouldn't stop pipeline execution
    }
  }

  calculateChecksum(data) {
    // Simple checksum calculation for demonstration
    // In production, would use proper hashing algorithm
    const dataString = JSON.stringify(data, Object.keys(data).sort());
    let hash = 0;
    for (let i = 0; i < dataString.length; i++) {
      const char = dataString.charCodeAt(i);
      hash = ((hash << 5) - hash) + char;
      hash = hash & hash; // Convert to 32bit integer
    }
    return hash.toString(36);
  }

  async shutdown() {
    console.log('Shutting down data pipeline manager...');

    try {
      // Stop all running pipelines
      for (const [pipelineId, pipeline] of this.pipelines.entries()) {
        if (pipeline.status === 'running') {
          await this.stopPipeline(pipelineId);
        }
      }

      // Close all change streams
      for (const [streamKey, changeStream] of this.changeStreams.entries()) {
        await changeStream.close();
      }

      // Close MongoDB connection
      if (this.client) {
        await this.client.close();
      }

      console.log('Data pipeline manager shutdown complete');

    } catch (error) {
      console.error('Error during shutdown:', error);
    }
  }

  // Additional methods would include implementations for:
  // - setupChangeStreams()
  // - startPipelineMonitoring()
  // - startScheduledPipeline()
  // - startBatchPipeline()
  // - applyLookupEnrichment()
  // - applyCalculationEnrichment()
  // - applyExternalApiEnrichment()
  // - applyFiltering()
  // - applyNormalization()
  // - applyCustomFunction()
  // - writeToFile()
  // - writeToExternalAPI()
  // - writeToMessageQueue()
  // - handleValidationError()
  // - handleProcessingError()
}

// Benefits of MongoDB Advanced Data Pipeline Management:
// - Real-time stream processing with Change Streams
// - Sophisticated data transformation and enrichment capabilities  
// - Comprehensive error handling and recovery mechanisms
// - Built-in data quality validation and monitoring
// - Automatic scalability and performance optimization
// - Data lineage tracking and audit capabilities
// - Flexible pipeline orchestration and scheduling
// - SQL-compatible operations through QueryLeaf integration
// - Production-ready monitoring and observability features
// - Enterprise-grade reliability and fault tolerance

module.exports = {
  AdvancedDataPipelineManager
};

Advanced Stream Processing Patterns

Real-Time Data Transformation and Analytics

Implement sophisticated stream processing for real-time data analytics:

// Advanced real-time stream processing and analytics
class RealTimeStreamProcessor extends AdvancedDataPipelineManager {
  constructor(mongoUri, streamConfig) {
    super(mongoUri, streamConfig);

    this.streamConfig = {
      ...streamConfig,
      enableWindowedProcessing: true,
      enableEventTimeProcessing: true,
      enableComplexEventProcessing: true,
      enableStreamAggregation: true
    };

    this.windowManager = new Map();
    this.eventPatterns = new Map();
    this.streamState = new Map();

    this.setupStreamProcessing();
  }

  async processEventStream(streamDefinition) {
    console.log('Setting up advanced event stream processing...');

    try {
      const streamProcessor = {
        streamId: this.generateStreamId(streamDefinition.name),
        definition: streamDefinition,

        // Windowing configuration
        windowConfig: {
          type: streamDefinition.windowType || 'tumbling', // tumbling, hopping, sliding
          size: streamDefinition.windowSize || 60000, // 1 minute
          advance: streamDefinition.windowAdvance || 30000 // 30 seconds
        },

        // Processing configuration
        processingConfig: {
          enableLateSparks: streamDefinition.enableLateSparks || false,
          watermarkDelay: streamDefinition.watermarkDelay || 5000,
          enableExactlyOnceProcessing: streamDefinition.enableExactlyOnceProcessing || false
        },

        // Analytics configuration
        analyticsConfig: {
          enableAggregation: streamDefinition.enableAggregation !== false,
          enablePatternDetection: streamDefinition.enablePatternDetection || false,
          enableAnomalyDetection: streamDefinition.enableAnomalyDetection || false,
          enableTrendAnalysis: streamDefinition.enableTrendAnalysis || false
        }
      };

      return await this.deployStreamProcessor(streamProcessor);

    } catch (error) {
      console.error('Error processing event stream:', error);
      return {
        success: false,
        error: error.message
      };
    }
  }

  async deployStreamProcessor(streamProcessor) {
    console.log(`Deploying stream processor: ${streamProcessor.streamId}`);

    try {
      // Setup windowed processing
      if (this.streamConfig.enableWindowedProcessing) {
        await this.setupWindowedProcessing(streamProcessor);
      }

      // Setup complex event processing
      if (this.streamConfig.enableComplexEventProcessing) {
        await this.setupComplexEventProcessing(streamProcessor);
      }

      // Setup stream aggregation
      if (this.streamConfig.enableStreamAggregation) {
        await this.setupStreamAggregation(streamProcessor);
      }

      return {
        success: true,
        streamId: streamProcessor.streamId,
        processorConfig: streamProcessor
      };

    } catch (error) {
      console.error(`Error deploying stream processor ${streamProcessor.streamId}:`, error);
      return {
        success: false,
        streamId: streamProcessor.streamId,
        error: error.message
      };
    }
  }
}

SQL-Style Data Pipeline Operations with QueryLeaf

QueryLeaf provides familiar SQL syntax for MongoDB data pipeline management:

-- QueryLeaf advanced data pipeline operations with SQL-familiar syntax for MongoDB

-- Pipeline definition and configuration
CREATE OR REPLACE PIPELINE customer_data_enrichment_pipeline
AS
WITH pipeline_config AS (
    -- Pipeline metadata and configuration
    SELECT 
        'customer_data_enrichment' as pipeline_name,
        'stream' as processing_mode,
        'change_stream' as trigger_type,
        true as auto_start,

        -- Performance configuration
        1000 as batch_size,
        5 as max_concurrency,
        300000 as timeout_ms,

        -- Quality configuration
        true as enable_data_validation,
        true as enable_schema_evolution,
        true as enable_data_lineage,

        -- Error handling
        'exponential_backoff' as retry_strategy,
        3 as max_retries,
        true as dead_letter_queue
),

data_sources AS (
    -- Define data sources for pipeline
    SELECT ARRAY[
        JSON_BUILD_OBJECT(
            'name', 'customer_changes',
            'collection', 'customers',
            'database', 'ecommerce',
            'filter', JSON_BUILD_OBJECT(
                'operationType', JSON_BUILD_OBJECT('$in', ARRAY['insert', 'update'])
            ),
            'change_stream_options', JSON_BUILD_OBJECT(
                'fullDocument', 'updateLookup',
                'fullDocumentBeforeChange', 'whenAvailable'
            )
        ),
        JSON_BUILD_OBJECT(
            'name', 'order_changes',
            'collection', 'orders',
            'database', 'ecommerce',
            'filter', JSON_BUILD_OBJECT(
                'fullDocument.customer_id', JSON_BUILD_OBJECT('$exists', true)
            )
        )
    ] as sources
),

transformation_stages AS (
    -- Define transformation pipeline stages
    SELECT ARRAY[
        -- Stage 1: Data enrichment with external lookups
        JSON_BUILD_OBJECT(
            'type', 'data_enrichment',
            'name', 'customer_profile_enrichment',
            'enrichment_config', JSON_BUILD_OBJECT(
                'enrichments', ARRAY[
                    JSON_BUILD_OBJECT(
                        'type', 'lookup',
                        'lookup_collection', 'customer_profiles',
                        'lookup_field', 'customer_id',
                        'source_field', 'fullDocument.customer_id',
                        'target_field', 'customer_profile'
                    ),
                    JSON_BUILD_OBJECT(
                        'type', 'calculation',
                        'calculations', ARRAY[
                            JSON_BUILD_OBJECT(
                                'field', 'customer_lifetime_value',
                                'expression', 'customer_profile.total_orders * customer_profile.avg_order_value'
                            ),
                            JSON_BUILD_OBJECT(
                                'field', 'customer_segment',
                                'expression', 'CASE WHEN customer_lifetime_value > 1000 THEN "premium" WHEN customer_lifetime_value > 500 THEN "standard" ELSE "basic" END'
                            )
                        ]
                    )
                ]
            )
        ),

        -- Stage 2: Field mapping and normalization
        JSON_BUILD_OBJECT(
            'type', 'field_mapping',
            'name', 'customer_data_mapping',
            'field_mappings', JSON_BUILD_OBJECT(
                'customer_id', 'fullDocument.customer_id',
                'customer_email', 'fullDocument.email',
                'customer_name', 'fullDocument.full_name',
                'customer_phone', 'fullDocument.phone_number',
                'registration_date', 'fullDocument.created_at',
                'last_login', 'fullDocument.last_login_at',
                'profile_completion', 'customer_profile.completion_percentage',
                'lifetime_value', 'customer_lifetime_value',
                'segment', 'customer_segment',
                'change_type', 'operationType',
                'change_timestamp', 'clusterTime'
            )
        ),

        -- Stage 3: Data validation and quality checks
        JSON_BUILD_OBJECT(
            'type', 'data_validation',
            'name', 'customer_data_validation',
            'validation_rules', ARRAY[
                JSON_BUILD_OBJECT(
                    'field', 'customer_email',
                    'type', 'email',
                    'required', true
                ),
                JSON_BUILD_OBJECT(
                    'field', 'customer_phone',
                    'type', 'phone',
                    'required', false
                ),
                JSON_BUILD_OBJECT(
                    'field', 'lifetime_value',
                    'type', 'numeric',
                    'min_value', 0,
                    'max_value', 100000
                )
            ]
        ),

        -- Stage 4: Aggregation for analytics
        JSON_BUILD_OBJECT(
            'type', 'aggregation',
            'name', 'customer_analytics_aggregation',
            'aggregation_pipeline', ARRAY[
                JSON_BUILD_OBJECT(
                    '$addFields', JSON_BUILD_OBJECT(
                        'processing_date', '$$NOW',
                        'data_freshness_score', JSON_BUILD_OBJECT(
                            '$subtract', ARRAY[100, JSON_BUILD_OBJECT(
                                '$divide', ARRAY[
                                    JSON_BUILD_OBJECT('$subtract', ARRAY['$$NOW', '$change_timestamp']),
                                    3600000  -- Convert to hours
                                ]
                            )]
                        ),
                        'engagement_score', JSON_BUILD_OBJECT(
                            '$multiply', ARRAY[
                                '$profile_completion',
                                JSON_BUILD_OBJECT('$cond', ARRAY[
                                    JSON_BUILD_OBJECT('$ne', ARRAY['$last_login', NULL]),
                                    1.2,  -- Boost for active users
                                    1.0
                                ])
                            ]
                        )
                    )
                ),
                JSON_BUILD_OBJECT(
                    '$addFields', JSON_BUILD_OBJECT(
                        'customer_score', JSON_BUILD_OBJECT(
                            '$add', ARRAY[
                                JSON_BUILD_OBJECT('$multiply', ARRAY['$lifetime_value', 0.4]),
                                JSON_BUILD_OBJECT('$multiply', ARRAY['$engagement_score', 0.3]),
                                JSON_BUILD_OBJECT('$multiply', ARRAY['$data_freshness_score', 0.3])
                            ]
                        )
                    )
                )
            ]
        )
    ] as stages
),

data_targets AS (
    -- Define output destinations
    SELECT ARRAY[
        JSON_BUILD_OBJECT(
            'name', 'enriched_customers',
            'type', 'mongodb_collection',
            'collection', 'enriched_customers',
            'database', 'analytics',
            'write_mode', 'upsert',
            'upsert_filter', JSON_BUILD_OBJECT('customer_id', '$customer_id')
        ),
        JSON_BUILD_OBJECT(
            'name', 'customer_analytics_stream',
            'type', 'message_queue',
            'queue_name', 'customer_analytics',
            'format', 'json',
            'partition_key', 'customer_segment'
        ),
        JSON_BUILD_OBJECT(
            'name', 'data_warehouse_export',
            'type', 'file',
            'file_path', '/data/exports/customer_enrichment',
            'format', 'parquet',
            'partition_by', ARRAY['segment', 'processing_date']
        )
    ] as targets
),

data_quality_rules AS (
    -- Define comprehensive data quality rules
    SELECT ARRAY[
        JSON_BUILD_OBJECT(
            'name', 'required_customer_id',
            'type', 'required',
            'field', 'customer_id',
            'severity', 'critical'
        ),
        JSON_BUILD_OBJECT(
            'name', 'valid_email_format',
            'type', 'regex',
            'field', 'customer_email',
            'pattern', '^[\\w\\.-]+@[\\w\\.-]+\\.[a-zA-Z]{2,}$',
            'severity', 'high'
        ),
        JSON_BUILD_OBJECT(
            'name', 'reasonable_lifetime_value',
            'type', 'range',
            'field', 'lifetime_value',
            'min', 0,
            'max', 50000,
            'severity', 'medium'
        ),
        JSON_BUILD_OBJECT(
            'name', 'valid_customer_segment',
            'type', 'enum',
            'field', 'segment',
            'allowed_values', ARRAY['premium', 'standard', 'basic'],
            'severity', 'high'
        )
    ] as rules
)

-- Create the pipeline with comprehensive configuration
SELECT 
    'customer_data_enrichment_pipeline' as pipeline_name,
    pipeline_config.*,
    data_sources.sources,
    transformation_stages.stages,
    data_targets.targets,
    data_quality_rules.rules,

    -- Pipeline scheduling
    JSON_BUILD_OBJECT(
        'schedule_type', 'real_time',
        'trigger_conditions', ARRAY[
            'customer_data_change',
            'order_completion',
            'profile_update'
        ]
    ) as scheduling_config,

    -- Monitoring configuration  
    JSON_BUILD_OBJECT(
        'enable_metrics', true,
        'metrics_interval_seconds', 60,
        'alert_thresholds', JSON_BUILD_OBJECT(
            'error_rate_percent', 5,
            'processing_latency_ms', 5000,
            'throughput_records_per_second', 100
        ),
        'notification_channels', ARRAY[
            'email:[email protected]',
            'slack:#data-pipelines',
            'webhook:https://monitoring.company.com/alerts'
        ]
    ) as monitoring_config

FROM pipeline_config, data_sources, transformation_stages, data_targets, data_quality_rules;

-- Pipeline execution and monitoring queries

-- Real-time pipeline performance monitoring
WITH pipeline_performance AS (
    SELECT 
        pipeline_id,
        pipeline_name,
        run_id,
        start_time,
        end_time,
        status,

        -- Processing metrics
        records_processed,
        records_successful,
        records_failed,

        -- Performance calculations
        EXTRACT(MILLISECONDS FROM (COALESCE(end_time, CURRENT_TIMESTAMP) - start_time)) as duration_ms,

        -- Throughput calculation
        CASE 
            WHEN EXTRACT(SECONDS FROM (COALESCE(end_time, CURRENT_TIMESTAMP) - start_time)) > 0 THEN
                records_processed / EXTRACT(SECONDS FROM (COALESCE(end_time, CURRENT_TIMESTAMP) - start_time))
            ELSE 0
        END as throughput_records_per_second,

        -- Success rate
        CASE 
            WHEN records_processed > 0 THEN 
                (records_successful * 100.0) / records_processed
            ELSE 0
        END as success_rate_percent,

        -- Resource utilization
        memory_usage_mb,
        cpu_usage_percent,

        -- Current processing lag
        CASE 
            WHEN status = 'running' THEN 
                EXTRACT(SECONDS FROM (CURRENT_TIMESTAMP - last_processed_timestamp))
            ELSE NULL
        END as current_lag_seconds

    FROM pipeline_runs
    WHERE start_time >= CURRENT_TIMESTAMP - INTERVAL '1 hour'
),

performance_summary AS (
    SELECT 
        pipeline_name,
        COUNT(*) as total_runs,
        COUNT(*) FILTER (WHERE status = 'completed') as successful_runs,
        COUNT(*) FILTER (WHERE status = 'failed') as failed_runs,
        COUNT(*) FILTER (WHERE status = 'running') as active_runs,

        -- Aggregate performance metrics
        SUM(records_processed) as total_records_processed,
        SUM(records_successful) as total_records_successful,
        SUM(records_failed) as total_records_failed,

        -- Performance statistics
        AVG(duration_ms) as avg_duration_ms,
        PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY duration_ms) as p95_duration_ms,
        AVG(throughput_records_per_second) as avg_throughput_rps,
        MAX(throughput_records_per_second) as max_throughput_rps,

        -- Quality metrics
        AVG(success_rate_percent) as avg_success_rate,
        MIN(success_rate_percent) as min_success_rate,

        -- Resource usage
        AVG(memory_usage_mb) as avg_memory_usage_mb,
        MAX(memory_usage_mb) as max_memory_usage_mb,
        AVG(cpu_usage_percent) as avg_cpu_usage,
        MAX(cpu_usage_percent) as max_cpu_usage,

        -- Lag analysis
        AVG(current_lag_seconds) as avg_processing_lag_seconds,
        MAX(current_lag_seconds) as max_processing_lag_seconds

    FROM pipeline_performance
    GROUP BY pipeline_name
)

SELECT 
    pipeline_name,
    total_runs,
    successful_runs,
    failed_runs,
    active_runs,

    -- Overall health assessment
    CASE 
        WHEN failed_runs > total_runs * 0.1 THEN 'critical'
        WHEN avg_success_rate < 95 THEN 'warning'
        WHEN avg_processing_lag_seconds > 300 THEN 'warning'  -- 5 minutes lag
        WHEN max_cpu_usage > 90 OR max_memory_usage_mb > 4096 THEN 'warning'
        ELSE 'healthy'
    END as health_status,

    -- Processing statistics
    total_records_processed,
    total_records_successful,
    total_records_failed,

    -- Performance metrics
    ROUND(avg_duration_ms, 0) as avg_duration_ms,
    ROUND(p95_duration_ms, 0) as p95_duration_ms,
    ROUND(avg_throughput_rps, 2) as avg_throughput_rps,
    ROUND(max_throughput_rps, 2) as max_throughput_rps,

    -- Quality and reliability
    ROUND(avg_success_rate, 2) as avg_success_rate_percent,
    ROUND(min_success_rate, 2) as min_success_rate_percent,

    -- Resource utilization
    ROUND(avg_memory_usage_mb, 0) as avg_memory_usage_mb,
    ROUND(max_memory_usage_mb, 0) as max_memory_usage_mb,
    ROUND(avg_cpu_usage, 1) as avg_cpu_usage_percent,
    ROUND(max_cpu_usage, 1) as max_cpu_usage_percent,

    -- Processing lag indicators
    COALESCE(ROUND(avg_processing_lag_seconds, 0), 0) as avg_lag_seconds,
    COALESCE(ROUND(max_processing_lag_seconds, 0), 0) as max_lag_seconds,

    -- Operational recommendations
    CASE 
        WHEN failed_runs > total_runs * 0.05 THEN 'investigate_errors'
        WHEN avg_throughput_rps < 50 THEN 'optimize_performance'
        WHEN max_cpu_usage > 80 THEN 'scale_up_resources'
        WHEN avg_processing_lag_seconds > 120 THEN 'reduce_processing_latency'
        ELSE 'monitor_continued'
    END as recommendation,

    -- Capacity planning
    CASE 
        WHEN max_throughput_rps / avg_throughput_rps < 1.5 THEN 'add_capacity'
        WHEN max_memory_usage_mb > 3072 THEN 'increase_memory'
        WHEN active_runs > 1 THEN 'check_concurrency_limits'
        ELSE 'capacity_sufficient'
    END as capacity_recommendation

FROM performance_summary
ORDER BY 
    CASE health_status 
        WHEN 'critical' THEN 1 
        WHEN 'warning' THEN 2 
        ELSE 3 
    END,
    total_records_processed DESC;

-- Data lineage and quality tracking
WITH data_lineage_analysis AS (
    SELECT 
        pipeline_id,
        DATE_TRUNC('hour', timestamp) as processing_hour,

        -- Source and target tracking
        JSONB_ARRAY_ELEMENTS(data_sources) ->> 'collection' as source_collection,
        JSONB_ARRAY_ELEMENTS(data_targets) ->> 'collection' as target_collection,

        -- Data quality metrics
        COUNT(*) as total_transformations,
        COUNT(*) FILTER (WHERE original_data_checksum != processed_data_checksum) as data_modified,
        COUNT(DISTINCT original_record_id) as unique_source_records,
        COUNT(DISTINCT processed_record_id) as unique_target_records,

        -- Transformation tracking
        JSONB_ARRAY_ELEMENTS(transformations) ->> 'type' as transformation_type,
        COUNT(*) FILTER (WHERE (JSONB_ARRAY_ELEMENTS(transformations) ->> 'applied')::boolean = true) as transformations_applied,

        -- Data integrity checks
        COUNT(*) FILTER (WHERE original_data_checksum IS NOT NULL AND processed_data_checksum IS NOT NULL) as checksum_validations,

        -- Processing metadata
        MIN(timestamp) as first_processing,
        MAX(timestamp) as last_processing

    FROM data_lineage
    WHERE timestamp >= CURRENT_TIMESTAMP - INTERVAL '24 hours'
    GROUP BY 
        pipeline_id, 
        DATE_TRUNC('hour', timestamp),
        JSONB_ARRAY_ELEMENTS(data_sources) ->> 'collection',
        JSONB_ARRAY_ELEMENTS(data_targets) ->> 'collection',
        JSONB_ARRAY_ELEMENTS(transformations) ->> 'type'
),

quality_summary AS (
    SELECT 
        pipeline_id,
        source_collection,
        target_collection,
        transformation_type,

        -- Aggregated metrics
        SUM(total_transformations) as total_transformations,
        SUM(data_modified) as total_data_modified,
        SUM(unique_source_records) as total_source_records,
        SUM(unique_target_records) as total_target_records,
        SUM(transformations_applied) as total_transformations_applied,
        SUM(checksum_validations) as total_checksum_validations,

        -- Data quality calculations
        CASE 
            WHEN SUM(total_transformations) > 0 THEN
                (SUM(transformations_applied) * 100.0) / SUM(total_transformations)
            ELSE 0
        END as transformation_success_rate,

        CASE 
            WHEN SUM(unique_source_records) > 0 THEN
                (SUM(unique_target_records) * 100.0) / SUM(unique_source_records)
            ELSE 0
        END as record_completeness_rate,

        -- Data modification analysis
        CASE 
            WHEN SUM(total_transformations) > 0 THEN
                (SUM(data_modified) * 100.0) / SUM(total_transformations)
            ELSE 0
        END as data_modification_rate,

        -- Processing consistency
        COUNT(DISTINCT processing_hour) as processing_hours_active,
        AVG(EXTRACT(MINUTES FROM (last_processing - first_processing))) as avg_processing_window_minutes

    FROM data_lineage_analysis
    GROUP BY pipeline_id, source_collection, target_collection, transformation_type
)

SELECT 
    pipeline_id,
    source_collection,
    target_collection,
    transformation_type,

    -- Volume metrics
    total_source_records,
    total_target_records,
    total_transformations,
    total_transformations_applied,

    -- Quality scores
    ROUND(transformation_success_rate, 2) as transformation_success_percent,
    ROUND(record_completeness_rate, 2) as record_completeness_percent,
    ROUND(data_modification_rate, 2) as data_modification_percent,

    -- Data integrity assessment
    total_checksum_validations,
    CASE 
        WHEN total_checksum_validations > 0 AND transformation_success_rate > 98 THEN 'excellent'
        WHEN total_checksum_validations > 0 AND transformation_success_rate > 95 THEN 'good'
        WHEN total_checksum_validations > 0 AND transformation_success_rate > 90 THEN 'acceptable'
        ELSE 'needs_attention'
    END as data_quality_rating,

    -- Processing consistency
    processing_hours_active,
    ROUND(avg_processing_window_minutes, 1) as avg_processing_window_minutes,

    -- Operational insights
    CASE 
        WHEN record_completeness_rate < 98 THEN 'investigate_data_loss'
        WHEN transformation_success_rate < 95 THEN 'review_transformation_logic'
        WHEN data_modification_rate > 80 THEN 'validate_transformation_accuracy'
        WHEN avg_processing_window_minutes > 60 THEN 'optimize_processing_speed'
        ELSE 'quality_acceptable'
    END as quality_recommendation,

    -- Data flow health
    CASE 
        WHEN record_completeness_rate > 99 AND transformation_success_rate > 98 THEN 'healthy'
        WHEN record_completeness_rate > 95 AND transformation_success_rate > 95 THEN 'stable'
        WHEN record_completeness_rate > 90 AND transformation_success_rate > 90 THEN 'concerning'
        ELSE 'critical'
    END as data_flow_health

FROM quality_summary
WHERE total_transformations > 0
ORDER BY 
    CASE data_flow_health 
        WHEN 'critical' THEN 1 
        WHEN 'concerning' THEN 2 
        WHEN 'stable' THEN 3 
        ELSE 4 
    END,
    total_source_records DESC;

-- Error analysis and troubleshooting
SELECT 
    pe.pipeline_id,
    pe.run_id,
    pe.error_time,
    pe.error_type,
    pe.error_message,

    -- Error context
    pe.processing_context ->> 'recordsProcessedBeforeError' as records_before_error,
    pe.processing_context ->> 'runDuration' as run_duration_before_error,

    -- Error frequency analysis
    COUNT(*) OVER (
        PARTITION BY pe.pipeline_id, pe.error_type 
        ORDER BY pe.error_time 
        RANGE BETWEEN INTERVAL '1 hour' PRECEDING AND CURRENT ROW
    ) as similar_errors_last_hour,

    -- Error pattern detection
    LAG(pe.error_time) OVER (
        PARTITION BY pe.pipeline_id, pe.error_type 
        ORDER BY pe.error_time
    ) as previous_similar_error,

    -- Pipeline run context
    pr.start_time as run_start_time,
    pr.records_processed as total_run_records,
    pr.status as run_status,

    -- Resolution tracking
    CASE 
        WHEN pe.error_type IN ('ValidationError', 'SchemaError') THEN 'data_quality_issue'
        WHEN pe.error_type IN ('ConnectionError', 'TimeoutError') THEN 'infrastructure_issue'
        WHEN pe.error_type IN ('TransformationError', 'ProcessingError') THEN 'logic_issue'
        ELSE 'unknown_category'
    END as error_category,

    -- Priority assessment
    CASE 
        WHEN COUNT(*) OVER (PARTITION BY pe.pipeline_id, pe.error_type ORDER BY pe.error_time RANGE BETWEEN INTERVAL '1 hour' PRECEDING AND CURRENT ROW) > 10 THEN 'high'
        WHEN pe.error_type IN ('ConnectionError', 'TimeoutError') THEN 'high'
        WHEN pr.records_processed > 1000 THEN 'medium'
        ELSE 'low'
    END as error_priority,

    -- Suggested resolution
    CASE 
        WHEN pe.error_type = 'ValidationError' THEN 'Review data quality rules and source data format'
        WHEN pe.error_type = 'ConnectionError' THEN 'Check database connectivity and network stability'
        WHEN pe.error_type = 'TimeoutError' THEN 'Increase timeout values or optimize query performance'
        WHEN pe.error_type = 'TransformationError' THEN 'Review transformation logic and test with sample data'
        ELSE 'Investigate error stack trace and contact development team'
    END as suggested_resolution

FROM pipeline_errors pe
JOIN pipeline_runs pr ON pe.run_id = pr.run_id
WHERE pe.error_time >= CURRENT_TIMESTAMP - INTERVAL '24 hours'
ORDER BY error_priority DESC, pe.error_time DESC;

-- QueryLeaf provides comprehensive MongoDB data pipeline capabilities:
-- 1. Real-time change stream processing with SQL-familiar syntax
-- 2. Advanced data transformation and enrichment operations
-- 3. Comprehensive data quality validation and monitoring
-- 4. Pipeline orchestration and scheduling capabilities
-- 5. Data lineage tracking and audit functionality
-- 6. Error handling and troubleshooting tools
-- 7. Performance monitoring and optimization features
-- 8. Stream processing and windowed analytics
-- 9. SQL-style pipeline definition and management
-- 10. Enterprise-grade reliability and fault tolerance

Best Practices for Production Data Pipelines

Pipeline Architecture and Design Principles

Essential principles for effective MongoDB data pipeline deployment:

  1. Stream Processing Design: Implement real-time change stream processing for low-latency data operations
  2. Data Quality Management: Establish comprehensive validation rules and monitoring for data integrity
  3. Error Handling Strategy: Design robust error handling with retry mechanisms and dead letter queues
  4. Performance Optimization: Optimize pipeline throughput with appropriate batching and concurrency settings
  5. Monitoring Integration: Implement comprehensive monitoring for pipeline health and data quality metrics
  6. Schema Evolution: Plan for schema changes and backward compatibility in data transformations

Scalability and Production Operations

Optimize data pipeline operations for enterprise-scale requirements:

  1. Resource Management: Configure appropriate resource limits and scaling policies for pipeline execution
  2. Data Lineage: Track data transformations and dependencies for auditing and troubleshooting
  3. Backup and Recovery: Implement pipeline state backup and recovery mechanisms
  4. Security Integration: Ensure pipeline operations meet security and compliance requirements
  5. Operational Integration: Integrate pipeline monitoring with existing alerting and operational workflows
  6. Cost Optimization: Monitor resource usage and optimize pipeline efficiency for cost-effective operations

Conclusion

MongoDB data pipeline management provides sophisticated real-time data processing capabilities that enable modern applications to handle complex data transformation workflows, stream processing, and ETL operations with advanced monitoring, error handling, and scalability features. The native change stream support and aggregation framework ensure that data pipelines can process high-volume data streams efficiently while maintaining data quality and reliability.

Key MongoDB Data Pipeline benefits include:

  • Real-Time Processing: Native change stream support for immediate data processing and transformation
  • Advanced Transformations: Comprehensive data transformation capabilities with aggregation framework integration
  • Data Quality Management: Built-in validation, monitoring, and quality assessment tools
  • Stream Processing: Sophisticated stream processing patterns for complex event processing and analytics
  • Pipeline Orchestration: Flexible pipeline scheduling and orchestration with error handling and recovery
  • SQL Accessibility: Familiar SQL-style pipeline operations through QueryLeaf for accessible data pipeline management

Whether you're building real-time analytics systems, data warehousing pipelines, microservices data synchronization, or complex ETL workflows, MongoDB data pipeline management with QueryLeaf's familiar SQL interface provides the foundation for sophisticated, scalable data processing operations.

QueryLeaf Integration: QueryLeaf automatically translates SQL-style pipeline operations into MongoDB's native change streams and aggregation pipelines, making advanced data processing functionality accessible to SQL-oriented development teams. Complex data transformations, stream processing operations, and pipeline orchestration are seamlessly handled through familiar SQL constructs, enabling sophisticated data workflows without requiring deep MongoDB pipeline expertise.

The combination of MongoDB's robust data pipeline capabilities with SQL-style pipeline management operations makes it an ideal platform for applications requiring both sophisticated real-time data processing and familiar database management patterns, ensuring your data pipelines can scale efficiently while maintaining reliability and performance as data volume and processing complexity grow.

MongoDB Time Series Data Storage and Optimization: Advanced Temporal Data Analytics and High-Performance Storage Strategies

Modern applications generate massive volumes of time-stamped data from IoT devices, system monitoring, financial markets, user analytics, and sensor networks. Managing temporal data efficiently requires specialized storage strategies that can handle high ingestion rates, optimize storage utilization, and provide fast analytical queries across time ranges. Traditional relational databases struggle with time series workloads due to inefficient storage patterns, limited compression capabilities, and poor query performance for temporal analytics.

MongoDB's time series collections provide purpose-built capabilities for temporal data management through advanced compression algorithms, optimized storage layouts, and specialized indexing strategies. Unlike traditional approaches that require complex partitioning schemes and manual optimization, MongoDB time series collections automatically optimize storage efficiency, query performance, and analytical capabilities while maintaining schema flexibility for diverse time-stamped data formats.

The Traditional Time Series Data Challenge

Conventional approaches to time series data management in relational databases face significant limitations:

-- Traditional PostgreSQL time series data handling - inefficient storage and limited optimization

-- Basic time series table with poor storage efficiency
CREATE TABLE sensor_readings (
    reading_id SERIAL PRIMARY KEY,
    device_id VARCHAR(50) NOT NULL,
    sensor_type VARCHAR(50) NOT NULL,
    location VARCHAR(100),
    timestamp TIMESTAMP NOT NULL,

    -- Measurements stored as separate columns (inflexible schema)
    temperature DECIMAL(5,2),
    humidity DECIMAL(5,2),
    pressure DECIMAL(7,2),
    battery_level INTEGER,
    signal_strength INTEGER,

    -- Limited metadata support
    device_metadata JSONB,

    -- Basic audit fields
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,

    -- Manual partitioning hint
    partition_key DATE GENERATED ALWAYS AS (DATE(timestamp)) STORED
);

-- Manual partitioning setup (complex maintenance overhead)
CREATE INDEX idx_sensor_readings_timestamp ON sensor_readings(timestamp DESC);
CREATE INDEX idx_sensor_readings_device_time ON sensor_readings(device_id, timestamp DESC);
CREATE INDEX idx_sensor_readings_type_time ON sensor_readings(sensor_type, timestamp DESC);

-- Attempt at time-based partitioning (limited automation)
DO $$
DECLARE
    start_date DATE;
    end_date DATE;
    partition_name TEXT;
BEGIN
    start_date := DATE_TRUNC('month', CURRENT_DATE - INTERVAL '6 months');

    WHILE start_date <= DATE_TRUNC('month', CURRENT_DATE + INTERVAL '3 months') LOOP
        end_date := start_date + INTERVAL '1 month';
        partition_name := 'sensor_readings_' || TO_CHAR(start_date, 'YYYY_MM');

        EXECUTE format('
            CREATE TABLE IF NOT EXISTS %I PARTITION OF sensor_readings
            FOR VALUES FROM (%L) TO (%L)',
            partition_name, start_date, end_date);

        start_date := end_date;
    END LOOP;
END;
$$;

-- Time series aggregation queries (inefficient for large datasets)
WITH hourly_averages AS (
    SELECT 
        device_id,
        sensor_type,
        DATE_TRUNC('hour', timestamp) as hour_bucket,

        -- Basic aggregations (limited analytical functions)
        COUNT(*) as reading_count,
        AVG(temperature) as avg_temperature,
        AVG(humidity) as avg_humidity,
        AVG(pressure) as avg_pressure,
        MIN(temperature) as min_temperature,
        MAX(temperature) as max_temperature,

        -- Standard deviation calculations (expensive)
        STDDEV(temperature) as temp_stddev,
        STDDEV(humidity) as humidity_stddev,

        -- Battery and connectivity metrics
        AVG(battery_level) as avg_battery,
        AVG(signal_strength) as avg_signal_strength,

        -- Data quality metrics
        COUNT(*) FILTER (WHERE temperature IS NOT NULL) as valid_temp_readings,
        COUNT(*) FILTER (WHERE humidity IS NOT NULL) as valid_humidity_readings

    FROM sensor_readings sr
    WHERE timestamp >= CURRENT_TIMESTAMP - INTERVAL '24 hours'
    AND timestamp < CURRENT_TIMESTAMP
    GROUP BY device_id, sensor_type, DATE_TRUNC('hour', timestamp)
),

daily_summaries AS (
    SELECT 
        device_id,
        sensor_type,
        DATE_TRUNC('day', hour_bucket) as day_bucket,

        -- Aggregation of aggregations (double computation overhead)
        SUM(reading_count) as total_readings_per_day,
        AVG(avg_temperature) as daily_avg_temperature,
        MIN(min_temperature) as daily_min_temperature,
        MAX(max_temperature) as daily_max_temperature,
        AVG(avg_humidity) as daily_avg_humidity,
        AVG(avg_pressure) as daily_avg_pressure,

        -- Battery consumption analysis
        MIN(avg_battery) as daily_min_battery,
        AVG(avg_battery) as daily_avg_battery,

        -- Connectivity quality
        AVG(avg_signal_strength) as daily_avg_signal,

        -- Data completeness metrics
        ROUND(
            (SUM(valid_temp_readings) * 100.0) / NULLIF(SUM(reading_count), 0), 2
        ) as temperature_data_completeness_percent,

        ROUND(
            (SUM(valid_humidity_readings) * 100.0) / NULLIF(SUM(reading_count), 0), 2
        ) as humidity_data_completeness_percent

    FROM hourly_averages
    GROUP BY device_id, sensor_type, DATE_TRUNC('day', hour_bucket)
),

device_health_analysis AS (
    -- Complex analysis requiring multiple scans
    SELECT 
        ds.device_id,
        ds.sensor_type,
        COUNT(*) as analysis_days,

        -- Temperature trend analysis (limited analytical capabilities)
        AVG(ds.daily_avg_temperature) as overall_avg_temperature,
        STDDEV(ds.daily_avg_temperature) as temperature_variability,

        -- Battery degradation analysis
        CASE 
            WHEN COUNT(*) > 1 THEN
                -- Simple linear trend approximation
                (MAX(ds.daily_avg_battery) - MIN(ds.daily_avg_battery)) / NULLIF(COUNT(*) - 1, 0)
            ELSE NULL
        END as daily_battery_degradation_rate,

        -- Connectivity stability
        AVG(ds.daily_avg_signal) as avg_connectivity,
        STDDEV(ds.daily_avg_signal) as connectivity_stability,

        -- Data quality assessment
        AVG(ds.temperature_data_completeness_percent) as avg_data_completeness,

        -- Device status classification
        CASE 
            WHEN AVG(ds.daily_avg_battery) < 20 THEN 'low_battery'
            WHEN AVG(ds.daily_avg_signal) < 30 THEN 'poor_connectivity'  
            WHEN AVG(ds.temperature_data_completeness_percent) < 80 THEN 'unreliable_data'
            ELSE 'healthy'
        END as device_status,

        -- Alert generation
        ARRAY[
            CASE WHEN AVG(ds.daily_avg_battery) < 15 THEN 'CRITICAL_BATTERY' END,
            CASE WHEN AVG(ds.daily_avg_signal) < 20 THEN 'CRITICAL_CONNECTIVITY' END,
            CASE WHEN AVG(ds.temperature_data_completeness_percent) < 50 THEN 'DATA_QUALITY_ISSUE' END,
            CASE WHEN STDDEV(ds.daily_avg_temperature) > 10 THEN 'TEMPERATURE_ANOMALY' END
        ]::TEXT[] as active_alerts

    FROM daily_summaries ds
    WHERE ds.day_bucket >= CURRENT_DATE - INTERVAL '7 days'
    GROUP BY ds.device_id, ds.sensor_type
)
SELECT 
    device_id,
    sensor_type,
    analysis_days,

    -- Performance metrics
    ROUND(overall_avg_temperature, 2) as avg_temp,
    ROUND(temperature_variability, 2) as temp_variability,
    ROUND(daily_battery_degradation_rate, 4) as battery_degradation_per_day,
    ROUND(avg_connectivity, 1) as avg_signal_strength,
    ROUND(avg_data_completeness, 1) as data_completeness_percent,

    -- Status and alerts
    device_status,
    ARRAY_REMOVE(active_alerts, NULL) as alerts,

    -- Recommendations
    CASE device_status
        WHEN 'low_battery' THEN 'Schedule battery replacement or reduce sampling frequency'
        WHEN 'poor_connectivity' THEN 'Check network coverage or relocate device'
        WHEN 'unreliable_data' THEN 'Inspect device sensors and calibration'
        ELSE 'Device operating normally'
    END as recommendation

FROM device_health_analysis
ORDER BY 
    CASE device_status 
        WHEN 'low_battery' THEN 1
        WHEN 'poor_connectivity' THEN 2  
        WHEN 'unreliable_data' THEN 3
        ELSE 4
    END,
    overall_avg_temperature DESC;

-- Traditional approach problems:
-- 1. Inefficient storage - no automatic compression for time series patterns
-- 2. Manual partitioning overhead with limited automation
-- 3. Poor query performance for time range analytics
-- 4. Complex aggregation logic requiring multiple query stages
-- 5. Limited schema flexibility for diverse sensor data
-- 6. No built-in time series analytical functions
-- 7. Expensive index maintenance for time-based queries
-- 8. Poor compression ratios leading to high storage costs
-- 9. Complex retention policy implementation
-- 10. Limited support for high-frequency data ingestion

-- Attempt at high-frequency data insertion (poor performance)
INSERT INTO sensor_readings (
    device_id, sensor_type, location, timestamp,
    temperature, humidity, pressure, battery_level, signal_strength
)
VALUES 
    ('device_001', 'environmental', 'warehouse_a', '2024-10-14 10:00:00', 23.5, 45.2, 1013.2, 85, 75),
    ('device_001', 'environmental', 'warehouse_a', '2024-10-14 10:00:10', 23.6, 45.1, 1013.3, 85, 76),
    ('device_001', 'environmental', 'warehouse_a', '2024-10-14 10:00:20', 23.4, 45.3, 1013.1, 85, 74),
    ('device_002', 'environmental', 'warehouse_b', '2024-10-14 10:00:00', 24.1, 42.8, 1012.8, 90, 82),
    ('device_002', 'environmental', 'warehouse_b', '2024-10-14 10:00:10', 24.2, 42.9, 1012.9, 90, 83);
-- Individual inserts are extremely inefficient for high-frequency data

-- Range queries with limited optimization
SELECT 
    device_id,
    AVG(temperature) as avg_temp,
    COUNT(*) as reading_count
FROM sensor_readings
WHERE timestamp BETWEEN '2024-10-14 09:00:00' AND '2024-10-14 11:00:00'
    AND sensor_type = 'environmental'
GROUP BY device_id
ORDER BY avg_temp DESC;

-- Problems:
-- 1. Full table scan for time range queries despite indexing
-- 2. No automatic data compression reducing storage efficiency
-- 3. Poor aggregation performance for time-based analytics
-- 4. Limited analytical functions for time series analysis
-- 5. Complex retention and archival policy implementation
-- 6. No built-in support for irregular time intervals
-- 7. Inefficient handling of sparse data and missing measurements
-- 8. Manual optimization required for high ingestion rates
-- 9. Limited support for multi-metric time series analysis
-- 10. Complex downsampling and data summarization requirements

MongoDB provides sophisticated time series collection capabilities with automatic optimization:

// MongoDB Advanced Time Series Data Management
const { MongoClient, ObjectId } = require('mongodb');

const client = new MongoClient('mongodb://localhost:27017');
const db = client.db('advanced_time_series');

// Comprehensive MongoDB Time Series Manager
class AdvancedTimeSeriesManager {
  constructor(db, config = {}) {
    this.db = db;
    this.config = {
      // Time series collection configuration
      defaultGranularity: config.defaultGranularity || 'seconds',
      defaultExpiration: config.defaultExpiration || 86400 * 30, // 30 days
      enableCompression: config.enableCompression !== false,

      // Bucketing and storage optimization
      bucketMaxSpanSeconds: config.bucketMaxSpanSeconds || 3600, // 1 hour
      bucketRoundingSeconds: config.bucketRoundingSeconds || 60, // 1 minute

      // Performance optimization
      enablePreAggregation: config.enablePreAggregation || false,
      aggregationLevels: config.aggregationLevels || ['hourly', 'daily'],
      enableAutomaticIndexing: config.enableAutomaticIndexing !== false,

      // Data retention and lifecycle
      enableAutomaticExpiration: config.enableAutomaticExpiration !== false,
      retentionPolicies: config.retentionPolicies || {
        raw: 7 * 24 * 3600,      // 7 days
        hourly: 90 * 24 * 3600,  // 90 days  
        daily: 365 * 24 * 3600   // 1 year
      },

      // Quality and monitoring
      enableDataQualityTracking: config.enableDataQualityTracking || false,
      enableAnomalyDetection: config.enableAnomalyDetection || false,
      alertingThresholds: config.alertingThresholds || {}
    };

    this.collections = new Map();
    this.aggregationPipelines = new Map();

    this.initializeTimeSeriesSystem();
  }

  async initializeTimeSeriesSystem() {
    console.log('Initializing advanced time series system...');

    try {
      // Setup time series collections with optimization
      await this.setupTimeSeriesCollections();

      // Configure automatic aggregation pipelines  
      if (this.config.enablePreAggregation) {
        await this.setupPreAggregationPipelines();
      }

      // Setup data quality monitoring
      if (this.config.enableDataQualityTracking) {
        await this.setupDataQualityMonitoring();
      }

      // Initialize retention policies
      if (this.config.enableAutomaticExpiration) {
        await this.setupRetentionPolicies();
      }

      console.log('Time series system initialized successfully');

    } catch (error) {
      console.error('Error initializing time series system:', error);
      throw error;
    }
  }

  async createTimeSeriesCollection(collectionName, options = {}) {
    console.log(`Creating optimized time series collection: ${collectionName}`);

    try {
      const timeSeriesOptions = {
        timeseries: {
          timeField: options.timeField || 'timestamp',
          metaField: options.metaField || 'metadata',
          granularity: options.granularity || this.config.defaultGranularity,

          // Advanced bucketing configuration
          bucketMaxSpanSeconds: options.bucketMaxSpanSeconds || this.config.bucketMaxSpanSeconds,
          bucketRoundingSeconds: options.bucketRoundingSeconds || this.config.bucketRoundingSeconds
        },

        // Automatic expiration configuration
        expireAfterSeconds: options.expireAfterSeconds || this.config.defaultExpiration,

        // Storage optimization
        storageEngine: {
          wiredTiger: {
            configString: options.enableCompression ? 'block_compressor=zstd' : undefined
          }
        }
      };

      // Create the time series collection
      const collection = await this.db.createCollection(collectionName, timeSeriesOptions);

      // Store collection reference for management
      this.collections.set(collectionName, {
        collection: collection,
        config: timeSeriesOptions,
        createdAt: new Date()
      });

      // Create optimized indexes for time series queries
      await this.createTimeSeriesIndexes(collection, options);

      console.log(`Time series collection '${collectionName}' created successfully`);

      return {
        success: true,
        collectionName: collectionName,
        configuration: timeSeriesOptions,
        indexesCreated: true
      };

    } catch (error) {
      console.error(`Error creating time series collection '${collectionName}':`, error);
      return {
        success: false,
        error: error.message,
        collectionName: collectionName
      };
    }
  }

  async createTimeSeriesIndexes(collection, options = {}) {
    console.log('Creating optimized indexes for time series collection...');

    try {
      const indexes = [
        // Compound index for time range queries with metadata
        {
          key: { 
            [`${options.metaField || 'metadata'}.device_id`]: 1,
            [`${options.timeField || 'timestamp'}`]: -1 
          },
          name: 'device_time_idx',
          background: true
        },

        // Index for sensor type queries
        {
          key: { 
            [`${options.metaField || 'metadata'}.sensor_type`]: 1,
            [`${options.timeField || 'timestamp'}`]: -1 
          },
          name: 'sensor_time_idx',
          background: true
        },

        // Compound index for location-based queries
        {
          key: { 
            [`${options.metaField || 'metadata'}.location`]: 1,
            [`${options.metaField || 'metadata'}.device_id`]: 1,
            [`${options.timeField || 'timestamp'}`]: -1 
          },
          name: 'location_device_time_idx',
          background: true
        },

        // Index for data quality queries
        {
          key: { 
            [`${options.metaField || 'metadata'}.data_quality`]: 1,
            [`${options.timeField || 'timestamp'}`]: -1 
          },
          name: 'quality_time_idx',
          background: true,
          sparse: true
        }
      ];

      // Create all indexes
      await collection.createIndexes(indexes);

      console.log(`Created ${indexes.length} optimized indexes for time series collection`);

    } catch (error) {
      console.error('Error creating time series indexes:', error);
      throw error;
    }
  }

  async insertTimeSeriesData(collectionName, documents, options = {}) {
    console.log(`Inserting ${documents.length} time series documents into ${collectionName}...`);

    try {
      const collectionInfo = this.collections.get(collectionName);
      if (!collectionInfo) {
        throw new Error(`Time series collection '${collectionName}' not found`);
      }

      const collection = collectionInfo.collection;

      // Prepare documents for time series insertion
      const preparedDocuments = documents.map(doc => this.prepareTimeSeriesDocument(doc, options));

      // Execute optimized bulk insertion
      const insertOptions = {
        ordered: options.ordered !== undefined ? options.ordered : false,
        writeConcern: options.writeConcern || { w: 'majority', j: true },
        ...options.insertOptions
      };

      const insertResult = await collection.insertMany(preparedDocuments, insertOptions);

      // Update data quality metrics if enabled
      if (this.config.enableDataQualityTracking) {
        await this.updateDataQualityMetrics(collectionName, preparedDocuments);
      }

      // Trigger anomaly detection if enabled
      if (this.config.enableAnomalyDetection) {
        await this.checkForAnomalies(collectionName, preparedDocuments);
      }

      return {
        success: true,
        collectionName: collectionName,
        documentsInserted: insertResult.insertedCount,
        insertedIds: insertResult.insertedIds,

        // Performance metrics
        averageDocumentSize: this.calculateAverageDocumentSize(preparedDocuments),
        compressionEnabled: collectionInfo.config.timeseries.enableCompression,

        // Data quality summary
        dataQualityScore: options.trackQuality ? this.calculateDataQualityScore(preparedDocuments) : null
      };

    } catch (error) {
      console.error(`Error inserting time series data into '${collectionName}':`, error);
      return {
        success: false,
        error: error.message,
        collectionName: collectionName
      };
    }
  }

  prepareTimeSeriesDocument(document, options = {}) {
    // Ensure proper time series document structure
    const prepared = {
      timestamp: document.timestamp || new Date(),

      // Organize metadata for optimal bucketing
      metadata: {
        device_id: document.device_id || document.metadata?.device_id,
        sensor_type: document.sensor_type || document.metadata?.sensor_type,
        location: document.location || document.metadata?.location,

        // Device-specific metadata
        device_model: document.device_model || document.metadata?.device_model,
        firmware_version: document.firmware_version || document.metadata?.firmware_version,

        // Data quality indicators
        data_quality: options.calculateQuality ? this.assessDataQuality(document) : undefined,

        // Additional metadata preservation
        ...document.metadata
      },

      // Measurements with proper data types
      measurements: {
        // Environmental measurements
        temperature: this.validateMeasurement(document.temperature, 'temperature'),
        humidity: this.validateMeasurement(document.humidity, 'humidity'),
        pressure: this.validateMeasurement(document.pressure, 'pressure'),

        // Device status measurements
        battery_level: this.validateMeasurement(document.battery_level, 'battery'),
        signal_strength: this.validateMeasurement(document.signal_strength, 'signal'),

        // Custom measurements
        ...this.extractCustomMeasurements(document)
      }
    };

    // Remove undefined values to optimize storage
    this.removeUndefinedValues(prepared);

    return prepared;
  }

  validateMeasurement(value, measurementType) {
    if (value === null || value === undefined) return undefined;

    // Type-specific validation and normalization
    const validationRules = {
      temperature: { min: -50, max: 100, precision: 2 },
      humidity: { min: 0, max: 100, precision: 1 },
      pressure: { min: 900, max: 1100, precision: 1 },
      battery: { min: 0, max: 100, precision: 0 },
      signal: { min: 0, max: 100, precision: 0 }
    };

    const rule = validationRules[measurementType];
    if (!rule) return value; // No validation rule, return as-is

    const numericValue = Number(value);
    if (isNaN(numericValue)) return undefined;

    // Apply bounds checking
    const boundedValue = Math.max(rule.min, Math.min(rule.max, numericValue));

    // Apply precision rounding
    return Number(boundedValue.toFixed(rule.precision));
  }

  async performTimeSeriesAggregation(collectionName, aggregationRequest) {
    console.log(`Performing time series aggregation on ${collectionName}...`);

    try {
      const collectionInfo = this.collections.get(collectionName);
      if (!collectionInfo) {
        throw new Error(`Time series collection '${collectionName}' not found`);
      }

      const collection = collectionInfo.collection;

      // Build optimized aggregation pipeline
      const aggregationPipeline = this.buildTimeSeriesAggregationPipeline(aggregationRequest);

      // Execute aggregation with appropriate options
      const aggregationOptions = {
        allowDiskUse: true,
        maxTimeMS: aggregationRequest.maxTimeMS || 60000,
        hint: aggregationRequest.hint,
        comment: `time_series_aggregation_${Date.now()}`
      };

      const results = await collection.aggregate(aggregationPipeline, aggregationOptions).toArray();

      // Post-process results for enhanced analytics
      const processedResults = this.processAggregationResults(results, aggregationRequest);

      return {
        success: true,
        collectionName: collectionName,
        aggregationType: aggregationRequest.type,
        resultCount: results.length,

        // Aggregation results
        results: processedResults,

        // Execution metadata
        executionStats: {
          pipelineStages: aggregationPipeline.length,
          executionTime: Date.now(),
          dataPointsAnalyzed: this.estimateDataPointsAnalyzed(aggregationRequest)
        }
      };

    } catch (error) {
      console.error(`Error performing time series aggregation on '${collectionName}':`, error);
      return {
        success: false,
        error: error.message,
        collectionName: collectionName,
        aggregationType: aggregationRequest.type
      };
    }
  }

  buildTimeSeriesAggregationPipeline(request) {
    const pipeline = [];

    // Time range filtering (essential first stage for performance)
    if (request.timeRange) {
      pipeline.push({
        $match: {
          timestamp: {
            $gte: new Date(request.timeRange.start),
            $lte: new Date(request.timeRange.end)
          }
        }
      });
    }

    // Metadata filtering
    if (request.filters) {
      const matchConditions = {};

      if (request.filters.device_ids) {
        matchConditions['metadata.device_id'] = { $in: request.filters.device_ids };
      }

      if (request.filters.sensor_types) {
        matchConditions['metadata.sensor_type'] = { $in: request.filters.sensor_types };
      }

      if (request.filters.locations) {
        matchConditions['metadata.location'] = { $in: request.filters.locations };
      }

      if (Object.keys(matchConditions).length > 0) {
        pipeline.push({ $match: matchConditions });
      }
    }

    // Time-based grouping and aggregation
    switch (request.type) {
      case 'time_bucket_aggregation':
        pipeline.push(...this.buildTimeBucketAggregation(request));
        break;
      case 'device_summary':
        pipeline.push(...this.buildDeviceSummaryAggregation(request));
        break;
      case 'trend_analysis':
        pipeline.push(...this.buildTrendAnalysisAggregation(request));
        break;
      case 'anomaly_detection':
        pipeline.push(...this.buildAnomalyDetectionAggregation(request));
        break;
      default:
        pipeline.push(...this.buildDefaultAggregation(request));
    }

    // Result limiting and sorting
    if (request.sort) {
      pipeline.push({ $sort: request.sort });
    }

    if (request.limit) {
      pipeline.push({ $limit: request.limit });
    }

    return pipeline;
  }

  buildTimeBucketAggregation(request) {
    const bucketSize = request.bucketSize || 'hour';
    const bucketFormat = this.getBucketDateFormat(bucketSize);

    return [
      {
        $group: {
          _id: {
            time_bucket: {
              $dateFromString: {
                dateString: {
                  $dateToString: {
                    date: '$timestamp',
                    format: bucketFormat
                  }
                }
              }
            },
            device_id: '$metadata.device_id',
            sensor_type: '$metadata.sensor_type'
          },

          // Statistical aggregations
          measurement_count: { $sum: 1 },

          // Temperature statistics
          avg_temperature: { $avg: '$measurements.temperature' },
          min_temperature: { $min: '$measurements.temperature' },
          max_temperature: { $max: '$measurements.temperature' },
          temp_variance: { $stdDevPop: '$measurements.temperature' },

          // Humidity statistics
          avg_humidity: { $avg: '$measurements.humidity' },
          min_humidity: { $min: '$measurements.humidity' },
          max_humidity: { $max: '$measurements.humidity' },

          // Pressure statistics
          avg_pressure: { $avg: '$measurements.pressure' },
          pressure_range: {
            $subtract: [
              { $max: '$measurements.pressure' },
              { $min: '$measurements.pressure' }
            ]
          },

          // Device health metrics
          avg_battery_level: { $avg: '$measurements.battery_level' },
          min_battery_level: { $min: '$measurements.battery_level' },
          avg_signal_strength: { $avg: '$measurements.signal_strength' },

          // Data quality metrics
          data_completeness: {
            $avg: {
              $cond: {
                if: {
                  $and: [
                    { $ne: ['$measurements.temperature', null] },
                    { $ne: ['$measurements.humidity', null] },
                    { $ne: ['$measurements.pressure', null] }
                  ]
                },
                then: 1,
                else: 0
              }
            }
          },

          // Time range within bucket
          earliest_reading: { $min: '$timestamp' },
          latest_reading: { $max: '$timestamp' }
        }
      },

      // Post-processing and enrichment
      {
        $addFields: {
          time_bucket: '$_id.time_bucket',
          device_id: '$_id.device_id',
          sensor_type: '$_id.sensor_type',

          // Calculate additional metrics
          temperature_stability: {
            $cond: {
              if: { $gt: ['$temp_variance', 0] },
              then: { $divide: ['$temp_variance', '$avg_temperature'] },
              else: 0
            }
          },

          // Battery consumption rate (simplified)
          estimated_battery_consumption: {
            $subtract: [100, '$avg_battery_level']
          },

          // Data quality score
          data_quality_score: {
            $multiply: ['$data_completeness', 100]
          },

          // Bucket duration in minutes
          bucket_duration_minutes: {
            $divide: [
              { $subtract: ['$latest_reading', '$earliest_reading'] },
              60000
            ]
          }
        }
      },

      // Remove the grouped _id field
      {
        $project: { _id: 0 }
      }
    ];
  }

  buildDeviceSummaryAggregation(request) {
    return [
      {
        $group: {
          _id: '$metadata.device_id',

          // Basic metrics
          total_readings: { $sum: 1 },
          sensor_types: { $addToSet: '$metadata.sensor_type' },
          locations: { $addToSet: '$metadata.location' },

          // Time range
          first_reading: { $min: '$timestamp' },
          last_reading: { $max: '$timestamp' },

          // Environmental averages
          avg_temperature: { $avg: '$measurements.temperature' },
          avg_humidity: { $avg: '$measurements.humidity' },
          avg_pressure: { $avg: '$measurements.pressure' },

          // Environmental ranges
          temperature_range: {
            $subtract: [
              { $max: '$measurements.temperature' },
              { $min: '$measurements.temperature' }
            ]
          },
          humidity_range: {
            $subtract: [
              { $max: '$measurements.humidity' },
              { $min: '$measurements.humidity' }
            ]
          },

          // Device health metrics
          current_battery_level: { $last: '$measurements.battery_level' },
          min_battery_level: { $min: '$measurements.battery_level' },
          avg_signal_strength: { $avg: '$measurements.signal_strength' },
          min_signal_strength: { $min: '$measurements.signal_strength' },

          // Data quality assessment
          complete_readings: {
            $sum: {
              $cond: {
                if: {
                  $and: [
                    { $ne: ['$measurements.temperature', null] },
                    { $ne: ['$measurements.humidity', null] },
                    { $ne: ['$measurements.pressure', null] }
                  ]
                },
                then: 1,
                else: 0
              }
            }
          }
        }
      },

      {
        $addFields: {
          device_id: '$_id',

          // Operational duration
          operational_duration_hours: {
            $divide: [
              { $subtract: ['$last_reading', '$first_reading'] },
              3600000
            ]
          },

          // Reading frequency
          avg_reading_interval_minutes: {
            $cond: {
              if: { $gt: ['$total_readings', 1] },
              then: {
                $divide: [
                  { $subtract: ['$last_reading', '$first_reading'] },
                  { $multiply: [{ $subtract: ['$total_readings', 1] }, 60000] }
                ]
              },
              else: null
            }
          },

          // Data completeness percentage
          data_completeness_percent: {
            $multiply: [
              { $divide: ['$complete_readings', '$total_readings'] },
              100
            ]
          },

          // Device health status
          device_health_status: {
            $switch: {
              branches: [
                {
                  case: { $lt: ['$current_battery_level', 15] },
                  then: 'critical_battery'
                },
                {
                  case: { $lt: ['$avg_signal_strength', 30] },
                  then: 'poor_connectivity'
                },
                {
                  case: {
                    $lt: [
                      { $divide: ['$complete_readings', '$total_readings'] },
                      0.8
                    ]
                  },
                  then: 'data_quality_issues'
                }
              ],
              default: 'healthy'
            }
          }
        }
      },

      {
        $project: { _id: 0 }
      }
    ];
  }

  getBucketDateFormat(bucketSize) {
    const formats = {
      'minute': '%Y-%m-%d %H:%M:00',
      'hour': '%Y-%m-%d %H:00:00',
      'day': '%Y-%m-%d 00:00:00',
      'week': '%Y-%U 00:00:00', // Year-Week
      'month': '%Y-%m-01 00:00:00'
    };

    return formats[bucketSize] || formats['hour'];
  }

  async setupRetentionPolicies() {
    console.log('Setting up automatic data retention policies...');

    try {
      for (const [collectionName, collectionInfo] of this.collections.entries()) {
        // Configure TTL indexes for automatic expiration
        const collection = collectionInfo.collection;

        await collection.createIndex(
          { timestamp: 1 },
          {
            name: 'ttl_index',
            expireAfterSeconds: collectionInfo.config.expireAfterSeconds,
            background: true
          }
        );

        console.log(`Retention policy configured for ${collectionName}: ${collectionInfo.config.expireAfterSeconds} seconds`);
      }

    } catch (error) {
      console.error('Error setting up retention policies:', error);
      throw error;
    }
  }

  async setupPreAggregationPipelines() {
    console.log('Setting up pre-aggregation pipelines...');

    // This would typically involve setting up MongoDB change streams
    // or scheduled aggregation jobs for common query patterns

    for (const level of this.config.aggregationLevels) {
      const pipelineName = `pre_aggregation_${level}`;

      // Store pipeline configuration for later execution
      this.aggregationPipelines.set(pipelineName, {
        level: level,
        schedule: this.getAggregationSchedule(level),
        pipeline: this.buildPreAggregationPipeline(level)
      });

      console.log(`Pre-aggregation pipeline configured for ${level} level`);
    }
  }

  // Utility methods for time series management

  calculateAverageDocumentSize(documents) {
    if (!documents || documents.length === 0) return 0;

    const totalSize = documents.reduce((size, doc) => {
      return size + JSON.stringify(doc).length;
    }, 0);

    return Math.round(totalSize / documents.length);
  }

  assessDataQuality(document) {
    let qualityScore = 0;
    let totalChecks = 0;

    // Check for presence of key measurements
    const measurements = ['temperature', 'humidity', 'pressure'];
    for (const measurement of measurements) {
      totalChecks++;
      if (document[measurement] !== null && document[measurement] !== undefined) {
        qualityScore++;
      }
    }

    // Check for reasonable value ranges
    if (document.temperature !== null && document.temperature >= -50 && document.temperature <= 100) {
      qualityScore += 0.5;
    }
    totalChecks += 0.5;

    if (document.humidity !== null && document.humidity >= 0 && document.humidity <= 100) {
      qualityScore += 0.5;
    }
    totalChecks += 0.5;

    return totalChecks > 0 ? qualityScore / totalChecks : 0;
  }

  extractCustomMeasurements(document) {
    const customMeasurements = {};
    const standardFields = ['timestamp', 'device_id', 'sensor_type', 'location', 'metadata', 'temperature', 'humidity', 'pressure', 'battery_level', 'signal_strength'];

    for (const [key, value] of Object.entries(document)) {
      if (!standardFields.includes(key) && typeof value === 'number') {
        customMeasurements[key] = value;
      }
    }

    return customMeasurements;
  }

  removeUndefinedValues(obj) {
    Object.keys(obj).forEach(key => {
      if (obj[key] === undefined) {
        delete obj[key];
      } else if (typeof obj[key] === 'object' && obj[key] !== null) {
        this.removeUndefinedValues(obj[key]);

        // Remove empty objects
        if (Object.keys(obj[key]).length === 0) {
          delete obj[key];
        }
      }
    });
  }

  processAggregationResults(results, request) {
    // Add additional context and calculations to aggregation results
    return results.map(result => ({
      ...result,

      // Add computed fields based on aggregation type
      aggregation_metadata: {
        request_type: request.type,
        generated_at: new Date(),
        bucket_size: request.bucketSize,
        time_range: request.timeRange
      }
    }));
  }

  estimateDataPointsAnalyzed(request) {
    // Simplified estimation based on time range and expected frequency
    if (!request.timeRange) return 'unknown';

    const timeRangeMs = new Date(request.timeRange.end) - new Date(request.timeRange.start);
    const assumedFrequencyMs = 60000; // Assume 1 minute intervals

    return Math.round(timeRangeMs / assumedFrequencyMs);
  }

  getAggregationSchedule(level) {
    const schedules = {
      'hourly': '0 */1 * * * *',     // Every hour
      'daily': '0 0 */1 * * *',      // Every day at midnight
      'weekly': '0 0 0 */7 * *',     // Every week
      'monthly': '0 0 0 1 */1 *'     // Every month on 1st
    };

    return schedules[level] || schedules['daily'];
  }

  buildPreAggregationPipeline(level) {
    // Simplified pre-aggregation pipeline
    // In production, this would be much more sophisticated
    return [
      {
        $match: {
          timestamp: {
            $gte: new Date(Date.now() - this.getLevelTimeRange(level))
          }
        }
      },
      {
        $group: {
          _id: {
            device_id: '$metadata.device_id',
            time_bucket: this.getTimeBucketExpression(level)
          },
          avg_temperature: { $avg: '$measurements.temperature' },
          avg_humidity: { $avg: '$measurements.humidity' },
          count: { $sum: 1 }
        }
      }
    ];
  }

  getLevelTimeRange(level) {
    const ranges = {
      'hourly': 24 * 60 * 60 * 1000,      // 1 day
      'daily': 30 * 24 * 60 * 60 * 1000,  // 30 days
      'weekly': 12 * 7 * 24 * 60 * 60 * 1000, // 12 weeks
      'monthly': 12 * 30 * 24 * 60 * 60 * 1000 // 12 months
    };

    return ranges[level] || ranges['daily'];
  }

  getTimeBucketExpression(level) {
    const expressions = {
      'hourly': {
        $dateFromString: {
          dateString: {
            $dateToString: {
              date: '$timestamp',
              format: '%Y-%m-%d %H:00:00'
            }
          }
        }
      },
      'daily': {
        $dateFromString: {
          dateString: {
            $dateToString: {
              date: '$timestamp',
              format: '%Y-%m-%d 00:00:00'
            }
          }
        }
      }
    };

    return expressions[level] || expressions['hourly'];
  }
}

// Benefits of MongoDB Advanced Time Series Collections:
// - Purpose-built storage optimization with automatic compression
// - Intelligent bucketing for optimal query performance  
// - Built-in retention policies and automatic data expiration
// - Advanced indexing strategies optimized for temporal queries
// - Schema flexibility for diverse sensor and measurement data
// - Native aggregation capabilities for time series analytics
// - Automatic storage optimization and compression
// - High ingestion performance for IoT and monitoring workloads
// - Built-in support for metadata organization and filtering
// - SQL-compatible time series operations through QueryLeaf integration

module.exports = {
  AdvancedTimeSeriesManager
};

Understanding MongoDB Time Series Architecture

Advanced Temporal Data Management and Storage Optimization Strategies

Implement sophisticated time series patterns for production MongoDB deployments:

// Production-ready MongoDB time series with enterprise-grade optimization and monitoring
class ProductionTimeSeriesManager extends AdvancedTimeSeriesManager {
  constructor(db, productionConfig) {
    super(db, productionConfig);

    this.productionConfig = {
      ...productionConfig,
      enableDistributedCollection: true,
      enableRealTimeAggregation: true,
      enablePredictiveAnalytics: true,
      enableAutomaticScaling: true,
      enableComplianceTracking: true,
      enableAdvancedAlerting: true
    };

    this.setupProductionOptimizations();
    this.initializeRealTimeProcessing();
    this.setupPredictiveAnalytics();
  }

  async implementDistributedTimeSeriesProcessing(collections, distributionStrategy) {
    console.log('Implementing distributed time series processing across multiple collections...');

    const distributedStrategy = {
      // Temporal sharding strategies
      temporalSharding: {
        enableTimeBasedSharding: true,
        shardingGranularity: 'monthly',
        automaticShardRotation: true,
        optimizeForQueryPatterns: true
      },

      // Data lifecycle management
      lifecycleManagement: {
        hotDataRetention: '7d',
        warmDataRetention: '90d', 
        coldDataArchival: '1y',
        automaticTiering: true
      },

      // Performance optimization
      performanceOptimization: {
        compressionOptimization: true,
        indexingOptimization: true,
        bucketingOptimization: true,
        aggregationOptimization: true
      }
    };

    return await this.deployDistributedTimeSeriesArchitecture(collections, distributedStrategy);
  }

  async setupAdvancedTimeSeriesAnalytics() {
    console.log('Setting up advanced time series analytics and machine learning capabilities...');

    const analyticsCapabilities = {
      // Real-time analytics
      realTimeAnalytics: {
        streamingAggregation: true,
        anomalyDetection: true,
        trendAnalysis: true,
        alertingPipelines: true
      },

      // Predictive analytics
      predictiveAnalytics: {
        forecastingModels: true,
        patternRecognition: true,
        seasonalityDetection: true,
        capacityPlanning: true
      },

      // Advanced reporting
      reportingCapabilities: {
        automaticDashboards: true,
        customMetrics: true,
        correlationAnalysis: true,
        performanceReporting: true
      }
    };

    return await this.deployAdvancedAnalytics(analyticsCapabilities);
  }
}

SQL-Style Time Series Operations with QueryLeaf

QueryLeaf provides familiar SQL syntax for MongoDB time series operations and analytics:

-- QueryLeaf advanced time series operations with SQL-familiar syntax for MongoDB

-- Create optimized time series collection with advanced configuration
CREATE COLLECTION sensor_data AS TIME_SERIES (
  time_field = 'timestamp',
  meta_field = 'metadata',
  granularity = 'seconds',

  -- Storage optimization
  bucket_max_span_seconds = 3600,
  bucket_rounding_seconds = 60,
  expire_after_seconds = 2592000,  -- 30 days

  -- Compression settings
  enable_compression = true,
  compression_algorithm = 'zstd',

  -- Performance optimization
  enable_automatic_indexing = true,
  optimize_for_ingestion = true,
  optimize_for_analytics = true
);

-- Advanced time series data insertion with automatic optimization
INSERT INTO sensor_data (
  timestamp,
  metadata.device_id,
  metadata.sensor_type,
  metadata.location,
  metadata.data_quality,
  measurements.temperature,
  measurements.humidity,
  measurements.pressure,
  measurements.battery_level,
  measurements.signal_strength
)
SELECT 
  -- Time series specific timestamp handling
  CASE 
    WHEN source_timestamp IS NOT NULL THEN source_timestamp
    ELSE CURRENT_TIMESTAMP
  END as timestamp,

  -- Metadata organization for optimal bucketing
  device_identifier as "metadata.device_id",
  sensor_classification as "metadata.sensor_type", 
  installation_location as "metadata.location",

  -- Data quality assessment
  CASE 
    WHEN temp_reading IS NOT NULL AND humidity_reading IS NOT NULL AND pressure_reading IS NOT NULL THEN 'complete'
    WHEN temp_reading IS NOT NULL OR humidity_reading IS NOT NULL THEN 'partial'
    ELSE 'incomplete'
  END as "metadata.data_quality",

  -- Validated measurements
  CASE 
    WHEN temp_reading BETWEEN -50 AND 100 THEN ROUND(temp_reading, 2)
    ELSE NULL
  END as "measurements.temperature",

  CASE 
    WHEN humidity_reading BETWEEN 0 AND 100 THEN ROUND(humidity_reading, 1)
    ELSE NULL  
  END as "measurements.humidity",

  CASE 
    WHEN pressure_reading BETWEEN 900 AND 1100 THEN ROUND(pressure_reading, 1)
    ELSE NULL
  END as "measurements.pressure",

  -- Device health measurements
  GREATEST(0, LEAST(100, battery_percentage)) as "measurements.battery_level",
  GREATEST(0, LEAST(100, connectivity_strength)) as "measurements.signal_strength"

FROM staging_sensor_readings
WHERE ingestion_timestamp >= CURRENT_TIMESTAMP - INTERVAL '1 hour'
  AND device_identifier IS NOT NULL
  AND source_timestamp IS NOT NULL

-- Time series bulk insert configuration
WITH (
  batch_size = 5000,
  ordered_operations = false,
  write_concern = 'majority',
  enable_compression = true,
  bypass_document_validation = false
);

-- Advanced time-bucket aggregation with comprehensive analytics
WITH time_bucket_analysis AS (
  SELECT 
    -- Time bucketing with flexible granularity
    DATE_TRUNC('hour', timestamp) as time_bucket,
    metadata.device_id,
    metadata.sensor_type,
    metadata.location,

    -- Volume metrics
    COUNT(*) as reading_count,
    COUNT(measurements.temperature) as temp_reading_count,
    COUNT(measurements.humidity) as humidity_reading_count,
    COUNT(measurements.pressure) as pressure_reading_count,

    -- Temperature analytics
    AVG(measurements.temperature) as avg_temperature,
    MIN(measurements.temperature) as min_temperature,
    MAX(measurements.temperature) as max_temperature,
    STDDEV_POP(measurements.temperature) as temp_stddev,
    PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY measurements.temperature) as temp_median,
    PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY measurements.temperature) as temp_p95,

    -- Humidity analytics
    AVG(measurements.humidity) as avg_humidity,
    MIN(measurements.humidity) as min_humidity,
    MAX(measurements.humidity) as max_humidity,
    STDDEV_POP(measurements.humidity) as humidity_stddev,

    -- Pressure analytics  
    AVG(measurements.pressure) as avg_pressure,
    MIN(measurements.pressure) as min_pressure,
    MAX(measurements.pressure) as max_pressure,
    (MAX(measurements.pressure) - MIN(measurements.pressure)) as pressure_range,

    -- Device health analytics
    AVG(measurements.battery_level) as avg_battery,
    MIN(measurements.battery_level) as min_battery,
    AVG(measurements.signal_strength) as avg_signal,
    MIN(measurements.signal_strength) as min_signal,

    -- Data quality analytics
    (COUNT(measurements.temperature) * 100.0 / COUNT(*)) as temp_completeness_percent,
    (COUNT(measurements.humidity) * 100.0 / COUNT(*)) as humidity_completeness_percent,
    (COUNT(measurements.pressure) * 100.0 / COUNT(*)) as pressure_completeness_percent,

    -- Time range within bucket
    MIN(timestamp) as bucket_start_time,
    MAX(timestamp) as bucket_end_time,

    -- Advanced statistical measures
    (MAX(measurements.temperature) - MIN(measurements.temperature)) as temp_range,
    CASE 
      WHEN AVG(measurements.temperature) > 0 THEN 
        STDDEV_POP(measurements.temperature) / AVG(measurements.temperature) 
      ELSE NULL
    END as temp_coefficient_variation

  FROM sensor_data
  WHERE timestamp >= CURRENT_TIMESTAMP - INTERVAL '24 hours'
    AND timestamp < CURRENT_TIMESTAMP
    AND metadata.data_quality IN ('complete', 'partial')
  GROUP BY 
    DATE_TRUNC('hour', timestamp),
    metadata.device_id,
    metadata.sensor_type,
    metadata.location
),

anomaly_detection AS (
  SELECT 
    tba.*,

    -- Temperature anomaly detection
    CASE 
      WHEN temp_stddev > 0 THEN
        ABS(avg_temperature - LAG(avg_temperature) OVER (
          PARTITION BY device_id 
          ORDER BY time_bucket
        )) / temp_stddev
      ELSE 0
    END as temp_anomaly_score,

    -- Humidity anomaly detection  
    CASE 
      WHEN humidity_stddev > 0 THEN
        ABS(avg_humidity - LAG(avg_humidity) OVER (
          PARTITION BY device_id 
          ORDER BY time_bucket
        )) / humidity_stddev
      ELSE 0
    END as humidity_anomaly_score,

    -- Battery degradation analysis
    LAG(avg_battery) OVER (
      PARTITION BY device_id 
      ORDER BY time_bucket
    ) - avg_battery as battery_degradation,

    -- Signal strength trend
    avg_signal - LAG(avg_signal) OVER (
      PARTITION BY device_id 
      ORDER BY time_bucket
    ) as signal_trend,

    -- Data quality trend
    (temp_completeness_percent + humidity_completeness_percent + pressure_completeness_percent) / 3.0 as overall_completeness,

    -- Bucket characteristics
    EXTRACT(MINUTES FROM (bucket_end_time - bucket_start_time)) as bucket_duration_minutes

  FROM time_bucket_analysis tba
),

device_health_assessment AS (
  SELECT 
    ad.device_id,
    ad.sensor_type,
    ad.location,
    COUNT(*) as analysis_periods,

    -- Environmental stability analysis
    AVG(ad.avg_temperature) as device_avg_temperature,
    STDDEV(ad.avg_temperature) as temperature_stability,
    AVG(ad.temp_coefficient_variation) as avg_temp_variability,

    -- Environmental range analysis
    MIN(ad.min_temperature) as absolute_min_temperature,
    MAX(ad.max_temperature) as absolute_max_temperature,
    AVG(ad.temp_range) as avg_hourly_temp_range,

    -- Humidity environment analysis
    AVG(ad.avg_humidity) as device_avg_humidity,
    STDDEV(ad.avg_humidity) as humidity_stability,
    AVG(ad.pressure_range) as avg_pressure_variation,

    -- Device health metrics
    MIN(ad.min_battery) as lowest_battery_level,
    AVG(ad.avg_battery) as avg_battery_level,
    MAX(ad.battery_degradation) as max_battery_drop_per_hour,

    -- Connectivity analysis
    AVG(ad.avg_signal) as avg_connectivity,
    MIN(ad.min_signal) as worst_connectivity,
    STDDEV(ad.avg_signal) as connectivity_stability,

    -- Data reliability metrics
    AVG(ad.overall_completeness) as avg_data_completeness,
    MIN(ad.overall_completeness) as worst_data_completeness,

    -- Anomaly frequency
    COUNT(*) FILTER (WHERE ad.temp_anomaly_score > 2) as temp_anomaly_count,
    COUNT(*) FILTER (WHERE ad.humidity_anomaly_score > 2) as humidity_anomaly_count,
    AVG(ad.temp_anomaly_score) as avg_temp_anomaly_score,

    -- Recent trends (last 6 hours vs previous)
    AVG(CASE WHEN ad.time_bucket >= CURRENT_TIMESTAMP - INTERVAL '6 hours' 
             THEN ad.avg_battery ELSE NULL END) - 
    AVG(CASE WHEN ad.time_bucket < CURRENT_TIMESTAMP - INTERVAL '6 hours' 
             THEN ad.avg_battery ELSE NULL END) as recent_battery_trend,

    AVG(CASE WHEN ad.time_bucket >= CURRENT_TIMESTAMP - INTERVAL '6 hours' 
             THEN ad.avg_signal ELSE NULL END) - 
    AVG(CASE WHEN ad.time_bucket < CURRENT_TIMESTAMP - INTERVAL '6 hours' 
             THEN ad.avg_signal ELSE NULL END) as recent_signal_trend

  FROM anomaly_detection ad
  GROUP BY ad.device_id, ad.sensor_type, ad.location
)

SELECT 
  dha.device_id,
  dha.sensor_type,
  dha.location,
  dha.analysis_periods,

  -- Environmental summary
  ROUND(dha.device_avg_temperature, 2) as avg_temperature,
  ROUND(dha.temperature_stability, 3) as temp_stability_stddev,
  ROUND(dha.avg_temp_variability, 3) as avg_temp_coefficient_variation,
  dha.absolute_min_temperature,
  dha.absolute_max_temperature,

  -- Environmental classification
  CASE 
    WHEN dha.temperature_stability > 5 THEN 'highly_variable'
    WHEN dha.temperature_stability > 2 THEN 'moderately_variable'  
    WHEN dha.temperature_stability > 1 THEN 'stable'
    ELSE 'very_stable'
  END as temperature_environment_classification,

  -- Device health summary
  ROUND(dha.avg_battery_level, 1) as avg_battery_level,
  dha.lowest_battery_level,
  ROUND(dha.max_battery_drop_per_hour, 2) as max_hourly_battery_consumption,
  ROUND(dha.avg_connectivity, 1) as avg_signal_strength,

  -- Device status assessment
  CASE 
    WHEN dha.lowest_battery_level < 15 THEN 'critical_battery'
    WHEN dha.avg_battery_level < 25 THEN 'low_battery'
    WHEN dha.avg_connectivity < 30 THEN 'connectivity_issues'
    WHEN dha.avg_data_completeness < 80 THEN 'data_quality_issues'
    WHEN dha.temp_anomaly_count > dha.analysis_periods * 0.2 THEN 'environmental_anomalies'
    ELSE 'healthy'
  END as device_status,

  -- Data quality assessment
  ROUND(dha.avg_data_completeness, 1) as avg_data_completeness_percent,
  dha.worst_data_completeness,

  -- Anomaly summary
  dha.temp_anomaly_count,
  dha.humidity_anomaly_count,
  ROUND(dha.avg_temp_anomaly_score, 3) as avg_temp_anomaly_score,

  -- Recent trends
  ROUND(dha.recent_battery_trend, 2) as recent_battery_change,
  ROUND(dha.recent_signal_trend, 1) as recent_signal_change,

  -- Trend classification
  CASE 
    WHEN dha.recent_battery_trend < -2 THEN 'battery_degrading_fast'
    WHEN dha.recent_battery_trend < -0.5 THEN 'battery_degrading'
    WHEN dha.recent_battery_trend > 1 THEN 'battery_improving'  -- Could indicate replacement
    ELSE 'battery_stable'
  END as battery_trend_classification,

  CASE 
    WHEN dha.recent_signal_trend < -5 THEN 'connectivity_degrading'
    WHEN dha.recent_signal_trend > 5 THEN 'connectivity_improving'
    ELSE 'connectivity_stable'
  END as connectivity_trend_classification,

  -- Alert generation
  ARRAY[
    CASE WHEN dha.lowest_battery_level < 10 THEN 'CRITICAL: Battery critically low' END,
    CASE WHEN dha.avg_connectivity < 25 THEN 'WARNING: Poor connectivity detected' END,
    CASE WHEN dha.avg_data_completeness < 70 THEN 'WARNING: Low data quality' END,
    CASE WHEN dha.recent_battery_trend < -3 THEN 'ALERT: Rapid battery degradation' END,
    CASE WHEN dha.temp_anomaly_count > dha.analysis_periods * 0.3 THEN 'ALERT: Frequent temperature anomalies' END
  ]::TEXT[] as active_alerts,

  -- Recommendations
  CASE 
    WHEN dha.lowest_battery_level < 15 THEN 'Schedule immediate battery replacement'
    WHEN dha.avg_connectivity < 30 THEN 'Check network coverage and device positioning'  
    WHEN dha.avg_data_completeness < 80 THEN 'Inspect sensors and perform calibration'
    WHEN dha.temp_anomaly_count > dha.analysis_periods * 0.2 THEN 'Investigate environmental factors'
    ELSE 'Device operating within normal parameters'
  END as maintenance_recommendation

FROM device_health_assessment dha
ORDER BY 
  CASE 
    WHEN dha.lowest_battery_level < 15 THEN 1
    WHEN dha.avg_connectivity < 30 THEN 2
    WHEN dha.avg_data_completeness < 80 THEN 3
    ELSE 4
  END,
  dha.device_id;

-- Advanced time series trend analysis with seasonality detection
WITH daily_aggregates AS (
  SELECT 
    DATE_TRUNC('day', timestamp) as date_bucket,
    metadata.location,
    metadata.sensor_type,

    -- Daily environmental summaries
    AVG(measurements.temperature) as daily_avg_temp,
    MIN(measurements.temperature) as daily_min_temp,
    MAX(measurements.temperature) as daily_max_temp,
    AVG(measurements.humidity) as daily_avg_humidity,
    AVG(measurements.pressure) as daily_avg_pressure,

    -- Data volume and quality
    COUNT(*) as daily_reading_count,
    (COUNT(measurements.temperature) * 100.0 / COUNT(*)) as daily_completeness

  FROM sensor_data
  WHERE timestamp >= CURRENT_DATE - INTERVAL '90 days'
    AND timestamp < CURRENT_DATE
    AND metadata.location IS NOT NULL
  GROUP BY DATE_TRUNC('day', timestamp), metadata.location, metadata.sensor_type
),

weekly_patterns AS (
  SELECT 
    da.*,
    EXTRACT(DOW FROM da.date_bucket) as day_of_week,  -- 0=Sunday, 6=Saturday
    EXTRACT(WEEK FROM da.date_bucket) as week_number,

    -- Moving averages for trend analysis
    AVG(da.daily_avg_temp) OVER (
      PARTITION BY da.location, da.sensor_type
      ORDER BY da.date_bucket
      ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
    ) as temp_7day_avg,

    AVG(da.daily_avg_temp) OVER (
      PARTITION BY da.location, da.sensor_type  
      ORDER BY da.date_bucket
      ROWS BETWEEN 29 PRECEDING AND CURRENT ROW
    ) as temp_30day_avg,

    -- Trend detection
    da.daily_avg_temp - LAG(da.daily_avg_temp, 7) OVER (
      PARTITION BY da.location, da.sensor_type
      ORDER BY da.date_bucket
    ) as week_over_week_temp_change,

    -- Seasonality indicators
    LAG(da.daily_avg_temp, 7) OVER (
      PARTITION BY da.location, da.sensor_type
      ORDER BY da.date_bucket
    ) as same_day_last_week_temp,

    LAG(da.daily_avg_temp, 30) OVER (
      PARTITION BY da.location, da.sensor_type  
      ORDER BY da.date_bucket
    ) as same_day_last_month_temp

  FROM daily_aggregates da
),

trend_analysis AS (
  SELECT 
    wp.location,
    wp.sensor_type,
    COUNT(*) as analysis_days,

    -- Overall trend analysis
    AVG(wp.daily_avg_temp) as overall_avg_temp,
    STDDEV(wp.daily_avg_temp) as temp_variability,
    MIN(wp.daily_min_temp) as absolute_min_temp,
    MAX(wp.daily_max_temp) as absolute_max_temp,

    -- Seasonal pattern analysis  
    AVG(CASE WHEN wp.day_of_week IN (0,6) THEN wp.daily_avg_temp END) as weekend_avg_temp,
    AVG(CASE WHEN wp.day_of_week BETWEEN 1 AND 5 THEN wp.daily_avg_temp END) as weekday_avg_temp,

    -- Weekly cyclical patterns
    AVG(CASE WHEN wp.day_of_week = 0 THEN wp.daily_avg_temp END) as sunday_avg,
    AVG(CASE WHEN wp.day_of_week = 1 THEN wp.daily_avg_temp END) as monday_avg,
    AVG(CASE WHEN wp.day_of_week = 2 THEN wp.daily_avg_temp END) as tuesday_avg,
    AVG(CASE WHEN wp.day_of_week = 3 THEN wp.daily_avg_temp END) as wednesday_avg,
    AVG(CASE WHEN wp.day_of_week = 4 THEN wp.daily_avg_temp END) as thursday_avg,
    AVG(CASE WHEN wp.day_of_week = 5 THEN wp.daily_avg_temp END) as friday_avg,
    AVG(CASE WHEN wp.day_of_week = 6 THEN wp.daily_avg_temp END) as saturday_avg,

    -- Trend strength analysis
    AVG(wp.week_over_week_temp_change) as avg_weekly_change,
    STDDEV(wp.week_over_week_temp_change) as weekly_change_variability,

    -- Linear trend approximation (simplified)
    (MAX(wp.temp_30day_avg) - MIN(wp.temp_30day_avg)) / 
    NULLIF(EXTRACT(DAYS FROM MAX(wp.date_bucket) - MIN(wp.date_bucket)), 0) as daily_trend_rate,

    -- Data quality trend
    AVG(wp.daily_completeness) as avg_data_completeness,
    MIN(wp.daily_completeness) as worst_daily_completeness

  FROM weekly_patterns wp
  WHERE wp.date_bucket >= CURRENT_DATE - INTERVAL '60 days'  -- Focus on last 60 days for trends
  GROUP BY wp.location, wp.sensor_type
)

SELECT 
  ta.location,
  ta.sensor_type,
  ta.analysis_days,

  -- Environmental summary
  ROUND(ta.overall_avg_temp, 2) as avg_temperature,
  ROUND(ta.temp_variability, 2) as temperature_variability,
  ta.absolute_min_temp,
  ta.absolute_max_temp,

  -- Seasonal patterns
  ROUND(COALESCE(ta.weekday_avg_temp, 0), 2) as weekday_avg_temp,
  ROUND(COALESCE(ta.weekend_avg_temp, 0), 2) as weekend_avg_temp,
  ROUND(COALESCE(ta.weekend_avg_temp - ta.weekday_avg_temp, 0), 2) as weekend_weekday_diff,

  -- Weekly pattern analysis (day of week variations)
  JSON_OBJECT(
    'sunday', ROUND(COALESCE(ta.sunday_avg, 0), 2),
    'monday', ROUND(COALESCE(ta.monday_avg, 0), 2),
    'tuesday', ROUND(COALESCE(ta.tuesday_avg, 0), 2),
    'wednesday', ROUND(COALESCE(ta.wednesday_avg, 0), 2),
    'thursday', ROUND(COALESCE(ta.thursday_avg, 0), 2),
    'friday', ROUND(COALESCE(ta.friday_avg, 0), 2),
    'saturday', ROUND(COALESCE(ta.saturday_avg, 0), 2)
  ) as daily_temperature_pattern,

  -- Trend analysis
  ROUND(ta.avg_weekly_change, 3) as avg_weekly_temperature_change,
  ROUND(ta.daily_trend_rate * 30, 3) as monthly_trend_rate,

  -- Trend classification
  CASE 
    WHEN ta.daily_trend_rate > 0.1 THEN 'warming_trend'
    WHEN ta.daily_trend_rate < -0.1 THEN 'cooling_trend'
    ELSE 'stable'
  END as temperature_trend_classification,

  -- Seasonal pattern classification
  CASE 
    WHEN ABS(COALESCE(ta.weekend_avg_temp - ta.weekday_avg_temp, 0)) > 2 THEN 'strong_weekly_pattern'
    WHEN ABS(COALESCE(ta.weekend_avg_temp - ta.weekday_avg_temp, 0)) > 1 THEN 'moderate_weekly_pattern'
    ELSE 'minimal_weekly_pattern'
  END as weekly_seasonality,

  -- Variability assessment
  CASE 
    WHEN ta.temp_variability > 5 THEN 'highly_variable'
    WHEN ta.temp_variability > 2 THEN 'moderately_variable'
    ELSE 'stable_environment'
  END as environment_stability,

  -- Data quality assessment
  ROUND(ta.avg_data_completeness, 1) as avg_data_completeness_percent,

  -- Insights and recommendations
  CASE 
    WHEN ABS(ta.daily_trend_rate) > 0.1 THEN 'Monitor for environmental changes'
    WHEN ta.temp_variability > 5 THEN 'High variability - check for external factors'
    WHEN ta.avg_data_completeness < 90 THEN 'Improve sensor reliability'
    ELSE 'Environment stable, monitoring nominal'
  END as analysis_recommendation

FROM trend_analysis ta
WHERE ta.analysis_days >= 30  -- Require at least 30 days for meaningful trend analysis
ORDER BY 
  ABS(ta.daily_trend_rate) DESC,  -- Show locations with strongest trends first
  ta.temp_variability DESC,
  ta.location, 
  ta.sensor_type;

-- Time series data retention and archival with automated lifecycle management
WITH retention_analysis AS (
  SELECT 
    -- Analyze data age distribution
    DATE_TRUNC('day', timestamp) as date_bucket,
    metadata.location,
    COUNT(*) as daily_record_count,
    AVG(JSON_EXTRACT_PATH_TEXT(measurements, 'temperature')::NUMERIC) as daily_avg_temp,

    -- Data age categories
    CASE 
      WHEN timestamp >= CURRENT_TIMESTAMP - INTERVAL '7 days' THEN 'hot_data'
      WHEN timestamp >= CURRENT_TIMESTAMP - INTERVAL '30 days' THEN 'warm_data' 
      WHEN timestamp >= CURRENT_TIMESTAMP - INTERVAL '90 days' THEN 'cold_data'
      ELSE 'archive_candidate'
    END as data_tier,

    -- Storage impact estimation
    COUNT(*) * 500 as estimated_storage_bytes,  -- Assume ~500 bytes per document

    -- Access pattern analysis (simplified)
    CURRENT_DATE - DATE_TRUNC('day', timestamp)::DATE as days_old

  FROM sensor_data
  WHERE timestamp >= CURRENT_DATE - INTERVAL '180 days'  -- Analyze last 6 months
  GROUP BY DATE_TRUNC('day', timestamp), metadata.location
),

archival_candidates AS (
  SELECT 
    ra.location,
    ra.data_tier,
    COUNT(*) as total_days,
    SUM(ra.daily_record_count) as total_records,
    SUM(ra.estimated_storage_bytes) as total_estimated_bytes,
    MIN(ra.days_old) as newest_data_age_days,
    MAX(ra.days_old) as oldest_data_age_days,
    AVG(ra.daily_avg_temp) as avg_temperature_for_tier

  FROM retention_analysis ra
  GROUP BY ra.location, ra.data_tier
),

archival_recommendations AS (
  SELECT 
    ac.location,
    ac.data_tier,
    ac.total_records,
    ROUND(ac.total_estimated_bytes / 1024.0 / 1024.0, 2) as estimated_storage_mb,
    ac.oldest_data_age_days,

    -- Archival recommendations
    CASE ac.data_tier
      WHEN 'archive_candidate' THEN 'ARCHIVE: Move to cold storage or delete'
      WHEN 'cold_data' THEN 'CONSIDER: Compress or move to slower storage'
      WHEN 'warm_data' THEN 'OPTIMIZE: Apply compression if not already done'
      ELSE 'KEEP: Hot data for active queries'
    END as retention_recommendation,

    -- Priority scoring for archival actions
    CASE 
      WHEN ac.data_tier = 'archive_candidate' AND ac.total_estimated_bytes > 100*1024*1024 THEN 'high_priority'
      WHEN ac.data_tier = 'cold_data' AND ac.total_estimated_bytes > 50*1024*1024 THEN 'medium_priority'
      WHEN ac.data_tier IN ('archive_candidate', 'cold_data') THEN 'low_priority'
      ELSE 'no_action_needed'
    END as archival_priority,

    -- Estimated storage savings
    CASE ac.data_tier
      WHEN 'archive_candidate' THEN ac.total_estimated_bytes * 0.9  -- 90% savings from deletion
      WHEN 'cold_data' THEN ac.total_estimated_bytes * 0.6  -- 60% savings from compression
      ELSE 0
    END as potential_storage_savings_bytes

  FROM archival_candidates ac
)

SELECT 
  ar.location,
  ar.data_tier,
  ar.total_records,
  ar.estimated_storage_mb,
  ar.oldest_data_age_days,
  ar.retention_recommendation,
  ar.archival_priority,
  ROUND(ar.potential_storage_savings_bytes / 1024.0 / 1024.0, 2) as potential_savings_mb,

  -- Specific actions
  CASE ar.data_tier
    WHEN 'archive_candidate' THEN 
      FORMAT('DELETE FROM sensor_data WHERE timestamp < CURRENT_DATE - INTERVAL ''%s days'' AND metadata.location = ''%s''', 
             ar.oldest_data_age_days, ar.location)
    WHEN 'cold_data' THEN
      FORMAT('Consider enabling compression for location: %s', ar.location)
    ELSE 'No action required'
  END as suggested_action

FROM archival_recommendations ar
WHERE ar.archival_priority != 'no_action_needed'
ORDER BY 
  CASE ar.archival_priority
    WHEN 'high_priority' THEN 1
    WHEN 'medium_priority' THEN 2  
    WHEN 'low_priority' THEN 3
    ELSE 4
  END,
  ar.estimated_storage_mb DESC;

-- QueryLeaf provides comprehensive MongoDB time series capabilities:
-- 1. Purpose-built time series collections with automatic optimization
-- 2. Advanced temporal aggregation with statistical analysis
-- 3. Intelligent bucketing and compression for storage efficiency
-- 4. Built-in retention policies and lifecycle management
-- 5. Real-time analytics and anomaly detection
-- 6. Comprehensive trend analysis and seasonality detection
-- 7. SQL-familiar syntax for complex time series operations
-- 8. Automatic indexing and query optimization
-- 9. Production-ready time series analytics and reporting
-- 10. Integration with MongoDB's native time series optimizations

Best Practices for Production Time Series Applications

Storage Optimization and Performance Strategy

Essential principles for effective MongoDB time series application deployment:

  1. Collection Design: Configure appropriate time series granularity and bucketing strategies based on data ingestion patterns
  2. Index Strategy: Create compound indexes optimizing for common query patterns combining time ranges with metadata filters
  3. Compression Management: Enable appropriate compression algorithms to optimize storage efficiency for temporal data
  4. Retention Policies: Implement automatic data expiration and archival strategies aligned with business requirements
  5. Aggregation Optimization: Design aggregation pipelines that leverage time series collection optimizations
  6. Monitoring Integration: Track collection performance, storage utilization, and query patterns for continuous optimization

Scalability and Production Deployment

Optimize time series operations for enterprise-scale requirements:

  1. Sharding Strategy: Design shard keys that support time-based distribution and query patterns
  2. Data Lifecycle Management: Implement tiered storage strategies for hot, warm, and cold time series data
  3. Real-Time Processing: Configure streaming aggregation and real-time analytics for time-sensitive applications
  4. Capacity Planning: Monitor ingestion rates, storage growth, and query performance for scaling decisions
  5. Disaster Recovery: Design backup and recovery strategies appropriate for time series data characteristics
  6. Integration Patterns: Implement integration with monitoring, alerting, and visualization platforms

Conclusion

MongoDB time series collections provide comprehensive temporal data management capabilities that enable efficient storage, high-performance analytics, and scalable ingestion for IoT, monitoring, and analytical applications through purpose-built storage optimization, advanced compression, and specialized indexing strategies. The native time series support ensures that temporal workloads benefit from MongoDB's storage efficiency, query optimization, and analytical capabilities.

Key MongoDB Time Series benefits include:

  • Storage Optimization: Automatic compression and bucketing strategies optimized for temporal data patterns
  • Query Performance: Specialized indexing and aggregation capabilities for time-range and analytical queries
  • Ingestion Efficiency: High-throughput data insertion with minimal overhead and optimal storage utilization
  • Analytical Capabilities: Built-in aggregation functions designed for time series analytics and trend analysis
  • Lifecycle Management: Automatic retention policies and data expiration for operational efficiency
  • SQL Accessibility: Familiar SQL-style time series operations through QueryLeaf for accessible temporal data management

Whether you're building IoT platforms, system monitoring solutions, financial analytics applications, or sensor data management systems, MongoDB time series collections with QueryLeaf's familiar SQL interface provide the foundation for efficient, scalable temporal data management.

QueryLeaf Integration: QueryLeaf automatically optimizes MongoDB time series operations while providing SQL-familiar syntax for temporal data management, aggregation, and analytics. Advanced time series patterns, compression strategies, and analytical functions are seamlessly handled through familiar SQL constructs, making sophisticated time series applications accessible to SQL-oriented development teams.

The combination of MongoDB's robust time series capabilities with SQL-style temporal operations makes it an ideal platform for applications requiring both high-performance time series storage and familiar database management patterns, ensuring your temporal data operations can scale efficiently while maintaining query performance and storage optimization as data volume and analytical complexity grow.

MongoDB Connection Pooling and Performance Optimization: Advanced Database Connection Management for Production Applications

Production database applications require sophisticated connection management strategies that can handle fluctuating workloads, maintain optimal performance under high concurrency, and prevent resource exhaustion while ensuring consistent response times. Traditional database connection approaches often struggle with connection overhead, resource contention, and inefficient connection lifecycle management, leading to performance bottlenecks, connection timeouts, and application instability in production environments.

MongoDB provides comprehensive connection pooling capabilities that enable efficient database connection management through intelligent pooling strategies, automatic connection lifecycle management, and advanced monitoring features. Unlike traditional databases that require complex connection management libraries or manual connection handling, MongoDB's built-in connection pooling offers optimized resource utilization, automatic scaling, and production-ready connection management with minimal configuration overhead.

The Traditional Connection Management Challenge

Conventional database connection management approaches face significant limitations in production environments:

-- Traditional PostgreSQL connection management - manual and resource-intensive approaches

-- Basic connection handling with poor resource management
CREATE TABLE connection_usage_log (
    log_id SERIAL PRIMARY KEY,
    connection_id VARCHAR(100) NOT NULL,
    application_name VARCHAR(100),
    database_name VARCHAR(100),
    username VARCHAR(100),

    -- Connection lifecycle tracking
    connection_established TIMESTAMP NOT NULL,
    connection_closed TIMESTAMP,
    connection_duration_seconds INTEGER,

    -- Resource utilization tracking (limited visibility)
    queries_executed INTEGER DEFAULT 0,
    transactions_executed INTEGER DEFAULT 0,
    bytes_transferred BIGINT DEFAULT 0,

    -- Basic performance metrics
    avg_query_time_ms DECIMAL(10,3),
    max_query_time_ms DECIMAL(10,3),
    total_wait_time_ms BIGINT DEFAULT 0,

    -- Connection status tracking
    connection_status VARCHAR(50) DEFAULT 'active',
    last_activity TIMESTAMP,
    idle_time_seconds INTEGER DEFAULT 0,

    -- Error tracking (basic)
    error_count INTEGER DEFAULT 0,
    last_error_message TEXT,
    last_error_timestamp TIMESTAMP
);

-- Manual connection pool simulation (extremely limited functionality)
CREATE TABLE connection_pool_config (
    pool_name VARCHAR(100) PRIMARY KEY,
    database_name VARCHAR(100) NOT NULL,
    min_connections INTEGER DEFAULT 5,
    max_connections INTEGER DEFAULT 25,
    connection_timeout_seconds INTEGER DEFAULT 30,
    idle_timeout_seconds INTEGER DEFAULT 300,

    -- Basic pool settings
    validate_connections BOOLEAN DEFAULT true,
    test_query VARCHAR(200) DEFAULT 'SELECT 1',
    max_lifetime_seconds INTEGER DEFAULT 1800,

    -- Pool status (manually maintained)
    pool_enabled BOOLEAN DEFAULT true,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    last_modified TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

-- Simple connection allocation function (no real pooling)
CREATE OR REPLACE FUNCTION allocate_connection(
    pool_name_param VARCHAR(100),
    application_name_param VARCHAR(100)
) RETURNS TABLE (
    connection_id VARCHAR(100),
    allocation_success BOOLEAN,
    wait_time_ms INTEGER,
    error_message TEXT
) AS $$
DECLARE
    available_connections INTEGER;
    pool_config RECORD;
    new_connection_id VARCHAR(100);
    start_time TIMESTAMP := clock_timestamp();
    wait_timeout TIMESTAMP;
BEGIN

    -- Get pool configuration (basic validation)
    SELECT * INTO pool_config
    FROM connection_pool_config
    WHERE pool_name = pool_name_param AND pool_enabled = true;

    IF NOT FOUND THEN
        RETURN QUERY SELECT 
            NULL::VARCHAR(100),
            false,
            0,
            'Connection pool not found or disabled';
        RETURN;
    END IF;

    -- Count "available" connections (very basic simulation)
    SELECT COUNT(*) INTO available_connections
    FROM connection_usage_log
    WHERE connection_status = 'active'
    AND last_activity > CURRENT_TIMESTAMP - (pool_config.idle_timeout_seconds * INTERVAL '1 second');

    -- Simple allocation logic (doesn't actually create real connections)
    IF available_connections >= pool_config.max_connections THEN
        -- Wait for connection with timeout (simulated blocking)
        wait_timeout := start_time + (pool_config.connection_timeout_seconds * INTERVAL '1 second');

        WHILE clock_timestamp() < wait_timeout LOOP
            -- Check for available connections again
            SELECT COUNT(*) INTO available_connections
            FROM connection_usage_log
            WHERE connection_status = 'active'
            AND last_activity > CURRENT_TIMESTAMP - (pool_config.idle_timeout_seconds * INTERVAL '1 second');

            IF available_connections < pool_config.max_connections THEN
                EXIT;
            END IF;

            -- Simple wait (PostgreSQL doesn't have good sleep in functions)
            PERFORM pg_sleep(0.1);
        END LOOP;

        IF available_connections >= pool_config.max_connections THEN
            RETURN QUERY SELECT 
                NULL::VARCHAR(100),
                false,
                EXTRACT(MILLISECONDS FROM clock_timestamp() - start_time)::INTEGER,
                'Connection pool exhausted - timeout waiting for available connection';
            RETURN;
        END IF;
    END IF;

    -- "Allocate" a connection (just create a log entry)
    new_connection_id := 'conn_' || extract(epoch from now())::BIGINT || '_' || (random() * 10000)::INTEGER;

    INSERT INTO connection_usage_log (
        connection_id,
        application_name,
        database_name,
        username,
        connection_established,
        connection_status,
        last_activity
    ) VALUES (
        new_connection_id,
        application_name_param,
        pool_config.database_name,
        current_user,
        clock_timestamp(),
        'active',
        clock_timestamp()
    );

    RETURN QUERY SELECT 
        new_connection_id,
        true,
        EXTRACT(MILLISECONDS FROM clock_timestamp() - start_time)::INTEGER,
        NULL::TEXT;

EXCEPTION WHEN OTHERS THEN
    RETURN QUERY SELECT 
        NULL::VARCHAR(100),
        false,
        EXTRACT(MILLISECONDS FROM clock_timestamp() - start_time)::INTEGER,
        SQLERRM;
END;
$$ LANGUAGE plpgsql;

-- Manual connection health checking (limited effectiveness)
CREATE OR REPLACE FUNCTION check_connection_health() 
RETURNS TABLE (
    pool_name VARCHAR(100),
    total_connections INTEGER,
    active_connections INTEGER,
    idle_connections INTEGER,
    stale_connections INTEGER,
    avg_connection_age_minutes INTEGER,
    pool_utilization_percent DECIMAL(5,2),
    health_status VARCHAR(50)
) AS $$
BEGIN
    RETURN QUERY
    WITH connection_stats AS (
        SELECT 
            cpc.pool_name,
            COUNT(cul.connection_id) as total_connections,
            COUNT(*) FILTER (WHERE cul.connection_status = 'active' 
                           AND cul.last_activity > CURRENT_TIMESTAMP - INTERVAL '5 minutes') as active_connections,
            COUNT(*) FILTER (WHERE cul.connection_status = 'active' 
                           AND cul.last_activity <= CURRENT_TIMESTAMP - INTERVAL '5 minutes') as idle_connections,
            COUNT(*) FILTER (WHERE cul.last_activity < CURRENT_TIMESTAMP - INTERVAL '30 minutes') as stale_connections,
            AVG(EXTRACT(MINUTES FROM CURRENT_TIMESTAMP - cul.connection_established))::INTEGER as avg_age_minutes,
            cpc.max_connections

        FROM connection_pool_config cpc
        LEFT JOIN connection_usage_log cul ON cul.database_name = cpc.database_name
        WHERE cpc.pool_enabled = true
        GROUP BY cpc.pool_name, cpc.max_connections
    )
    SELECT 
        cs.pool_name,
        cs.total_connections,
        cs.active_connections,
        cs.idle_connections,
        cs.stale_connections,
        cs.avg_age_minutes,
        ROUND((cs.total_connections::DECIMAL / cs.max_connections) * 100, 2) as utilization_percent,

        -- Basic health assessment
        CASE 
            WHEN cs.stale_connections > cs.max_connections * 0.5 THEN 'unhealthy'
            WHEN cs.total_connections > cs.max_connections * 0.9 THEN 'stressed'
            WHEN cs.active_connections < 2 THEN 'underutilized'
            ELSE 'healthy'
        END as health_status

    FROM connection_stats cs;
END;
$$ LANGUAGE plpgsql;

-- Basic connection cleanup (manual process)
WITH stale_connections AS (
    SELECT 
        connection_id,
        connection_established,
        last_activity,
        EXTRACT(MINUTES FROM CURRENT_TIMESTAMP - last_activity) as idle_minutes

    FROM connection_usage_log
    WHERE connection_status = 'active'
    AND (
        last_activity < CURRENT_TIMESTAMP - INTERVAL '30 minutes'  -- Long idle
        OR connection_established < CURRENT_TIMESTAMP - INTERVAL '2 hours'  -- Old connections
        OR error_count > 5  -- Error-prone connections
    )
),
cleanup_summary AS (
    UPDATE connection_usage_log
    SET 
        connection_status = 'closed',
        connection_closed = CURRENT_TIMESTAMP,
        connection_duration_seconds = EXTRACT(SECONDS FROM CURRENT_TIMESTAMP - connection_established)::INTEGER
    WHERE connection_id IN (SELECT connection_id FROM stale_connections)
    RETURNING connection_id, connection_duration_seconds
)
SELECT 
    COUNT(*) as connections_cleaned,
    AVG(connection_duration_seconds) as avg_connection_lifetime_seconds,
    SUM(connection_duration_seconds) as total_connection_time_freed,

    -- Limited cleanup statistics
    COUNT(*) FILTER (WHERE connection_duration_seconds > 3600) as long_lived_connections_cleaned,
    COUNT(*) FILTER (WHERE connection_duration_seconds < 300) as short_lived_connections_cleaned

FROM cleanup_summary;

-- Problems with traditional connection management:
-- 1. No real connection pooling - just tracking and simulation
-- 2. Manual resource management with high maintenance overhead  
-- 3. Limited connection lifecycle automation and optimization
-- 4. No built-in load balancing or intelligent connection distribution
-- 5. Poor visibility into connection performance and resource utilization
-- 6. Manual scaling and configuration adjustments required
-- 7. No automatic connection validation or health checking
-- 8. Limited error handling and recovery mechanisms
-- 9. No support for advanced features like read preference or write concern optimization
-- 10. Complex application-level connection management logic required

-- Attempt at connection performance monitoring (very limited)
WITH connection_performance AS (
    SELECT 
        DATE_TRUNC('hour', cul.last_activity) as hour_bucket,
        cul.application_name,
        COUNT(*) as total_connections_used,
        AVG(cul.connection_duration_seconds) as avg_connection_lifetime,
        SUM(cul.queries_executed) as total_queries,

        -- Basic performance calculations
        CASE 
            WHEN SUM(cul.queries_executed) > 0 THEN
                AVG(cul.avg_query_time_ms)
            ELSE NULL
        END as overall_avg_query_time,

        COUNT(*) FILTER (WHERE cul.error_count > 0) as connections_with_errors,
        SUM(cul.error_count) as total_errors

    FROM connection_usage_log cul
    WHERE cul.last_activity >= CURRENT_TIMESTAMP - INTERVAL '24 hours'
    AND cul.connection_status = 'closed'  -- Only analyze completed connections
    GROUP BY DATE_TRUNC('hour', cul.last_activity), cul.application_name
),
hourly_trends AS (
    SELECT 
        cp.*,
        LAG(total_connections_used) OVER (
            PARTITION BY application_name 
            ORDER BY hour_bucket
        ) as prev_hour_connections,
        LAG(overall_avg_query_time) OVER (
            PARTITION BY application_name 
            ORDER BY hour_bucket
        ) as prev_hour_query_time
    FROM connection_performance cp
)
SELECT 
    hour_bucket,
    application_name,
    total_connections_used,
    ROUND(avg_connection_lifetime::NUMERIC, 1) as avg_lifetime_seconds,
    total_queries,
    ROUND(overall_avg_query_time::NUMERIC, 2) as avg_query_time_ms,

    -- Error rates
    connections_with_errors,
    CASE 
        WHEN total_connections_used > 0 THEN
            ROUND((connections_with_errors::DECIMAL / total_connections_used) * 100, 2)
        ELSE 0
    END as connection_error_rate_percent,

    -- Trend indicators (very basic)
    CASE 
        WHEN prev_hour_connections IS NOT NULL AND prev_hour_connections > 0 THEN
            ROUND(((total_connections_used - prev_hour_connections)::DECIMAL / prev_hour_connections) * 100, 1)
        ELSE NULL
    END as connection_usage_change_percent,

    -- Performance assessment
    CASE 
        WHEN overall_avg_query_time > 1000 THEN 'slow'
        WHEN overall_avg_query_time > 500 THEN 'moderate'
        WHEN overall_avg_query_time < 100 THEN 'fast'
        ELSE 'normal'
    END as performance_assessment

FROM hourly_trends
WHERE total_connections_used > 0
ORDER BY hour_bucket DESC, application_name;

-- Traditional limitations:
-- 1. No actual connection pooling implementation - only monitoring and simulation
-- 2. Manual configuration and tuning without automatic optimization
-- 3. Limited connection health monitoring and automated recovery
-- 4. No intelligent load balancing or connection routing
-- 5. Poor integration with application frameworks and ORMs
-- 6. Limited scalability for high-concurrency applications
-- 7. No advanced features like connection warming or pre-allocation
-- 8. Manual error handling and connection recovery logic
-- 9. No built-in support for distributed database topologies
-- 10. Complex troubleshooting and performance analysis requirements

MongoDB provides sophisticated connection pooling capabilities with automatic optimization and management:

// MongoDB Advanced Connection Pooling - comprehensive connection management with intelligent optimization
const { MongoClient, ServerApiVersion } = require('mongodb');
const { EventEmitter } = require('events');

// Comprehensive MongoDB Connection Pool Manager
class AdvancedConnectionPoolManager extends EventEmitter {
  constructor(connectionString, config = {}) {
    super();
    this.connectionString = connectionString;
    this.pools = new Map();
    this.monitoringIntervals = new Map();

    // Advanced connection pooling configuration
    this.config = {
      // Connection pool size management
      minPoolSize: config.minPoolSize || 5,
      maxPoolSize: config.maxPoolSize || 50,
      maxIdleTimeMS: config.maxIdleTimeMS || 300000, // 5 minutes
      maxConnecting: config.maxConnecting || 10,

      // Connection lifecycle management
      connectTimeoutMS: config.connectTimeoutMS || 30000,
      socketTimeoutMS: config.socketTimeoutMS || 120000,
      serverSelectionTimeoutMS: config.serverSelectionTimeoutMS || 30000,
      heartbeatFrequencyMS: config.heartbeatFrequencyMS || 10000,

      // Performance optimization
      maxStalenessSeconds: config.maxStalenessSeconds || 90,
      readPreference: config.readPreference || 'primaryPreferred',
      readConcern: config.readConcern || { level: 'majority' },
      writeConcern: config.writeConcern || { w: 'majority', j: true },

      // Connection pool monitoring
      enableMonitoring: config.enableMonitoring !== false,
      monitoringInterval: config.monitoringInterval || 30000,
      enableConnectionEvents: config.enableConnectionEvents !== false,
      enablePerformanceTracking: config.enablePerformanceTracking !== false,

      // Advanced features
      enableLoadBalancing: config.enableLoadBalancing || false,
      enableAutomaticFailover: config.enableAutomaticFailover !== false,
      enableConnectionWarming: config.enableConnectionWarming || false,
      enableIntelligentRouting: config.enableIntelligentRouting || false,

      // Error handling and recovery
      maxRetries: config.maxRetries || 3,
      retryDelayMS: config.retryDelayMS || 1000,
      enableCircuitBreaker: config.enableCircuitBreaker || false,
      circuitBreakerThreshold: config.circuitBreakerThreshold || 10,
      circuitBreakerTimeout: config.circuitBreakerTimeout || 60000,

      // Security and authentication
      authSource: config.authSource || 'admin',
      authMechanism: config.authMechanism || 'SCRAM-SHA-256',
      tlsAllowInvalidCertificates: config.tlsAllowInvalidCertificates || false,
      tlsAllowInvalidHostnames: config.tlsAllowInvalidHostnames || false
    };

    // Performance and monitoring metrics
    this.metrics = {
      connectionPools: new Map(),
      performanceHistory: [],
      errorStats: new Map(),
      operationStats: new Map()
    };

    // Circuit breaker state for error handling
    this.circuitBreaker = {
      state: 'closed', // closed, open, half-open
      failures: 0,
      lastFailureTime: null,
      nextRetryTime: null
    };

    this.initializeConnectionPooling();
  }

  async initializeConnectionPooling() {
    console.log('Initializing advanced MongoDB connection pooling system...');

    try {
      // Setup primary connection pool
      await this.createConnectionPool('primary', this.connectionString);

      // Initialize connection monitoring
      if (this.config.enableMonitoring) {
        await this.setupConnectionMonitoring();
      }

      // Setup connection event handlers
      if (this.config.enableConnectionEvents) {
        await this.setupConnectionEventHandlers();
      }

      // Initialize performance tracking
      if (this.config.enablePerformanceTracking) {
        await this.setupPerformanceTracking();
      }

      // Enable connection warming if configured
      if (this.config.enableConnectionWarming) {
        await this.warmupConnectionPool('primary');
      }

      console.log('Advanced connection pooling system initialized successfully');

    } catch (error) {
      console.error('Error initializing connection pooling:', error);
      throw error;
    }
  }

  async createConnectionPool(poolName, connectionString, poolConfig = {}) {
    console.log(`Creating connection pool: ${poolName}`);

    try {
      // Merge pool-specific configuration with global configuration
      const mergedConfig = {
        ...this.config,
        ...poolConfig
      };

      // Configure MongoDB client options with advanced connection pooling
      const clientOptions = {
        // Connection pool configuration
        minPoolSize: mergedConfig.minPoolSize,
        maxPoolSize: mergedConfig.maxPoolSize,
        maxIdleTimeMS: mergedConfig.maxIdleTimeMS,
        maxConnecting: mergedConfig.maxConnecting,

        // Connection timeout configuration
        connectTimeoutMS: mergedConfig.connectTimeoutMS,
        socketTimeoutMS: mergedConfig.socketTimeoutMS,
        serverSelectionTimeoutMS: mergedConfig.serverSelectionTimeoutMS,
        heartbeatFrequencyMS: mergedConfig.heartbeatFrequencyMS,

        // Read and write preferences
        readPreference: mergedConfig.readPreference,
        readConcern: mergedConfig.readConcern,
        writeConcern: mergedConfig.writeConcern,
        maxStalenessSeconds: mergedConfig.maxStalenessSeconds,

        // Advanced features
        loadBalanced: mergedConfig.enableLoadBalancing,
        retryWrites: true,
        retryReads: true,

        // Security configuration
        authSource: mergedConfig.authSource,
        authMechanism: mergedConfig.authMechanism,
        tls: connectionString.includes('ssl=true') || connectionString.includes('+srv'),
        tlsAllowInvalidCertificates: mergedConfig.tlsAllowInvalidCertificates,
        tlsAllowInvalidHostnames: mergedConfig.tlsAllowInvalidHostnames,

        // Monitoring and logging
        monitorCommands: mergedConfig.enablePerformanceTracking,
        serverApi: {
          version: ServerApiVersion.v1,
          strict: true,
          deprecationErrors: false
        }
      };

      // Create MongoDB client with connection pooling
      const client = new MongoClient(connectionString, clientOptions);

      // Establish initial connection
      await client.connect();

      // Store pool information
      const poolInfo = {
        client: client,
        name: poolName,
        connectionString: connectionString,
        configuration: mergedConfig,
        createdAt: new Date(),

        // Connection pool statistics
        stats: {
          connectionsCreated: 0,
          connectionsClosed: 0,
          connectionsInUse: 0,
          connectionsAvailable: 0,
          totalOperations: 0,
          failedOperations: 0,
          averageOperationTime: 0,
          lastConnectionActivity: new Date()
        },

        // Health status
        healthStatus: 'healthy',
        lastHealthCheck: new Date()
      };

      this.pools.set(poolName, poolInfo);

      console.log(`Connection pool '${poolName}' created successfully with ${mergedConfig.maxPoolSize} max connections`);

      return poolInfo;

    } catch (error) {
      console.error(`Error creating connection pool '${poolName}':`, error);
      throw error;
    }
  }

  async setupConnectionEventHandlers() {
    console.log('Setting up comprehensive connection event handlers...');

    for (const [poolName, poolInfo] of this.pools.entries()) {
      const client = poolInfo.client;

      // Connection pool events
      client.on('connectionPoolCreated', (event) => {
        console.log(`Connection pool created: ${poolName}`, {
          address: event.address,
          options: event.options
        });

        this.metrics.connectionPools.set(poolName, {
          ...this.metrics.connectionPools.get(poolName),
          poolCreated: new Date(),
          address: event.address
        });

        this.emit('poolCreated', { poolName, event });
      });

      client.on('connectionCreated', (event) => {
        console.log(`Connection created in pool '${poolName}':`, {
          connectionId: event.connectionId,
          address: event.address
        });

        poolInfo.stats.connectionsCreated++;
        poolInfo.stats.lastConnectionActivity = new Date();

        this.emit('connectionCreated', { poolName, event });
      });

      client.on('connectionReady', (event) => {
        console.log(`Connection ready in pool '${poolName}':`, {
          connectionId: event.connectionId,
          address: event.address
        });

        this.emit('connectionReady', { poolName, event });
      });

      client.on('connectionClosed', (event) => {
        console.log(`Connection closed in pool '${poolName}':`, {
          connectionId: event.connectionId,
          reason: event.reason,
          address: event.address
        });

        poolInfo.stats.connectionsClosed++;
        poolInfo.stats.lastConnectionActivity = new Date();

        this.emit('connectionClosed', { poolName, event });
      });

      client.on('connectionCheckOutStarted', (event) => {
        this.emit('connectionCheckOutStarted', { poolName, event });
      });

      client.on('connectionCheckedOut', (event) => {
        poolInfo.stats.connectionsInUse++;
        poolInfo.stats.connectionsAvailable--;

        this.emit('connectionCheckedOut', { poolName, event });
      });

      client.on('connectionCheckOutFailed', (event) => {
        console.error(`Connection checkout failed in pool '${poolName}':`, {
          reason: event.reason,
          address: event.address
        });

        poolInfo.stats.failedOperations++;

        this.emit('connectionCheckOutFailed', { poolName, event });
      });

      client.on('connectionCheckedIn', (event) => {
        poolInfo.stats.connectionsInUse--;
        poolInfo.stats.connectionsAvailable++;

        this.emit('connectionCheckedIn', { poolName, event });
      });

      // Server discovery and monitoring events
      client.on('serverDescriptionChanged', (event) => {
        console.log(`Server description changed for pool '${poolName}':`, {
          address: event.address,
          newDescription: event.newDescription.type,
          previousDescription: event.previousDescription.type
        });

        this.emit('serverDescriptionChanged', { poolName, event });
      });

      client.on('topologyDescriptionChanged', (event) => {
        console.log(`Topology changed for pool '${poolName}':`, {
          newTopologyType: event.newDescription.type,
          previousTopologyType: event.previousDescription.type
        });

        this.emit('topologyDescriptionChanged', { poolName, event });
      });

      // Command monitoring events (if performance tracking enabled)
      if (this.config.enablePerformanceTracking) {
        client.on('commandStarted', (event) => {
          this.trackCommandStart(poolName, event);
        });

        client.on('commandSucceeded', (event) => {
          this.trackCommandSuccess(poolName, event);
        });

        client.on('commandFailed', (event) => {
          this.trackCommandFailure(poolName, event);
        });
      }
    }
  }

  trackCommandStart(poolName, event) {
    // Store command start time for performance tracking
    if (!this.metrics.operationStats.has(poolName)) {
      this.metrics.operationStats.set(poolName, new Map());
    }

    const poolStats = this.metrics.operationStats.get(poolName);
    poolStats.set(event.requestId, {
      command: event.commandName,
      startTime: Date.now(),
      connectionId: event.connectionId
    });
  }

  trackCommandSuccess(poolName, event) {
    const poolStats = this.metrics.operationStats.get(poolName);
    const commandInfo = poolStats.get(event.requestId);

    if (commandInfo) {
      const duration = Date.now() - commandInfo.startTime;

      // Update pool statistics
      const poolInfo = this.pools.get(poolName);
      poolInfo.stats.totalOperations++;

      // Update average operation time
      const currentAvg = poolInfo.stats.averageOperationTime;
      const totalOps = poolInfo.stats.totalOperations;
      poolInfo.stats.averageOperationTime = ((currentAvg * (totalOps - 1)) + duration) / totalOps;

      // Store performance history
      this.metrics.performanceHistory.push({
        poolName: poolName,
        command: event.commandName,
        duration: duration,
        timestamp: new Date(),
        success: true
      });

      // Clean up tracking
      poolStats.delete(event.requestId);

      this.emit('commandCompleted', {
        poolName,
        command: event.commandName,
        duration,
        success: true
      });
    }
  }

  trackCommandFailure(poolName, event) {
    const poolStats = this.metrics.operationStats.get(poolName);
    const commandInfo = poolStats.get(event.requestId);

    if (commandInfo) {
      const duration = Date.now() - commandInfo.startTime;

      // Update failure statistics
      const poolInfo = this.pools.get(poolName);
      poolInfo.stats.failedOperations++;

      // Update error statistics
      if (!this.metrics.errorStats.has(poolName)) {
        this.metrics.errorStats.set(poolName, new Map());
      }

      const errorStats = this.metrics.errorStats.get(poolName);
      const errorKey = `${event.failure.codeName || 'UnknownError'}`;
      const currentCount = errorStats.get(errorKey) || 0;
      errorStats.set(errorKey, currentCount + 1);

      // Store performance history
      this.metrics.performanceHistory.push({
        poolName: poolName,
        command: event.commandName,
        duration: duration,
        timestamp: new Date(),
        success: false,
        error: event.failure
      });

      // Update circuit breaker if enabled
      if (this.config.enableCircuitBreaker) {
        this.updateCircuitBreaker(event.failure);
      }

      // Clean up tracking
      poolStats.delete(event.requestId);

      this.emit('commandFailed', {
        poolName,
        command: event.commandName,
        duration,
        error: event.failure
      });
    }
  }

  async setupConnectionMonitoring() {
    console.log('Setting up connection pool monitoring...');

    const monitoringInterval = setInterval(async () => {
      try {
        await this.performHealthChecks();
        await this.collectConnectionMetrics();
        await this.optimizeConnectionPools();

      } catch (error) {
        console.error('Error during connection monitoring:', error);
      }
    }, this.config.monitoringInterval);

    this.monitoringIntervals.set('health_monitoring', monitoringInterval);
  }

  async performHealthChecks() {
    for (const [poolName, poolInfo] of this.pools.entries()) {
      try {
        const client = poolInfo.client;

        // Perform ping to check connection health
        const pingStart = Date.now();
        await client.db('admin').admin().ping();
        const pingDuration = Date.now() - pingStart;

        // Update health status
        poolInfo.healthStatus = pingDuration < 1000 ? 'healthy' : 
                              pingDuration < 5000 ? 'degraded' : 'unhealthy';
        poolInfo.lastHealthCheck = new Date();
        poolInfo.lastPingDuration = pingDuration;

        // Emit health status event
        this.emit('healthCheck', {
          poolName,
          healthStatus: poolInfo.healthStatus,
          pingDuration
        });

      } catch (error) {
        console.error(`Health check failed for pool '${poolName}':`, error);

        poolInfo.healthStatus = 'unhealthy';
        poolInfo.lastHealthCheck = new Date();
        poolInfo.lastError = error;

        this.emit('healthCheckFailed', { poolName, error });
      }
    }
  }

  async collectConnectionMetrics() {
    for (const [poolName, poolInfo] of this.pools.entries()) {
      try {
        const client = poolInfo.client;

        // Get server status for connection metrics
        const serverStatus = await client.db('admin').admin().serverStatus();
        const connections = serverStatus.connections;

        // Update pool metrics
        if (!this.metrics.connectionPools.has(poolName)) {
          this.metrics.connectionPools.set(poolName, {});
        }

        const poolMetrics = this.metrics.connectionPools.get(poolName);
        Object.assign(poolMetrics, {
          current: connections.current,
          available: connections.available,
          totalCreated: connections.totalCreated,
          active: connections.active || 0,

          // Pool-specific statistics
          poolUtilization: (poolInfo.stats.connectionsInUse / this.config.maxPoolSize) * 100,
          averageOperationTime: poolInfo.stats.averageOperationTime,
          operationsPerSecond: this.calculateOperationsPerSecond(poolName),
          errorRate: this.calculateErrorRate(poolName),

          lastUpdated: new Date()
        });

      } catch (error) {
        console.warn(`Error collecting metrics for pool '${poolName}':`, error.message);
      }
    }
  }

  calculateOperationsPerSecond(poolName) {
    const now = Date.now();
    const oneSecondAgo = now - 1000;

    const recentOperations = this.metrics.performanceHistory.filter(
      op => op.poolName === poolName && 
            op.timestamp.getTime() > oneSecondAgo
    );

    return recentOperations.length;
  }

  calculateErrorRate(poolName) {
    const now = Date.now();
    const oneMinuteAgo = now - 60000;

    const recentOperations = this.metrics.performanceHistory.filter(
      op => op.poolName === poolName && 
            op.timestamp.getTime() > oneMinuteAgo
    );

    if (recentOperations.length === 0) return 0;

    const failedOperations = recentOperations.filter(op => !op.success);
    return (failedOperations.length / recentOperations.length) * 100;
  }

  async optimizeConnectionPools() {
    for (const [poolName, poolInfo] of this.pools.entries()) {
      try {
        const poolMetrics = this.metrics.connectionPools.get(poolName);
        if (!poolMetrics) continue;

        // Analyze pool performance and suggest optimizations
        const optimizationRecommendations = this.analyzePoolPerformance(poolName, poolMetrics);

        // Apply automatic optimizations if enabled
        if (optimizationRecommendations.length > 0) {
          console.log(`Optimization recommendations for pool '${poolName}':`, optimizationRecommendations);

          this.emit('optimizationRecommendations', {
            poolName,
            recommendations: optimizationRecommendations
          });
        }

      } catch (error) {
        console.warn(`Error optimizing pool '${poolName}':`, error.message);
      }
    }
  }

  analyzePoolPerformance(poolName, poolMetrics) {
    const recommendations = [];

    // High utilization check
    if (poolMetrics.poolUtilization > 90) {
      recommendations.push({
        type: 'scale_up',
        priority: 'high',
        message: 'Pool utilization is very high, consider increasing maxPoolSize',
        currentValue: this.config.maxPoolSize,
        suggestedValue: Math.min(this.config.maxPoolSize * 1.5, 100)
      });
    }

    // Low utilization check
    if (poolMetrics.poolUtilization < 10 && this.config.maxPoolSize > 10) {
      recommendations.push({
        type: 'scale_down',
        priority: 'low',
        message: 'Pool utilization is low, consider decreasing maxPoolSize to save resources',
        currentValue: this.config.maxPoolSize,
        suggestedValue: Math.max(this.config.maxPoolSize * 0.8, 5)
      });
    }

    // High error rate check
    if (poolMetrics.errorRate > 5) {
      recommendations.push({
        type: 'investigate_errors',
        priority: 'high',
        message: 'High error rate detected, investigate connection issues',
        errorRate: poolMetrics.errorRate
      });
    }

    // Slow operations check
    if (poolMetrics.averageOperationTime > 1000) {
      recommendations.push({
        type: 'performance_tuning',
        priority: 'medium',
        message: 'Average operation time is high, consider query optimization or read preference tuning',
        averageTime: poolMetrics.averageOperationTime
      });
    }

    return recommendations;
  }

  async warmupConnectionPool(poolName) {
    console.log(`Warming up connection pool: ${poolName}`);

    try {
      const poolInfo = this.pools.get(poolName);
      if (!poolInfo) {
        throw new Error(`Connection pool '${poolName}' not found`);
      }

      const client = poolInfo.client;
      const minConnections = this.config.minPoolSize;

      // Pre-create minimum number of connections
      const warmupPromises = [];
      for (let i = 0; i < minConnections; i++) {
        warmupPromises.push(
          client.db('admin').admin().ping().catch(error => {
            console.warn(`Warmup connection ${i} failed:`, error.message);
          })
        );
      }

      await Promise.allSettled(warmupPromises);

      console.log(`Connection pool '${poolName}' warmed up successfully`);

      this.emit('poolWarmedUp', { poolName, minConnections });

    } catch (error) {
      console.error(`Error warming up pool '${poolName}':`, error);
      throw error;
    }
  }

  async getConnectionPoolStats(poolName = null) {
    const stats = {};

    const poolsToCheck = poolName ? [poolName] : Array.from(this.pools.keys());

    for (const name of poolsToCheck) {
      const poolInfo = this.pools.get(name);
      const poolMetrics = this.metrics.connectionPools.get(name);

      if (poolInfo) {
        stats[name] = {
          // Basic pool information
          configuration: {
            minPoolSize: this.config.minPoolSize,
            maxPoolSize: this.config.maxPoolSize,
            maxIdleTimeMS: this.config.maxIdleTimeMS,
            readPreference: this.config.readPreference
          },

          // Current pool statistics
          current: poolMetrics ? {
            connections: {
              current: poolMetrics.current || 0,
              available: poolMetrics.available || 0,
              inUse: poolInfo.stats.connectionsInUse,
              created: poolInfo.stats.connectionsCreated,
              closed: poolInfo.stats.connectionsClosed
            },

            performance: {
              utilization: poolMetrics.poolUtilization || 0,
              averageOperationTime: poolInfo.stats.averageOperationTime,
              operationsPerSecond: poolMetrics.operationsPerSecond || 0,
              totalOperations: poolInfo.stats.totalOperations,
              failedOperations: poolInfo.stats.failedOperations,
              errorRate: poolMetrics.errorRate || 0
            },

            health: {
              status: poolInfo.healthStatus,
              lastHealthCheck: poolInfo.lastHealthCheck,
              lastPingDuration: poolInfo.lastPingDuration || null,
              lastError: poolInfo.lastError ? poolInfo.lastError.message : null
            }
          } : null,

          // Historical data
          recentPerformance: this.getRecentPerformanceData(name),
          errorBreakdown: this.getErrorBreakdown(name)
        };
      }
    }

    return poolName ? stats[poolName] : stats;
  }

  getRecentPerformanceData(poolName, minutes = 10) {
    const cutoff = Date.now() - (minutes * 60 * 1000);

    return this.metrics.performanceHistory
      .filter(op => op.poolName === poolName && op.timestamp.getTime() > cutoff)
      .map(op => ({
        command: op.command,
        duration: op.duration,
        timestamp: op.timestamp,
        success: op.success
      }));
  }

  getErrorBreakdown(poolName) {
    const errorStats = this.metrics.errorStats.get(poolName);
    if (!errorStats) return {};

    const breakdown = {};
    for (const [errorType, count] of errorStats.entries()) {
      breakdown[errorType] = count;
    }

    return breakdown;
  }

  updateCircuitBreaker(error) {
    const now = Date.now();

    this.circuitBreaker.failures++;
    this.circuitBreaker.lastFailureTime = now;

    if (this.circuitBreaker.failures >= this.config.circuitBreakerThreshold) {
      if (this.circuitBreaker.state !== 'open') {
        console.warn('Circuit breaker opened due to high failure rate');
        this.circuitBreaker.state = 'open';
        this.circuitBreaker.nextRetryTime = now + this.config.circuitBreakerTimeout;

        this.emit('circuitBreakerOpened', {
          failures: this.circuitBreaker.failures,
          threshold: this.config.circuitBreakerThreshold
        });
      }
    }
  }

  async closeConnectionPool(poolName) {
    console.log(`Closing connection pool: ${poolName}`);

    try {
      const poolInfo = this.pools.get(poolName);
      if (!poolInfo) {
        throw new Error(`Connection pool '${poolName}' not found`);
      }

      // Close the MongoDB client
      await poolInfo.client.close();

      // Remove from pools
      this.pools.delete(poolName);

      // Clean up metrics
      this.metrics.connectionPools.delete(poolName);
      this.metrics.errorStats.delete(poolName);
      this.metrics.operationStats.delete(poolName);

      console.log(`Connection pool '${poolName}' closed successfully`);

      this.emit('poolClosed', { poolName });

    } catch (error) {
      console.error(`Error closing connection pool '${poolName}':`, error);
      throw error;
    }
  }

  async closeAllPools() {
    console.log('Closing all connection pools...');

    const closePromises = [];
    for (const poolName of this.pools.keys()) {
      closePromises.push(this.closeConnectionPool(poolName));
    }

    await Promise.allSettled(closePromises);

    // Clear monitoring intervals
    for (const [name, interval] of this.monitoringIntervals.entries()) {
      clearInterval(interval);
    }
    this.monitoringIntervals.clear();

    console.log('All connection pools closed');
    this.emit('allPoolsClosed');
  }
}

// Benefits of MongoDB Advanced Connection Pooling:
// - Intelligent connection pool sizing with automatic optimization
// - Comprehensive connection lifecycle management and monitoring
// - Advanced performance tracking and metrics collection
// - Built-in error handling with circuit breaker patterns
// - Automatic failover and recovery mechanisms
// - Health monitoring with proactive connection management
// - Load balancing and intelligent connection routing
// - Production-ready connection pooling with minimal configuration
// - Real-time performance analysis and optimization recommendations
// - SQL-compatible connection management through QueryLeaf integration

module.exports = {
  AdvancedConnectionPoolManager
};

Understanding MongoDB Connection Pooling Architecture

Advanced Connection Management and Performance Optimization Strategies

Implement sophisticated connection pooling patterns for production MongoDB deployments:

// Production-ready MongoDB connection pooling with enterprise-grade optimization
class ProductionConnectionManager extends AdvancedConnectionPoolManager {
  constructor(connectionString, productionConfig) {
    super(connectionString, productionConfig);

    this.productionConfig = {
      ...productionConfig,
      enableDistributedPooling: true,
      enableLoadBalancingOptimization: true,
      enableCapacityPlanning: true,
      enableAutomaticScaling: true,
      enableComplianceAuditing: true,
      enableSecurityMonitoring: true
    };

    this.setupProductionOptimizations();
    this.initializeDistributedPooling();
    this.setupCapacityPlanning();
  }

  async implementDistributedConnectionPooling() {
    console.log('Implementing distributed connection pooling across multiple nodes...');

    const distributedStrategy = {
      // Multi-node connection distribution
      nodeDistribution: {
        enableGeoAware: true,
        preferLocalConnections: true,
        enableFailoverRouting: true,
        optimizeForLatency: true
      },

      // Load balancing strategies
      loadBalancing: {
        roundRobinOptimization: true,
        weightedDistribution: true,
        connectionAffinityOptimization: true,
        realTimeLoadAdjustment: true
      },

      // Performance optimization
      performanceOptimization: {
        connectionPoolSharding: true,
        intelligentConnectionRouting: true,
        predictiveScaling: true,
        resourceUtilizationOptimization: true
      }
    };

    return await this.deployDistributedPooling(distributedStrategy);
  }

  async setupAdvancedConnectionOptimization() {
    console.log('Setting up advanced connection optimization strategies...');

    const optimizationStrategies = {
      // Connection lifecycle optimization
      lifecycleOptimization: {
        connectionWarmupStrategies: true,
        intelligentIdleManagement: true,
        predictiveConnectionCreation: true,
        optimizedConnectionReuse: true
      },

      // Performance monitoring and tuning
      performanceTuning: {
        realTimePerformanceAnalysis: true,
        automaticParameterTuning: true,
        connectionLatencyOptimization: true,
        throughputMaximization: true
      },

      // Resource utilization optimization
      resourceOptimization: {
        memoryPoolingOptimization: true,
        networkBandwidthOptimization: true,
        cpuUtilizationOptimization: true,
        diskIOOptimization: true
      }
    };

    return await this.deployOptimizationStrategies(optimizationStrategies);
  }
}

SQL-Style Connection Pooling with QueryLeaf

QueryLeaf provides familiar SQL syntax for MongoDB connection pooling and performance optimization:

-- QueryLeaf advanced connection pooling with SQL-familiar syntax for MongoDB

-- Configure connection pool settings with comprehensive optimization
CONFIGURE CONNECTION_POOL 
SET pool_name = 'production_pool',
    min_connections = 10,
    max_connections = 100,
    connection_timeout_ms = 30000,
    idle_timeout_ms = 300000,
    max_idle_time_ms = 600000,
    socket_timeout_ms = 120000,
    server_selection_timeout_ms = 30000,
    heartbeat_frequency_ms = 10000,

    -- Read and write preferences
    read_preference = 'primaryPreferred',
    read_concern = 'majority',
    write_concern = 'majority',
    max_staleness_seconds = 90,

    -- Performance optimization
    enable_load_balancing = true,
    enable_connection_warming = true,
    enable_intelligent_routing = true,
    enable_automatic_failover = true,

    -- Monitoring and health checks
    enable_monitoring = true,
    monitoring_interval_ms = 30000,
    enable_performance_tracking = true,
    enable_connection_events = true,

    -- Error handling and circuit breaker
    enable_circuit_breaker = true,
    circuit_breaker_threshold = 10,
    circuit_breaker_timeout_ms = 60000,
    max_retries = 3,
    retry_delay_ms = 1000;

-- Advanced connection pool monitoring and analytics
WITH connection_pool_metrics AS (
  SELECT 
    pool_name,
    DATE_TRUNC('minute', event_timestamp) as time_bucket,

    -- Connection utilization metrics
    AVG(connections_active) as avg_active_connections,
    AVG(connections_available) as avg_available_connections,
    MAX(connections_active) as peak_active_connections,
    AVG(pool_utilization_percent) as avg_pool_utilization,

    -- Performance metrics
    COUNT(*) FILTER (WHERE event_type = 'connection_created') as connections_created,
    COUNT(*) FILTER (WHERE event_type = 'connection_closed') as connections_closed,
    COUNT(*) FILTER (WHERE event_type = 'connection_checkout_failed') as checkout_failures,

    -- Operation performance
    AVG(operation_duration_ms) as avg_operation_duration,
    PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY operation_duration_ms) as p95_operation_duration,
    COUNT(*) FILTER (WHERE operation_success = true) as successful_operations,
    COUNT(*) FILTER (WHERE operation_success = false) as failed_operations,

    -- Connection lifecycle analysis
    AVG(connection_lifetime_seconds) as avg_connection_lifetime,
    AVG(connection_idle_time_seconds) as avg_connection_idle_time,
    MAX(connection_wait_time_ms) as max_checkout_wait_time,

    -- Error analysis
    COUNT(*) FILTER (WHERE error_category = 'timeout') as timeout_errors,
    COUNT(*) FILTER (WHERE error_category = 'network') as network_errors,
    COUNT(*) FILTER (WHERE error_category = 'authentication') as auth_errors,
    COUNT(*) FILTER (WHERE error_category = 'server_selection') as server_selection_errors

  FROM connection_pool_events
  WHERE event_timestamp >= CURRENT_TIMESTAMP - INTERVAL '1 hour'
  GROUP BY pool_name, DATE_TRUNC('minute', event_timestamp)
),

performance_analysis AS (
  SELECT 
    cpm.*,

    -- Utilization efficiency calculations
    CASE 
      WHEN avg_pool_utilization > 90 THEN 'over_utilized'
      WHEN avg_pool_utilization > 70 THEN 'well_utilized'
      WHEN avg_pool_utilization > 30 THEN 'under_utilized'
      ELSE 'severely_under_utilized'
    END as utilization_status,

    -- Performance classification
    CASE 
      WHEN avg_operation_duration < 100 THEN 'excellent'
      WHEN avg_operation_duration < 500 THEN 'good'
      WHEN avg_operation_duration < 1000 THEN 'acceptable'
      ELSE 'poor'
    END as performance_classification,

    -- Error rate calculations
    CASE 
      WHEN successful_operations + failed_operations > 0 THEN
        ROUND((failed_operations * 100.0) / (successful_operations + failed_operations), 2)
      ELSE 0
    END as error_rate_percent,

    -- Connection efficiency metrics
    CASE 
      WHEN connections_created > 0 THEN
        ROUND(avg_connection_lifetime / 60.0, 1)  -- Average lifetime in minutes
      ELSE 0
    END as avg_connection_lifetime_minutes,

    -- Checkout performance assessment
    CASE 
      WHEN checkout_failures > 0 AND connections_created > 0 THEN
        ROUND((checkout_failures * 100.0) / connections_created, 2)
      ELSE 0
    END as checkout_failure_rate_percent,

    -- Trend analysis
    LAG(avg_active_connections) OVER (
      PARTITION BY pool_name 
      ORDER BY time_bucket
    ) as prev_avg_active_connections,

    LAG(avg_operation_duration) OVER (
      PARTITION BY pool_name 
      ORDER BY time_bucket
    ) as prev_avg_operation_duration

  FROM connection_pool_metrics cpm
),

optimization_recommendations AS (
  SELECT 
    pa.*,

    -- Generate optimization recommendations based on analysis
    ARRAY[
      CASE WHEN utilization_status = 'over_utilized' 
           THEN 'Increase max_connections to handle higher load' END,
      CASE WHEN utilization_status = 'severely_under_utilized' 
           THEN 'Decrease max_connections to save resources' END,
      CASE WHEN performance_classification = 'poor' 
           THEN 'Investigate slow operations and consider index optimization' END,
      CASE WHEN error_rate_percent > 5 
           THEN 'High error rate detected - investigate connection issues' END,
      CASE WHEN checkout_failure_rate_percent > 10 
           THEN 'High checkout failure rate - increase connection pool size or timeout' END,
      CASE WHEN max_checkout_wait_time > 5000 
           THEN 'Long checkout wait times - optimize connection allocation' END,
      CASE WHEN avg_connection_lifetime_minutes < 1 
           THEN 'Short connection lifetimes - investigate connection recycling' END,
      CASE WHEN timeout_errors > 10 
           THEN 'High timeout errors - increase timeout values or optimize queries' END,
      CASE WHEN network_errors > 5 
           THEN 'Network connectivity issues detected - check network stability' END
    ]::TEXT[] as optimization_recommendations,

    -- Performance trend indicators
    CASE 
      WHEN prev_avg_active_connections IS NOT NULL AND prev_avg_active_connections > 0 THEN
        ROUND(((avg_active_connections - prev_avg_active_connections) / prev_avg_active_connections) * 100, 1)
      ELSE NULL
    END as connection_usage_trend_percent,

    CASE 
      WHEN prev_avg_operation_duration IS NOT NULL AND prev_avg_operation_duration > 0 THEN
        ROUND(((avg_operation_duration - prev_avg_operation_duration) / prev_avg_operation_duration) * 100, 1)
      ELSE NULL
    END as performance_trend_percent,

    -- Capacity planning indicators
    CASE 
      WHEN peak_active_connections / NULLIF(CAST(CURRENT_SETTING('max_connections') AS INTEGER), 0) > 0.8 
      THEN 'approaching_capacity_limit'
      WHEN peak_active_connections / NULLIF(CAST(CURRENT_SETTING('max_connections') AS INTEGER), 0) > 0.6 
      THEN 'moderate_capacity_usage'
      ELSE 'sufficient_capacity'
    END as capacity_status

  FROM performance_analysis pa
)

SELECT 
  pool_name,
  time_bucket,

  -- Connection pool utilization summary
  ROUND(avg_active_connections, 1) as avg_active_connections,
  ROUND(avg_available_connections, 1) as avg_available_connections,
  peak_active_connections,
  ROUND(avg_pool_utilization, 1) as pool_utilization_percent,
  utilization_status,

  -- Performance summary
  ROUND(avg_operation_duration, 1) as avg_operation_time_ms,
  ROUND(p95_operation_duration, 1) as p95_operation_time_ms,
  performance_classification,

  -- Connection lifecycle summary
  connections_created,
  connections_closed,
  ROUND(avg_connection_lifetime_minutes, 1) as avg_connection_lifetime_minutes,
  max_checkout_wait_time as max_wait_time_ms,

  -- Error and reliability summary
  successful_operations,
  failed_operations,
  error_rate_percent,
  checkout_failure_rate_percent,

  -- Error breakdown
  timeout_errors,
  network_errors,
  auth_errors,
  server_selection_errors,

  -- Trend analysis
  connection_usage_trend_percent,
  performance_trend_percent,

  -- Capacity and recommendations
  capacity_status,
  ARRAY_REMOVE(optimization_recommendations, NULL) as recommendations,

  -- Health indicator
  CASE 
    WHEN error_rate_percent > 10 OR checkout_failure_rate_percent > 20 THEN 'critical'
    WHEN error_rate_percent > 5 OR performance_classification = 'poor' THEN 'warning'
    WHEN utilization_status = 'over_utilized' THEN 'stressed'
    ELSE 'healthy'
  END as overall_health_status

FROM optimization_recommendations
WHERE ARRAY_LENGTH(ARRAY_REMOVE(optimization_recommendations, NULL), 1) > 0 
   OR error_rate_percent > 1 
   OR utilization_status IN ('over_utilized', 'severely_under_utilized')
ORDER BY time_bucket DESC, pool_name;

-- Advanced connection pool configuration optimization
WITH current_pool_configuration AS (
  SELECT 
    pool_name,
    current_min_connections,
    current_max_connections,
    current_connection_timeout_ms,
    current_idle_timeout_ms,
    current_socket_timeout_ms,

    -- Historical performance metrics
    AVG(pool_utilization_percent) as avg_historical_utilization,
    MAX(peak_active_connections) as historical_peak_connections,
    AVG(avg_operation_duration_ms) as avg_historical_operation_time,
    AVG(error_rate_percent) as avg_historical_error_rate,
    COUNT(*) as historical_data_points

  FROM connection_pool_performance_history
  WHERE measurement_date >= CURRENT_DATE - INTERVAL '7 days'
  GROUP BY pool_name, current_min_connections, current_max_connections,
           current_connection_timeout_ms, current_idle_timeout_ms, current_socket_timeout_ms
),

workload_analysis AS (
  SELECT 
    pool_name,
    DATE_TRUNC('hour', operation_timestamp) as hour_bucket,

    -- Workload pattern analysis
    COUNT(*) as operations_per_hour,
    AVG(concurrent_operations) as avg_concurrency,
    MAX(concurrent_operations) as peak_concurrency,
    AVG(operation_duration_ms) as avg_operation_duration,

    -- Connection demand patterns
    AVG(active_connections) as avg_connections_needed,
    MAX(active_connections) as peak_connections_needed,
    AVG(connection_wait_time_ms) as avg_wait_time,
    COUNT(*) FILTER (WHERE connection_wait_time_ms > 1000) as long_wait_operations,

    -- Error patterns
    COUNT(*) FILTER (WHERE operation_result = 'timeout') as timeout_operations,
    COUNT(*) FILTER (WHERE operation_result = 'connection_error') as connection_errors

  FROM connection_operation_log
  WHERE operation_timestamp >= CURRENT_TIMESTAMP - INTERVAL '7 days'
  GROUP BY pool_name, DATE_TRUNC('hour', operation_timestamp)
),

capacity_planning AS (
  SELECT 
    cpc.pool_name,

    -- Current configuration assessment
    cpc.current_min_connections,
    cpc.current_max_connections,
    cpc.avg_historical_utilization,
    cpc.historical_peak_connections,

    -- Workload-based recommendations
    CEIL(AVG(wa.peak_connections_needed) * 1.2) as recommended_max_connections,
    CEIL(AVG(wa.avg_connections_needed) * 0.8) as recommended_min_connections,

    -- Performance-based timeout recommendations
    CASE 
      WHEN AVG(wa.avg_wait_time) > 2000 THEN LEAST(cpc.current_connection_timeout_ms * 1.5, 60000)
      WHEN AVG(wa.avg_wait_time) < 100 THEN GREATEST(cpc.current_connection_timeout_ms * 0.8, 10000)
      ELSE cpc.current_connection_timeout_ms
    END as recommended_connection_timeout_ms,

    -- Idle timeout optimization
    CASE 
      WHEN AVG(wa.operations_per_hour) > 1000 THEN 180000  -- 3 minutes for high-traffic
      WHEN AVG(wa.operations_per_hour) > 100 THEN 300000   -- 5 minutes for medium-traffic
      ELSE 600000  -- 10 minutes for low-traffic
    END as recommended_idle_timeout_ms,

    -- Configuration change justification
    CASE 
      WHEN CEIL(AVG(wa.peak_connections_needed) * 1.2) > cpc.current_max_connections THEN 
        'Increase max connections to handle peak load'
      WHEN CEIL(AVG(wa.peak_connections_needed) * 1.2) < cpc.current_max_connections * 0.7 THEN 
        'Decrease max connections to optimize resource usage'
      ELSE 'Current max connections are appropriately sized'
    END as max_connections_justification,

    CASE 
      WHEN CEIL(AVG(wa.avg_connections_needed) * 0.8) > cpc.current_min_connections THEN 
        'Increase min connections to reduce startup latency'
      WHEN CEIL(AVG(wa.avg_connections_needed) * 0.8) < cpc.current_min_connections * 0.5 THEN 
        'Decrease min connections to reduce resource overhead'
      ELSE 'Current min connections are appropriately sized'
    END as min_connections_justification,

    -- Performance impact assessment
    AVG(wa.avg_operation_duration) as avg_operation_performance,
    SUM(wa.long_wait_operations) as total_long_wait_operations,
    AVG(wa.timeout_operations) as avg_timeout_operations_per_hour,

    -- Resource efficiency metrics
    ROUND(
      (AVG(wa.avg_connections_needed) / NULLIF(cpc.current_max_connections, 0)) * 100, 
      2
    ) as resource_efficiency_percent

  FROM current_pool_configuration cpc
  JOIN workload_analysis wa ON cpc.pool_name = wa.pool_name
  GROUP BY cpc.pool_name, cpc.current_min_connections, cpc.current_max_connections,
           cpc.current_connection_timeout_ms, cpc.current_idle_timeout_ms,
           cpc.avg_historical_utilization, cpc.historical_peak_connections
)

SELECT 
  pool_name,

  -- Current configuration
  current_min_connections,
  current_max_connections,
  ROUND(resource_efficiency_percent, 1) as current_efficiency_percent,

  -- Recommended configuration
  recommended_min_connections,
  recommended_max_connections,
  recommended_connection_timeout_ms,
  recommended_idle_timeout_ms,

  -- Configuration change analysis
  (recommended_max_connections - current_max_connections) as max_connections_change,
  (recommended_min_connections - current_min_connections) as min_connections_change,
  max_connections_justification,
  min_connections_justification,

  -- Performance impact prediction
  ROUND(avg_operation_performance, 1) as current_avg_operation_ms,
  total_long_wait_operations,
  ROUND(avg_timeout_operations_per_hour, 1) as avg_timeouts_per_hour,

  -- Expected improvements
  CASE 
    WHEN recommended_max_connections > current_max_connections THEN
      'Expect reduced wait times and fewer timeout errors'
    WHEN recommended_max_connections < current_max_connections THEN
      'Expect reduced resource usage with minimal performance impact'
    ELSE 'Configuration is optimal for current workload'
  END as expected_performance_impact,

  -- Implementation priority
  CASE 
    WHEN total_long_wait_operations > 100 OR avg_timeout_operations_per_hour > 5 THEN 'high'
    WHEN ABS(recommended_max_connections - current_max_connections) > current_max_connections * 0.2 THEN 'medium'
    WHEN resource_efficiency_percent < 30 OR resource_efficiency_percent > 90 THEN 'medium'
    ELSE 'low'
  END as implementation_priority,

  -- Recommended action
  CASE 
    WHEN implementation_priority = 'high' THEN
      FORMAT('IMMEDIATE ACTION: Update pool configuration - Max: %s->%s, Min: %s->%s',
             current_max_connections, recommended_max_connections,
             current_min_connections, recommended_min_connections)
    WHEN implementation_priority = 'medium' THEN
      FORMAT('SCHEDULE UPDATE: Adjust pool settings during maintenance window')
    ELSE
      'MONITOR: Current configuration is adequate'
  END as recommended_action

FROM capacity_planning
ORDER BY 
  CASE implementation_priority 
    WHEN 'high' THEN 1 
    WHEN 'medium' THEN 2 
    ELSE 3 
  END,
  total_long_wait_operations DESC,
  pool_name;

-- Real-time connection pool health dashboard
CREATE VIEW connection_pool_health_dashboard AS
WITH real_time_metrics AS (
  SELECT 
    -- Current timestamp for real-time display
    CURRENT_TIMESTAMP as dashboard_time,

    -- Pool status overview
    (SELECT COUNT(*) FROM active_connection_pools) as total_active_pools,
    (SELECT COUNT(*) FROM active_connection_pools WHERE health_status = 'healthy') as healthy_pools,
    (SELECT COUNT(*) FROM active_connection_pools WHERE health_status = 'warning') as warning_pools,
    (SELECT COUNT(*) FROM active_connection_pools WHERE health_status = 'critical') as critical_pools,

    -- Connection utilization across all pools
    (SELECT SUM(current_active_connections) FROM active_connection_pools) as total_active_connections,
    (SELECT SUM(current_available_connections) FROM active_connection_pools) as total_available_connections,
    (SELECT SUM(max_pool_size) FROM active_connection_pools) as total_max_connections,

    -- Performance indicators
    (SELECT AVG(avg_operation_duration_ms) 
     FROM connection_pool_performance 
     WHERE measurement_time >= CURRENT_TIMESTAMP - INTERVAL '5 minutes') as system_avg_operation_time,

    (SELECT SUM(operations_per_second) 
     FROM connection_pool_performance 
     WHERE measurement_time >= CURRENT_TIMESTAMP - INTERVAL '1 minute') as system_operations_per_second,

    -- Error indicators
    (SELECT COUNT(*) 
     FROM connection_errors 
     WHERE error_timestamp >= CURRENT_TIMESTAMP - INTERVAL '5 minutes') as recent_errors,

    (SELECT COUNT(*) 
     FROM connection_timeouts 
     WHERE timeout_timestamp >= CURRENT_TIMESTAMP - INTERVAL '5 minutes') as recent_timeouts,

    -- Capacity indicators
    (SELECT COUNT(*) 
     FROM connection_pools 
     WHERE current_utilization_percent > 80) as pools_near_capacity,

    (SELECT COUNT(*) 
     FROM connection_pools 
     WHERE checkout_queue_length > 5) as pools_with_queues
),

pool_details AS (
  SELECT 
    pool_name,
    health_status,
    current_active_connections,
    max_pool_size,
    ROUND(current_utilization_percent, 1) as utilization_percent,
    avg_operation_duration_ms,
    operations_per_second,
    error_rate_percent,
    last_health_check,

    -- Status indicators
    CASE 
      WHEN health_status = 'critical' THEN '🔴'
      WHEN health_status = 'warning' THEN '🟡'
      ELSE '🟢'
    END as status_indicator

  FROM active_connection_pools
  ORDER BY 
    CASE health_status 
      WHEN 'critical' THEN 1 
      WHEN 'warning' THEN 2 
      ELSE 3 
    END,
    current_utilization_percent DESC
)

SELECT 
  dashboard_time,

  -- System overview
  total_active_pools,
  FORMAT('%s healthy, %s warning, %s critical', 
         healthy_pools, warning_pools, critical_pools) as pool_health_summary,

  -- Connection utilization
  total_active_connections,
  total_available_connections,
  total_max_connections,
  ROUND((total_active_connections::DECIMAL / total_max_connections) * 100, 1) as system_utilization_percent,

  -- Performance indicators
  ROUND(system_avg_operation_time::NUMERIC, 1) as avg_operation_time_ms,
  ROUND(system_operations_per_second::NUMERIC, 0) as operations_per_second,

  -- Health indicators
  recent_errors,
  recent_timeouts,
  pools_near_capacity,
  pools_with_queues,

  -- Overall system health
  CASE 
    WHEN critical_pools > 0 OR recent_errors > 20 OR pools_near_capacity > total_active_pools * 0.5 THEN 'CRITICAL'
    WHEN warning_pools > 0 OR recent_errors > 5 OR pools_near_capacity > 0 THEN 'WARNING'
    ELSE 'HEALTHY'
  END as system_health_status,

  -- Active alerts
  ARRAY[
    CASE WHEN critical_pools > 0 THEN FORMAT('%s pools in critical state', critical_pools) END,
    CASE WHEN pools_near_capacity > 2 THEN FORMAT('%s pools near capacity limit', pools_near_capacity) END,
    CASE WHEN recent_errors > 10 THEN FORMAT('%s connection errors in last 5 minutes', recent_errors) END,
    CASE WHEN recent_timeouts > 5 THEN FORMAT('%s connection timeouts detected', recent_timeouts) END,
    CASE WHEN system_avg_operation_time > 1000 THEN 'High average operation latency detected' END
  ]::TEXT[] as active_alerts,

  -- Pool details for monitoring dashboard
  (SELECT JSON_AGG(
    JSON_BUILD_OBJECT(
      'pool_name', pool_name,
      'status', status_indicator || ' ' || health_status,
      'connections', current_active_connections || '/' || max_pool_size,
      'utilization', utilization_percent || '%',
      'performance', avg_operation_duration_ms || 'ms',
      'throughput', operations_per_second || ' ops/s',
      'error_rate', error_rate_percent || '%'
    )
  ) FROM pool_details) as pool_status_details

FROM real_time_metrics;

-- QueryLeaf provides comprehensive connection pooling capabilities:
-- 1. SQL-familiar syntax for MongoDB connection pool configuration
-- 2. Advanced performance monitoring and optimization recommendations
-- 3. Intelligent connection lifecycle management with automatic scaling
-- 4. Comprehensive error handling and circuit breaker patterns
-- 5. Real-time health monitoring with proactive alerting
-- 6. Capacity planning and workload analysis for optimal sizing
-- 7. Production-ready connection management with minimal configuration
-- 8. Integration with MongoDB's native connection pooling optimizations
-- 9. Advanced analytics for connection performance and resource utilization
-- 10. Automated optimization recommendations based on workload patterns

Best Practices for Production Connection Pooling

Connection Pool Strategy Design

Essential principles for effective MongoDB connection pooling deployment:

  1. Pool Sizing Strategy: Configure optimal pool sizes based on application concurrency patterns and system resources
  2. Timeout Management: Set appropriate connection, socket, and operation timeouts for reliable connection handling
  3. Health Monitoring: Implement comprehensive connection health monitoring with proactive alerting and recovery
  4. Load Balancing: Design intelligent connection distribution strategies for optimal resource utilization
  5. Error Handling: Configure robust error handling with retry logic, circuit breakers, and graceful degradation
  6. Performance Optimization: Monitor connection performance metrics and implement automatic optimization strategies

Scalability and Production Deployment

Optimize connection pooling for enterprise-scale requirements:

  1. Distributed Pooling: Implement distributed connection pooling strategies for multi-node deployments
  2. Capacity Planning: Monitor historical patterns and predict connection requirements for scaling decisions
  3. Security Integration: Ensure connection pooling meets security, authentication, and compliance requirements
  4. Operational Integration: Integrate connection monitoring with existing alerting and operational workflows
  5. Disaster Recovery: Design connection pooling with failover capabilities and automatic recovery mechanisms
  6. Resource Optimization: Monitor and optimize connection resource usage for cost-effective operations

Conclusion

MongoDB connection pooling provides sophisticated database connection management capabilities that enable optimal performance, resource utilization, and reliability for production applications through intelligent connection lifecycle management, advanced monitoring, and automatic optimization features. The native connection pooling support ensures applications benefit from MongoDB's optimized connection handling with minimal configuration overhead.

Key MongoDB Connection Pooling benefits include:

  • Intelligent Resource Management: Automated connection pool sizing and lifecycle management based on application workload patterns
  • Advanced Performance Monitoring: Comprehensive connection performance tracking with real-time optimization recommendations
  • Production Reliability: Built-in error handling, circuit breaker patterns, and automatic failover capabilities
  • Scalable Architecture: Distributed connection pooling strategies that scale efficiently with application growth
  • Operational Excellence: Enterprise-ready monitoring, alerting, and diagnostic capabilities for production environments
  • SQL Accessibility: Familiar SQL-style connection management operations through QueryLeaf for accessible database connection optimization

Whether you're building high-concurrency web applications, microservices architectures, data processing pipelines, or enterprise database systems, MongoDB connection pooling with QueryLeaf's familiar SQL interface provides the foundation for efficient, reliable, and scalable database connection management.

QueryLeaf Integration: QueryLeaf automatically optimizes MongoDB connection pooling while providing SQL-familiar syntax for connection management, performance monitoring, and optimization strategies. Advanced connection pooling patterns, health monitoring, and capacity planning are seamlessly handled through familiar SQL constructs, making sophisticated database connection management accessible to SQL-oriented development teams.

The combination of MongoDB's robust connection pooling capabilities with SQL-style connection management operations makes it an ideal platform for applications requiring both high-performance database connectivity and familiar database management patterns, ensuring your applications can maintain optimal performance and reliability as they scale and evolve.

MongoDB Geospatial Queries and Location Data Management: Advanced Geographic Indexing and Spatial Analysis for Modern Applications

Modern applications increasingly rely on location-aware functionality to provide contextual services, from ride-sharing and delivery apps to social networks and real estate platforms. Managing geographic data efficiently requires sophisticated spatial indexing, proximity calculations, and complex geospatial queries that traditional databases struggle to handle effectively. MongoDB's comprehensive geospatial capabilities provide advanced geographic indexing and spatial analysis features that enable location-based applications to scale efficiently.

MongoDB's geospatial features support both 2D and spherical geometry operations, enabling applications to perform complex spatial queries including proximity searches, geofencing, route optimization, and area calculations. Unlike traditional approaches that require specialized GIS extensions or complex spatial calculations in application code, MongoDB integrates geospatial functionality directly into the database with optimized indexing strategies and query operators.

The Traditional Geographic Data Challenge

Conventional approaches to managing location data in relational databases face significant limitations:

-- Traditional PostgreSQL geographic data handling - complex and limited functionality

-- Basic location storage with separate latitude/longitude columns
CREATE TABLE locations (
    location_id SERIAL PRIMARY KEY,
    name VARCHAR(255) NOT NULL,
    address TEXT,
    latitude DECIMAL(10, 8) NOT NULL,
    longitude DECIMAL(11, 8) NOT NULL,
    category VARCHAR(100),
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,

    -- Basic constraints for valid coordinates
    CONSTRAINT valid_latitude CHECK (latitude >= -90 AND latitude <= 90),
    CONSTRAINT valid_longitude CHECK (longitude >= -180 AND longitude <= 180)
);

-- Create basic index for coordinate lookups (limited efficiency)
CREATE INDEX idx_locations_lat_lng ON locations(latitude, longitude);

-- Simple proximity query using Haversine formula (inefficient for large datasets)
CREATE OR REPLACE FUNCTION calculate_distance(
    lat1 DECIMAL, lng1 DECIMAL,
    lat2 DECIMAL, lng2 DECIMAL
) RETURNS DECIMAL AS $$
DECLARE
    earth_radius DECIMAL := 6371; -- Earth radius in kilometers
    lat1_rad DECIMAL;
    lng1_rad DECIMAL;
    lat2_rad DECIMAL;
    lng2_rad DECIMAL;
    dlat DECIMAL;
    dlng DECIMAL;
    a DECIMAL;
    c DECIMAL;
BEGIN
    -- Convert degrees to radians
    lat1_rad := radians(lat1);
    lng1_rad := radians(lng1);
    lat2_rad := radians(lat2);
    lng2_rad := radians(lng2);

    -- Haversine formula
    dlat := lat2_rad - lat1_rad;
    dlng := lng2_rad - lng1_rad;

    a := sin(dlat/2) * sin(dlat/2) + 
         cos(lat1_rad) * cos(lat2_rad) * 
         sin(dlng/2) * sin(dlng/2);
    c := 2 * atan2(sqrt(a), sqrt(1-a));

    RETURN earth_radius * c;
END;
$$ LANGUAGE plpgsql;

-- Find nearby locations (slow for large datasets)
WITH nearby_locations AS (
    SELECT 
        l.*,
        calculate_distance(
            40.7128, -74.0060,  -- New York City coordinates
            l.latitude, l.longitude
        ) as distance_km
    FROM locations l
    WHERE 
        -- Basic bounding box filter (rectangular approximation)
        latitude BETWEEN 40.7128 - 0.1 AND 40.7128 + 0.1
        AND longitude BETWEEN -74.0060 - 0.1 AND -74.0060 + 0.1
)
SELECT 
    location_id,
    name,
    address,
    latitude,
    longitude,
    distance_km,

    -- Categories for results
    CASE 
        WHEN distance_km <= 1 THEN 'very_close'
        WHEN distance_km <= 5 THEN 'nearby'
        WHEN distance_km <= 10 THEN 'moderate_distance'
        ELSE 'far'
    END as proximity_category

FROM nearby_locations
WHERE distance_km <= 10  -- Within 10km
ORDER BY distance_km
LIMIT 50;

-- Geofencing implementation (complex and inefficient)
CREATE TABLE geofences (
    geofence_id SERIAL PRIMARY KEY,
    name VARCHAR(255) NOT NULL,
    center_latitude DECIMAL(10, 8) NOT NULL,
    center_longitude DECIMAL(11, 8) NOT NULL,
    radius_meters INTEGER NOT NULL,
    fence_type VARCHAR(50) DEFAULT 'circular',
    active BOOLEAN DEFAULT true,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

-- Check if location is within geofence (expensive operation)
CREATE OR REPLACE FUNCTION point_in_geofence(
    point_lat DECIMAL, point_lng DECIMAL,
    fence_id INTEGER
) RETURNS BOOLEAN AS $$
DECLARE
    fence_record geofences%ROWTYPE;
    distance_m DECIMAL;
BEGIN
    SELECT * INTO fence_record 
    FROM geofences 
    WHERE geofence_id = fence_id AND active = true;

    IF NOT FOUND THEN
        RETURN false;
    END IF;

    -- Calculate distance in meters
    distance_m := calculate_distance(
        point_lat, point_lng,
        fence_record.center_latitude, fence_record.center_longitude
    ) * 1000;

    RETURN distance_m <= fence_record.radius_meters;
END;
$$ LANGUAGE plpgsql;

-- Area-based queries (extremely limited without proper GIS support)
WITH service_areas AS (
    SELECT 
        sa.area_id,
        sa.area_name,
        -- Simple rectangular area definition (very limited)
        sa.min_latitude,
        sa.max_latitude,
        sa.min_longitude,
        sa.max_longitude,
        sa.service_type
    FROM service_areas sa
    WHERE sa.active = true
),
area_coverage AS (
    SELECT 
        sa.*,
        COUNT(l.location_id) as locations_in_area,
        AVG(l.latitude) as avg_latitude,
        AVG(l.longitude) as avg_longitude
    FROM service_areas sa
    LEFT JOIN locations l ON (
        l.latitude BETWEEN sa.min_latitude AND sa.max_latitude
        AND l.longitude BETWEEN sa.min_longitude AND sa.max_longitude
    )
    GROUP BY sa.area_id, sa.area_name, sa.min_latitude, sa.max_latitude, 
             sa.min_longitude, sa.max_longitude, sa.service_type
)
SELECT 
    area_id,
    area_name,
    service_type,
    locations_in_area,

    -- Calculate approximate area (very rough rectangle calculation)
    (max_latitude - min_latitude) * (max_longitude - min_longitude) * 111000 as approx_area_sqm,

    -- Service density calculation
    CASE 
        WHEN locations_in_area > 0 THEN 
            locations_in_area::DECIMAL / 
            ((max_latitude - min_latitude) * (max_longitude - min_longitude) * 111000)
        ELSE 0
    END as service_density_per_sqkm

FROM area_coverage
ORDER BY locations_in_area DESC;

-- Route planning (extremely basic and inefficient)
WITH route_waypoints AS (
    SELECT 
        ROW_NUMBER() OVER (ORDER BY waypoint_order) as sequence,
        latitude,
        longitude,
        location_name
    FROM route_points
    WHERE route_id = :route_id
    ORDER BY waypoint_order
),
route_segments AS (
    SELECT 
        rw1.sequence as from_seq,
        rw1.location_name as from_location,
        rw1.latitude as from_lat,
        rw1.longitude as from_lng,
        rw2.sequence as to_seq,
        rw2.location_name as to_location,
        rw2.latitude as to_lat,
        rw2.longitude as to_lng,

        -- Calculate segment distance
        calculate_distance(
            rw1.latitude, rw1.longitude,
            rw2.latitude, rw2.longitude
        ) as segment_distance_km

    FROM route_waypoints rw1
    JOIN route_waypoints rw2 ON rw2.sequence = rw1.sequence + 1
)
SELECT 
    from_location,
    to_location,
    segment_distance_km,

    -- Running total distance
    SUM(segment_distance_km) OVER (
        ORDER BY from_seq 
        ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
    ) as cumulative_distance_km,

    -- Estimated time (very basic calculation)
    segment_distance_km * 60 / 50 as estimated_minutes  -- Assume 50 km/h average

FROM route_segments
ORDER BY from_seq;

-- Problems with traditional geographic approaches:
-- 1. No native spatial indexing - queries are slow on large datasets
-- 2. Limited geometric operations - only basic distance calculations
-- 3. No support for complex shapes or polygons
-- 4. Inefficient bounding box calculations
-- 5. No proper coordinate system support
-- 6. Manual implementation of spatial algorithms
-- 7. Limited geospatial query operators
-- 8. Poor performance for proximity searches
-- 9. No native support for geographic data types
-- 10. Complex implementation for basic geospatial functionality

MongoDB provides comprehensive geospatial capabilities with advanced indexing and query operators:

// MongoDB Advanced Geospatial Data Management
const { MongoClient } = require('mongodb');

const client = new MongoClient('mongodb://localhost:27017');
const db = client.db('geospatial_applications');

// Comprehensive MongoDB Geospatial Manager
class AdvancedGeospatialManager {
  constructor(db, config = {}) {
    this.db = db;
    this.collections = {
      locations: db.collection('locations'),
      geofences: db.collection('geofences'),
      routes: db.collection('routes'),
      serviceAreas: db.collection('service_areas'),
      trackingData: db.collection('tracking_data'),
      spatialAnalytics: db.collection('spatial_analytics')
    };

    // Advanced geospatial configuration
    this.config = {
      // Coordinate system settings
      coordinateSystem: config.coordinateSystem || 'WGS84',
      defaultSRID: config.defaultSRID || 4326,

      // Index optimization settings
      enable2dSphereIndexes: config.enable2dSphereIndexes !== false,
      enableGeoHashIndexes: config.enableGeoHashIndexes || false,
      indexPrecision: config.indexPrecision || 26,

      // Query optimization settings
      defaultDistanceUnit: config.defaultDistanceUnit || 'meters',
      maxProximityDistance: config.maxProximityDistance || 50000, // 50km
      defaultResultLimit: config.defaultResultLimit || 100,

      // Performance settings
      enableSpatialCaching: config.enableSpatialCaching || false,
      cacheExpirySeconds: config.cacheExpirySeconds || 300,
      enableParallelQueries: config.enableParallelQueries || false
    };

    this.initializeGeospatialSystem();
  }

  async initializeGeospatialSystem() {
    console.log('Initializing advanced geospatial system...');

    try {
      // Create optimized geospatial indexes
      await this.setupGeospatialIndexes();

      // Initialize spatial analysis capabilities
      await this.setupSpatialAnalytics();

      // Setup geospatial data validation
      await this.setupGeospatialValidation();

      console.log('Advanced geospatial system initialized successfully');

    } catch (error) {
      console.error('Error initializing geospatial system:', error);
      throw error;
    }
  }

  async setupGeospatialIndexes() {
    console.log('Setting up optimized geospatial indexes...');

    try {
      // Locations collection - 2dsphere index for spherical geometry
      await this.collections.locations.createIndex(
        { location: '2dsphere' },
        { 
          name: 'location_2dsphere_idx',
          background: true,
          '2dsphereIndexVersion': 3
        }
      );

      // Compound index for location + category queries
      await this.collections.locations.createIndex(
        { location: '2dsphere', category: 1, active: 1 },
        { 
          name: 'location_category_active_idx',
          background: true 
        }
      );

      // Geofences collection - optimized for area queries
      await this.collections.geofences.createIndex(
        { geometry: '2dsphere' },
        { 
          name: 'geofence_geometry_idx',
          background: true 
        }
      );

      // Tracking data - time-series with geospatial
      await this.collections.trackingData.createIndex(
        { location: '2dsphere', timestamp: -1, user_id: 1 },
        { 
          name: 'tracking_location_time_user_idx',
          background: true 
        }
      );

      // Service areas - polygon-based spatial index
      await this.collections.serviceAreas.createIndex(
        { coverage_area: '2dsphere', service_type: 1, active: 1 },
        { 
          name: 'service_area_coverage_idx',
          background: true 
        }
      );

      console.log('Geospatial indexes created successfully');

    } catch (error) {
      console.error('Error creating geospatial indexes:', error);
      throw error;
    }
  }

  async findNearbyLocations(centerPoint, radiusMeters, options = {}) {
    console.log(`Finding locations within ${radiusMeters}m of point [${centerPoint.coordinates}]...`);

    try {
      const query = {
        location: {
          $near: {
            $geometry: centerPoint,
            $maxDistance: radiusMeters
          }
        }
      };

      // Add additional filters
      if (options.category) {
        query.category = options.category;
      }

      if (options.active !== undefined) {
        query.active = options.active;
      }

      if (options.excludeIds) {
        query._id = { $nin: options.excludeIds };
      }

      // Execute proximity query with optimization
      const nearbyLocations = await this.collections.locations
        .find(query)
        .limit(options.limit || this.config.defaultResultLimit)
        .toArray();

      // Calculate precise distances and additional metadata
      const enrichedResults = nearbyLocations.map(location => {
        const distance = this.calculateDistance(centerPoint, location.location);

        return {
          ...location,
          distance: {
            meters: Math.round(distance),
            kilometers: Math.round(distance / 1000 * 100) / 100,
            miles: Math.round(distance * 0.000621371 * 100) / 100
          },
          proximityCategory: this.categorizeDistance(distance),
          bearing: this.calculateBearing(centerPoint, location.location)
        };
      });

      // Sort by distance (MongoDB $near already does this, but ensure precision)
      enrichedResults.sort((a, b) => a.distance.meters - b.distance.meters);

      return {
        success: true,
        centerPoint: centerPoint,
        searchRadius: radiusMeters,
        totalResults: enrichedResults.length,
        locations: enrichedResults,
        searchMetadata: {
          queryOptions: options,
          executionTime: Date.now(),
          coordinateSystem: this.config.coordinateSystem
        }
      };

    } catch (error) {
      console.error('Error finding nearby locations:', error);
      return {
        success: false,
        error: error.message,
        centerPoint: centerPoint,
        searchRadius: radiusMeters
      };
    }
  }

  async implementAdvancedGeofencing(geofenceData, monitoringOptions = {}) {
    console.log('Implementing advanced geofencing system...');

    try {
      // Create comprehensive geofence document
      const geofenceDocument = {
        _id: this.generateGeofenceId(),
        name: geofenceData.name,
        description: geofenceData.description,

        // Geofence geometry (supports various shapes)
        geometry: this.normalizeGeometry(geofenceData.geometry),

        // Geofence properties
        properties: {
          type: geofenceData.type || 'monitoring',
          priority: geofenceData.priority || 'normal',
          active: geofenceData.active !== false,

          // Trigger conditions
          triggerEvents: geofenceData.triggerEvents || ['enter', 'exit'],
          dwellTime: geofenceData.dwellTime || 0, // Minimum time in seconds

          // Notification settings
          notifications: {
            enabled: monitoringOptions.enableNotifications || false,
            webhookUrl: monitoringOptions.webhookUrl,
            emailRecipients: monitoringOptions.emailRecipients || []
          },

          // Analytics settings
          analytics: {
            trackDwellTime: monitoringOptions.trackDwellTime || false,
            trackEntryExitPatterns: monitoringOptions.trackEntryExitPatterns || false,
            aggregateStatistics: monitoringOptions.aggregateStatistics || false
          }
        },

        // Metadata
        createdAt: new Date(),
        updatedAt: new Date(),
        createdBy: geofenceData.createdBy,

        // Performance optimization hints
        indexingHints: {
          expectedQueryVolume: monitoringOptions.expectedQueryVolume || 'medium',
          primaryUseCase: monitoringOptions.primaryUseCase || 'point_in_polygon'
        }
      };

      // Insert geofence with validation
      const insertResult = await this.collections.geofences.insertOne(geofenceDocument);

      if (!insertResult.acknowledged) {
        throw new Error('Failed to create geofence');
      }

      console.log(`Geofence created successfully: ${geofenceDocument.name}`);

      return {
        success: true,
        geofenceId: geofenceDocument._id,
        geometry: geofenceDocument.geometry,
        properties: geofenceDocument.properties
      };

    } catch (error) {
      console.error('Error implementing geofencing:', error);
      return {
        success: false,
        error: error.message
      };
    }
  }

  async checkGeofenceViolations(pointLocation, userContext = {}) {
    console.log('Checking geofence violations for location...');

    try {
      // Find all active geofences that contain the point
      const geofenceQuery = {
        'properties.active': true,
        geometry: {
          $geoIntersects: {
            $geometry: pointLocation
          }
        }
      };

      const violatedGeofences = await this.collections.geofences
        .find(geofenceQuery)
        .toArray();

      const violations = [];

      for (const geofence of violatedGeofences) {
        // Check if this is a new entry or continued presence
        const previousStatus = await this.getPreviousGeofenceStatus(
          userContext.userId, 
          geofence._id
        );

        const violationData = {
          geofenceId: geofence._id,
          geofenceName: geofence.name,
          violationType: this.determineViolationType(previousStatus, geofence),
          timestamp: new Date(),
          location: pointLocation,
          userContext: userContext,

          // Geofence-specific data
          geofenceType: geofence.properties.type,
          priority: geofence.properties.priority,
          triggerEvents: geofence.properties.triggerEvents,

          // Additional context
          dwellTimeRequired: geofence.properties.dwellTime,
          currentDwellTime: this.calculateCurrentDwellTime(userContext.userId, geofence._id)
        };

        violations.push(violationData);

        // Update geofence status tracking
        await this.updateGeofenceStatus(userContext.userId, geofence._id, violationData);

        // Trigger notifications if configured
        if (geofence.properties.notifications.enabled) {
          await this.triggerGeofenceNotification(geofence, violationData);
        }

        // Record analytics if enabled
        if (geofence.properties.analytics.trackEntryExitPatterns) {
          await this.recordGeofenceAnalytics(geofence, violationData);
        }
      }

      // Check for geofence exits (locations user was in but no longer in)
      const exitViolations = await this.checkGeofenceExits(
        userContext.userId, 
        violatedGeofences.map(g => g._id),
        pointLocation
      );

      violations.push(...exitViolations);

      return {
        success: true,
        location: pointLocation,
        totalViolations: violations.length,
        violations: violations,
        userContext: userContext
      };

    } catch (error) {
      console.error('Error checking geofence violations:', error);
      return {
        success: false,
        error: error.message,
        location: pointLocation
      };
    }
  }

  async performSpatialAnalysis(analysisType, parameters) {
    console.log(`Performing spatial analysis: ${analysisType}...`);

    try {
      let analysisResult = {};

      switch (analysisType) {
        case 'density_heatmap':
          analysisResult = await this.generateDensityHeatmap(parameters);
          break;

        case 'service_coverage':
          analysisResult = await this.analyzeServiceCoverage(parameters);
          break;

        case 'route_optimization':
          analysisResult = await this.optimizeRoutes(parameters);
          break;

        case 'clustering_analysis':
          analysisResult = await this.performClusteringAnalysis(parameters);
          break;

        case 'accessibility_analysis':
          analysisResult = await this.analyzeAccessibility(parameters);
          break;

        default:
          throw new Error(`Unsupported analysis type: ${analysisType}`);
      }

      // Store analysis results for future reference
      const analysisRecord = {
        analysisType: analysisType,
        parameters: parameters,
        results: analysisResult,
        executedAt: new Date(),
        executionTime: Date.now() - parameters.startTime
      };

      await this.collections.spatialAnalytics.insertOne(analysisRecord);

      return {
        success: true,
        analysisType: analysisType,
        ...analysisResult
      };

    } catch (error) {
      console.error(`Error performing spatial analysis (${analysisType}):`, error);
      return {
        success: false,
        analysisType: analysisType,
        error: error.message
      };
    }
  }

  async generateDensityHeatmap(parameters) {
    console.log('Generating location density heatmap...');

    const { bounds, gridSize, category } = parameters;

    // Create grid cells for heatmap
    const gridCells = this.createSpatialGrid(bounds, gridSize);
    const heatmapData = [];

    for (const cell of gridCells) {
      // Count locations in each grid cell
      const cellQuery = {
        location: {
          $geoWithin: {
            $geometry: cell.geometry
          }
        }
      };

      if (category) {
        cellQuery.category = category;
      }

      const locationCount = await this.collections.locations.countDocuments(cellQuery);

      if (locationCount > 0) {
        heatmapData.push({
          cellId: cell.id,
          geometry: cell.geometry,
          center: cell.center,
          locationCount: locationCount,
          density: locationCount / cell.area // locations per square meter
        });
      }
    }

    // Calculate density statistics
    const densities = heatmapData.map(cell => cell.density);
    const maxDensity = Math.max(...densities);
    const avgDensity = densities.reduce((sum, d) => sum + d, 0) / densities.length;

    // Normalize density values for heatmap visualization
    const normalizedHeatmap = heatmapData.map(cell => ({
      ...cell,
      normalizedDensity: cell.density / maxDensity,
      intensityLevel: this.categorizeDensity(cell.density, maxDensity, avgDensity)
    }));

    return {
      heatmapData: normalizedHeatmap,
      statistics: {
        totalCells: gridCells.length,
        activeCells: heatmapData.length,
        maxDensity: maxDensity,
        averageDensity: avgDensity,
        totalLocations: heatmapData.reduce((sum, cell) => sum + cell.locationCount, 0)
      },
      metadata: {
        bounds: bounds,
        gridSize: gridSize,
        category: category,
        generatedAt: new Date()
      }
    };
  }

  async analyzeServiceCoverage(parameters) {
    console.log('Analyzing service coverage areas...');

    const { serviceType, analysisArea, coverageRadius } = parameters;

    // Get all service locations of the specified type
    const serviceLocations = await this.collections.locations
      .find({
        category: serviceType,
        active: true,
        location: {
          $geoWithin: {
            $geometry: analysisArea
          }
        }
      })
      .toArray();

    // Create coverage areas around each service location
    const coverageAreas = serviceLocations.map(location => ({
      serviceLocation: location,
      coverageArea: {
        type: 'Polygon',
        coordinates: [this.createCircleCoordinates(location.location, coverageRadius)]
      },
      radius: coverageRadius
    }));

    // Calculate union of all coverage areas
    const totalCoverage = await this.calculateCoverageUnion(coverageAreas);

    // Calculate coverage metrics
    const analysisAreaSize = this.calculatePolygonArea(analysisArea);
    const coveredAreaSize = this.calculatePolygonArea(totalCoverage);
    const coveragePercentage = (coveredAreaSize / analysisAreaSize) * 100;

    // Find coverage gaps
    const coverageGaps = await this.findCoverageGaps(analysisArea, totalCoverage);

    // Identify optimal locations for new services
    const optimalNewLocations = await this.findOptimalServiceLocations(
      coverageGaps, 
      serviceLocations, 
      coverageRadius
    );

    return {
      serviceType: serviceType,
      analysisArea: analysisArea,
      serviceLocations: serviceLocations,
      coverageAreas: coverageAreas,
      totalCoverage: totalCoverage,

      // Coverage metrics
      metrics: {
        totalServiceLocations: serviceLocations.length,
        analysisAreaSqKm: Math.round(analysisAreaSize / 1000000 * 100) / 100,
        coveredAreaSqKm: Math.round(coveredAreaSize / 1000000 * 100) / 100,
        coveragePercentage: Math.round(coveragePercentage * 100) / 100,
        gapCount: coverageGaps.length
      },

      // Recommendations
      recommendations: {
        coverageGaps: coverageGaps,
        optimalNewLocations: optimalNewLocations,
        serviceEfficiency: this.calculateServiceEfficiency(serviceLocations, totalCoverage)
      }
    };
  }

  async optimizeRoutes(parameters) {
    console.log('Optimizing routes for multiple stops...');

    const { startPoint, waypoints, endPoint, optimizationCriteria } = parameters;

    // Prepare all points for route optimization
    const allPoints = [startPoint, ...waypoints];
    if (endPoint && !this.pointsEqual(startPoint, endPoint)) {
      allPoints.push(endPoint);
    }

    // Calculate distance matrix between all points
    const distanceMatrix = await this.calculateDistanceMatrix(allPoints);

    // Apply route optimization algorithm based on criteria
    let optimizedRoute = {};

    switch (optimizationCriteria) {
      case 'shortest_distance':
        optimizedRoute = await this.optimizeForShortestDistance(allPoints, distanceMatrix);
        break;

      case 'fastest_time':
        optimizedRoute = await this.optimizeForFastestTime(allPoints, distanceMatrix);
        break;

      case 'balanced':
        optimizedRoute = await this.optimizeBalanced(allPoints, distanceMatrix);
        break;

      default:
        optimizedRoute = await this.optimizeForShortestDistance(allPoints, distanceMatrix);
    }

    // Calculate route statistics
    const routeStatistics = this.calculateRouteStatistics(optimizedRoute, distanceMatrix);

    // Generate turn-by-turn directions
    const directions = await this.generateRouteDirections(optimizedRoute.orderedPoints);

    return {
      originalPoints: {
        start: startPoint,
        waypoints: waypoints,
        end: endPoint
      },
      optimizedRoute: optimizedRoute,
      routeStatistics: routeStatistics,
      directions: directions,
      optimizationCriteria: optimizationCriteria,

      // Performance comparison
      improvement: {
        distanceReduction: routeStatistics.totalDistance < parameters.originalDistance 
          ? parameters.originalDistance - routeStatistics.totalDistance 
          : 0,
        timeReduction: routeStatistics.estimatedTime < parameters.originalTime 
          ? parameters.originalTime - routeStatistics.estimatedTime 
          : 0
      }
    };
  }

  // Utility methods for geospatial calculations

  calculateDistance(point1, point2) {
    // Use MongoDB's geospatial calculation or implement Haversine formula
    const R = 6371000; // Earth's radius in meters
    const lat1Rad = this.toRadians(point1.coordinates[1]);
    const lat2Rad = this.toRadians(point2.coordinates[1]);
    const deltaLatRad = this.toRadians(point2.coordinates[1] - point1.coordinates[1]);
    const deltaLngRad = this.toRadians(point2.coordinates[0] - point1.coordinates[0]);

    const a = Math.sin(deltaLatRad/2) * Math.sin(deltaLatRad/2) +
              Math.cos(lat1Rad) * Math.cos(lat2Rad) *
              Math.sin(deltaLngRad/2) * Math.sin(deltaLngRad/2);
    const c = 2 * Math.atan2(Math.sqrt(a), Math.sqrt(1-a));

    return R * c;
  }

  calculateBearing(point1, point2) {
    const lat1Rad = this.toRadians(point1.coordinates[1]);
    const lat2Rad = this.toRadians(point2.coordinates[1]);
    const deltaLngRad = this.toRadians(point2.coordinates[0] - point1.coordinates[0]);

    const y = Math.sin(deltaLngRad) * Math.cos(lat2Rad);
    const x = Math.cos(lat1Rad) * Math.sin(lat2Rad) - 
              Math.sin(lat1Rad) * Math.cos(lat2Rad) * Math.cos(deltaLngRad);

    const bearingRad = Math.atan2(y, x);
    return (this.toDegrees(bearingRad) + 360) % 360;
  }

  categorizeDistance(distanceMeters) {
    if (distanceMeters <= 100) return 'very_close';
    if (distanceMeters <= 500) return 'walking_distance';
    if (distanceMeters <= 2000) return 'nearby';
    if (distanceMeters <= 10000) return 'moderate_distance';
    return 'far';
  }

  normalizeGeometry(geometry) {
    // Ensure geometry follows GeoJSON specification
    if (!geometry || !geometry.type || !geometry.coordinates) {
      throw new Error('Invalid geometry format');
    }

    // Validate coordinate ranges
    this.validateCoordinates(geometry);

    return geometry;
  }

  validateCoordinates(geometry) {
    const validatePoint = (coords) => {
      if (!Array.isArray(coords) || coords.length < 2) {
        throw new Error('Invalid coordinate format');
      }

      const [lng, lat] = coords;
      if (lng < -180 || lng > 180 || lat < -90 || lat > 90) {
        throw new Error(`Invalid coordinates: [${lng}, ${lat}]`);
      }
    };

    switch (geometry.type) {
      case 'Point':
        validatePoint(geometry.coordinates);
        break;

      case 'LineString':
      case 'MultiPoint':
        geometry.coordinates.forEach(validatePoint);
        break;

      case 'Polygon':
      case 'MultiLineString':
        geometry.coordinates.forEach(ring => ring.forEach(validatePoint));
        break;

      case 'MultiPolygon':
        geometry.coordinates.forEach(polygon => 
          polygon.forEach(ring => ring.forEach(validatePoint))
        );
        break;
    }
  }

  createSpatialGrid(bounds, gridSize) {
    const grid = [];
    const { southwest, northeast } = bounds;

    const latStep = (northeast.coordinates[1] - southwest.coordinates[1]) / gridSize;
    const lngStep = (northeast.coordinates[0] - southwest.coordinates[0]) / gridSize;

    for (let i = 0; i < gridSize; i++) {
      for (let j = 0; j < gridSize; j++) {
        const sw = [
          southwest.coordinates[0] + (j * lngStep),
          southwest.coordinates[1] + (i * latStep)
        ];
        const ne = [
          southwest.coordinates[0] + ((j + 1) * lngStep),
          southwest.coordinates[1] + ((i + 1) * latStep)
        ];

        const cellGeometry = {
          type: 'Polygon',
          coordinates: [[
            sw,
            [ne[0], sw[1]],
            ne,
            [sw[0], ne[1]],
            sw
          ]]
        };

        grid.push({
          id: `cell_${i}_${j}`,
          geometry: cellGeometry,
          center: {
            type: 'Point',
            coordinates: [(sw[0] + ne[0]) / 2, (sw[1] + ne[1]) / 2]
          },
          area: this.calculatePolygonArea(cellGeometry)
        });
      }
    }

    return grid;
  }

  createCircleCoordinates(center, radiusMeters, points = 32) {
    const coords = [];
    const earthRadius = 6371000; // Earth radius in meters

    for (let i = 0; i < points; i++) {
      const angle = (i * 2 * Math.PI) / points;
      const lat = this.toRadians(center.coordinates[1]);
      const lng = this.toRadians(center.coordinates[0]);

      const newLat = Math.asin(
        Math.sin(lat) * Math.cos(radiusMeters / earthRadius) +
        Math.cos(lat) * Math.sin(radiusMeters / earthRadius) * Math.cos(angle)
      );

      const newLng = lng + Math.atan2(
        Math.sin(angle) * Math.sin(radiusMeters / earthRadius) * Math.cos(lat),
        Math.cos(radiusMeters / earthRadius) - Math.sin(lat) * Math.sin(newLat)
      );

      coords.push([this.toDegrees(newLng), this.toDegrees(newLat)]);
    }

    // Close the polygon
    coords.push(coords[0]);
    return coords;
  }

  calculatePolygonArea(polygon) {
    // Simplified area calculation for demonstration
    // In production, use proper spherical geometry calculations
    if (polygon.type !== 'Polygon') return 0;

    const coords = polygon.coordinates[0];
    let area = 0;

    for (let i = 0; i < coords.length - 1; i++) {
      area += coords[i][0] * coords[i + 1][1] - coords[i + 1][0] * coords[i][1];
    }

    return Math.abs(area / 2) * 111000 * 111000; // Rough conversion to square meters
  }

  toRadians(degrees) {
    return degrees * (Math.PI / 180);
  }

  toDegrees(radians) {
    return radians * (180 / Math.PI);
  }

  generateGeofenceId() {
    return `geofence_${Date.now()}_${Math.random().toString(36).substr(2, 9)}`;
  }

  // Additional methods would include implementations for:
  // - setupSpatialAnalytics()
  // - setupGeospatialValidation()
  // - getPreviousGeofenceStatus()
  // - determineViolationType()
  // - updateGeofenceStatus()
  // - triggerGeofenceNotification()
  // - recordGeofenceAnalytics()
  // - checkGeofenceExits()
  // - calculateCoverageUnion()
  // - findCoverageGaps()
  // - findOptimalServiceLocations()
  // - calculateServiceEfficiency()
  // - calculateDistanceMatrix()
  // - optimizeForShortestDistance()
  // - optimizeForFastestTime()
  // - optimizeBalanced()
  // - calculateRouteStatistics()
  // - generateRouteDirections()
  // - performClusteringAnalysis()
  // - analyzeAccessibility()
  // - categorizeDensity()
  // - pointsEqual()
}

// Benefits of MongoDB Advanced Geospatial Operations:
// - Native 2dsphere indexing for efficient spherical geometry queries
// - Comprehensive geospatial operators for proximity, intersection, and containment
// - Support for complex geometric shapes and polygon operations
// - Optimized spatial indexing with configurable precision
// - Built-in coordinate system support and transformations
// - Advanced geofencing with real-time violation detection
// - Spatial aggregation and analytics capabilities
// - Route optimization and path planning functionality
// - High-performance location-based queries at scale
// - Integration with external mapping and routing services

module.exports = {
  AdvancedGeospatialManager
};

Advanced Geospatial Query Patterns

Location-Based Service Discovery

Implement sophisticated location-aware service discovery systems:

// Advanced location-based service discovery
class LocationBasedServiceDiscovery extends AdvancedGeospatialManager {
  constructor(db, serviceConfig) {
    super(db, serviceConfig);

    this.serviceConfig = {
      ...serviceConfig,
      enableServiceRanking: true,
      enableCapacityAwareness: true,
      enableRealTimeUpdates: true,
      enableServiceQuality: true
    };

    this.setupServiceDiscovery();
  }

  async findOptimalServices(userLocation, serviceRequest) {
    console.log('Finding optimal services based on location and requirements...');

    const { serviceType, maxDistance, requirements, preferences } = serviceRequest;

    try {
      // Multi-criteria service discovery
      const serviceDiscoveryPipeline = [
        // Geographic proximity filter
        {
          $geoNear: {
            near: userLocation,
            distanceField: 'distance',
            maxDistance: maxDistance,
            spherical: true,
            query: {
              serviceType: serviceType,
              active: true,

              // Service availability filter
              'availability.currentlyAvailable': true,
              'availability.capacity': { $gt: 0 }
            }
          }
        },

        // Requirements matching
        {
          $match: this.buildRequirementsFilter(requirements)
        },

        // Service quality and rating filtering
        {
          $addFields: {
            qualityScore: {
              $multiply: [
                '$ratings.averageRating',
                { $divide: ['$ratings.totalReviews', 100] }
              ]
            },

            // Proximity score (closer = higher score)
            proximityScore: {
              $subtract: [
                maxDistance,
                '$distance'
              ]
            },

            // Availability score
            availabilityScore: {
              $divide: [
                '$availability.capacity',
                '$availability.maxCapacity'
              ]
            }
          }
        },

        // Calculate composite service score
        {
          $addFields: {
            compositeScore: {
              $add: [
                { $multiply: ['$qualityScore', 0.3] },
                { $multiply: ['$proximityScore', 0.4] },
                { $multiply: ['$availabilityScore', 0.3] }
              ]
            }
          }
        },

        // Apply preference-based boosting
        {
          $addFields: {
            finalScore: this.applyPreferenceBoosts('$compositeScore', preferences)
          }
        },

        // Sort by final score and limit results
        { $sort: { finalScore: -1, distance: 1 } },
        { $limit: 20 }
      ];

      const optimalServices = await this.collections.locations
        .aggregate(serviceDiscoveryPipeline)
        .toArray();

      // Enrich results with additional context
      const enrichedServices = await Promise.all(
        optimalServices.map(async service => ({
          ...service,

          // Estimated arrival time
          estimatedArrivalTime: await this.calculateEstimatedArrival(
            userLocation, 
            service.location
          ),

          // Real-time availability
          realTimeAvailability: await this.getRealTimeAvailability(service._id),

          // Service-specific recommendations
          recommendations: await this.generateServiceRecommendations(
            service, 
            userLocation, 
            preferences
          ),

          // Booking options
          bookingOptions: await this.getBookingOptions(service._id)
        }))
      );

      return {
        success: true,
        userLocation: userLocation,
        serviceRequest: serviceRequest,
        totalResults: enrichedServices.length,
        services: enrichedServices,

        // Discovery metadata
        discoveryMetadata: {
          searchRadius: maxDistance,
          averageDistance: this.calculateAverageDistance(enrichedServices),
          qualityDistribution: this.analyzeQualityDistribution(enrichedServices),
          availabilityRate: this.calculateAvailabilityRate(enrichedServices)
        }
      };

    } catch (error) {
      console.error('Error in service discovery:', error);
      return {
        success: false,
        error: error.message,
        userLocation: userLocation,
        serviceRequest: serviceRequest
      };
    }
  }

  async implementDynamicServiceZones(serviceArea, demandData) {
    console.log('Implementing dynamic service zones based on demand patterns...');

    try {
      // Analyze demand patterns
      const demandAnalysis = await this.analyzeDemandPatterns(demandData);

      // Create dynamic zones based on demand density
      const dynamicZones = await this.createDemandBasedZones(
        serviceArea, 
        demandAnalysis
      );

      // Optimize service allocation across zones
      const serviceAllocation = await this.optimizeServiceAllocation(
        dynamicZones, 
        demandAnalysis
      );

      // Update service areas and routing
      const updateResults = await this.updateServiceAreas(serviceAllocation);

      return {
        success: true,
        serviceArea: serviceArea,
        dynamicZones: dynamicZones,
        serviceAllocation: serviceAllocation,
        updateResults: updateResults,

        // Performance metrics
        metrics: {
          totalZones: dynamicZones.length,
          averageResponseTime: serviceAllocation.averageResponseTime,
          coverageEfficiency: serviceAllocation.coverageEfficiency,
          demandSatisfaction: serviceAllocation.demandSatisfaction
        }
      };

    } catch (error) {
      console.error('Error implementing dynamic service zones:', error);
      return {
        success: false,
        error: error.message
      };
    }
  }
}

SQL-Style Geospatial Queries with QueryLeaf

QueryLeaf provides familiar SQL syntax for MongoDB geospatial operations:

-- QueryLeaf advanced geospatial operations with SQL-familiar syntax for MongoDB

-- Proximity queries with distance calculations
SELECT 
    name,
    address,
    category,
    location,

    -- Calculate distance from center point
    ST_DISTANCE(
        location,
        ST_GEOMFROMTEXT('POINT(-74.0060 40.7128)', 4326)  -- NYC coordinates
    ) as distance_meters,

    -- Categorize proximity
    CASE 
        WHEN ST_DISTANCE(location, ST_GEOMFROMTEXT('POINT(-74.0060 40.7128)', 4326)) <= 500 THEN 'walking_distance'
        WHEN ST_DISTANCE(location, ST_GEOMFROMTEXT('POINT(-74.0060 40.7128)', 4326)) <= 2000 THEN 'nearby'
        WHEN ST_DISTANCE(location, ST_GEOMFROMTEXT('POINT(-74.0060 40.7128)', 4326)) <= 10000 THEN 'moderate_distance'
        ELSE 'far'
    END as proximity_category,

    -- Calculate bearing
    ST_AZIMUTH(
        ST_GEOMFROMTEXT('POINT(-74.0060 40.7128)', 4326),
        location
    ) as bearing_degrees,

    -- Additional location context
    ratings.average_rating,
    availability.currently_available,
    hours.is_open_now

FROM locations
WHERE 
    -- Proximity filter using spatial index
    ST_DWITHIN(
        location, 
        ST_GEOMFROMTEXT('POINT(-74.0060 40.7128)', 4326), 
        10000  -- 10km radius
    )

    -- Additional filters
    AND category = 'restaurant'
    AND active = true
    AND ratings.average_rating >= 4.0

    -- Availability constraints
    AND availability.currently_available = true
    AND availability.capacity > 0

ORDER BY 
    -- Primary sort by distance
    ST_DISTANCE(location, ST_GEOMFROMTEXT('POINT(-74.0060 40.7128)', 4326)),
    -- Secondary sort by rating
    ratings.average_rating DESC,
    -- Tertiary sort by availability
    availability.capacity DESC

LIMIT 50;

-- Advanced geofencing with violation detection
WITH user_movements AS (
    SELECT 
        user_id,
        location,
        timestamp,

        -- Previous location for movement analysis
        LAG(location) OVER (
            PARTITION BY user_id 
            ORDER BY timestamp
        ) as previous_location,

        LAG(timestamp) OVER (
            PARTITION BY user_id 
            ORDER BY timestamp
        ) as previous_timestamp

    FROM tracking_data
    WHERE timestamp >= CURRENT_TIMESTAMP - INTERVAL '1 hour'
),

geofence_violations AS (
    SELECT 
        um.user_id,
        um.location,
        um.timestamp,
        gf.geofence_id,
        gf.name as geofence_name,
        gf.properties.type as geofence_type,

        -- Check if current location is within geofence
        ST_CONTAINS(gf.geometry, um.location) as currently_inside,

        -- Check if previous location was within geofence
        CASE 
            WHEN um.previous_location IS NOT NULL THEN
                ST_CONTAINS(gf.geometry, um.previous_location)
            ELSE false
        END as previously_inside,

        -- Determine violation type
        CASE 
            WHEN ST_CONTAINS(gf.geometry, um.location) 
                 AND NOT ST_CONTAINS(gf.geometry, um.previous_location) THEN 'entry'
            WHEN NOT ST_CONTAINS(gf.geometry, um.location) 
                 AND ST_CONTAINS(gf.geometry, um.previous_location) THEN 'exit'
            WHEN ST_CONTAINS(gf.geometry, um.location) 
                 AND ST_CONTAINS(gf.geometry, um.previous_location) THEN 'dwell'
            ELSE 'outside'
        END as violation_type,

        -- Calculate dwell time for entry violations
        CASE 
            WHEN ST_CONTAINS(gf.geometry, um.location) THEN
                EXTRACT(SECONDS FROM (um.timestamp - um.previous_timestamp))
            ELSE 0
        END as dwell_seconds,

        -- Distance to geofence boundary
        ST_DISTANCE(
            um.location,
            ST_BOUNDARY(gf.geometry)
        ) as distance_to_boundary

    FROM user_movements um
    CROSS JOIN geofences gf
    WHERE 
        gf.properties.active = true

        -- Only check geofences that are relevant to current location
        AND ST_DWITHIN(
            um.location, 
            gf.geometry, 
            gf.properties.buffer_distance
        )
),

violation_summary AS (
    SELECT 
        user_id,
        geofence_id,
        geofence_name,
        geofence_type,
        violation_type,
        COUNT(*) as violation_count,
        MIN(timestamp) as first_violation,
        MAX(timestamp) as last_violation,
        AVG(dwell_seconds) as average_dwell_time,
        SUM(dwell_seconds) as total_dwell_time,

        -- Risk assessment
        CASE 
            WHEN geofence_type = 'restricted' AND violation_type = 'entry' THEN 'high'
            WHEN geofence_type = 'monitoring' AND violation_type = 'dwell' 
                 AND SUM(dwell_seconds) > 300 THEN 'medium'
            ELSE 'low'
        END as risk_level

    FROM geofence_violations
    WHERE violation_type IN ('entry', 'exit', 'dwell')
    GROUP BY user_id, geofence_id, geofence_name, geofence_type, violation_type
)

SELECT 
    user_id,
    geofence_name,
    geofence_type,
    violation_type,
    violation_count,
    first_violation,
    last_violation,

    -- Time-based analysis
    EXTRACT(MINUTES FROM (last_violation - first_violation)) as violation_duration_minutes,
    ROUND(average_dwell_time, 2) as avg_dwell_seconds,
    ROUND(total_dwell_time / 60.0, 2) as total_dwell_minutes,

    -- Risk and priority
    risk_level,

    -- Priority score for alerts
    CASE risk_level
        WHEN 'high' THEN 100
        WHEN 'medium' THEN 50
        ELSE 10
    END as priority_score,

    -- Recommended actions
    CASE 
        WHEN risk_level = 'high' THEN 'immediate_notification'
        WHEN risk_level = 'medium' AND total_dwell_time > 600 THEN 'monitor_closely'
        ELSE 'log_only'
    END as recommended_action

FROM violation_summary
WHERE risk_level != 'low'
ORDER BY priority_score DESC, first_violation DESC;

-- Spatial aggregation and density analysis
WITH spatial_grid AS (
    -- Create grid cells for spatial analysis
    SELECT 
        grid_id,
        ST_MAKEENVELOPE(
            grid_x * 0.01 - 74.1,     -- Grid cell boundaries
            grid_y * 0.01 + 40.6,
            (grid_x + 1) * 0.01 - 74.1,
            (grid_y + 1) * 0.01 + 40.6,
            4326
        ) as grid_cell,

        -- Grid cell center point
        ST_CENTROID(
            ST_MAKEENVELOPE(
                grid_x * 0.01 - 74.1,
                grid_y * 0.01 + 40.6,
                (grid_x + 1) * 0.01 - 74.1,
                (grid_y + 1) * 0.01 + 40.6,
                4326
            )
        ) as grid_center

    FROM generate_series(0, 20) as grid_x
    CROSS JOIN generate_series(0, 20) as grid_y
),

location_density AS (
    SELECT 
        sg.grid_id,
        sg.grid_cell,
        sg.grid_center,

        -- Count locations in each grid cell
        COUNT(l.location_id) as location_count,

        -- Category breakdown
        COUNT(*) FILTER (WHERE l.category = 'restaurant') as restaurant_count,
        COUNT(*) FILTER (WHERE l.category = 'retail') as retail_count,
        COUNT(*) FILTER (WHERE l.category = 'service') as service_count,

        -- Rating analysis
        AVG(l.ratings.average_rating) as avg_rating,
        COUNT(*) FILTER (WHERE l.ratings.average_rating >= 4.5) as high_rated_count,

        -- Calculate density per square kilometer
        COUNT(l.location_id) / ST_AREA(
            ST_TRANSFORM(sg.grid_cell, 3857)  -- Transform to projected CRS for area calculation
        ) * 1000000 as density_per_sqkm,

        -- Grid cell area in square kilometers
        ST_AREA(ST_TRANSFORM(sg.grid_cell, 3857)) / 1000000 as cell_area_sqkm

    FROM spatial_grid sg
    LEFT JOIN locations l ON ST_CONTAINS(sg.grid_cell, l.location)
    GROUP BY sg.grid_id, sg.grid_cell, sg.grid_center
),

density_analysis AS (
    SELECT 
        *,

        -- Density classification
        CASE 
            WHEN density_per_sqkm >= 100 THEN 'very_dense'
            WHEN density_per_sqkm >= 50 THEN 'dense'
            WHEN density_per_sqkm >= 20 THEN 'moderate'
            WHEN density_per_sqkm >= 5 THEN 'sparse'
            ELSE 'very_sparse'
        END as density_class,

        -- Service diversity index
        CASE 
            WHEN restaurant_count + retail_count + service_count = 0 THEN 0
            ELSE (
                CASE WHEN restaurant_count > 0 THEN 1 ELSE 0 END +
                CASE WHEN retail_count > 0 THEN 1 ELSE 0 END +
                CASE WHEN service_count > 0 THEN 1 ELSE 0 END
            )
        END as service_diversity,

        -- Quality score
        CASE 
            WHEN location_count = 0 THEN 0
            ELSE (high_rated_count::DECIMAL / location_count) * 100
        END as quality_percentage

    FROM location_density
)

SELECT 
    grid_id,
    ST_X(grid_center) as center_longitude,
    ST_Y(grid_center) as center_latitude,
    location_count,

    -- Category distribution
    restaurant_count,
    retail_count,
    service_count,

    -- Density metrics
    ROUND(density_per_sqkm, 2) as density_per_sqkm,
    density_class,

    -- Quality metrics
    ROUND(avg_rating, 2) as avg_rating,
    high_rated_count,
    ROUND(quality_percentage, 1) as quality_percentage,

    -- Diversity and mixed-use analysis
    service_diversity,

    -- Area characteristics
    ROUND(cell_area_sqkm, 4) as cell_area_sqkm,

    -- Heat map values for visualization
    CASE density_class
        WHEN 'very_dense' THEN 1.0
        WHEN 'dense' THEN 0.8
        WHEN 'moderate' THEN 0.6
        WHEN 'sparse' THEN 0.3
        ELSE 0.1
    END as heat_intensity,

    -- Recommendations
    CASE 
        WHEN density_class = 'very_sparse' AND service_diversity = 0 THEN 'expansion_opportunity'
        WHEN density_class IN ('dense', 'very_dense') AND quality_percentage < 50 THEN 'quality_improvement_needed'
        WHEN service_diversity <= 1 AND location_count >= 5 THEN 'diversification_opportunity'
        ELSE 'well_served'
    END as area_recommendation

FROM density_analysis
WHERE location_count > 0  -- Only show areas with locations
ORDER BY density_per_sqkm DESC, quality_percentage DESC;

-- Route optimization with multiple waypoints
WITH route_waypoints AS (
    SELECT 
        waypoint_id,
        ST_GEOMFROMTEXT(waypoint_coordinates, 4326) as waypoint_location,
        waypoint_order,
        waypoint_type,
        service_time_minutes,
        priority_level
    FROM route_stops
    WHERE route_id = :route_id
),

distance_matrix AS (
    SELECT 
        w1.waypoint_id as from_waypoint,
        w2.waypoint_id as to_waypoint,

        -- Calculate distances between all waypoint pairs
        ST_DISTANCE(w1.waypoint_location, w2.waypoint_location) as distance_meters,

        -- Estimate travel time (simplified - would use routing service in production)
        (ST_DISTANCE(w1.waypoint_location, w2.waypoint_location) / 1000) / 50 * 60 as estimated_minutes,

        -- Calculate bearing for navigation
        ST_AZIMUTH(w1.waypoint_location, w2.waypoint_location) as bearing_radians

    FROM route_waypoints w1
    CROSS JOIN route_waypoints w2
    WHERE w1.waypoint_id != w2.waypoint_id
),

optimized_sequence AS (
    -- Simplified optimization (in production would use advanced algorithms)
    SELECT 
        rw.*,

        -- Calculate priority-weighted score for ordering
        (rw.priority_level * 0.4) + 
        ((10 - rw.waypoint_order) * 0.3) +  -- Original order preference
        (rw.service_time_minutes * 0.3) as optimization_score,

        -- Assign new optimized order
        ROW_NUMBER() OVER (ORDER BY 
            rw.priority_level DESC,
            rw.waypoint_order ASC
        ) as optimized_order

    FROM route_waypoints rw
),

route_segments AS (
    SELECT 
        os1.waypoint_id as from_waypoint,
        os1.waypoint_location as from_location,
        os1.waypoint_type as from_type,
        os1.service_time_minutes as from_service_time,

        os2.waypoint_id as to_waypoint,
        os2.waypoint_location as to_location,
        os2.waypoint_type as to_type,
        os2.service_time_minutes as to_service_time,

        -- Segment details from distance matrix
        dm.distance_meters as segment_distance,
        dm.estimated_minutes as travel_time,
        dm.bearing_radians,

        -- Convert bearing to compass direction
        CASE 
            WHEN dm.bearing_radians BETWEEN 0 AND PI()/8 OR dm.bearing_radians > 15*PI()/8 THEN 'North'
            WHEN dm.bearing_radians BETWEEN PI()/8 AND 3*PI()/8 THEN 'Northeast'
            WHEN dm.bearing_radians BETWEEN 3*PI()/8 AND 5*PI()/8 THEN 'East'
            WHEN dm.bearing_radians BETWEEN 5*PI()/8 AND 7*PI()/8 THEN 'Southeast'
            WHEN dm.bearing_radians BETWEEN 7*PI()/8 AND 9*PI()/8 THEN 'South'
            WHEN dm.bearing_radians BETWEEN 9*PI()/8 AND 11*PI()/8 THEN 'Southwest'
            WHEN dm.bearing_radians BETWEEN 11*PI()/8 AND 13*PI()/8 THEN 'West'
            ELSE 'Northwest'
        END as compass_direction,

        os1.optimized_order as segment_order

    FROM optimized_sequence os1
    JOIN optimized_sequence os2 ON os2.optimized_order = os1.optimized_order + 1
    JOIN distance_matrix dm ON dm.from_waypoint = os1.waypoint_id 
                              AND dm.to_waypoint = os2.waypoint_id
),

route_summary AS (
    SELECT 
        COUNT(*) as total_segments,
        SUM(segment_distance) as total_distance_meters,
        SUM(travel_time) as total_travel_minutes,
        SUM(to_service_time) as total_service_minutes,
        AVG(segment_distance) as avg_segment_distance,

        -- Route efficiency metrics
        SUM(segment_distance) / 1000 as total_distance_km,
        (SUM(travel_time) + SUM(to_service_time)) as total_route_time,

        -- Calculate route efficiency (distance/time ratio)
        CASE 
            WHEN SUM(travel_time) > 0 THEN 
                (SUM(segment_distance) / 1000) / (SUM(travel_time) / 60)
            ELSE 0
        END as avg_speed_kmh

    FROM route_segments
)

SELECT 
    -- Route segment details
    segment_order,
    from_waypoint,
    to_waypoint,
    from_type,
    to_type,

    -- Distance and time
    ROUND(segment_distance, 0) as distance_meters,
    ROUND(segment_distance / 1000.0, 2) as distance_km,
    ROUND(travel_time, 1) as travel_minutes,
    from_service_time as service_minutes,

    -- Navigation details
    compass_direction,
    ROUND(DEGREES(bearing_radians), 1) as bearing_degrees,

    -- Cumulative totals
    SUM(segment_distance) OVER (
        ORDER BY segment_order 
        ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
    ) / 1000.0 as cumulative_distance_km,

    SUM(travel_time + from_service_time) OVER (
        ORDER BY segment_order 
        ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
    ) as cumulative_time_minutes,

    -- Route coordinates for mapping
    ST_ASTEXT(from_location) as from_coordinates,
    ST_ASTEXT(to_location) as to_coordinates

FROM route_segments
ORDER BY segment_order;

-- Show route summary
SELECT 
    total_segments,
    ROUND(total_distance_km, 2) as total_distance_km,
    ROUND(total_travel_minutes, 1) as travel_time_minutes,
    ROUND(total_service_minutes, 1) as service_time_minutes,
    ROUND(total_route_time, 1) as total_time_minutes,
    ROUND(avg_speed_kmh, 1) as average_speed_kmh,

    -- Time breakdown
    ROUND(total_travel_minutes / total_route_time * 100, 1) as travel_time_percentage,
    ROUND(total_service_minutes / total_route_time * 100, 1) as service_time_percentage,

    -- Efficiency assessment
    CASE 
        WHEN avg_speed_kmh >= 40 THEN 'efficient'
        WHEN avg_speed_kmh >= 25 THEN 'moderate'
        ELSE 'inefficient'
    END as route_efficiency,

    -- Estimated completion time
    CURRENT_TIMESTAMP + (total_route_time * INTERVAL '1 minute') as estimated_completion

FROM route_summary;

-- QueryLeaf provides comprehensive MongoDB geospatial capabilities:
-- 1. Native 2dsphere indexing for efficient spherical geometry
-- 2. Advanced spatial operators for proximity and containment queries
-- 3. Geofencing with real-time violation detection
-- 4. Spatial aggregation and density analysis
-- 5. Route optimization with multiple waypoints
-- 6. SQL-familiar syntax for complex geospatial operations
-- 7. Integration with coordinate reference systems
-- 8. High-performance location-based queries
-- 9. Advanced spatial analytics and reporting
-- 10. Seamless integration with mapping and routing services

Best Practices for Production Geospatial Applications

Performance Optimization and Indexing Strategy

Essential principles for effective MongoDB geospatial application deployment:

  1. Spatial Indexing: Create appropriate 2dsphere indexes for spherical geometry operations
  2. Query Optimization: Use bounding box filters before expensive spatial operations
  3. Data Modeling: Store coordinates in GeoJSON format for optimal performance
  4. Precision Management: Configure appropriate coordinate precision for use case requirements
  5. Caching Strategy: Implement spatial caching for frequently accessed location data
  6. Connection Pooling: Optimize database connections for geospatial query patterns

Scalability and Production Deployment

Optimize geospatial operations for enterprise-scale requirements:

  1. Sharding Strategy: Design shard keys that support geospatial query patterns
  2. Load Balancing: Distribute geospatial queries across replica set members
  3. Real-Time Processing: Implement efficient real-time location tracking and updates
  4. Data Archiving: Manage historical location data with appropriate retention policies
  5. Monitoring Integration: Track geospatial query performance and resource utilization
  6. Error Handling: Implement robust error handling for location service failures

Conclusion

MongoDB geospatial queries provide comprehensive location data management capabilities that enable sophisticated location-based applications with advanced geographic indexing, proximity searches, geofencing, and spatial analysis features. The native geospatial support ensures that location-aware applications can scale efficiently while maintaining high query performance and accurate spatial calculations.

Key MongoDB Geospatial benefits include:

  • Advanced Spatial Indexing: Optimized 2dsphere indexes for efficient spherical geometry operations
  • Comprehensive Query Operators: Rich set of spatial operators for proximity, intersection, and containment queries
  • Real-Time Geofencing: Efficient geofence violation detection with customizable trigger conditions
  • Spatial Analytics: Built-in aggregation capabilities for density analysis and geographic reporting
  • Route Optimization: Advanced algorithms for multi-waypoint route planning and optimization
  • SQL Accessibility: Familiar SQL-style geospatial operations through QueryLeaf for accessible location data management

Whether you're building ride-sharing platforms, delivery applications, location-based social networks, or asset tracking systems, MongoDB geospatial capabilities with QueryLeaf's familiar SQL interface provide the foundation for sophisticated location-aware applications.

QueryLeaf Integration: QueryLeaf automatically translates SQL-style spatial operations into MongoDB's native geospatial queries, making advanced location-based functionality accessible to SQL-oriented development teams. Complex spatial calculations, proximity searches, and geofencing operations are seamlessly handled through familiar SQL constructs, enabling sophisticated location-based applications without requiring deep MongoDB geospatial expertise.

The combination of MongoDB's robust geospatial capabilities with SQL-style spatial operations makes it an ideal platform for applications requiring both sophisticated location-based functionality and familiar database management patterns, ensuring your geospatial operations can scale efficiently while maintaining accuracy and performance as data volume and query complexity grow.

MongoDB Change Data Capture and Real-Time Data Synchronization: Advanced Event-Driven Architecture and Data Pipeline Management

Modern distributed applications require real-time data synchronization capabilities that enable immediate propagation of data changes across multiple systems, microservices, and external platforms without complex polling mechanisms or batch synchronization processes. Traditional change detection approaches rely on timestamp-based polling, database triggers, or application-level change tracking, leading to data inconsistencies, performance bottlenecks, and complex synchronization logic that fails to scale with growing data volumes.

MongoDB Change Data Capture provides comprehensive real-time change detection and streaming capabilities through change streams, enabling applications to react immediately to data modifications, maintain synchronized data across distributed systems, and build event-driven architectures that scale efficiently. Unlike traditional CDC approaches that require complex trigger systems or external change detection tools, MongoDB change streams offer native, scalable, and reliable change tracking with minimal performance impact.

The Traditional Change Detection Challenge

Conventional approaches to change data capture and synchronization have significant limitations for modern distributed architectures:

-- Traditional PostgreSQL change detection - complex and resource-intensive approaches

-- Manual timestamp-based change tracking with performance limitations
CREATE TABLE products (
    product_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    product_name VARCHAR(200) NOT NULL,
    category VARCHAR(100),
    price DECIMAL(10,2) NOT NULL,
    stock_quantity INTEGER NOT NULL DEFAULT 0,
    supplier_id UUID,

    -- Manual change tracking fields (limited granularity)
    created_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
    version_number INTEGER NOT NULL DEFAULT 1,
    last_modified_by VARCHAR(100),

    -- Change type tracking (application-managed)
    change_type VARCHAR(20) DEFAULT 'insert',
    is_deleted BOOLEAN DEFAULT FALSE,

    -- Synchronization status tracking
    sync_status VARCHAR(50) DEFAULT 'pending',
    last_sync_timestamp TIMESTAMP,
    sync_retry_count INTEGER DEFAULT 0,
    sync_error TEXT
);

-- Trigger-based change detection with complex maintenance requirements
CREATE TABLE product_changes (
    change_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    product_id UUID NOT NULL,
    change_type VARCHAR(20) NOT NULL, -- 'INSERT', 'UPDATE', 'DELETE'
    change_timestamp TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,

    -- Before and after data (limited effectiveness for complex documents)
    old_data JSONB,
    new_data JSONB,
    changed_fields TEXT[], -- Manual field tracking

    -- Change metadata
    changed_by VARCHAR(100),
    change_reason VARCHAR(200),
    transaction_id BIGINT,

    -- Synchronization tracking
    processed BOOLEAN DEFAULT FALSE,
    processed_at TIMESTAMP,
    processing_attempts INTEGER DEFAULT 0,
    error_message TEXT
);

-- Complex trigger system for change capture (high maintenance overhead)
CREATE OR REPLACE FUNCTION track_product_changes()
RETURNS TRIGGER AS $$
DECLARE
    change_type_value VARCHAR(20);
    old_json JSONB;
    new_json JSONB;
    changed_fields_array TEXT[] := '{}';
    field_name TEXT;
BEGIN
    -- Determine change type
    IF TG_OP = 'INSERT' THEN
        change_type_value := 'INSERT';
        new_json := to_jsonb(NEW);
        old_json := NULL;

    ELSIF TG_OP = 'UPDATE' THEN
        change_type_value := 'UPDATE';
        old_json := to_jsonb(OLD);
        new_json := to_jsonb(NEW);

        -- Manual field-by-field comparison (extremely limited)
        IF OLD.product_name != NEW.product_name THEN
            changed_fields_array := array_append(changed_fields_array, 'product_name');
        END IF;
        IF OLD.category != NEW.category OR (OLD.category IS NULL) != (NEW.category IS NULL) THEN
            changed_fields_array := array_append(changed_fields_array, 'category');
        END IF;
        IF OLD.price != NEW.price THEN
            changed_fields_array := array_append(changed_fields_array, 'price');
        END IF;
        IF OLD.stock_quantity != NEW.stock_quantity THEN
            changed_fields_array := array_append(changed_fields_array, 'stock_quantity');
        END IF;
        -- Limited to predefined fields, no support for dynamic schema

    ELSIF TG_OP = 'DELETE' THEN
        change_type_value := 'DELETE';
        old_json := to_jsonb(OLD);
        new_json := NULL;

    END IF;

    -- Insert change record (potential performance bottleneck)
    INSERT INTO product_changes (
        product_id,
        change_type,
        old_data,
        new_data,
        changed_fields,
        changed_by,
        transaction_id
    ) VALUES (
        COALESCE(NEW.product_id, OLD.product_id),
        change_type_value,
        old_json,
        new_json,
        changed_fields_array,
        current_user,
        txid_current()
    );

    -- Update the main record's change tracking
    IF TG_OP != 'DELETE' THEN
        NEW.updated_at := CURRENT_TIMESTAMP;
        NEW.version_number := COALESCE(OLD.version_number, 0) + 1;
        NEW.sync_status := 'pending';
        RETURN NEW;
    ELSE
        RETURN OLD;
    END IF;

EXCEPTION 
    WHEN OTHERS THEN
        -- Log error but don't fail the main operation
        RAISE WARNING 'Change tracking failed: %', SQLERRM;
        RETURN COALESCE(NEW, OLD);
END;
$$ LANGUAGE plpgsql;

-- Create triggers (must be maintained for every table)
CREATE TRIGGER products_change_trigger
    BEFORE INSERT OR UPDATE OR DELETE ON products
    FOR EACH ROW EXECUTE FUNCTION track_product_changes();

-- Polling-based synchronization query (inefficient and resource-intensive)
WITH pending_changes AS (
    SELECT 
        pc.change_id,
        pc.product_id,
        pc.change_type,
        pc.change_timestamp,
        pc.new_data,
        pc.old_data,
        pc.changed_fields,

        -- Extract change details for synchronization
        CASE 
            WHEN pc.change_type = 'INSERT' THEN pc.new_data
            WHEN pc.change_type = 'UPDATE' THEN pc.new_data  
            WHEN pc.change_type = 'DELETE' THEN pc.old_data
        END as sync_data,

        -- Priority scoring (limited effectiveness)
        CASE 
            WHEN pc.change_type = 'DELETE' THEN 3
            WHEN pc.change_type = 'INSERT' THEN 2
            WHEN pc.change_type = 'UPDATE' AND 'price' = ANY(pc.changed_fields) THEN 2
            ELSE 1
        END as sync_priority,

        -- Calculate processing delay (for monitoring)
        EXTRACT(EPOCH FROM CURRENT_TIMESTAMP - pc.change_timestamp) as delay_seconds

    FROM product_changes pc
    WHERE pc.processed = false
    AND pc.processing_attempts < 5
    AND (pc.change_timestamp + INTERVAL '1 minute' * POWER(2, pc.processing_attempts)) <= CURRENT_TIMESTAMP

    ORDER BY sync_priority DESC, change_timestamp ASC
    LIMIT 1000  -- Batch processing limitation
),

synchronization_targets AS (
    -- Define external systems to sync with (static configuration)
    SELECT unnest(ARRAY[
        'search_service',
        'analytics_warehouse', 
        'recommendation_engine',
        'external_api',
        'cache_invalidation'
    ]) as target_system
)

SELECT 
    pc.change_id,
    pc.product_id,
    pc.change_type,
    pc.sync_data,
    pc.delay_seconds,
    st.target_system,

    -- Generate synchronization payloads (limited transformation capabilities)
    JSON_BUILD_OBJECT(
        'change_id', pc.change_id,
        'entity_type', 'product',
        'entity_id', pc.product_id,
        'operation', LOWER(pc.change_type),
        'timestamp', pc.change_timestamp,
        'data', pc.sync_data,
        'changed_fields', pc.changed_fields,
        'target_system', st.target_system,
        'priority', pc.sync_priority
    ) as sync_payload,

    -- Endpoint configuration (manual maintenance)
    CASE st.target_system
        WHEN 'search_service' THEN 'http://search-api/products/sync'
        WHEN 'analytics_warehouse' THEN 'http://warehouse-api/data/ingest'
        WHEN 'recommendation_engine' THEN 'http://recommendations/products/update'
        WHEN 'external_api' THEN 'http://external-partner/webhook/products'
        WHEN 'cache_invalidation' THEN 'http://cache-service/invalidate'
        ELSE 'http://default-sync-service/webhook'
    END as target_endpoint

FROM pending_changes pc
CROSS JOIN synchronization_targets st;

-- Update processing status (requires external application logic)
UPDATE product_changes 
SET 
    processing_attempts = processing_attempts + 1,
    error_message = CASE 
        WHEN processing_attempts >= 4 THEN 'Max retry attempts exceeded'
        ELSE error_message
    END
WHERE change_id IN (SELECT change_id FROM pending_changes);

-- Problems with traditional CDC approaches:
-- 1. Complex trigger maintenance and performance impact on write operations
-- 2. Limited change detection granularity and field-level change tracking
-- 3. Manual synchronization logic with no built-in retry or error handling
-- 4. Polling-based detection causing delays and resource waste
-- 5. No support for transaction-level change grouping or ordering guarantees  
-- 6. Difficult schema evolution and maintenance of change tracking infrastructure
-- 7. No built-in filtering or transformation capabilities for change streams
-- 8. Complex error handling and dead letter queue management
-- 9. Limited scalability for high-volume change processing
-- 10. No native support for distributed system synchronization patterns

-- Manual batch synchronization attempt (resource-intensive and delayed)
WITH hourly_changes AS (
    SELECT 
        product_id,
        array_agg(
            JSON_BUILD_OBJECT(
                'change_type', change_type,
                'timestamp', change_timestamp,
                'data', COALESCE(new_data, old_data)
            ) 
            ORDER BY change_timestamp
        ) as change_history,
        MIN(change_timestamp) as first_change,
        MAX(change_timestamp) as last_change,
        COUNT(*) as change_count

    FROM product_changes
    WHERE change_timestamp >= CURRENT_TIMESTAMP - INTERVAL '1 hour'
    AND processed = false
    GROUP BY product_id
),

batch_sync_data AS (
    SELECT 
        hc.product_id,
        hc.change_history,
        hc.change_count,

        -- Get current product state (may be inconsistent due to timing)
        p.product_name,
        p.category, 
        p.price,
        p.stock_quantity,

        -- Calculate sync requirements
        CASE 
            WHEN p.product_id IS NULL THEN 'product_deleted'
            WHEN hc.change_count > 10 THEN 'full_refresh'
            ELSE 'incremental_sync'
        END as sync_strategy

    FROM hourly_changes hc
    LEFT JOIN products p ON hc.product_id = p.product_id
)

SELECT 
    COUNT(*) as total_products_to_sync,
    COUNT(*) FILTER (WHERE sync_strategy = 'full_refresh') as full_refresh_count,
    COUNT(*) FILTER (WHERE sync_strategy = 'incremental_sync') as incremental_count,  
    COUNT(*) FILTER (WHERE sync_strategy = 'product_deleted') as deletion_count,
    SUM(change_count) as total_changes,
    AVG(change_count) as avg_changes_per_product,

    -- Estimate processing time (rough calculation)
    CEIL(SUM(change_count) / 100.0) as estimated_processing_minutes

FROM batch_sync_data;

-- Traditional limitations:
-- 1. No real-time change detection - relies on polling with delays
-- 2. Complex trigger and stored procedure maintenance overhead
-- 3. Performance impact on write operations due to change tracking triggers
-- 4. Limited transformation and filtering capabilities for change data
-- 5. Manual error handling and retry logic implementation required
-- 6. No built-in support for distributed synchronization patterns
-- 7. Difficult to scale change processing for high-volume systems
-- 8. Schema evolution breaks change tracking infrastructure
-- 9. No transaction-level change ordering or consistency guarantees
-- 10. Complex debugging and monitoring of change propagation failures

MongoDB provides sophisticated Change Data Capture capabilities with advanced streaming and synchronization:

// MongoDB Advanced Change Data Capture and Real-Time Synchronization System
const { MongoClient, ChangeStream } = require('mongodb');
const { EventEmitter } = require('events');

const client = new MongoClient('mongodb://localhost:27017/?replicaSet=rs0');
const db = client.db('realtime_cdc_system');

// Comprehensive MongoDB Change Data Capture Manager
class AdvancedCDCManager extends EventEmitter {
  constructor(db, config = {}) {
    super();
    this.db = db;
    this.collections = {
      products: db.collection('products'),
      orders: db.collection('orders'),
      customers: db.collection('customers'),
      inventory: db.collection('inventory'),
      cdcConfiguration: db.collection('cdc_configuration'),
      changeLog: db.collection('change_log'),
      syncStatus: db.collection('sync_status'),
      errorLog: db.collection('error_log')
    };

    // Advanced CDC configuration
    this.config = {
      enableResumeTokens: config.enableResumeTokens !== false,
      batchSize: config.batchSize || 100,
      maxAwaitTimeMS: config.maxAwaitTimeMS || 1000,
      fullDocument: config.fullDocument || 'updateLookup',
      fullDocumentBeforeChange: config.fullDocumentBeforeChange || 'whenAvailable',

      // Change stream filtering
      enableFiltering: config.enableFiltering !== false,
      filterCriteria: config.filterCriteria || {},
      includeNamespaces: config.includeNamespaces || [],
      excludeNamespaces: config.excludeNamespaces || [],

      // Synchronization targets
      syncTargets: config.syncTargets || [
        { name: 'search_service', enabled: true, priority: 1 },
        { name: 'analytics_warehouse', enabled: true, priority: 2 },
        { name: 'recommendation_engine', enabled: true, priority: 3 },
        { name: 'cache_invalidation', enabled: true, priority: 1 }
      ],

      // Error handling and retry
      enableRetryLogic: config.enableRetryLogic !== false,
      maxRetries: config.maxRetries || 5,
      retryDelayBase: config.retryDelayBase || 1000,
      deadLetterQueue: config.enableDeadLetterQueue !== false,

      // Performance optimization
      enableBatchProcessing: config.enableBatchProcessing !== false,
      enableParallelSync: config.enableParallelSync !== false,
      maxConcurrentSyncs: config.maxConcurrentSyncs || 10,

      // Monitoring and metrics
      enableMetrics: config.enableMetrics !== false,
      metricsInterval: config.metricsInterval || 60000,
      enableHealthChecks: config.enableHealthChecks !== false
    };

    // CDC state management
    this.changeStreams = new Map();
    this.resumeTokens = new Map();
    this.syncQueues = new Map();
    this.processingStats = {
      totalChanges: 0,
      successfulSyncs: 0,
      failedSyncs: 0,
      avgProcessingTime: 0,
      lastProcessedTimestamp: null
    };

    // Initialize CDC system
    this.initializeCDCSystem();
  }

  async initializeCDCSystem() {
    console.log('Initializing comprehensive MongoDB Change Data Capture system...');

    try {
      // Setup change stream configurations
      await this.setupChangeStreamConfiguration();

      // Initialize synchronization targets
      await this.initializeSyncTargets();

      // Setup error handling and monitoring
      await this.setupErrorHandlingAndMonitoring();

      // Start change streams for configured collections
      await this.startChangeStreams();

      // Initialize metrics collection
      if (this.config.enableMetrics) {
        await this.startMetricsCollection();
      }

      // Setup health monitoring
      if (this.config.enableHealthChecks) {
        await this.setupHealthMonitoring();
      }

      console.log('Change Data Capture system initialized successfully');

    } catch (error) {
      console.error('Error initializing CDC system:', error);
      throw error;
    }
  }

  async setupChangeStreamConfiguration() {
    console.log('Setting up change stream configuration...');

    try {
      // Define collections to monitor with specific configurations
      const monitoringConfig = [
        {
          collection: 'products',
          pipeline: [
            {
              $match: {
                'operationType': { $in: ['insert', 'update', 'delete', 'replace'] },
                $or: [
                  { 'updateDescription.updatedFields.price': { $exists: true } },
                  { 'updateDescription.updatedFields.stock_quantity': { $exists: true } },
                  { 'updateDescription.updatedFields.status': { $exists: true } },
                  { 'operationType': { $in: ['insert', 'delete'] } }
                ]
              }
            }
          ],
          options: {
            fullDocument: 'updateLookup',
            fullDocumentBeforeChange: 'whenAvailable'
          },
          syncTargets: ['search_service', 'analytics_warehouse', 'cache_invalidation'],
          transformations: ['priceCalculation', 'stockValidation', 'searchIndexing']
        },
        {
          collection: 'orders',
          pipeline: [
            {
              $match: {
                'operationType': { $in: ['insert', 'update'] },
                $or: [
                  { 'operationType': 'insert' },
                  { 'updateDescription.updatedFields.status': { $exists: true } },
                  { 'updateDescription.updatedFields.payment_status': { $exists: true } }
                ]
              }
            }
          ],
          options: {
            fullDocument: 'updateLookup'
          },
          syncTargets: ['analytics_warehouse', 'recommendation_engine'],
          transformations: ['orderAnalytics', 'customerInsights', 'inventoryImpact']
        },
        {
          collection: 'customers',
          pipeline: [
            {
              $match: {
                'operationType': { $in: ['insert', 'update'] },
                $or: [
                  { 'operationType': 'insert' },
                  { 'updateDescription.updatedFields.preferences': { $exists: true } },
                  { 'updateDescription.updatedFields.profile': { $exists: true } }
                ]
              }
            }
          ],
          options: {
            fullDocument: 'updateLookup'
          },
          syncTargets: ['recommendation_engine', 'analytics_warehouse'],
          transformations: ['profileEnrichment', 'preferencesAnalysis', 'segmentation']
        }
      ];

      // Store configuration for runtime access
      await this.collections.cdcConfiguration.deleteMany({});
      await this.collections.cdcConfiguration.insertMany(
        monitoringConfig.map(config => ({
          ...config,
          enabled: true,
          createdAt: new Date(),
          lastResumeToken: null
        }))
      );

      this.monitoringConfig = monitoringConfig;

    } catch (error) {
      console.error('Error setting up change stream configuration:', error);
      throw error;
    }
  }

  async startChangeStreams() {
    console.log('Starting change streams for all configured collections...');

    try {
      for (const config of this.monitoringConfig) {
        await this.startCollectionChangeStream(config);
      }

      console.log(`Started ${this.changeStreams.size} change streams successfully`);

    } catch (error) {
      console.error('Error starting change streams:', error);
      throw error;
    }
  }

  async startCollectionChangeStream(config) {
    console.log(`Starting change stream for collection: ${config.collection}`);

    try {
      const collection = this.collections[config.collection];
      if (!collection) {
        throw new Error(`Collection ${config.collection} not found`);
      }

      // Retrieve resume token if available
      const savedConfig = await this.collections.cdcConfiguration.findOne({
        collection: config.collection
      });

      const changeStreamOptions = {
        ...config.options,
        batchSize: this.config.batchSize,
        maxAwaitTimeMS: this.config.maxAwaitTimeMS
      };

      // Add resume token if available and enabled
      if (this.config.enableResumeTokens && savedConfig?.lastResumeToken) {
        changeStreamOptions.resumeAfter = savedConfig.lastResumeToken;
        console.log(`Resuming change stream for ${config.collection} from saved token`);
      }

      // Create change stream with pipeline
      const changeStream = collection.watch(config.pipeline || [], changeStreamOptions);

      // Store change stream reference
      this.changeStreams.set(config.collection, {
        stream: changeStream,
        config: config,
        startTime: new Date(),
        processedCount: 0,
        errorCount: 0
      });

      // Setup change event handling
      changeStream.on('change', async (changeDoc) => {
        await this.handleChangeEvent(changeDoc, config);
      });

      changeStream.on('error', async (error) => {
        console.error(`Change stream error for ${config.collection}:`, error);
        await this.handleChangeStreamError(config.collection, error);
      });

      changeStream.on('close', () => {
        console.warn(`Change stream closed for ${config.collection}`);
        this.emit('changeStreamClosed', config.collection);
      });

      changeStream.on('end', () => {
        console.warn(`Change stream ended for ${config.collection}`);
        this.emit('changeStreamEnded', config.collection);
      });

      // Store resume token periodically
      if (this.config.enableResumeTokens) {
        setInterval(async () => {
          try {
            const resumeToken = changeStream.resumeToken;
            if (resumeToken) {
              await this.saveResumeToken(config.collection, resumeToken);
            }
          } catch (error) {
            console.warn(`Error saving resume token for ${config.collection}:`, error.message);
          }
        }, 30000); // Save every 30 seconds
      }

    } catch (error) {
      console.error(`Error starting change stream for ${config.collection}:`, error);
      throw error;
    }
  }

  async handleChangeEvent(changeDoc, config) {
    const startTime = Date.now();

    try {
      // Update processing statistics
      this.processingStats.totalChanges++;
      this.processingStats.lastProcessedTimestamp = new Date();

      // Update collection-specific statistics
      const streamInfo = this.changeStreams.get(config.collection);
      if (streamInfo) {
        streamInfo.processedCount++;
      }

      console.log(`Processing change event for ${config.collection}:`, {
        operationType: changeDoc.operationType,
        documentKey: changeDoc.documentKey,
        timestamp: changeDoc.clusterTime
      });

      // Apply transformations if configured
      const transformedChangeDoc = await this.applyTransformations(changeDoc, config);

      // Log change event for audit trail
      await this.logChangeEvent(transformedChangeDoc, config);

      // Process synchronization to configured targets
      const syncPromises = config.syncTargets.map(targetName => 
        this.synchronizeToTarget(transformedChangeDoc, config, targetName)
      );

      if (this.config.enableParallelSync) {
        // Execute synchronizations in parallel
        const syncResults = await Promise.allSettled(syncPromises);
        await this.processSyncResults(syncResults, transformedChangeDoc, config);
      } else {
        // Execute synchronizations sequentially
        for (const syncPromise of syncPromises) {
          try {
            await syncPromise;
            this.processingStats.successfulSyncs++;
          } catch (error) {
            this.processingStats.failedSyncs++;
            console.error('Sequential sync error:', error);
            await this.handleSyncError(error, transformedChangeDoc, config);
          }
        }
      }

      // Update processing time metrics
      const processingTime = Date.now() - startTime;
      this.updateProcessingMetrics(processingTime);

      // Emit processed event for external monitoring
      this.emit('changeProcessed', {
        collection: config.collection,
        operationType: changeDoc.operationType,
        documentKey: changeDoc.documentKey,
        processingTime: processingTime,
        syncTargets: config.syncTargets
      });

    } catch (error) {
      console.error('Error handling change event:', error);

      // Update error statistics
      const streamInfo = this.changeStreams.get(config.collection);
      if (streamInfo) {
        streamInfo.errorCount++;
      }

      // Log error for debugging
      await this.logError(error, changeDoc, config);

      // Emit error event
      this.emit('changeProcessingError', {
        error: error,
        changeDoc: changeDoc,
        config: config
      });
    }
  }

  async applyTransformations(changeDoc, config) {
    if (!config.transformations || config.transformations.length === 0) {
      return changeDoc;
    }

    console.log(`Applying ${config.transformations.length} transformations...`);

    let transformedDoc = { ...changeDoc };

    try {
      for (const transformationName of config.transformations) {
        transformedDoc = await this.applyTransformation(transformedDoc, transformationName, config);
      }

      return transformedDoc;

    } catch (error) {
      console.error('Error applying transformations:', error);
      // Return original document if transformation fails
      return changeDoc;
    }
  }

  async applyTransformation(changeDoc, transformationName, config) {
    switch (transformationName) {
      case 'priceCalculation':
        return await this.transformPriceCalculation(changeDoc);

      case 'stockValidation':
        return await this.transformStockValidation(changeDoc);

      case 'searchIndexing':
        return await this.transformSearchIndexing(changeDoc);

      case 'orderAnalytics':
        return await this.transformOrderAnalytics(changeDoc);

      case 'customerInsights':
        return await this.transformCustomerInsights(changeDoc);

      case 'inventoryImpact':
        return await this.transformInventoryImpact(changeDoc);

      case 'profileEnrichment':
        return await this.transformProfileEnrichment(changeDoc);

      case 'preferencesAnalysis':
        return await this.transformPreferencesAnalysis(changeDoc);

      case 'segmentation':
        return await this.transformSegmentation(changeDoc);

      default:
        console.warn(`Unknown transformation: ${transformationName}`);
        return changeDoc;
    }
  }

  async transformPriceCalculation(changeDoc) {
    if (changeDoc.operationType === 'update' && 
        changeDoc.updateDescription?.updatedFields?.price) {

      const newPrice = changeDoc.fullDocument?.price;
      const oldPrice = changeDoc.fullDocumentBeforeChange?.price;

      if (newPrice && oldPrice) {
        const priceChange = newPrice - oldPrice;
        const priceChangePercent = ((priceChange / oldPrice) * 100);

        changeDoc.enrichment = {
          ...changeDoc.enrichment,
          priceAnalysis: {
            oldPrice: oldPrice,
            newPrice: newPrice,
            priceChange: priceChange,
            priceChangePercent: Math.round(priceChangePercent * 100) / 100,
            priceDirection: priceChange > 0 ? 'increase' : 'decrease',
            significantChange: Math.abs(priceChangePercent) > 10
          }
        };
      }
    }

    return changeDoc;
  }

  async transformStockValidation(changeDoc) {
    if (changeDoc.fullDocument?.stock_quantity !== undefined) {
      const stockQuantity = changeDoc.fullDocument.stock_quantity;

      changeDoc.enrichment = {
        ...changeDoc.enrichment,
        stockAnalysis: {
          currentStock: stockQuantity,
          stockStatus: stockQuantity === 0 ? 'out_of_stock' : 
                      stockQuantity < 10 ? 'low_stock' : 'in_stock',
          restockNeeded: stockQuantity < 10,
          stockChangeAlert: changeDoc.operationType === 'update' && 
                           changeDoc.updateDescription?.updatedFields?.stock_quantity !== undefined
        }
      };
    }

    return changeDoc;
  }

  async transformSearchIndexing(changeDoc) {
    if (changeDoc.fullDocument) {
      const doc = changeDoc.fullDocument;

      // Generate search keywords and metadata
      const searchKeywords = [];
      if (doc.product_name) searchKeywords.push(...doc.product_name.toLowerCase().split(/\s+/));
      if (doc.category) searchKeywords.push(...doc.category.toLowerCase().split(/\s+/));
      if (doc.tags) searchKeywords.push(...doc.tags.map(tag => tag.toLowerCase()));

      changeDoc.enrichment = {
        ...changeDoc.enrichment,
        searchMetadata: {
          searchKeywords: [...new Set(searchKeywords)].filter(word => word.length > 2),
          searchableFields: ['product_name', 'category', 'description', 'tags'],
          indexPriority: doc.featured ? 'high' : 'normal',
          lastIndexUpdate: new Date()
        }
      };
    }

    return changeDoc;
  }

  async transformOrderAnalytics(changeDoc) {
    if (changeDoc.fullDocument && changeDoc.ns.coll === 'orders') {
      const order = changeDoc.fullDocument;

      // Calculate order metrics
      const orderValue = order.items?.reduce((sum, item) => sum + (item.price * item.quantity), 0) || 0;
      const itemCount = order.items?.reduce((sum, item) => sum + item.quantity, 0) || 0;

      changeDoc.enrichment = {
        ...changeDoc.enrichment,
        orderAnalytics: {
          orderValue: orderValue,
          itemCount: itemCount,
          averageItemValue: itemCount > 0 ? orderValue / itemCount : 0,
          customerSegment: orderValue > 500 ? 'high_value' : orderValue > 100 ? 'medium_value' : 'low_value',
          orderComplexity: itemCount > 5 ? 'complex' : 'simple'
        }
      };
    }

    return changeDoc;
  }

  async synchronizeToTarget(changeDoc, config, targetName) {
    console.log(`Synchronizing to target: ${targetName}`);

    try {
      // Find target configuration
      const targetConfig = this.config.syncTargets.find(t => t.name === targetName);
      if (!targetConfig || !targetConfig.enabled) {
        console.log(`Target ${targetName} is disabled, skipping sync`);
        return;
      }

      // Prepare synchronization payload
      const syncPayload = await this.prepareSyncPayload(changeDoc, config, targetName);

      // Execute synchronization based on target type
      const syncResult = await this.executeSynchronization(syncPayload, targetConfig);

      // Log successful synchronization
      await this.logSuccessfulSync(changeDoc, targetName, syncResult);

      return syncResult;

    } catch (error) {
      console.error(`Synchronization failed for target ${targetName}:`, error);

      // Handle sync error with retry logic
      await this.handleSyncError(error, changeDoc, config, targetName);
      throw error;
    }
  }

  async prepareSyncPayload(changeDoc, config, targetName) {
    const basePayload = {
      changeId: changeDoc._id?.toString() || `${Date.now()}-${Math.random()}`,
      timestamp: new Date(),
      source: {
        database: changeDoc.ns.db,
        collection: changeDoc.ns.coll,
        operationType: changeDoc.operationType
      },
      documentKey: changeDoc.documentKey,
      clusterTime: changeDoc.clusterTime,
      enrichment: changeDoc.enrichment || {}
    };

    // Add operation-specific data
    switch (changeDoc.operationType) {
      case 'insert':
        basePayload.document = changeDoc.fullDocument;
        break;

      case 'update':
        basePayload.document = changeDoc.fullDocument;
        basePayload.updateDescription = changeDoc.updateDescription;
        if (changeDoc.fullDocumentBeforeChange) {
          basePayload.documentBeforeChange = changeDoc.fullDocumentBeforeChange;
        }
        break;

      case 'replace':
        basePayload.document = changeDoc.fullDocument;
        if (changeDoc.fullDocumentBeforeChange) {
          basePayload.documentBeforeChange = changeDoc.fullDocumentBeforeChange;
        }
        break;

      case 'delete':
        if (changeDoc.fullDocumentBeforeChange) {
          basePayload.deletedDocument = changeDoc.fullDocumentBeforeChange;
        }
        break;
    }

    // Apply target-specific transformations
    return await this.applyTargetSpecificTransformations(basePayload, targetName);
  }

  async applyTargetSpecificTransformations(payload, targetName) {
    switch (targetName) {
      case 'search_service':
        return this.transformForSearchService(payload);

      case 'analytics_warehouse':
        return this.transformForAnalyticsWarehouse(payload);

      case 'recommendation_engine':
        return this.transformForRecommendationEngine(payload);

      case 'cache_invalidation':
        return this.transformForCacheInvalidation(payload);

      default:
        return payload;
    }
  }

  transformForSearchService(payload) {
    if (payload.source.collection === 'products') {
      return {
        ...payload,
        searchServiceData: {
          action: payload.source.operationType === 'delete' ? 'delete' : 'upsert',
          document: payload.document ? {
            id: payload.document._id,
            title: payload.document.product_name,
            content: payload.document.description,
            category: payload.document.category,
            price: payload.document.price,
            inStock: payload.document.stock_quantity > 0,
            keywords: payload.enrichment?.searchMetadata?.searchKeywords || [],
            lastUpdated: payload.timestamp
          } : null,
          priority: payload.enrichment?.searchMetadata?.indexPriority || 'normal'
        }
      };
    }

    return payload;
  }

  transformForAnalyticsWarehouse(payload) {
    return {
      ...payload,
      warehouseData: {
        entityType: payload.source.collection,
        eventType: `${payload.source.collection}_${payload.source.operationType}`,
        eventTimestamp: payload.timestamp,
        entityId: payload.documentKey._id,
        eventData: payload.document || payload.deletedDocument,
        changeMetadata: {
          updatedFields: payload.updateDescription?.updatedFields ? Object.keys(payload.updateDescription.updatedFields) : null,
          removedFields: payload.updateDescription?.removedFields || null
        },
        enrichment: payload.enrichment
      }
    };
  }

  transformForRecommendationEngine(payload) {
    if (payload.source.collection === 'orders') {
      return {
        ...payload,
        recommendationData: {
          userId: payload.document?.customer_id,
          items: payload.document?.items || [],
          orderValue: payload.enrichment?.orderAnalytics?.orderValue,
          customerSegment: payload.enrichment?.orderAnalytics?.customerSegment,
          eventType: 'purchase',
          timestamp: payload.timestamp
        }
      };
    }

    return payload;
  }

  transformForCacheInvalidation(payload) {
    const cacheKeys = [];

    // Generate cache keys based on the changed document
    if (payload.source.collection === 'products' && payload.documentKey._id) {
      cacheKeys.push(
        `product:${payload.documentKey._id}`,
        `products:category:${payload.document?.category}`,
        `products:search:*` // Wildcard for search result caches
      );
    }

    return {
      ...payload,
      cacheInvalidationData: {
        keys: cacheKeys,
        operation: 'invalidate',
        cascade: true,
        reason: `${payload.source.collection}_${payload.source.operationType}`
      }
    };
  }

  async executeSynchronization(syncPayload, targetConfig) {
    // Simulate different synchronization mechanisms
    switch (targetConfig.name) {
      case 'search_service':
        return await this.syncToSearchService(syncPayload);

      case 'analytics_warehouse':
        return await this.syncToAnalyticsWarehouse(syncPayload);

      case 'recommendation_engine':
        return await this.syncToRecommendationEngine(syncPayload);

      case 'cache_invalidation':
        return await this.syncToCacheService(syncPayload);

      default:
        throw new Error(`Unknown sync target: ${targetConfig.name}`);
    }
  }

  async syncToSearchService(payload) {
    // Simulate search service synchronization
    console.log('Syncing to search service:', payload.searchServiceData?.action);

    // Simulate API call delay
    await new Promise(resolve => setTimeout(resolve, 50));

    return {
      success: true,
      target: 'search_service',
      action: payload.searchServiceData?.action,
      processedAt: new Date(),
      responseTime: 50
    };
  }

  async syncToAnalyticsWarehouse(payload) {
    // Simulate analytics warehouse synchronization
    console.log('Syncing to analytics warehouse:', payload.warehouseData?.eventType);

    // Simulate processing delay
    await new Promise(resolve => setTimeout(resolve, 100));

    return {
      success: true,
      target: 'analytics_warehouse',
      eventType: payload.warehouseData?.eventType,
      processedAt: new Date(),
      responseTime: 100
    };
  }

  async syncToRecommendationEngine(payload) {
    // Simulate recommendation engine synchronization
    console.log('Syncing to recommendation engine');

    await new Promise(resolve => setTimeout(resolve, 75));

    return {
      success: true,
      target: 'recommendation_engine',
      processedAt: new Date(),
      responseTime: 75
    };
  }

  async syncToCacheService(payload) {
    // Simulate cache invalidation
    console.log('Invalidating cache keys:', payload.cacheInvalidationData?.keys);

    await new Promise(resolve => setTimeout(resolve, 25));

    return {
      success: true,
      target: 'cache_invalidation',
      keysInvalidated: payload.cacheInvalidationData?.keys?.length || 0,
      processedAt: new Date(),
      responseTime: 25
    };
  }

  async handleSyncError(error, changeDoc, config, targetName) {
    console.error(`Sync error for ${targetName}:`, error.message);

    // Log error for debugging
    await this.collections.errorLog.insertOne({
      errorType: 'sync_error',
      targetName: targetName,
      collection: config.collection,
      changeId: changeDoc._id?.toString(),
      documentKey: changeDoc.documentKey,
      error: {
        message: error.message,
        stack: error.stack,
        name: error.name
      },
      timestamp: new Date(),
      retryable: this.isRetryableError(error)
    });

    // Implement retry logic if enabled
    if (this.config.enableRetryLogic && this.isRetryableError(error)) {
      await this.scheduleRetry(changeDoc, config, targetName);
    } else if (this.config.deadLetterQueue) {
      await this.sendToDeadLetterQueue(changeDoc, config, targetName, error);
    }
  }

  isRetryableError(error) {
    // Define which errors should trigger retries
    const retryableErrorTypes = [
      'ECONNREFUSED',
      'ECONNRESET', 
      'ETIMEDOUT',
      'EAI_AGAIN',
      'ENOTFOUND'
    ];

    return retryableErrorTypes.includes(error.code) || 
           error.message?.includes('timeout') ||
           error.message?.includes('connection') ||
           (error.status >= 500 && error.status < 600);
  }

  async scheduleRetry(changeDoc, config, targetName) {
    // Implement exponential backoff retry
    const retryKey = `${config.collection}_${changeDoc.documentKey._id}_${targetName}`;

    // Get current retry count
    let retryCount = await this.getRetryCount(retryKey);

    if (retryCount < this.config.maxRetries) {
      const delayMs = this.config.retryDelayBase * Math.pow(2, retryCount);

      console.log(`Scheduling retry ${retryCount + 1} for ${targetName} in ${delayMs}ms`);

      setTimeout(async () => {
        try {
          await this.synchronizeToTarget(changeDoc, config, targetName);
          await this.clearRetryCount(retryKey);
        } catch (retryError) {
          await this.incrementRetryCount(retryKey);
          await this.handleSyncError(retryError, changeDoc, config, targetName);
        }
      }, delayMs);
    } else {
      console.error(`Max retries exceeded for ${targetName}, sending to dead letter queue`);
      await this.sendToDeadLetterQueue(changeDoc, config, targetName, new Error('Max retries exceeded'));
    }
  }

  async logChangeEvent(changeDoc, config) {
    try {
      await this.collections.changeLog.insertOne({
        changeId: changeDoc._id?.toString() || `${Date.now()}-${Math.random()}`,
        collection: config.collection,
        operationType: changeDoc.operationType,
        documentKey: changeDoc.documentKey,
        clusterTime: changeDoc.clusterTime,
        hasFullDocument: !!changeDoc.fullDocument,
        hasFullDocumentBeforeChange: !!changeDoc.fullDocumentBeforeChange,
        updateDescription: changeDoc.updateDescription,
        enrichment: changeDoc.enrichment,
        syncTargets: config.syncTargets,
        timestamp: new Date()
      });
    } catch (error) {
      console.warn('Error logging change event:', error.message);
    }
  }

  async saveResumeToken(collection, resumeToken) {
    try {
      await this.collections.cdcConfiguration.updateOne(
        { collection: collection },
        { 
          $set: { 
            lastResumeToken: resumeToken,
            lastResumeTokenUpdate: new Date()
          }
        }
      );

      this.resumeTokens.set(collection, resumeToken);
    } catch (error) {
      console.warn(`Error saving resume token for ${collection}:`, error.message);
    }
  }

  updateProcessingMetrics(processingTime) {
    const currentAvg = this.processingStats.avgProcessingTime;
    const totalProcessed = this.processingStats.totalChanges;

    this.processingStats.avgProcessingTime = 
      ((currentAvg * (totalProcessed - 1)) + processingTime) / totalProcessed;
  }

  async generateCDCHealthReport() {
    console.log('Generating CDC system health report...');

    try {
      const healthReport = {
        timestamp: new Date(),
        systemStatus: 'healthy',

        // Change stream status
        changeStreams: Array.from(this.changeStreams.entries()).map(([collection, info]) => ({
          collection: collection,
          status: info.stream.closed ? 'closed' : 'active',
          processedCount: info.processedCount,
          errorCount: info.errorCount,
          startTime: info.startTime,
          uptime: Date.now() - info.startTime.getTime()
        })),

        // Processing statistics
        processingStats: {
          ...this.processingStats,
          successRate: this.processingStats.totalChanges > 0 ? 
            (this.processingStats.successfulSyncs / this.processingStats.totalChanges * 100).toFixed(2) : 0,
          errorRate: this.processingStats.totalChanges > 0 ?
            (this.processingStats.failedSyncs / this.processingStats.totalChanges * 100).toFixed(2) : 0
        },

        // Sync target health
        syncTargetHealth: await this.checkSyncTargetHealth(),

        // Recent errors
        recentErrors: await this.collections.errorLog.find({
          timestamp: { $gte: new Date(Date.now() - 3600000) } // Last hour
        }).limit(10).toArray(),

        // Resume token status
        resumeTokenStatus: Array.from(this.resumeTokens.entries()).map(([collection, token]) => ({
          collection: collection,
          hasResumeToken: !!token,
          tokenAge: new Date() // Would calculate actual age in production
        }))
      };

      return healthReport;

    } catch (error) {
      console.error('Error generating health report:', error);
      return {
        timestamp: new Date(),
        systemStatus: 'error',
        error: error.message
      };
    }
  }

  async checkSyncTargetHealth() {
    const healthChecks = [];

    for (const target of this.config.syncTargets) {
      try {
        // Simulate health check for each target
        const healthCheck = {
          name: target.name,
          status: target.enabled ? 'healthy' : 'disabled',
          priority: target.priority,
          lastHealthCheck: new Date(),
          responseTime: Math.random() * 100 + 50 // Simulated response time
        };

        healthChecks.push(healthCheck);

      } catch (error) {
        healthChecks.push({
          name: target.name,
          status: 'unhealthy',
          error: error.message,
          lastHealthCheck: new Date()
        });
      }
    }

    return healthChecks;
  }

  // Utility methods for retry management
  async getRetryCount(retryKey) {
    const status = await this.collections.syncStatus.findOne({ retryKey: retryKey });
    return status ? status.retryCount : 0;
  }

  async incrementRetryCount(retryKey) {
    await this.collections.syncStatus.updateOne(
      { retryKey: retryKey },
      { 
        $inc: { retryCount: 1 },
        $set: { lastRetryAttempt: new Date() }
      },
      { upsert: true }
    );
  }

  async clearRetryCount(retryKey) {
    await this.collections.syncStatus.deleteOne({ retryKey: retryKey });
  }

  async sendToDeadLetterQueue(changeDoc, config, targetName, error) {
    console.log(`Sending to dead letter queue: ${config.collection} -> ${targetName}`);

    await this.collections.errorLog.insertOne({
      errorType: 'dead_letter_queue',
      collection: config.collection,
      targetName: targetName,
      changeDoc: changeDoc,
      config: config,
      error: {
        message: error.message,
        stack: error.stack
      },
      timestamp: new Date(),
      requiresManualIntervention: true
    });
  }
}

// Benefits of MongoDB Advanced Change Data Capture:
// - Real-time change detection with minimal latency and no polling overhead
// - Comprehensive change filtering and transformation capabilities
// - Built-in resume token support for fault-tolerant change stream processing
// - Advanced error handling with retry logic and dead letter queue management
// - Parallel synchronization to multiple targets with configurable priorities
// - Transaction-aware change ordering and consistency guarantees
// - Native MongoDB integration with minimal performance impact
// - Scalable architecture supporting high-volume change processing
// - Flexible transformation pipeline for data enrichment and formatting
// - SQL-compatible CDC operations through QueryLeaf integration

module.exports = {
  AdvancedCDCManager
};

Understanding MongoDB Change Data Capture Architecture

Advanced Change Stream and Synchronization Patterns

Implement sophisticated CDC patterns for production MongoDB deployments:

// Production-ready MongoDB CDC with enterprise-grade features
class EnterpriseCDCOrchestrator extends AdvancedCDCManager {
  constructor(db, enterpriseConfig) {
    super(db, enterpriseConfig);

    this.enterpriseConfig = {
      ...enterpriseConfig,
      enableDistributedProcessing: true,
      enableLoadBalancing: true,
      enableFailoverHandling: true,
      enableComplianceAuditing: true,
      enableDataLineage: true,
      enableSchemaEvolution: true
    };

    this.setupEnterpriseFeatures();
    this.initializeDistributedCDC();
    this.setupComplianceFramework();
  }

  async implementDistributedCDCProcessing() {
    console.log('Implementing distributed CDC processing across multiple nodes...');

    const distributedConfig = {
      // Multi-node change stream distribution
      nodeDistribution: {
        enableShardAwareness: true,
        balanceAcrossNodes: true,
        minimizeCrossShardOperations: true,
        optimizeForReplicaSetTopology: true
      },

      // Load balancing strategies
      loadBalancing: {
        dynamicWorkloadDistribution: true,
        nodeCapacityAware: true,
        latencyOptimized: true,
        failoverCapable: true
      },

      // Consistency guarantees
      consistencyManagement: {
        maintainChangeOrdering: true,
        transactionBoundaryRespect: true,
        causalConsistencyPreservation: true,
        replicationLagHandling: true
      }
    };

    return await this.deployDistributedCDC(distributedConfig);
  }

  async setupEnterpriseComplianceFramework() {
    console.log('Setting up enterprise compliance framework for CDC...');

    const complianceFramework = {
      // Data governance
      dataGovernance: {
        changeDataClassification: true,
        sensitiveDataDetection: true,
        accessControlEnforcement: true,
        dataRetentionPolicies: true
      },

      // Audit requirements
      auditCompliance: {
        comprehensiveChangeLogging: true,
        tamperEvidenceCapture: true,
        regulatoryReportingSupport: true,
        retentionPolicyEnforcement: true
      },

      // Security controls
      securityCompliance: {
        encryptionInTransit: true,
        encryptionAtRest: true,
        accessControlValidation: true,
        nonRepudiationSupport: true
      }
    };

    return await this.implementComplianceFramework(complianceFramework);
  }
}

SQL-Style Change Data Capture with QueryLeaf

QueryLeaf provides familiar SQL syntax for MongoDB Change Data Capture and real-time synchronization operations:

-- QueryLeaf advanced change data capture with SQL-familiar syntax for MongoDB

-- Configure comprehensive change data capture with advanced filtering and routing
CONFIGURE CHANGE_DATA_CAPTURE
SET enabled = true,
    resume_tokens = true,
    batch_size = 100,
    max_await_time_ms = 1000,
    full_document = 'updateLookup',
    full_document_before_change = 'whenAvailable';

-- Setup change stream monitoring with sophisticated filtering and transformation
CREATE CHANGE_STREAM product_changes_stream AS
WITH change_filtering AS (
  -- Advanced change detection filters
  SELECT 
    change_id,
    operation_type,
    collection_name,
    document_key,
    cluster_time,
    full_document,
    full_document_before_change,
    update_description,

    -- Intelligent change classification
    CASE 
      WHEN operation_type = 'insert' THEN 'new_product'
      WHEN operation_type = 'delete' THEN 'product_removal'
      WHEN operation_type = 'update' AND 
           JSON_EXTRACT(update_description, '$.updatedFields.price') IS NOT NULL THEN 'price_change'
      WHEN operation_type = 'update' AND
           JSON_EXTRACT(update_description, '$.updatedFields.stock_quantity') IS NOT NULL THEN 'inventory_change'
      WHEN operation_type = 'update' AND
           JSON_EXTRACT(update_description, '$.updatedFields.status') IS NOT NULL THEN 'status_change'
      ELSE 'general_update'
    END as change_category,

    -- Priority scoring for processing order
    CASE 
      WHEN operation_type = 'delete' THEN 5
      WHEN operation_type = 'insert' THEN 4
      WHEN JSON_EXTRACT(update_description, '$.updatedFields.price') IS NOT NULL THEN 4
      WHEN JSON_EXTRACT(update_description, '$.updatedFields.stock_quantity') IS NOT NULL THEN 3
      WHEN JSON_EXTRACT(update_description, '$.updatedFields.status') IS NOT NULL THEN 3
      ELSE 1
    END as processing_priority,

    -- Impact assessment
    CASE 
      WHEN operation_type IN ('insert', 'delete') THEN 'high'
      WHEN JSON_EXTRACT(update_description, '$.updatedFields.price') IS NOT NULL THEN
        CASE 
          WHEN ABS(
            CAST(JSON_EXTRACT(full_document, '$.price') AS DECIMAL(10,2)) - 
            CAST(JSON_EXTRACT(full_document_before_change, '$.price') AS DECIMAL(10,2))
          ) > 50 THEN 'high'
          WHEN ABS(
            CAST(JSON_EXTRACT(full_document, '$.price') AS DECIMAL(10,2)) - 
            CAST(JSON_EXTRACT(full_document_before_change, '$.price') AS DECIMAL(10,2))
          ) > 10 THEN 'medium'
          ELSE 'low'
        END
      WHEN JSON_EXTRACT(update_description, '$.updatedFields.stock_quantity') IS NOT NULL THEN
        CASE 
          WHEN CAST(JSON_EXTRACT(full_document, '$.stock_quantity') AS INTEGER) = 0 THEN 'high'
          WHEN CAST(JSON_EXTRACT(full_document, '$.stock_quantity') AS INTEGER) < 10 THEN 'medium'
          ELSE 'low'
        END
      ELSE 'low'
    END as business_impact,

    -- Synchronization target determination
    ARRAY[
      CASE WHEN change_category IN ('new_product', 'product_removal', 'price_change') 
           THEN 'search_service' ELSE NULL END,
      CASE WHEN change_category IN ('new_product', 'price_change', 'inventory_change')
           THEN 'analytics_warehouse' ELSE NULL END,
      CASE WHEN change_category IN ('new_product', 'price_change', 'inventory_change', 'status_change')
           THEN 'cache_invalidation' ELSE NULL END,
      CASE WHEN change_category IN ('new_product', 'price_change')
           THEN 'recommendation_engine' ELSE NULL END
    ]::TEXT[] as sync_targets,

    -- Enhanced change metadata
    JSON_OBJECT(
      'change_detected_at', CURRENT_TIMESTAMP,
      'source_replica_set', CONNECTION_INFO('replica_set'),
      'source_node', CONNECTION_INFO('host'),
      'change_stream_id', CHANGE_STREAM_INFO('stream_id'),
      'resume_token', CHANGE_STREAM_INFO('resume_token')
    ) as change_metadata,

    CURRENT_TIMESTAMP as processed_at

  FROM CHANGE_STREAM('products')
  WHERE 
    -- Filter criteria for relevant changes
    operation_type IN ('insert', 'update', 'delete')
    AND (
      operation_type IN ('insert', 'delete') OR
      update_description IS NOT NULL AND (
        JSON_EXTRACT(update_description, '$.updatedFields.price') IS NOT NULL OR
        JSON_EXTRACT(update_description, '$.updatedFields.stock_quantity') IS NOT NULL OR
        JSON_EXTRACT(update_description, '$.updatedFields.status') IS NOT NULL OR
        JSON_EXTRACT(update_description, '$.updatedFields.category') IS NOT NULL
      )
    )
),

change_enrichment AS (
  SELECT 
    cf.*,

    -- Price change analysis
    CASE 
      WHEN cf.change_category = 'price_change' THEN
        JSON_OBJECT(
          'old_price', CAST(JSON_EXTRACT(cf.full_document_before_change, '$.price') AS DECIMAL(10,2)),
          'new_price', CAST(JSON_EXTRACT(cf.full_document, '$.price') AS DECIMAL(10,2)),
          'price_change', 
            CAST(JSON_EXTRACT(cf.full_document, '$.price') AS DECIMAL(10,2)) - 
            CAST(JSON_EXTRACT(cf.full_document_before_change, '$.price') AS DECIMAL(10,2)),
          'price_change_percent',
            ROUND(
              ((CAST(JSON_EXTRACT(cf.full_document, '$.price') AS DECIMAL(10,2)) - 
                CAST(JSON_EXTRACT(cf.full_document_before_change, '$.price') AS DECIMAL(10,2))) /
               CAST(JSON_EXTRACT(cf.full_document_before_change, '$.price') AS DECIMAL(10,2))) * 100,
              2
            ),
          'price_direction',
            CASE 
              WHEN CAST(JSON_EXTRACT(cf.full_document, '$.price') AS DECIMAL(10,2)) > 
                   CAST(JSON_EXTRACT(cf.full_document_before_change, '$.price') AS DECIMAL(10,2))
              THEN 'increase' ELSE 'decrease'
            END
        )
      ELSE NULL
    END as price_analysis,

    -- Inventory change analysis
    CASE 
      WHEN cf.change_category = 'inventory_change' THEN
        JSON_OBJECT(
          'old_stock', CAST(JSON_EXTRACT(cf.full_document_before_change, '$.stock_quantity') AS INTEGER),
          'new_stock', CAST(JSON_EXTRACT(cf.full_document, '$.stock_quantity') AS INTEGER),
          'stock_change',
            CAST(JSON_EXTRACT(cf.full_document, '$.stock_quantity') AS INTEGER) -
            CAST(JSON_EXTRACT(cf.full_document_before_change, '$.stock_quantity') AS INTEGER),
          'stock_status',
            CASE 
              WHEN CAST(JSON_EXTRACT(cf.full_document, '$.stock_quantity') AS INTEGER) = 0 THEN 'out_of_stock'
              WHEN CAST(JSON_EXTRACT(cf.full_document, '$.stock_quantity') AS INTEGER) < 10 THEN 'low_stock'
              WHEN CAST(JSON_EXTRACT(cf.full_document, '$.stock_quantity') AS INTEGER) > 100 THEN 'high_stock'
              ELSE 'normal_stock'
            END,
          'restock_needed',
            CAST(JSON_EXTRACT(cf.full_document, '$.stock_quantity') AS INTEGER) < 10
        )
      ELSE NULL
    END as inventory_analysis,

    -- Generate search keywords for search service sync
    CASE 
      WHEN 'search_service' = ANY(cf.sync_targets) THEN
        ARRAY_CAT(
          STRING_TO_ARRAY(LOWER(JSON_EXTRACT_TEXT(cf.full_document, '$.product_name')), ' '),
          STRING_TO_ARRAY(LOWER(JSON_EXTRACT_TEXT(cf.full_document, '$.category')), ' ')
        )
      ELSE NULL
    END as search_keywords,

    -- Cache invalidation keys
    CASE 
      WHEN 'cache_invalidation' = ANY(cf.sync_targets) THEN
        ARRAY[
          'product:' || JSON_EXTRACT_TEXT(cf.document_key, '$._id'),
          'products:category:' || JSON_EXTRACT_TEXT(cf.full_document, '$.category'),
          'products:search:*'
        ]
      ELSE NULL
    END as cache_keys_to_invalidate

  FROM change_filtering cf
),

sync_routing AS (
  SELECT 
    ce.*,

    -- Generate target-specific sync payloads
    UNNEST(
      ARRAY_REMOVE(ce.sync_targets, NULL)
    ) as sync_target,

    -- Create sync payload based on target
    CASE UNNEST(ARRAY_REMOVE(ce.sync_targets, NULL))
      WHEN 'search_service' THEN
        JSON_OBJECT(
          'action', 
            CASE ce.operation_type 
              WHEN 'delete' THEN 'delete'
              ELSE 'upsert'
            END,
          'document', 
            CASE ce.operation_type
              WHEN 'delete' THEN NULL
              ELSE JSON_OBJECT(
                'id', JSON_EXTRACT_TEXT(ce.document_key, '$._id'),
                'title', JSON_EXTRACT_TEXT(ce.full_document, '$.product_name'),
                'content', JSON_EXTRACT_TEXT(ce.full_document, '$.description'),
                'category', JSON_EXTRACT_TEXT(ce.full_document, '$.category'),
                'price', CAST(JSON_EXTRACT(ce.full_document, '$.price') AS DECIMAL(10,2)),
                'in_stock', CAST(JSON_EXTRACT(ce.full_document, '$.stock_quantity') AS INTEGER) > 0,
                'keywords', ce.search_keywords,
                'last_updated', ce.processed_at
              )
            END,
          'priority', 
            CASE ce.business_impact
              WHEN 'high' THEN 'urgent'
              WHEN 'medium' THEN 'normal'
              ELSE 'low'
            END
        )

      WHEN 'analytics_warehouse' THEN
        JSON_OBJECT(
          'entity_type', 'product',
          'event_type', 'product_' || ce.operation_type,
          'event_timestamp', ce.processed_at,
          'entity_id', JSON_EXTRACT_TEXT(ce.document_key, '$._id'),
          'event_data', 
            CASE ce.operation_type
              WHEN 'delete' THEN ce.full_document_before_change
              ELSE ce.full_document
            END,
          'change_metadata', JSON_OBJECT(
            'updated_fields', 
              CASE WHEN ce.update_description IS NOT NULL THEN
                JSON_EXTRACT(ce.update_description, '$.updatedFields')
              ELSE NULL END,
            'removed_fields',
              CASE WHEN ce.update_description IS NOT NULL THEN
                JSON_EXTRACT(ce.update_description, '$.removedFields') 
              ELSE NULL END
          ),
          'enrichment', JSON_OBJECT(
            'price_analysis', ce.price_analysis,
            'inventory_analysis', ce.inventory_analysis,
            'business_impact', ce.business_impact,
            'change_category', ce.change_category
          )
        )

      WHEN 'recommendation_engine' THEN
        JSON_OBJECT(
          'entity_type', 'product',
          'entity_id', JSON_EXTRACT_TEXT(ce.document_key, '$._id'),
          'action', 
            CASE ce.operation_type
              WHEN 'delete' THEN 'remove'
              WHEN 'insert' THEN 'add'
              ELSE 'update'
            END,
          'product_data', 
            CASE ce.operation_type
              WHEN 'delete' THEN ce.full_document_before_change
              ELSE ce.full_document
            END,
          'recommendation_hints', JSON_OBJECT(
            'price_changed', ce.price_analysis IS NOT NULL,
            'new_product', ce.operation_type = 'insert',
            'business_impact', ce.business_impact,
            'category', JSON_EXTRACT_TEXT(ce.full_document, '$.category')
          )
        )

      WHEN 'cache_invalidation' THEN  
        JSON_OBJECT(
          'operation', 'invalidate',
          'keys', ce.cache_keys_to_invalidate,
          'cascade', true,
          'reason', ce.change_category,
          'priority', 
            CASE ce.business_impact
              WHEN 'high' THEN 1
              WHEN 'medium' THEN 2
              ELSE 3
            END
        )

      ELSE JSON_OBJECT('error', 'unknown_sync_target')
    END as sync_payload

  FROM change_enrichment ce
)

-- Execute synchronization with comprehensive tracking and monitoring
INSERT INTO sync_operations (
  change_id,
  operation_type,
  collection_name,
  document_key,
  change_category,
  business_impact,
  processing_priority,
  sync_target,
  sync_payload,
  sync_status,
  sync_attempt_count,
  created_at,
  scheduled_for
)
SELECT 
  sr.change_id,
  sr.operation_type,
  sr.collection_name,
  sr.document_key,
  sr.change_category,
  sr.business_impact,
  sr.processing_priority,
  sr.sync_target,
  sr.sync_payload,
  'pending' as sync_status,
  0 as sync_attempt_count,
  sr.processed_at as created_at,

  -- Schedule based on priority
  sr.processed_at + 
  CASE sr.processing_priority
    WHEN 5 THEN INTERVAL '0 seconds'     -- Immediate for deletes
    WHEN 4 THEN INTERVAL '5 seconds'     -- Near immediate for high priority
    WHEN 3 THEN INTERVAL '30 seconds'    -- Medium priority
    WHEN 2 THEN INTERVAL '2 minutes'     -- Lower priority
    ELSE INTERVAL '5 minutes'            -- Lowest priority
  END as scheduled_for

FROM sync_routing sr
ORDER BY sr.processing_priority DESC, sr.processed_at ASC;

-- Advanced change stream monitoring and analytics
WITH change_stream_analytics AS (
  SELECT 
    DATE_TRUNC('minute', processed_at) as time_bucket,
    change_category,
    business_impact,

    -- Volume metrics
    COUNT(*) as change_count,
    COUNT(DISTINCT document_key) as unique_documents_changed,

    -- Operation type distribution
    COUNT(*) FILTER (WHERE operation_type = 'insert') as inserts,
    COUNT(*) FILTER (WHERE operation_type = 'update') as updates, 
    COUNT(*) FILTER (WHERE operation_type = 'delete') as deletes,

    -- Business impact distribution
    COUNT(*) FILTER (WHERE business_impact = 'high') as high_impact_changes,
    COUNT(*) FILTER (WHERE business_impact = 'medium') as medium_impact_changes,
    COUNT(*) FILTER (WHERE business_impact = 'low') as low_impact_changes,

    -- Sync target requirements
    SUM(ARRAY_LENGTH(sync_targets, 1)) as total_sync_operations,
    AVG(ARRAY_LENGTH(sync_targets, 1)) as avg_sync_targets_per_change,

    -- Processing latency (from cluster time to processing)
    AVG(
      EXTRACT(MILLISECONDS FROM processed_at - cluster_time)
    ) as avg_processing_latency_ms,

    PERCENTILE_CONT(0.95) WITHIN GROUP (
      ORDER BY EXTRACT(MILLISECONDS FROM processed_at - cluster_time)
    ) as p95_processing_latency_ms

  FROM CHANGE_STREAM_LOG
  WHERE processed_at >= CURRENT_TIMESTAMP - INTERVAL '1 hour'
  GROUP BY DATE_TRUNC('minute', processed_at), change_category, business_impact
),

sync_performance_analysis AS (
  SELECT 
    DATE_TRUNC('minute', created_at) as time_bucket,
    sync_target,

    -- Sync success metrics
    COUNT(*) as total_sync_attempts,
    COUNT(*) FILTER (WHERE sync_status = 'completed') as successful_syncs,
    COUNT(*) FILTER (WHERE sync_status = 'failed') as failed_syncs,
    COUNT(*) FILTER (WHERE sync_status = 'pending') as pending_syncs,
    COUNT(*) FILTER (WHERE sync_status = 'retrying') as retrying_syncs,

    -- Performance metrics
    AVG(sync_duration_ms) as avg_sync_duration_ms,
    PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY sync_duration_ms) as p95_sync_duration_ms,
    AVG(sync_attempt_count) as avg_retry_count,

    -- Success rate calculation
    ROUND(
      (COUNT(*) FILTER (WHERE sync_status = 'completed')::FLOAT / 
       NULLIF(COUNT(*), 0)) * 100, 
      2
    ) as success_rate_percent,

    -- Queue depth analysis
    AVG(
      EXTRACT(MILLISECONDS FROM sync_started_at - scheduled_for)
    ) as avg_queue_wait_time_ms

  FROM sync_operations
  WHERE created_at >= CURRENT_TIMESTAMP - INTERVAL '1 hour'
  GROUP BY DATE_TRUNC('minute', created_at), sync_target
)

SELECT 
  csa.time_bucket,
  csa.change_category,
  csa.business_impact,

  -- Change stream metrics
  csa.change_count,
  csa.unique_documents_changed,
  csa.inserts,
  csa.updates,
  csa.deletes,

  -- Impact distribution
  csa.high_impact_changes,
  csa.medium_impact_changes,
  csa.low_impact_changes,

  -- Processing performance
  ROUND(csa.avg_processing_latency_ms::NUMERIC, 2) as avg_processing_latency_ms,
  ROUND(csa.p95_processing_latency_ms::NUMERIC, 2) as p95_processing_latency_ms,

  -- Sync requirements
  csa.total_sync_operations,
  ROUND(csa.avg_sync_targets_per_change::NUMERIC, 2) as avg_sync_targets_per_change,

  -- Sync performance by target
  JSON_OBJECT_AGG(
    spa.sync_target,
    JSON_OBJECT(
      'success_rate', spa.success_rate_percent,
      'avg_duration_ms', ROUND(spa.avg_sync_duration_ms::NUMERIC, 2),
      'p95_duration_ms', ROUND(spa.p95_sync_duration_ms::NUMERIC, 2),
      'avg_queue_wait_ms', ROUND(spa.avg_queue_wait_time_ms::NUMERIC, 2),
      'pending_count', spa.pending_syncs,
      'retry_count', spa.retrying_syncs
    )
  ) as sync_performance_by_target,

  -- Health indicators
  CASE 
    WHEN AVG(spa.success_rate_percent) > 95 THEN 'healthy'
    WHEN AVG(spa.success_rate_percent) > 90 THEN 'warning'
    ELSE 'critical'
  END as overall_health_status,

  -- Alerts and recommendations
  ARRAY[
    CASE WHEN csa.avg_processing_latency_ms > 1000 
         THEN 'High processing latency detected' END,
    CASE WHEN AVG(spa.success_rate_percent) < 95 
         THEN 'Sync success rate below threshold' END,
    CASE WHEN AVG(spa.avg_queue_wait_time_ms) > 30000 
         THEN 'High sync queue wait times' END,
    CASE WHEN csa.high_impact_changes > 100 
         THEN 'Unusual volume of high-impact changes' END
  ]::TEXT[] as alert_conditions

FROM change_stream_analytics csa
LEFT JOIN sync_performance_analysis spa ON csa.time_bucket = spa.time_bucket
GROUP BY 
  csa.time_bucket, csa.change_category, csa.business_impact,
  csa.change_count, csa.unique_documents_changed, csa.inserts, csa.updates, csa.deletes,
  csa.high_impact_changes, csa.medium_impact_changes, csa.low_impact_changes,
  csa.avg_processing_latency_ms, csa.p95_processing_latency_ms,
  csa.total_sync_operations, csa.avg_sync_targets_per_change
ORDER BY csa.time_bucket DESC, csa.change_count DESC;

-- Real-time CDC health monitoring dashboard
CREATE VIEW cdc_health_dashboard AS
WITH real_time_metrics AS (
  SELECT 
    -- Current timestamp for real-time display
    CURRENT_TIMESTAMP as dashboard_time,

    -- Change stream status
    (SELECT COUNT(*) FROM ACTIVE_CHANGE_STREAMS) as active_streams,
    (SELECT COUNT(*) FROM CHANGE_STREAM_LOG 
     WHERE processed_at >= CURRENT_TIMESTAMP - INTERVAL '1 minute') as changes_last_minute,
    (SELECT COUNT(*) FROM CHANGE_STREAM_LOG 
     WHERE processed_at >= CURRENT_TIMESTAMP - INTERVAL '5 minutes') as changes_last_5_minutes,

    -- Sync operation status
    (SELECT COUNT(*) FROM sync_operations WHERE sync_status = 'pending') as pending_syncs,
    (SELECT COUNT(*) FROM sync_operations WHERE sync_status = 'retrying') as retrying_syncs,
    (SELECT COUNT(*) FROM sync_operations WHERE sync_status = 'failed') as failed_syncs,

    -- Performance indicators
    (SELECT AVG(EXTRACT(MILLISECONDS FROM processed_at - cluster_time))
     FROM CHANGE_STREAM_LOG 
     WHERE processed_at >= CURRENT_TIMESTAMP - INTERVAL '5 minutes') as avg_latency_5m,

    (SELECT AVG(sync_duration_ms) FROM sync_operations 
     WHERE sync_completed_at >= CURRENT_TIMESTAMP - INTERVAL '5 minutes'
     AND sync_status = 'completed') as avg_sync_duration_5m,

    -- Error rates
    (SELECT COUNT(*) FROM sync_operations 
     WHERE created_at >= CURRENT_TIMESTAMP - INTERVAL '5 minutes'
     AND sync_status = 'failed') as sync_failures_5m,

    (SELECT COUNT(*) FROM sync_operations
     WHERE created_at >= CURRENT_TIMESTAMP - INTERVAL '5 minutes') as total_syncs_5m
)

SELECT 
  dashboard_time,

  -- Stream health
  active_streams,
  changes_last_minute,
  changes_last_5_minutes,
  ROUND(changes_last_5_minutes / 5.0, 1) as avg_changes_per_minute,

  -- Queue status
  pending_syncs,
  retrying_syncs, 
  failed_syncs,

  -- Performance metrics
  ROUND(avg_latency_5m::NUMERIC, 2) as avg_processing_latency_ms,
  ROUND(avg_sync_duration_5m::NUMERIC, 2) as avg_sync_duration_ms,

  -- Health indicators
  CASE 
    WHEN pending_syncs + retrying_syncs > 1000 THEN 'critical'
    WHEN pending_syncs + retrying_syncs > 500 THEN 'warning'
    WHEN avg_latency_5m > 1000 THEN 'warning'
    ELSE 'healthy'
  END as system_health,

  -- Error rates
  sync_failures_5m,
  total_syncs_5m,
  CASE 
    WHEN total_syncs_5m > 0 THEN
      ROUND((sync_failures_5m::FLOAT / total_syncs_5m * 100)::NUMERIC, 2)
    ELSE 0
  END as error_rate_percent,

  -- Capacity indicators
  CASE 
    WHEN changes_last_minute > 1000 THEN 'high_volume'
    WHEN changes_last_minute > 100 THEN 'medium_volume'
    ELSE 'normal_volume'  
  END as current_load,

  -- Operational status
  ARRAY[
    CASE WHEN active_streams = 0 THEN 'No active change streams' END,
    CASE WHEN failed_syncs > 10 THEN 'High number of failed syncs' END,
    CASE WHEN pending_syncs > 500 THEN 'High sync queue depth' END,
    CASE WHEN avg_latency_5m > 1000 THEN 'High processing latency' END
  ]::TEXT[] as current_alerts

FROM real_time_metrics;

-- QueryLeaf provides comprehensive MongoDB CDC capabilities:
-- 1. SQL-familiar syntax for change stream configuration and monitoring
-- 2. Advanced change filtering and routing with business logic integration
-- 3. Intelligent sync target determination based on change characteristics
-- 4. Real-time change processing with priority-based queue management
-- 5. Comprehensive error handling and retry mechanisms with dead letter queues
-- 6. Advanced performance monitoring and analytics with health indicators
-- 7. Production-ready CDC operations with resume token management
-- 8. Integration with MongoDB's native change stream capabilities
-- 9. Sophisticated transformation and enrichment pipeline support
-- 10. Enterprise-grade compliance and audit trail capabilities

Best Practices for Production CDC Implementation

Change Data Capture Strategy Design

Essential principles for effective MongoDB CDC deployment and management:

  1. Stream Configuration: Configure change streams with appropriate filtering, batch sizing, and resume token management for reliability
  2. Transformation Pipeline: Design flexible transformation pipelines that can adapt to schema evolution and business requirement changes
  3. Error Handling: Implement comprehensive error handling with retry logic, dead letter queues, and alert mechanisms
  4. Performance Optimization: Optimize CDC performance through intelligent batching, parallel processing, and resource management
  5. Monitoring and Observability: Deploy comprehensive monitoring that tracks change stream health, sync performance, and business metrics
  6. Scalability Planning: Design CDC architecture that can scale with data volume growth and increasing synchronization requirements

Enterprise CDC Deployment

Optimize CDC systems for production enterprise environments:

  1. Distributed Processing: Implement distributed CDC processing that can handle high-volume change streams across multiple nodes
  2. Compliance Integration: Ensure CDC operations meet regulatory requirements for data lineage, audit trails, and access controls
  3. Disaster Recovery: Design CDC systems with failover capabilities and data recovery procedures for business continuity
  4. Security Controls: Implement encryption, access controls, and security monitoring for CDC data flows
  5. Operational Integration: Integrate CDC with existing monitoring, alerting, and operational workflows for seamless management
  6. Cost Optimization: Monitor and optimize CDC resource usage and synchronization costs for efficient operations

Conclusion

MongoDB Change Data Capture provides sophisticated real-time data synchronization capabilities that enable modern event-driven architectures, distributed system integration, and responsive data pipelines without the complexity and limitations of traditional CDC approaches. Native change streams offer reliable, scalable, and efficient change detection with minimal performance impact and comprehensive transformation capabilities.

Key MongoDB CDC benefits include:

  • Real-Time Synchronization: Immediate change detection and propagation without polling delays or batch processing limitations
  • Advanced Filtering: Sophisticated change stream filtering and routing based on business logic and data characteristics
  • Fault Tolerance: Built-in resume token support and error handling for reliable change stream processing
  • Scalable Architecture: Native MongoDB integration that scales efficiently with data volume and system complexity
  • Flexible Transformations: Comprehensive data transformation and enrichment capabilities for target-specific synchronization
  • SQL Accessibility: Familiar SQL-style CDC operations through QueryLeaf for accessible change data capture management

Whether you're building event-driven microservices, maintaining data warehouse synchronization, implementing search index updates, or orchestrating complex distributed system workflows, MongoDB CDC with QueryLeaf's familiar SQL interface provides the foundation for reliable, efficient, and scalable real-time data synchronization.

QueryLeaf Integration: QueryLeaf automatically translates SQL-style CDC configurations into MongoDB's native change streams, making advanced real-time synchronization accessible to SQL-oriented development teams. Complex change filtering, transformation pipelines, and sync orchestration are seamlessly handled through familiar SQL constructs, enabling sophisticated event-driven architectures without requiring deep MongoDB change stream expertise.

The combination of MongoDB's robust change data capture capabilities with SQL-style synchronization operations makes it an ideal platform for applications requiring both real-time data propagation and familiar database management patterns, ensuring your distributed systems can maintain data consistency and responsiveness as they scale and evolve.

MongoDB Bulk Operations and Performance Optimization: Advanced Batch Processing and High-Throughput Data Management

High-performance data processing applications require sophisticated bulk operation strategies that can handle large volumes of data efficiently while maintaining consistency and performance under varying load conditions. Traditional row-by-row database operations become prohibitively slow when processing thousands or millions of records, leading to application bottlenecks, extended processing times, and resource exhaustion in production environments.

MongoDB provides comprehensive bulk operation capabilities that enable high-throughput batch processing for insertions, updates, and deletions through optimized write strategies and intelligent batching mechanisms. Unlike traditional databases that require complex stored procedures or application-level batching logic, MongoDB's bulk operations leverage server-side optimization, write concern management, and atomic operation guarantees to deliver superior performance for large-scale data processing scenarios.

The Traditional Batch Processing Challenge

Conventional database batch processing approaches often struggle with performance and complexity:

-- Traditional PostgreSQL batch processing - limited throughput and complex error handling

-- Basic batch insert approach with poor performance characteristics
CREATE OR REPLACE FUNCTION batch_insert_products(
    product_data JSONB[]
) RETURNS TABLE(
    inserted_count INTEGER,
    failed_count INTEGER,
    processing_time_ms INTEGER,
    error_details JSONB
) AS $$
DECLARE
    product_record JSONB;
    insert_count INTEGER := 0;
    error_count INTEGER := 0;
    start_time TIMESTAMP := clock_timestamp();
    current_error TEXT;
    error_list JSONB := '[]'::JSONB;
BEGIN

    -- Individual row processing (extremely inefficient for large datasets)
    FOREACH product_record IN ARRAY product_data
    LOOP
        BEGIN
            INSERT INTO products (
                product_name,
                category,
                price,
                stock_quantity,
                supplier_id,
                created_at,
                updated_at,

                -- Basic validation during insertion
                sku,
                description,
                weight_kg,
                dimensions_cm,

                -- Limited metadata support
                tags,
                attributes
            )
            VALUES (
                product_record->>'product_name',
                product_record->>'category',
                (product_record->>'price')::DECIMAL(10,2),
                (product_record->>'stock_quantity')::INTEGER,
                (product_record->>'supplier_id')::UUID,
                CURRENT_TIMESTAMP,
                CURRENT_TIMESTAMP,

                -- Manual data extraction and validation
                product_record->>'sku',
                product_record->>'description',
                (product_record->>'weight_kg')::DECIMAL(8,3),
                product_record->>'dimensions_cm',

                -- Limited JSON processing capabilities
                string_to_array(product_record->>'tags', ','),
                product_record->'attributes'
            );

            insert_count := insert_count + 1;

        EXCEPTION 
            WHEN unique_violation THEN
                error_count := error_count + 1;
                error_list := error_list || jsonb_build_object(
                    'sku', product_record->>'sku',
                    'error', 'Duplicate SKU violation',
                    'error_code', 'UNIQUE_VIOLATION'
                );
            WHEN check_violation THEN
                error_count := error_count + 1;
                error_list := error_list || jsonb_build_object(
                    'sku', product_record->>'sku',
                    'error', 'Data validation failed',
                    'error_code', 'CHECK_VIOLATION'
                );
            WHEN OTHERS THEN
                error_count := error_count + 1;
                GET STACKED DIAGNOSTICS current_error = MESSAGE_TEXT;
                error_list := error_list || jsonb_build_object(
                    'sku', product_record->>'sku',
                    'error', current_error,
                    'error_code', 'GENERAL_ERROR'
                );
        END;
    END LOOP;

    RETURN QUERY SELECT 
        insert_count,
        error_count,
        EXTRACT(MILLISECONDS FROM clock_timestamp() - start_time)::INTEGER,
        error_list;
END;
$$ LANGUAGE plpgsql;

-- Batch update operation with limited optimization
CREATE OR REPLACE FUNCTION batch_update_inventory(
    updates JSONB[]
) RETURNS TABLE(
    updated_count INTEGER,
    not_found_count INTEGER,
    error_count INTEGER,
    processing_details JSONB
) AS $$
DECLARE
    update_record JSONB;
    updated_rows INTEGER := 0;
    not_found_rows INTEGER := 0;
    error_rows INTEGER := 0;
    temp_table_name TEXT := 'temp_inventory_updates_' || extract(epoch from now())::INTEGER;
    processing_stats JSONB := '{}'::JSONB;
BEGIN

    -- Create temporary table for batch processing (complex setup)
    EXECUTE format('
        CREATE TEMP TABLE %I (
            sku VARCHAR(100),
            stock_adjustment INTEGER,
            price_adjustment DECIMAL(10,2),
            update_reason VARCHAR(200),
            batch_id UUID DEFAULT gen_random_uuid()
        )', temp_table_name);

    -- Insert updates into temporary table
    FOREACH update_record IN ARRAY updates
    LOOP
        EXECUTE format('
            INSERT INTO %I (sku, stock_adjustment, price_adjustment, update_reason)
            VALUES ($1, $2, $3, $4)', 
            temp_table_name
        ) USING 
            update_record->>'sku',
            (update_record->>'stock_adjustment')::INTEGER,
            (update_record->>'price_adjustment')::DECIMAL(10,2),
            update_record->>'update_reason';
    END LOOP;

    -- Perform batch update with limited atomicity
    EXECUTE format('
        WITH update_results AS (
            UPDATE products p
            SET 
                stock_quantity = p.stock_quantity + t.stock_adjustment,
                price = CASE 
                    WHEN t.price_adjustment IS NOT NULL THEN p.price + t.price_adjustment
                    ELSE p.price
                END,
                updated_at = CURRENT_TIMESTAMP,
                last_update_reason = t.update_reason
            FROM %I t
            WHERE p.sku = t.sku
            RETURNING p.sku, p.stock_quantity, p.price
        ),
        stats AS (
            SELECT COUNT(*) as updated_count FROM update_results
        )
        SELECT updated_count FROM stats', 
        temp_table_name
    ) INTO updated_rows;

    -- Calculate not found items (complex logic)
    EXECUTE format('
        SELECT COUNT(*)
        FROM %I t
        WHERE NOT EXISTS (
            SELECT 1 FROM products p WHERE p.sku = t.sku
        )', temp_table_name
    ) INTO not_found_rows;

    -- Cleanup temporary table
    EXECUTE format('DROP TABLE %I', temp_table_name);

    processing_stats := jsonb_build_object(
        'total_processed', array_length(updates, 1),
        'success_rate', CASE 
            WHEN array_length(updates, 1) > 0 THEN 
                ROUND((updated_rows::DECIMAL / array_length(updates, 1)) * 100, 2)
            ELSE 0
        END
    );

    RETURN QUERY SELECT 
        updated_rows,
        not_found_rows,
        error_rows,
        processing_stats;
END;
$$ LANGUAGE plpgsql;

-- Complex batch delete with limited performance optimization
WITH batch_delete_products AS (
    -- Identify products to delete based on complex criteria
    SELECT 
        product_id,
        sku,
        category,
        last_sold_date,
        stock_quantity,

        -- Complex deletion logic
        CASE 
            WHEN stock_quantity = 0 AND last_sold_date < CURRENT_DATE - INTERVAL '365 days' THEN 'discontinued'
            WHEN category = 'seasonal' AND EXTRACT(MONTH FROM CURRENT_DATE) NOT BETWEEN 6 AND 8 THEN 'seasonal_cleanup'
            WHEN supplier_id IN (
                SELECT supplier_id FROM suppliers WHERE status = 'inactive'
            ) THEN 'supplier_inactive'
            ELSE 'no_delete'
        END as delete_reason

    FROM products
    WHERE 
        -- Multi-condition filtering
        (stock_quantity = 0 AND last_sold_date < CURRENT_DATE - INTERVAL '365 days')
        OR (category = 'seasonal' AND EXTRACT(MONTH FROM CURRENT_DATE) NOT BETWEEN 6 AND 8)
        OR supplier_id IN (
            SELECT supplier_id FROM suppliers WHERE status = 'inactive'
        )
),
deletion_validation AS (
    -- Validate deletion constraints (complex dependency checking)
    SELECT 
        bdp.*,
        CASE 
            WHEN EXISTS (
                SELECT 1 FROM order_items oi 
                WHERE oi.product_id = bdp.product_id 
                AND oi.order_date > CURRENT_DATE - INTERVAL '90 days'
            ) THEN 'recent_orders_exist'
            WHEN EXISTS (
                SELECT 1 FROM shopping_carts sc 
                WHERE sc.product_id = bdp.product_id
            ) THEN 'in_shopping_carts'
            WHEN EXISTS (
                SELECT 1 FROM wishlists w 
                WHERE w.product_id = bdp.product_id
            ) THEN 'in_wishlists'
            ELSE 'safe_to_delete'
        END as validation_status

    FROM batch_delete_products bdp
    WHERE bdp.delete_reason != 'no_delete'
),
safe_deletions AS (
    -- Only proceed with safe deletions
    SELECT product_id, sku, delete_reason
    FROM deletion_validation
    WHERE validation_status = 'safe_to_delete'
),
delete_execution AS (
    -- Perform the actual deletion (limited batch efficiency)
    DELETE FROM products
    WHERE product_id IN (
        SELECT product_id FROM safe_deletions
    )
    RETURNING product_id, sku
)
SELECT 
    COUNT(*) as deleted_count,

    -- Limited statistics and reporting
    json_agg(
        json_build_object(
            'sku', de.sku,
            'delete_reason', sd.delete_reason
        )
    ) as deleted_items,

    -- Processing summary
    (
        SELECT COUNT(*) 
        FROM batch_delete_products 
        WHERE delete_reason != 'no_delete'
    ) as candidates_identified,

    (
        SELECT COUNT(*) 
        FROM deletion_validation 
        WHERE validation_status != 'safe_to_delete'
    ) as unsafe_deletions_blocked

FROM delete_execution de
JOIN safe_deletions sd ON de.product_id = sd.product_id;

-- Problems with traditional batch processing approaches:
-- 1. Poor performance due to row-by-row processing instead of set-based operations
-- 2. Complex error handling that doesn't scale with data volume
-- 3. Limited transaction management and rollback capabilities for batch operations
-- 4. No built-in support for partial failures and retry mechanisms
-- 5. Difficulty in maintaining data consistency during large batch operations
-- 6. Complex temporary table management and cleanup requirements
-- 7. Limited monitoring and progress tracking capabilities
-- 8. No native support for ordered vs unordered bulk operations
-- 9. Inefficient memory usage and connection management for large batches
-- 10. Lack of automatic optimization based on operation types and data patterns

-- Attempt at optimized bulk insert (still limited)
INSERT INTO products (
    product_name, category, price, stock_quantity, 
    supplier_id, sku, description, created_at, updated_at
)
SELECT 
    batch_data.product_name,
    batch_data.category,
    batch_data.price::DECIMAL(10,2),
    batch_data.stock_quantity::INTEGER,
    batch_data.supplier_id::UUID,
    batch_data.sku,
    batch_data.description,
    CURRENT_TIMESTAMP,
    CURRENT_TIMESTAMP
FROM (
    VALUES 
        ('Product A', 'Electronics', '299.99', '100', '123e4567-e89b-12d3-a456-426614174000', 'SKU001', 'Description A'),
        ('Product B', 'Electronics', '199.99', '50', '123e4567-e89b-12d3-a456-426614174000', 'SKU002', 'Description B')
    -- Limited to small static datasets
) AS batch_data(product_name, category, price, stock_quantity, supplier_id, sku, description)
ON CONFLICT (sku) DO UPDATE SET
    stock_quantity = products.stock_quantity + EXCLUDED.stock_quantity,
    price = EXCLUDED.price,
    updated_at = CURRENT_TIMESTAMP;

-- Traditional approach limitations:
-- 1. No dynamic batch size optimization based on system resources
-- 2. Limited support for complex document structures and nested data
-- 3. Poor error reporting and partial failure handling
-- 4. No built-in retry logic for transient failures
-- 5. Complex application logic required for batch orchestration
-- 6. Limited write concern and consistency level management
-- 7. No automatic performance monitoring and optimization
-- 8. Difficulty in handling mixed operation types (insert, update, delete) efficiently
-- 9. No native support for bulk operations with custom validation logic
-- 10. Limited scalability for distributed database deployments

MongoDB provides comprehensive bulk operation capabilities with advanced optimization:

// MongoDB Advanced Bulk Operations - high-performance batch processing with intelligent optimization
const { MongoClient, ObjectId } = require('mongodb');

const client = new MongoClient('mongodb://localhost:27017');
const db = client.db('advanced_bulk_operations');

// Comprehensive MongoDB Bulk Operations Manager
class AdvancedBulkOperationsManager {
  constructor(db, config = {}) {
    this.db = db;
    this.collections = {
      products: db.collection('products'),
      inventory: db.collection('inventory'),
      orders: db.collection('orders'),
      customers: db.collection('customers'),
      bulkOperationLogs: db.collection('bulk_operation_logs'),
      performanceMetrics: db.collection('performance_metrics')
    };

    // Advanced bulk operation configuration
    this.config = {
      defaultBatchSize: config.defaultBatchSize || 1000,
      maxBatchSize: config.maxBatchSize || 10000,
      maxRetries: config.maxRetries || 3,
      retryDelay: config.retryDelay || 1000,

      // Performance optimization settings
      enableOptimisticBatching: config.enableOptimisticBatching !== false,
      enableAdaptiveBatchSize: config.enableAdaptiveBatchSize !== false,
      enablePerformanceMonitoring: config.enablePerformanceMonitoring !== false,
      enableErrorAggregation: config.enableErrorAggregation !== false,

      // Write concern and consistency settings
      writeConcern: config.writeConcern || {
        w: 'majority',
        j: true,
        wtimeout: 30000
      },

      // Bulk operation strategies
      unorderedOperations: config.unorderedOperations !== false,
      enablePartialFailures: config.enablePartialFailures !== false,
      enableTransactionalBulk: config.enableTransactionalBulk || false,

      // Memory and resource management
      maxMemoryUsage: config.maxMemoryUsage || '1GB',
      enableGarbageCollection: config.enableGarbageCollection !== false,
      parallelOperations: config.parallelOperations || 4
    };

    // Performance tracking
    this.performanceMetrics = {
      operationsPerSecond: new Map(),
      averageBatchTime: new Map(),
      errorRates: new Map(),
      throughputHistory: []
    };

    this.initializeBulkOperations();
  }

  async initializeBulkOperations() {
    console.log('Initializing advanced bulk operations system...');

    try {
      // Create optimized indexes for bulk operations
      await this.setupOptimizedIndexes();

      // Initialize performance monitoring
      if (this.config.enablePerformanceMonitoring) {
        await this.setupPerformanceMonitoring();
      }

      // Setup bulk operation logging
      await this.setupBulkOperationLogging();

      console.log('Bulk operations system initialized successfully');

    } catch (error) {
      console.error('Error initializing bulk operations:', error);
      throw error;
    }
  }

  async setupOptimizedIndexes() {
    console.log('Setting up indexes optimized for bulk operations...');

    try {
      // Product collection indexes for efficient bulk operations
      await this.collections.products.createIndexes([
        { key: { sku: 1 }, unique: true, background: true },
        { key: { category: 1, createdAt: -1 }, background: true },
        { key: { supplier_id: 1, status: 1 }, background: true },
        { key: { 'pricing.lastUpdated': -1 }, background: true, sparse: true },
        { key: { tags: 1 }, background: true },
        { key: { 'inventory.lastStockUpdate': -1 }, background: true, sparse: true }
      ]);

      // Inventory collection indexes
      await this.collections.inventory.createIndexes([
        { key: { product_id: 1, warehouse_id: 1 }, unique: true, background: true },
        { key: { lastUpdated: -1 }, background: true },
        { key: { quantity: 1, status: 1 }, background: true }
      ]);

      console.log('Bulk operation indexes created successfully');

    } catch (error) {
      console.error('Error creating bulk operation indexes:', error);
      throw error;
    }
  }

  async performAdvancedBulkInsert(documents, options = {}) {
    console.log(`Performing advanced bulk insert for ${documents.length} documents...`);
    const startTime = Date.now();

    try {
      // Validate and prepare documents for bulk insertion
      const preparedDocuments = await this.prepareDocumentsForInsertion(documents, options);

      // Determine optimal batch configuration
      const batchConfig = this.calculateOptimalBatchConfiguration(preparedDocuments, 'insert');

      // Execute bulk insert with advanced error handling
      const insertResults = await this.executeBulkInsertBatches(preparedDocuments, batchConfig, options);

      // Process and aggregate results
      const aggregatedResults = await this.aggregateBulkResults(insertResults, 'insert');

      // Log operation performance
      await this.logBulkOperation('bulk_insert', {
        documentCount: documents.length,
        batchConfiguration: batchConfig,
        results: aggregatedResults,
        processingTime: Date.now() - startTime
      });

      return {
        operation: 'bulk_insert',
        totalDocuments: documents.length,
        successful: aggregatedResults.successfulInserts,
        failed: aggregatedResults.failedInserts,

        // Detailed results
        insertedIds: aggregatedResults.insertedIds,
        errors: aggregatedResults.errors,
        duplicates: aggregatedResults.duplicateErrors,

        // Performance metrics
        processingTime: Date.now() - startTime,
        documentsPerSecond: Math.round((aggregatedResults.successfulInserts / (Date.now() - startTime)) * 1000),
        batchesProcessed: insertResults.length,
        averageBatchTime: insertResults.reduce((sum, r) => sum + r.processingTime, 0) / insertResults.length,

        // Configuration used
        batchConfiguration: batchConfig,

        // Quality metrics
        successRate: (aggregatedResults.successfulInserts / documents.length) * 100,
        errorRate: (aggregatedResults.failedInserts / documents.length) * 100
      };

    } catch (error) {
      console.error('Bulk insert operation failed:', error);

      // Log failed operation
      await this.logBulkOperation('bulk_insert_failed', {
        documentCount: documents.length,
        error: error.message,
        processingTime: Date.now() - startTime
      });

      throw error;
    }
  }

  async prepareDocumentsForInsertion(documents, options = {}) {
    console.log('Preparing documents for bulk insertion with validation and enhancement...');

    const preparedDocuments = [];
    const validationErrors = [];

    for (let i = 0; i < documents.length; i++) {
      const document = documents[i];

      try {
        // Document validation and standardization
        const preparedDoc = {
          ...document,

          // Ensure consistent ObjectId handling
          _id: document._id || new ObjectId(),

          // Standardize timestamps
          createdAt: document.createdAt || new Date(),
          updatedAt: document.updatedAt || new Date(),

          // Add bulk operation metadata
          bulkOperationMetadata: {
            batchId: options.batchId || new ObjectId(),
            sourceOperation: 'bulk_insert',
            insertionIndex: i,
            processingTimestamp: new Date()
          }
        };

        // Apply custom document transformations if provided
        if (options.documentTransform) {
          const transformedDoc = await options.documentTransform(preparedDoc, i);
          preparedDocuments.push(transformedDoc);
        } else {
          preparedDocuments.push(preparedDoc);
        }

        // Enhanced document preparation for specific collections
        if (options.collection === 'products') {
          preparedDoc.searchKeywords = this.generateSearchKeywords(preparedDoc);
          preparedDoc.categoryHierarchy = this.buildCategoryHierarchy(preparedDoc.category);
          preparedDoc.pricingTiers = this.calculatePricingTiers(preparedDoc.price);
        }

      } catch (validationError) {
        validationErrors.push({
          index: i,
          document: document,
          error: validationError.message
        });
      }
    }

    if (validationErrors.length > 0 && !options.allowPartialFailures) {
      throw new Error(`Document validation failed for ${validationErrors.length} documents`);
    }

    return {
      documents: preparedDocuments,
      validationErrors: validationErrors
    };
  }

  calculateOptimalBatchConfiguration(preparedDocuments, operationType) {
    console.log(`Calculating optimal batch configuration for ${operationType}...`);

    const documentCount = preparedDocuments.documents ? preparedDocuments.documents.length : preparedDocuments.length;
    const avgDocumentSize = this.estimateAverageDocumentSize(preparedDocuments);

    // Adaptive batch sizing based on document characteristics
    let optimalBatchSize = this.config.defaultBatchSize;

    // Adjust based on document size
    if (avgDocumentSize > 100000) { // Large documents (>100KB)
      optimalBatchSize = Math.min(100, this.config.defaultBatchSize);
    } else if (avgDocumentSize > 10000) { // Medium documents (>10KB)
      optimalBatchSize = Math.min(500, this.config.defaultBatchSize);
    } else { // Small documents
      optimalBatchSize = Math.min(this.config.maxBatchSize, documentCount);
    }

    // Adjust based on operation type
    const operationMultiplier = {
      'insert': 1.0,
      'update': 0.8,
      'delete': 1.2,
      'upsert': 0.7
    };

    optimalBatchSize = Math.round(optimalBatchSize * (operationMultiplier[operationType] || 1.0));

    // Calculate number of batches
    const numberOfBatches = Math.ceil(documentCount / optimalBatchSize);

    return {
      batchSize: optimalBatchSize,
      numberOfBatches: numberOfBatches,
      estimatedDocumentSize: avgDocumentSize,
      operationType: operationType,

      // Advanced configuration
      unordered: this.config.unorderedOperations,
      writeConcern: this.config.writeConcern,
      maxTimeMS: 30000,

      // Parallel processing configuration
      parallelBatches: Math.min(this.config.parallelOperations, numberOfBatches)
    };
  }

  async executeBulkInsertBatches(preparedDocuments, batchConfig, options = {}) {
    console.log(`Executing ${batchConfig.numberOfBatches} bulk insert batches...`);

    const documents = preparedDocuments.documents || preparedDocuments;
    const batchResults = [];
    const batches = this.createBatches(documents, batchConfig.batchSize);

    // Execute batches with parallel processing
    if (batchConfig.parallelBatches > 1) {
      const batchGroups = this.createBatchGroups(batches, batchConfig.parallelBatches);

      for (const batchGroup of batchGroups) {
        const groupResults = await Promise.all(
          batchGroup.map(batch => this.executeSingleInsertBatch(batch, batchConfig, options))
        );
        batchResults.push(...groupResults);
      }
    } else {
      // Sequential execution for ordered operations
      for (const batch of batches) {
        const result = await this.executeSingleInsertBatch(batch, batchConfig, options);
        batchResults.push(result);
      }
    }

    return batchResults;
  }

  async executeSingleInsertBatch(batchDocuments, batchConfig, options = {}) {
    const batchStartTime = Date.now();

    try {
      // Create collection reference
      const collection = options.collection ? this.db.collection(options.collection) : this.collections.products;

      // Configure bulk insert operation
      const insertOptions = {
        ordered: !batchConfig.unordered,
        writeConcern: batchConfig.writeConcern,
        maxTimeMS: batchConfig.maxTimeMS,
        bypassDocumentValidation: options.bypassValidation || false
      };

      // Execute bulk insert
      const insertResult = await collection.insertMany(batchDocuments, insertOptions);

      return {
        success: true,
        batchSize: batchDocuments.length,
        insertedCount: insertResult.insertedCount,
        insertedIds: insertResult.insertedIds,
        processingTime: Date.now() - batchStartTime,
        errors: [],

        // Performance metrics
        documentsPerSecond: Math.round((insertResult.insertedCount / (Date.now() - batchStartTime)) * 1000),
        avgDocumentProcessingTime: (Date.now() - batchStartTime) / batchDocuments.length
      };

    } catch (error) {
      console.error('Batch insert failed:', error);

      // Handle bulk write errors with detailed analysis
      if (error.name === 'BulkWriteError' || error.name === 'MongoBulkWriteError') {
        return this.processBulkWriteError(error, batchDocuments, batchStartTime);
      }

      return {
        success: false,
        batchSize: batchDocuments.length,
        insertedCount: 0,
        insertedIds: {},
        processingTime: Date.now() - batchStartTime,
        errors: [{
          error: error.message,
          errorCode: error.code,
          batchIndex: 0
        }]
      };
    }
  }

  processBulkWriteError(bulkError, batchDocuments, startTime) {
    console.log('Processing bulk write error with detailed analysis...');

    const processedResults = {
      success: false,
      batchSize: batchDocuments.length,
      insertedCount: bulkError.result?.insertedCount || 0,
      insertedIds: bulkError.result?.insertedIds || {},
      processingTime: Date.now() - startTime,
      errors: []
    };

    // Process individual write errors
    if (bulkError.writeErrors) {
      for (const writeError of bulkError.writeErrors) {
        processedResults.errors.push({
          index: writeError.index,
          error: writeError.errmsg,
          errorCode: writeError.code,
          document: batchDocuments[writeError.index]
        });
      }
    }

    // Process write concern errors
    if (bulkError.writeConcernErrors) {
      for (const wcError of bulkError.writeConcernErrors) {
        processedResults.errors.push({
          error: wcError.errmsg,
          errorCode: wcError.code,
          type: 'write_concern_error'
        });
      }
    }

    return processedResults;
  }

  async performAdvancedBulkUpdate(updates, options = {}) {
    console.log(`Performing advanced bulk update for ${updates.length} operations...`);
    const startTime = Date.now();

    try {
      // Prepare update operations
      const preparedUpdates = await this.prepareUpdateOperations(updates, options);

      // Calculate optimal batching strategy
      const batchConfig = this.calculateOptimalBatchConfiguration(preparedUpdates, 'update');

      // Execute bulk updates with error handling
      const updateResults = await this.executeBulkUpdateBatches(preparedUpdates, batchConfig, options);

      // Aggregate and analyze results
      const aggregatedResults = await this.aggregateBulkResults(updateResults, 'update');

      return {
        operation: 'bulk_update',
        totalOperations: updates.length,
        successful: aggregatedResults.successfulUpdates,
        failed: aggregatedResults.failedUpdates,
        modified: aggregatedResults.modifiedCount,
        matched: aggregatedResults.matchedCount,
        upserted: aggregatedResults.upsertedCount,

        // Detailed results
        errors: aggregatedResults.errors,
        upsertedIds: aggregatedResults.upsertedIds,

        // Performance metrics
        processingTime: Date.now() - startTime,
        operationsPerSecond: Math.round((aggregatedResults.successfulUpdates / (Date.now() - startTime)) * 1000),
        batchesProcessed: updateResults.length,

        // Update-specific metrics
        updateEfficiency: aggregatedResults.modifiedCount / Math.max(aggregatedResults.matchedCount, 1),

        batchConfiguration: batchConfig
      };

    } catch (error) {
      console.error('Bulk update operation failed:', error);
      throw error;
    }
  }

  async prepareUpdateOperations(updates, options = {}) {
    console.log('Preparing update operations with validation and optimization...');

    const preparedOperations = [];

    for (let i = 0; i < updates.length; i++) {
      const update = updates[i];

      // Standardize update operation structure
      const preparedOperation = {
        updateOne: {
          filter: update.filter || { _id: update._id },
          update: {
            $set: {
              ...update.$set,
              updatedAt: new Date(),
              'bulkOperationMetadata.lastBulkUpdate': new Date(),
              'bulkOperationMetadata.updateIndex': i
            },
            ...(update.$inc && { $inc: update.$inc }),
            ...(update.$unset && { $unset: update.$unset }),
            ...(update.$push && { $push: update.$push }),
            ...(update.$pull && { $pull: update.$pull })
          },
          upsert: update.upsert || options.upsert || false,
          arrayFilters: update.arrayFilters,
          hint: update.hint
        }
      };

      // Add conditional updates based on operation type
      if (options.operationType === 'inventory_update') {
        preparedOperation.updateOne.update.$set.lastStockUpdate = new Date();

        // Add inventory-specific validation
        if (update.$inc && update.$inc.quantity) {
          preparedOperation.updateOne.update.$max = { 
            quantity: 0 // Prevent negative inventory
          };
        }
      }

      preparedOperations.push(preparedOperation);
    }

    return preparedOperations;
  }

  async executeBulkUpdateBatches(operations, batchConfig, options = {}) {
    console.log(`Executing ${batchConfig.numberOfBatches} bulk update batches...`);

    const collection = options.collection ? this.db.collection(options.collection) : this.collections.products;
    const batches = this.createBatches(operations, batchConfig.batchSize);
    const batchResults = [];

    for (const batch of batches) {
      const batchStartTime = Date.now();

      try {
        // Execute bulk write operations
        const bulkResult = await collection.bulkWrite(batch, {
          ordered: !batchConfig.unordered,
          writeConcern: batchConfig.writeConcern,
          maxTimeMS: batchConfig.maxTimeMS
        });

        batchResults.push({
          success: true,
          batchSize: batch.length,
          matchedCount: bulkResult.matchedCount,
          modifiedCount: bulkResult.modifiedCount,
          upsertedCount: bulkResult.upsertedCount,
          upsertedIds: bulkResult.upsertedIds,
          processingTime: Date.now() - batchStartTime,
          errors: []
        });

      } catch (error) {
        console.error('Bulk update batch failed:', error);

        if (error.name === 'BulkWriteError') {
          batchResults.push(this.processBulkWriteError(error, batch, batchStartTime));
        } else {
          batchResults.push({
            success: false,
            batchSize: batch.length,
            matchedCount: 0,
            modifiedCount: 0,
            processingTime: Date.now() - batchStartTime,
            errors: [{ error: error.message, errorCode: error.code }]
          });
        }
      }
    }

    return batchResults;
  }

  async performAdvancedBulkDelete(deletions, options = {}) {
    console.log(`Performing advanced bulk delete for ${deletions.length} operations...`);
    const startTime = Date.now();

    try {
      // Prepare deletion operations with safety checks
      const preparedDeletions = await this.prepareDeletionOperations(deletions, options);

      // Calculate optimal batching
      const batchConfig = this.calculateOptimalBatchConfiguration(preparedDeletions, 'delete');

      // Execute bulk deletions
      const deleteResults = await this.executeBulkDeleteBatches(preparedDeletions, batchConfig, options);

      // Aggregate results
      const aggregatedResults = await this.aggregateBulkResults(deleteResults, 'delete');

      return {
        operation: 'bulk_delete',
        totalOperations: deletions.length,
        successful: aggregatedResults.successfulDeletes,
        failed: aggregatedResults.failedDeletes,
        deletedCount: aggregatedResults.deletedCount,

        // Safety and audit information
        safeguardsApplied: preparedDeletions.safeguards || [],
        blockedDeletions: preparedDeletions.blocked || [],

        // Performance metrics
        processingTime: Date.now() - startTime,
        operationsPerSecond: Math.round((aggregatedResults.successfulDeletes / (Date.now() - startTime)) * 1000),

        errors: aggregatedResults.errors,
        batchConfiguration: batchConfig
      };

    } catch (error) {
      console.error('Bulk delete operation failed:', error);
      throw error;
    }
  }

  async prepareDeletionOperations(deletions, options = {}) {
    console.log('Preparing deletion operations with safety validations...');

    const preparedOperations = [];
    const blockedDeletions = [];
    const appliedSafeguards = [];

    for (const deletion of deletions) {
      // Apply safety checks for deletion operations
      const safetyCheck = await this.validateDeletionSafety(deletion, options);

      if (safetyCheck.safe) {
        preparedOperations.push({
          deleteOne: {
            filter: deletion.filter || { _id: deletion._id },
            hint: deletion.hint,
            collation: deletion.collation
          }
        });
      } else {
        blockedDeletions.push({
          operation: deletion,
          reason: safetyCheck.reason,
          dependencies: safetyCheck.dependencies
        });
      }

      if (safetyCheck.safeguards) {
        appliedSafeguards.push(...safetyCheck.safeguards);
      }
    }

    return {
      operations: preparedOperations,
      blocked: blockedDeletions,
      safeguards: appliedSafeguards
    };
  }

  async validateDeletionSafety(deletion, options = {}) {
    // Implement comprehensive safety checks for deletion operations
    const safeguards = [];
    const dependencies = [];

    // Check for referential integrity
    if (options.checkReferences !== false) {
      const refCheck = await this.checkReferentialIntegrity(deletion.filter);
      if (refCheck.hasReferences) {
        dependencies.push(...refCheck.references);
      }
    }

    // Check for recent activity
    if (options.checkRecentActivity !== false) {
      const activityCheck = await this.checkRecentActivity(deletion.filter);
      if (activityCheck.hasRecentActivity) {
        safeguards.push('recent_activity_detected');
      }
    }

    // Determine if deletion is safe
    const safe = dependencies.length === 0 && (!options.requireConfirmation || deletion.confirmed);

    return {
      safe: safe,
      reason: safe ? null : `Dependencies found: ${dependencies.join(', ')}`,
      dependencies: dependencies,
      safeguards: safeguards
    };
  }

  // Utility methods for batch processing and optimization

  createBatches(items, batchSize) {
    const batches = [];
    for (let i = 0; i < items.length; i += batchSize) {
      batches.push(items.slice(i, i + batchSize));
    }
    return batches;
  }

  createBatchGroups(batches, groupSize) {
    const groups = [];
    for (let i = 0; i < batches.length; i += groupSize) {
      groups.push(batches.slice(i, i + groupSize));
    }
    return groups;
  }

  estimateAverageDocumentSize(documents) {
    if (!documents || documents.length === 0) return 1000; // Default estimate

    const sampleSize = Math.min(10, documents.length);
    const sample = documents.slice(0, sampleSize);
    const totalSize = sample.reduce((size, doc) => {
      return size + JSON.stringify(doc).length;
    }, 0);

    return Math.round(totalSize / sampleSize);
  }

  async aggregateBulkResults(batchResults, operationType) {
    console.log(`Aggregating results for ${batchResults.length} batches...`);

    const aggregated = {
      successfulOperations: 0,
      failedOperations: 0,
      errors: [],
      totalProcessingTime: 0
    };

    // Operation-specific aggregation
    switch (operationType) {
      case 'insert':
        aggregated.successfulInserts = 0;
        aggregated.failedInserts = 0;
        aggregated.insertedIds = {};
        aggregated.duplicateErrors = [];
        break;
      case 'update':
        aggregated.successfulUpdates = 0;
        aggregated.failedUpdates = 0;
        aggregated.matchedCount = 0;
        aggregated.modifiedCount = 0;
        aggregated.upsertedCount = 0;
        aggregated.upsertedIds = {};
        break;
      case 'delete':
        aggregated.successfulDeletes = 0;
        aggregated.failedDeletes = 0;
        aggregated.deletedCount = 0;
        break;
    }

    // Aggregate results from all batches
    for (const batchResult of batchResults) {
      aggregated.totalProcessingTime += batchResult.processingTime;

      if (batchResult.success) {
        switch (operationType) {
          case 'insert':
            aggregated.successfulInserts += batchResult.insertedCount;
            Object.assign(aggregated.insertedIds, batchResult.insertedIds);
            break;
          case 'update':
            aggregated.successfulUpdates += batchResult.batchSize;
            aggregated.matchedCount += batchResult.matchedCount;
            aggregated.modifiedCount += batchResult.modifiedCount;
            aggregated.upsertedCount += batchResult.upsertedCount || 0;
            Object.assign(aggregated.upsertedIds, batchResult.upsertedIds || {});
            break;
          case 'delete':
            aggregated.successfulDeletes += batchResult.batchSize;
            aggregated.deletedCount += batchResult.deletedCount || batchResult.batchSize;
            break;
        }
      } else {
        switch (operationType) {
          case 'insert':
            aggregated.failedInserts += batchResult.batchSize - (batchResult.insertedCount || 0);
            break;
          case 'update':
            aggregated.failedUpdates += batchResult.batchSize - (batchResult.matchedCount || 0);
            break;
          case 'delete':
            aggregated.failedDeletes += batchResult.batchSize;
            break;
        }
      }

      // Aggregate errors
      if (batchResult.errors && batchResult.errors.length > 0) {
        aggregated.errors.push(...batchResult.errors);
      }
    }

    return aggregated;
  }

  async logBulkOperation(operationType, operationData) {
    try {
      const logEntry = {
        operationType: operationType,
        timestamp: new Date(),
        ...operationData,

        // System context
        systemMetrics: {
          memoryUsage: process.memoryUsage(),
          nodeVersion: process.version
        }
      };

      await this.collections.bulkOperationLogs.insertOne(logEntry);

    } catch (error) {
      console.error('Error logging bulk operation:', error);
      // Don't throw - logging shouldn't break bulk operations
    }
  }

  // Additional utility methods for comprehensive bulk operations

  generateSearchKeywords(document) {
    const keywords = [];

    if (document.title) {
      keywords.push(...document.title.toLowerCase().split(/\s+/));
    }

    if (document.description) {
      keywords.push(...document.description.toLowerCase().split(/\s+/));
    }

    if (document.tags) {
      keywords.push(...document.tags.map(tag => tag.toLowerCase()));
    }

    // Remove duplicates and filter short words
    return [...new Set(keywords)].filter(word => word.length > 2);
  }

  buildCategoryHierarchy(category) {
    if (!category) return [];

    const hierarchy = category.split('/');
    const hierarchyPath = [];

    for (let i = 0; i < hierarchy.length; i++) {
      hierarchyPath.push(hierarchy.slice(0, i + 1).join('/'));
    }

    return hierarchyPath;
  }

  calculatePricingTiers(price) {
    if (!price) return {};

    return {
      tier: price < 50 ? 'budget' : price < 200 ? 'mid-range' : 'premium',
      priceRange: {
        min: Math.floor(price / 50) * 50,
        max: Math.ceil(price / 50) * 50
      }
    };
  }

  async checkReferentialIntegrity(filter) {
    // Simplified referential integrity check
    // In production, implement comprehensive relationship checking
    return {
      hasReferences: false,
      references: []
    };
  }

  async checkRecentActivity(filter) {
    // Simplified activity check
    // In production, check recent orders, updates, etc.
    return {
      hasRecentActivity: false,
      lastActivity: null
    };
  }
}

// Benefits of MongoDB Advanced Bulk Operations:
// - High-performance batch processing with intelligent batch size optimization
// - Comprehensive error handling and partial failure recovery
// - Advanced write concern and consistency management
// - Optimized memory usage and resource management
// - Built-in performance monitoring and metrics collection
// - Sophisticated validation and safety checks for data integrity
// - Parallel processing capabilities for maximum throughput
// - Transaction support for atomic multi-document operations
// - Automatic retry logic with exponential backoff
// - SQL-compatible bulk operations through QueryLeaf integration

module.exports = {
  AdvancedBulkOperationsManager
};

Understanding MongoDB Bulk Operations Architecture

Advanced Batch Processing and Performance Optimization Strategies

Implement sophisticated bulk operation patterns for production MongoDB deployments:

// Production-ready MongoDB bulk operations with advanced optimization and monitoring
class ProductionBulkProcessor extends AdvancedBulkOperationsManager {
  constructor(db, productionConfig) {
    super(db, productionConfig);

    this.productionConfig = {
      ...productionConfig,
      enableDistributedProcessing: true,
      enableLoadBalancing: true,
      enableFailoverHandling: true,
      enableCapacityPlanning: true,
      enableAutomaticOptimization: true,
      enableComplianceAuditing: true
    };

    this.setupProductionOptimizations();
    this.initializeDistributedProcessing();
    this.setupCapacityPlanning();
  }

  async implementDistributedBulkProcessing(operations, distributionStrategy) {
    console.log('Implementing distributed bulk processing across multiple nodes...');

    const distributedStrategy = {
      // Sharding-aware distribution
      shardAwareDistribution: {
        enableShardKeyOptimization: true,
        balanceAcrossShards: true,
        minimizeCrossShardOperations: true,
        optimizeForShardKey: distributionStrategy.shardKey
      },

      // Load balancing strategies
      loadBalancing: {
        dynamicBatchSizing: true,
        nodeCapacityAware: true,
        latencyOptimized: true,
        throughputMaximization: true
      },

      // Fault tolerance and recovery
      faultTolerance: {
        automaticFailover: true,
        retryFailedBatches: true,
        partialFailureRecovery: true,
        deadlockDetection: true
      }
    };

    return await this.executeDistributedBulkOperations(operations, distributedStrategy);
  }

  async setupAdvancedBulkOptimization() {
    console.log('Setting up advanced bulk operation optimization...');

    const optimizationStrategies = {
      // Write optimization patterns
      writeOptimization: {
        journalingSyncOptimization: true,
        writeBufferOptimization: true,
        concurrencyControlOptimization: true,
        lockMinimizationStrategies: true
      },

      // Memory management optimization
      memoryOptimization: {
        documentBatching: true,
        memoryPooling: true,
        garbageCollectionOptimization: true,
        cacheOptimization: true
      },

      // Network optimization
      networkOptimization: {
        compressionOptimization: true,
        connectionPoolingOptimization: true,
        batchTransmissionOptimization: true,
        networkLatencyMinimization: true
      }
    };

    return await this.deployOptimizationStrategies(optimizationStrategies);
  }

  async implementAdvancedErrorHandlingAndRecovery() {
    console.log('Implementing advanced error handling and recovery mechanisms...');

    const errorHandlingStrategy = {
      // Error classification and handling
      errorClassification: {
        transientErrors: ['NetworkTimeout', 'TemporaryUnavailable'],
        permanentErrors: ['ValidationError', 'DuplicateKey'],
        retriableErrors: ['WriteConflict', 'LockTimeout'],
        fatalErrors: ['OutOfMemory', 'DiskFull']
      },

      // Recovery strategies
      recoveryStrategies: {
        automaticRetry: {
          maxRetries: 5,
          exponentialBackoff: true,
          jitterRandomization: true
        },
        partialFailureHandling: {
          isolateFailedOperations: true,
          continueWithSuccessful: true,
          generateFailureReport: true
        },
        circuitBreaker: {
          failureThreshold: 10,
          recoveryTimeout: 60000,
          halfOpenRetryCount: 3
        }
      }
    };

    return await this.deployErrorHandlingStrategy(errorHandlingStrategy);
  }
}

SQL-Style Bulk Operations with QueryLeaf

QueryLeaf provides familiar SQL syntax for MongoDB bulk operations and high-throughput data processing:

-- QueryLeaf advanced bulk operations with SQL-familiar syntax for MongoDB

-- Configure bulk operation settings
SET bulk_operation_batch_size = 1000;
SET bulk_operation_parallel_batches = 4;
SET bulk_operation_write_concern = 'majority';
SET bulk_operation_ordered = false;
SET bulk_operation_bypass_validation = false;

-- Advanced bulk insert with comprehensive error handling and performance optimization
WITH product_data_preparation AS (
  SELECT 
    -- Prepare product data with validation and enhancement
    product_id,
    product_name,
    category,
    CAST(price AS DECIMAL(10,2)) as validated_price,
    CAST(stock_quantity AS INTEGER) as validated_stock,
    supplier_id,

    -- Generate enhanced metadata for optimal MongoDB storage
    ARRAY[
      LOWER(product_name),
      LOWER(category),
      LOWER(supplier_name)
    ] as search_keywords,

    -- Build category hierarchy for efficient querying
    STRING_TO_ARRAY(category, '/') as category_hierarchy,

    -- Calculate pricing tiers for analytics
    CASE 
      WHEN price < 50 THEN 'budget'
      WHEN price < 200 THEN 'mid-range' 
      ELSE 'premium'
    END as pricing_tier,

    -- Add bulk operation metadata
    JSON_OBJECT(
      'batch_id', GENERATE_UUID(),
      'source_system', 'inventory_import',
      'import_timestamp', CURRENT_TIMESTAMP,
      'validation_status', 'passed'
    ) as bulk_metadata,

    -- Standard timestamps
    CURRENT_TIMESTAMP as created_at,
    CURRENT_TIMESTAMP as updated_at,

    -- Data quality scoring
    (
      CASE WHEN product_name IS NOT NULL AND LENGTH(TRIM(product_name)) > 0 THEN 1 ELSE 0 END +
      CASE WHEN category IS NOT NULL AND LENGTH(TRIM(category)) > 0 THEN 1 ELSE 0 END +
      CASE WHEN price > 0 THEN 1 ELSE 0 END +
      CASE WHEN stock_quantity >= 0 THEN 1 ELSE 0 END +
      CASE WHEN supplier_id IS NOT NULL THEN 1 ELSE 0 END
    ) / 5.0 as data_quality_score

  FROM staging_products sp
  JOIN suppliers s ON sp.supplier_id = s.supplier_id
  WHERE 
    -- Data validation filters
    sp.product_name IS NOT NULL 
    AND TRIM(sp.product_name) != ''
    AND sp.price > 0
    AND sp.stock_quantity >= 0
    AND s.status = 'active'
),

bulk_insert_configuration AS (
  SELECT 
    COUNT(*) as total_documents,

    -- Calculate optimal batch configuration
    CASE 
      WHEN AVG(LENGTH(product_name::TEXT) + LENGTH(COALESCE(description, '')::TEXT)) > 10000 THEN 500
      WHEN AVG(LENGTH(product_name::TEXT) + LENGTH(COALESCE(description, '')::TEXT)) > 1000 THEN 1000
      ELSE 2000
    END as optimal_batch_size,

    -- Parallel processing configuration
    LEAST(4, CEIL(COUNT(*) / 1000.0)) as parallel_batches,

    -- Performance prediction
    CASE 
      WHEN COUNT(*) < 1000 THEN 'fast'
      WHEN COUNT(*) < 10000 THEN 'moderate'
      ELSE 'extended'
    END as expected_processing_time

  FROM product_data_preparation
  WHERE data_quality_score >= 0.8
)

-- Execute advanced bulk insert operation
INSERT INTO products (
  product_id,
  product_name,
  category,
  category_hierarchy,
  price,
  pricing_tier,
  stock_quantity,
  supplier_id,
  search_keywords,
  bulk_operation_metadata,
  created_at,
  updated_at,
  data_quality_score
)
SELECT 
  pdp.product_id,
  pdp.product_name,
  pdp.category,
  pdp.category_hierarchy,
  pdp.validated_price,
  pdp.pricing_tier,
  pdp.validated_stock,
  pdp.supplier_id,
  pdp.search_keywords,
  pdp.bulk_metadata,
  pdp.created_at,
  pdp.updated_at,
  pdp.data_quality_score
FROM product_data_preparation pdp
CROSS JOIN bulk_insert_configuration bic
WHERE pdp.data_quality_score >= 0.8

-- Advanced bulk insert configuration
WITH (
  batch_size = (SELECT optimal_batch_size FROM bulk_insert_configuration),
  parallel_batches = (SELECT parallel_batches FROM bulk_insert_configuration),
  write_concern = 'majority',
  ordered_operations = false,

  -- Error handling configuration
  continue_on_error = true,
  duplicate_key_handling = 'skip',
  validation_bypass = false,

  -- Performance optimization
  enable_compression = true,
  connection_pooling = true,
  write_buffer_size = '64MB',

  -- Monitoring and logging
  enable_performance_monitoring = true,
  log_detailed_errors = true,
  track_operation_metrics = true
);

-- Advanced bulk update with intelligent batching and conflict resolution
WITH inventory_updates AS (
  SELECT 
    product_id,
    warehouse_id,
    quantity_adjustment,
    price_adjustment,
    update_reason,
    source_system,

    -- Calculate update priority
    CASE 
      WHEN ABS(quantity_adjustment) > 1000 THEN 'high'
      WHEN ABS(quantity_adjustment) > 100 THEN 'medium'  
      ELSE 'low'
    END as update_priority,

    -- Validate adjustments
    CASE 
      WHEN quantity_adjustment < 0 THEN 
        -- Ensure we don't create negative inventory
        GREATEST(quantity_adjustment, -current_stock_quantity)
      ELSE quantity_adjustment
    END as safe_quantity_adjustment,

    -- Add update metadata
    JSON_OBJECT(
      'update_batch_id', GENERATE_UUID(),
      'update_timestamp', CURRENT_TIMESTAMP,
      'update_source', source_system,
      'validation_status', 'approved'
    ) as update_metadata

  FROM staging_inventory_updates siu
  JOIN current_inventory ci ON siu.product_id = ci.product_id 
    AND siu.warehouse_id = ci.warehouse_id
  WHERE 
    -- Update validation
    ABS(siu.quantity_adjustment) <= 10000  -- Prevent massive adjustments
    AND siu.price_adjustment IS NULL OR ABS(siu.price_adjustment) <= siu.current_price * 0.5  -- Max 50% price change
),

conflict_resolution AS (
  -- Handle potential update conflicts
  SELECT 
    iu.*,

    -- Detect conflicting updates
    CASE 
      WHEN EXISTS (
        SELECT 1 FROM recent_inventory_updates riu 
        WHERE riu.product_id = iu.product_id 
        AND riu.warehouse_id = iu.warehouse_id
        AND riu.update_timestamp > CURRENT_TIMESTAMP - INTERVAL '5 minutes'
      ) THEN 'potential_conflict'
      ELSE 'safe_to_update'
    END as conflict_status,

    -- Calculate final values
    ci.stock_quantity + iu.safe_quantity_adjustment as final_stock_quantity,
    COALESCE(ci.price + iu.price_adjustment, ci.price) as final_price

  FROM inventory_updates iu
  JOIN current_inventory ci ON iu.product_id = ci.product_id 
    AND iu.warehouse_id = ci.warehouse_id
)

-- Execute bulk update with advanced error handling
UPDATE products 
SET 
  -- Core field updates
  stock_quantity = cr.final_stock_quantity,
  price = cr.final_price,
  updated_at = CURRENT_TIMESTAMP,

  -- Audit trail updates
  last_inventory_update = CURRENT_TIMESTAMP,
  inventory_update_reason = cr.update_reason,
  inventory_update_source = cr.source_system,

  -- Metadata updates
  bulk_operation_metadata = JSON_SET(
    COALESCE(bulk_operation_metadata, '{}'),
    '$.last_bulk_update', CURRENT_TIMESTAMP,
    '$.update_batch_info', cr.update_metadata
  ),

  -- Analytics updates
  total_adjustments = COALESCE(total_adjustments, 0) + 1,
  cumulative_quantity_adjustments = COALESCE(cumulative_quantity_adjustments, 0) + cr.safe_quantity_adjustment

FROM conflict_resolution cr
WHERE products.product_id = cr.product_id
  AND cr.conflict_status = 'safe_to_update'
  AND cr.final_stock_quantity >= 0  -- Additional safety check

-- Bulk update configuration
WITH (
  batch_size = 1500,
  parallel_batches = 3,
  write_concern = 'majority',
  max_time_ms = 30000,

  -- Conflict handling
  retry_on_conflict = true,
  max_retries = 3,
  backoff_strategy = 'exponential',

  -- Validation and safety
  enable_pre_update_validation = true,
  enable_post_update_validation = true,
  rollback_on_validation_failure = true,

  -- Performance optimization
  hint_index = 'product_warehouse_compound',
  bypass_document_validation = false
);

-- Advanced bulk upsert operation combining insert and update logic
WITH product_sync_data AS (
  SELECT 
    external_product_id,
    product_name,
    category,
    price,
    stock_quantity,
    supplier_code,
    last_modified_external,

    -- Determine if this should be insert or update
    CASE 
      WHEN EXISTS (
        SELECT 1 FROM products p 
        WHERE p.external_product_id = spd.external_product_id
      ) THEN 'update'
      ELSE 'insert'
    END as operation_type,

    -- Calculate data freshness
    EXTRACT(DAYS FROM CURRENT_TIMESTAMP - last_modified_external) as days_since_modified,

    -- Prepare upsert metadata
    JSON_OBJECT(
      'sync_batch_id', GENERATE_UUID(),
      'sync_timestamp', CURRENT_TIMESTAMP,
      'source_system', 'external_catalog',
      'operation_type', 'upsert',
      'data_freshness_days', EXTRACT(DAYS FROM CURRENT_TIMESTAMP - last_modified_external)
    ) as upsert_metadata

  FROM staging_product_data spd
  WHERE spd.last_modified_external > CURRENT_TIMESTAMP - INTERVAL '7 days'  -- Only sync recent changes
),

upsert_validation AS (
  SELECT 
    psd.*,

    -- Validate data quality for upsert
    (
      CASE WHEN product_name IS NOT NULL AND LENGTH(TRIM(product_name)) > 0 THEN 1 ELSE 0 END +
      CASE WHEN category IS NOT NULL THEN 1 ELSE 0 END +
      CASE WHEN price > 0 THEN 1 ELSE 0 END +
      CASE WHEN supplier_code IS NOT NULL THEN 1 ELSE 0 END
    ) / 4.0 as validation_score,

    -- Check for significant changes (for updates)
    CASE 
      WHEN psd.operation_type = 'update' THEN
        COALESCE(
          (SELECT 
            CASE 
              WHEN ABS(p.price - psd.price) > p.price * 0.1 OR  -- 10% price change
                   ABS(p.stock_quantity - psd.stock_quantity) > 10 OR  -- Stock change > 10
                   p.product_name != psd.product_name  -- Name change
              THEN 'significant_changes'
              ELSE 'minor_changes'
            END
           FROM products p 
           WHERE p.external_product_id = psd.external_product_id), 
          'new_record'
        )
      ELSE 'new_record'
    END as change_significance

  FROM product_sync_data psd
)

-- Execute bulk upsert operation
INSERT INTO products (
  external_product_id,
  product_name,
  category,
  price,
  stock_quantity,
  supplier_code,
  bulk_operation_metadata,
  created_at,
  updated_at,
  data_validation_score,
  sync_status
)
SELECT 
  uv.external_product_id,
  uv.product_name,
  uv.category,
  uv.price,
  uv.stock_quantity,
  uv.supplier_code,
  uv.upsert_metadata,
  CASE WHEN uv.operation_type = 'insert' THEN CURRENT_TIMESTAMP ELSE NULL END,
  CURRENT_TIMESTAMP,
  uv.validation_score,
  'synchronized'
FROM upsert_validation uv
WHERE uv.validation_score >= 0.75

-- Handle conflicts with upsert logic
ON CONFLICT (external_product_id) 
DO UPDATE SET
  product_name = CASE 
    WHEN EXCLUDED.change_significance = 'significant_changes' THEN EXCLUDED.product_name
    ELSE products.product_name
  END,

  category = EXCLUDED.category,

  price = CASE 
    WHEN ABS(EXCLUDED.price - products.price) > products.price * 0.05  -- 5% threshold
    THEN EXCLUDED.price
    ELSE products.price
  END,

  stock_quantity = EXCLUDED.stock_quantity,

  updated_at = CURRENT_TIMESTAMP,
  last_sync_timestamp = CURRENT_TIMESTAMP,
  sync_status = 'synchronized',

  -- Update metadata with merge information
  bulk_operation_metadata = JSON_SET(
    COALESCE(products.bulk_operation_metadata, '{}'),
    '$.last_upsert_operation', EXCLUDED.upsert_metadata,
    '$.upsert_history', JSON_ARRAY_APPEND(
      COALESCE(JSON_EXTRACT(products.bulk_operation_metadata, '$.upsert_history'), '[]'),
      '$', JSON_OBJECT(
        'timestamp', CURRENT_TIMESTAMP,
        'changes_applied', EXCLUDED.change_significance
      )
    )
  )

-- Upsert operation configuration  
WITH (
  batch_size = 800,  -- Smaller batches for upsert complexity
  parallel_batches = 2,
  write_concern = 'majority',

  -- Upsert-specific configuration
  conflict_resolution = 'merge_strategy',
  enable_change_detection = true,
  preserve_existing_metadata = true,

  -- Performance optimization for upsert
  enable_index_hints = true,
  optimize_for_update_heavy = true
);

-- Advanced bulk delete with comprehensive safety checks and audit trail
WITH deletion_candidates AS (
  SELECT 
    product_id,
    product_name,
    category,
    created_at,
    last_sold_date,
    stock_quantity,

    -- Determine deletion reason and safety
    CASE 
      WHEN stock_quantity = 0 AND last_sold_date < CURRENT_DATE - INTERVAL '2 years' THEN 'discontinued_product'
      WHEN category IN ('seasonal', 'limited_edition') AND created_at < CURRENT_DATE - INTERVAL '1 year' THEN 'seasonal_cleanup'
      WHEN supplier_id IN (SELECT supplier_id FROM suppliers WHERE status = 'inactive') THEN 'inactive_supplier'
      ELSE 'no_deletion'
    END as deletion_reason,

    -- Safety checks
    NOT EXISTS (
      SELECT 1 FROM order_items oi 
      WHERE oi.product_id = p.product_id 
      AND oi.order_date > CURRENT_DATE - INTERVAL '6 months'
    ) as no_recent_orders,

    NOT EXISTS (
      SELECT 1 FROM shopping_carts sc 
      WHERE sc.product_id = p.product_id
    ) as not_in_carts,

    NOT EXISTS (
      SELECT 1 FROM pending_shipments ps 
      WHERE ps.product_id = p.product_id
    ) as no_pending_shipments

  FROM products p
  WHERE p.status IN ('discontinued', 'inactive', 'marked_for_deletion')
),

safe_deletions AS (
  SELECT 
    dc.*,

    -- Overall safety assessment
    (dc.no_recent_orders AND dc.not_in_carts AND dc.no_pending_shipments) as safe_to_delete,

    -- Create audit record
    JSON_OBJECT(
      'deletion_batch_id', GENERATE_UUID(),
      'deletion_timestamp', CURRENT_TIMESTAMP,
      'deletion_reason', dc.deletion_reason,
      'safety_checks_passed', (dc.no_recent_orders AND dc.not_in_carts AND dc.no_pending_shipments),
      'product_snapshot', JSON_OBJECT(
        'product_id', dc.product_id,
        'product_name', dc.product_name,
        'category', dc.category,
        'last_sold_date', dc.last_sold_date,
        'stock_quantity', dc.stock_quantity
      )
    ) as audit_record

  FROM deletion_candidates dc
  WHERE dc.deletion_reason != 'no_deletion'
)

-- Create audit trail before deletion
INSERT INTO product_deletion_audit (
  product_id,
  deletion_reason,
  audit_record,
  deleted_at
)
SELECT 
  sd.product_id,
  sd.deletion_reason,
  sd.audit_record,
  CURRENT_TIMESTAMP
FROM safe_deletions sd
WHERE sd.safe_to_delete = true;

-- Execute bulk delete operation
DELETE FROM products 
WHERE product_id IN (
  SELECT sd.product_id 
  FROM safe_deletions sd 
  WHERE sd.safe_to_delete = true
)

-- Bulk delete configuration
WITH (
  batch_size = 500,  -- Conservative batch size for deletes
  parallel_batches = 2,
  write_concern = 'majority',

  -- Safety configuration
  enable_referential_integrity_check = true,
  enable_audit_trail = true,
  require_confirmation = true,

  -- Performance and safety balance
  max_deletions_per_batch = 500,
  enable_soft_delete = false,  -- True deletion for cleanup
  create_backup_before_delete = true
);

-- Comprehensive bulk operation monitoring and analytics
WITH bulk_operation_performance AS (
  SELECT 
    operation_type,
    DATE_TRUNC('hour', operation_timestamp) as hour_bucket,

    -- Volume metrics
    COUNT(*) as total_operations,
    SUM(documents_processed) as total_documents_processed,
    SUM(successful_operations) as total_successful,
    SUM(failed_operations) as total_failed,

    -- Performance metrics
    AVG(processing_time_ms) as avg_processing_time,
    PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY processing_time_ms) as p95_processing_time,
    AVG(documents_per_second) as avg_throughput,
    MAX(documents_per_second) as peak_throughput,

    -- Error analysis
    AVG(CASE WHEN total_failed > 0 THEN (failed_operations * 100.0 / documents_processed) ELSE 0 END) as avg_error_rate,

    -- Resource utilization
    AVG(batch_size_used) as avg_batch_size,
    AVG(parallel_batches_used) as avg_parallel_batches,
    AVG(memory_usage_mb) as avg_memory_usage,

    -- Configuration analysis  
    MODE() WITHIN GROUP (ORDER BY write_concern) as most_common_write_concern,
    AVG(CASE WHEN ordered_operations THEN 1 ELSE 0 END) as ordered_operations_ratio

  FROM bulk_operation_logs
  WHERE operation_timestamp >= CURRENT_TIMESTAMP - INTERVAL '24 hours'
  GROUP BY operation_type, DATE_TRUNC('hour', operation_timestamp)
),

performance_trends AS (
  SELECT 
    bop.*,

    -- Trend analysis
    LAG(avg_throughput) OVER (
      PARTITION BY operation_type 
      ORDER BY hour_bucket
    ) as prev_hour_throughput,

    LAG(avg_error_rate) OVER (
      PARTITION BY operation_type 
      ORDER BY hour_bucket
    ) as prev_hour_error_rate,

    -- Performance classification
    CASE 
      WHEN avg_throughput > 1000 THEN 'high_performance'
      WHEN avg_throughput > 500 THEN 'good_performance'  
      WHEN avg_throughput > 100 THEN 'adequate_performance'
      ELSE 'low_performance'
    END as performance_classification,

    -- Optimization recommendations
    CASE 
      WHEN avg_error_rate > 5 THEN 'investigate_error_patterns'
      WHEN p95_processing_time > avg_processing_time * 2 THEN 'optimize_batch_sizing'
      WHEN avg_memory_usage > 500 THEN 'optimize_memory_usage'
      WHEN avg_throughput < 100 THEN 'review_indexing_strategy'
      ELSE 'performance_optimal'
    END as optimization_recommendation

  FROM bulk_operation_performance bop
)

SELECT 
  operation_type,
  hour_bucket,

  -- Volume summary
  total_operations,
  total_documents_processed,
  ROUND((total_successful * 100.0 / NULLIF(total_documents_processed, 0)), 2) as success_rate_percent,

  -- Performance summary
  ROUND(avg_processing_time, 1) as avg_processing_time_ms,
  ROUND(p95_processing_time, 1) as p95_processing_time_ms,
  ROUND(avg_throughput, 0) as avg_documents_per_second,
  ROUND(peak_throughput, 0) as peak_documents_per_second,

  -- Trend indicators
  CASE 
    WHEN prev_hour_throughput IS NOT NULL THEN
      ROUND(((avg_throughput - prev_hour_throughput) / prev_hour_throughput * 100), 1)
    ELSE NULL
  END as throughput_change_percent,

  CASE 
    WHEN prev_hour_error_rate IS NOT NULL THEN
      ROUND((avg_error_rate - prev_hour_error_rate), 2)
    ELSE NULL
  END as error_rate_change,

  -- Configuration insights
  ROUND(avg_batch_size, 0) as optimal_batch_size,
  ROUND(avg_parallel_batches, 1) as avg_parallelization,
  most_common_write_concern,

  -- Performance assessment
  performance_classification,
  optimization_recommendation,

  -- Detailed recommendations
  CASE optimization_recommendation
    WHEN 'investigate_error_patterns' THEN 'Review error logs and implement better validation'
    WHEN 'optimize_batch_sizing' THEN 'Reduce batch size or increase timeout thresholds'  
    WHEN 'optimize_memory_usage' THEN 'Implement memory pooling and document streaming'
    WHEN 'review_indexing_strategy' THEN 'Add missing indexes for bulk operation filters'
    ELSE 'Continue current configuration - performance is optimal'
  END as detailed_recommendation

FROM performance_trends
WHERE total_operations > 0
ORDER BY operation_type, hour_bucket DESC;

-- QueryLeaf provides comprehensive bulk operation capabilities:
-- 1. Advanced batch processing with intelligent sizing and parallelization
-- 2. Sophisticated error handling and partial failure recovery  
-- 3. Comprehensive data validation and quality scoring
-- 4. Built-in audit trails and compliance tracking
-- 5. Performance monitoring and optimization recommendations
-- 6. Advanced conflict resolution and upsert strategies
-- 7. Safety checks and referential integrity validation
-- 8. Production-ready bulk operations with monitoring and alerting
-- 9. SQL-familiar syntax for complex bulk operation workflows
-- 10. Integration with MongoDB's native bulk operation optimizations

Best Practices for Production Bulk Operations

Performance Optimization and Batch Strategy

Essential principles for effective MongoDB bulk operation deployment:

  1. Batch Size Optimization: Calculate optimal batch sizes based on document size, operation type, and system resources
  2. Write Concern Management: Configure appropriate write concerns balancing performance with durability requirements
  3. Error Handling Strategy: Implement comprehensive error classification and recovery mechanisms for production resilience
  4. Validation and Safety: Design robust validation pipelines to ensure data quality and prevent harmful operations
  5. Performance Monitoring: Track operation metrics, throughput, and resource utilization for continuous optimization
  6. Resource Management: Monitor memory usage, connection pooling, and system resources during bulk operations

Scalability and Production Deployment

Optimize bulk operations for enterprise-scale requirements:

  1. Distributed Processing: Implement shard-aware batch distribution for optimal performance across MongoDB clusters
  2. Load Balancing: Design intelligent load balancing strategies that consider node capacity and network latency
  3. Fault Tolerance: Implement automatic failover and retry mechanisms for resilient bulk operation processing
  4. Capacity Planning: Monitor historical patterns and predict resource requirements for bulk operation scaling
  5. Compliance Integration: Ensure bulk operations meet audit, security, and compliance requirements
  6. Operational Integration: Integrate bulk operations with existing monitoring, alerting, and operational workflows

Conclusion

MongoDB bulk operations provide comprehensive high-performance batch processing capabilities that enable efficient handling of large-scale data operations through intelligent batching, advanced error handling, and sophisticated optimization strategies. The native bulk operation support ensures that batch processing benefits from MongoDB's write optimization, consistency guarantees, and scalability features.

Key MongoDB Bulk Operations benefits include:

  • High-Performance Processing: Optimized batch processing with intelligent sizing and parallel execution capabilities
  • Advanced Error Management: Comprehensive error handling with partial failure recovery and retry mechanisms
  • Data Quality Assurance: Built-in validation and safety checks to ensure data integrity during bulk operations
  • Resource Optimization: Intelligent memory management and resource utilization for optimal system performance
  • Production Readiness: Enterprise-ready bulk operations with monitoring, auditing, and compliance features
  • SQL Accessibility: Familiar SQL-style bulk operations through QueryLeaf for accessible high-throughput data management

Whether you're handling data imports, batch updates, inventory synchronization, or large-scale data cleanup operations, MongoDB bulk operations with QueryLeaf's familiar SQL interface provide the foundation for efficient, reliable, and scalable batch processing.

QueryLeaf Integration: QueryLeaf automatically optimizes MongoDB bulk operations while providing SQL-familiar syntax for batch processing, error handling, and performance monitoring. Advanced bulk operation patterns, validation strategies, and optimization techniques are seamlessly handled through familiar SQL constructs, making high-performance batch processing accessible to SQL-oriented development teams.

The combination of MongoDB's robust bulk operation capabilities with SQL-style batch processing operations makes it an ideal platform for applications requiring both high-throughput data processing and familiar database management patterns, ensuring your bulk operations can scale efficiently while maintaining data quality and operational reliability.

MongoDB Schema Validation and Data Integrity: Advanced Document Validation for Robust Database Design

Modern applications require robust data validation mechanisms to ensure data quality, maintain business rules, and prevent data corruption in production databases. Traditional NoSQL databases often sacrifice data validation for flexibility, leading to inconsistent data structures and difficult-to-debug application issues. MongoDB's document validation capabilities provide comprehensive schema enforcement while preserving the flexibility that makes document databases powerful for evolving applications.

MongoDB Schema Validation offers sophisticated document validation rules that can enforce field types, value constraints, required fields, and complex business logic at the database level. Unlike application-level validation that can be bypassed or inconsistently applied, database-level validation ensures data integrity regardless of how data enters the system, providing a critical safety net for production applications.

The Traditional Data Validation Challenge

Conventional approaches to data validation in both SQL and NoSQL systems have significant limitations:

-- Traditional relational database constraints - rigid but limited flexibility

-- PostgreSQL table with basic constraints
CREATE TABLE user_profiles (
    user_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    email VARCHAR(320) NOT NULL,
    username VARCHAR(50) NOT NULL,
    full_name VARCHAR(200),
    age INTEGER,
    account_status VARCHAR(20) DEFAULT 'active',
    created_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,

    -- Basic constraints
    CONSTRAINT ck_email_format CHECK (email ~* '^[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}$'),
    CONSTRAINT ck_age_valid CHECK (age >= 0 AND age <= 150),
    CONSTRAINT ck_status_valid CHECK (account_status IN ('active', 'inactive', 'suspended', 'pending')),
    CONSTRAINT ck_username_length CHECK (char_length(username) >= 3),

    -- Unique constraints
    UNIQUE(email),
    UNIQUE(username)
);

-- User preferences table with limited JSON validation
CREATE TABLE user_preferences (
    user_id UUID PRIMARY KEY REFERENCES user_profiles(user_id) ON DELETE CASCADE,
    preferences JSONB NOT NULL DEFAULT '{}',
    notification_settings JSONB,
    privacy_settings JSONB,

    -- Basic JSON structure validation (limited)
    CONSTRAINT ck_preferences_not_empty CHECK (jsonb_typeof(preferences) = 'object'),
    CONSTRAINT ck_notifications_structure CHECK (
        notification_settings IS NULL OR 
        (jsonb_typeof(notification_settings) = 'object' AND 
         notification_settings ? 'email' AND 
         notification_settings ? 'push')
    )
);

-- Product catalog with rigid structure
CREATE TABLE products (
    product_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    name VARCHAR(500) NOT NULL,
    description TEXT,
    category VARCHAR(100) NOT NULL,
    price DECIMAL(10,2) NOT NULL,
    currency VARCHAR(3) NOT NULL DEFAULT 'USD',
    availability_status VARCHAR(20) NOT NULL DEFAULT 'available',

    -- Product specifications (limited flexibility)
    specifications JSONB,
    dimensions JSONB,
    weight_grams INTEGER,

    -- Basic validation constraints
    CONSTRAINT ck_price_positive CHECK (price > 0),
    CONSTRAINT ck_currency_code CHECK (currency ~ '^[A-Z]{3}$'),
    CONSTRAINT ck_availability CHECK (availability_status IN ('available', 'out_of_stock', 'discontinued')),
    CONSTRAINT ck_weight_positive CHECK (weight_grams > 0),

    -- Limited JSON validation
    CONSTRAINT ck_specifications_object CHECK (
        specifications IS NULL OR jsonb_typeof(specifications) = 'object'
    ),

    created_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

-- Attempting complex validation with triggers (maintenance overhead)
CREATE OR REPLACE FUNCTION validate_user_preferences()
RETURNS TRIGGER AS $$
BEGIN
    -- Manual JSON validation logic
    IF NEW.notification_settings IS NOT NULL THEN
        IF NOT (NEW.notification_settings ? 'email' AND 
                NEW.notification_settings ? 'push' AND
                NEW.notification_settings ? 'sms') THEN
            RAISE EXCEPTION 'notification_settings must contain email, push, and sms keys';
        END IF;

        -- Validate nested structure
        IF NOT (jsonb_typeof(NEW.notification_settings->'email') = 'object' AND
                NEW.notification_settings->'email' ? 'enabled' AND
                jsonb_typeof(NEW.notification_settings->'email'->'enabled') = 'boolean') THEN
            RAISE EXCEPTION 'notification_settings.email must have enabled boolean field';
        END IF;
    END IF;

    -- Privacy settings validation
    IF NEW.privacy_settings IS NOT NULL THEN
        IF NOT (NEW.privacy_settings ? 'profile_visibility' AND
                NEW.privacy_settings->'profile_visibility' IN ('"public"', '"private"', '"friends"')) THEN
            RAISE EXCEPTION 'privacy_settings.profile_visibility must be public, private, or friends';
        END IF;
    END IF;

    RETURN NEW;
END;
$$ LANGUAGE plpgsql;

CREATE TRIGGER validate_preferences_trigger
    BEFORE INSERT OR UPDATE ON user_preferences
    FOR EACH ROW EXECUTE FUNCTION validate_user_preferences();

-- Complex business rule validation (difficult to maintain)
CREATE OR REPLACE FUNCTION validate_product_business_rules()
RETURNS TRIGGER AS $$
BEGIN
    -- Price validation based on category
    IF NEW.category = 'electronics' AND NEW.price < 10.00 THEN
        RAISE EXCEPTION 'Electronics products must have minimum price of $10.00';
    END IF;

    IF NEW.category = 'luxury' AND NEW.price < 100.00 THEN
        RAISE EXCEPTION 'Luxury products must have minimum price of $100.00';
    END IF;

    -- Specifications validation by category
    IF NEW.category = 'electronics' THEN
        IF NEW.specifications IS NULL OR 
           NOT (NEW.specifications ? 'brand' AND NEW.specifications ? 'model') THEN
            RAISE EXCEPTION 'Electronics products must specify brand and model in specifications';
        END IF;
    END IF;

    -- Weight requirements
    IF NEW.category IN ('furniture', 'appliances') AND NEW.weight_grams IS NULL THEN
        RAISE EXCEPTION 'Furniture and appliances must specify weight';
    END IF;

    RETURN NEW;
END;
$$ LANGUAGE plpgsql;

CREATE TRIGGER validate_product_rules_trigger
    BEFORE INSERT OR UPDATE ON products
    FOR EACH ROW EXECUTE FUNCTION validate_product_business_rules();

-- Attempt to query with validation checks (complex and inefficient)
WITH validation_summary AS (
    SELECT 
        'user_profiles' as table_name,
        COUNT(*) as total_records,
        COUNT(*) FILTER (WHERE email !~* '^[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}$') as invalid_emails,
        COUNT(*) FILTER (WHERE age < 0 OR age > 150) as invalid_ages,
        COUNT(*) FILTER (WHERE account_status NOT IN ('active', 'inactive', 'suspended', 'pending')) as invalid_statuses
    FROM user_profiles

    UNION ALL

    SELECT 
        'products' as table_name,
        COUNT(*) as total_records,
        COUNT(*) FILTER (WHERE price <= 0) as invalid_prices,
        COUNT(*) FILTER (WHERE currency !~ '^[A-Z]{3}$') as invalid_currencies,
        COUNT(*) FILTER (WHERE specifications IS NOT NULL AND jsonb_typeof(specifications) != 'object') as invalid_specs
    FROM products
)
SELECT 
    table_name,
    total_records,
    invalid_emails,
    invalid_ages,
    invalid_statuses,
    invalid_prices,
    invalid_currencies,
    invalid_specs,

    -- Overall data quality score
    CASE 
        WHEN table_name = 'user_profiles' THEN
            (total_records - COALESCE(invalid_emails, 0) - COALESCE(invalid_ages, 0) - COALESCE(invalid_statuses, 0))::float / total_records * 100
        ELSE 
            (total_records - COALESCE(invalid_prices, 0) - COALESCE(invalid_currencies, 0) - COALESCE(invalid_specs, 0))::float / total_records * 100
    END as data_quality_percent

FROM validation_summary;

-- Problems with traditional validation approaches:
-- 1. Limited flexibility for evolving schemas and nested structures
-- 2. Complex trigger logic that's difficult to maintain and debug
-- 3. Performance overhead from extensive validation triggers
-- 4. Limited support for conditional validation based on document context
-- 5. No built-in support for array validation and nested object constraints
-- 6. Difficulty enforcing business rules that span multiple fields
-- 7. Poor integration with application development workflows
-- 8. Limited error messaging and validation feedback
-- 9. Complex migration procedures when validation rules change
-- 10. No support for schema versioning and gradual migration strategies

MongoDB Schema Validation provides comprehensive document validation capabilities:

// MongoDB Advanced Schema Validation - comprehensive document validation system
const { MongoClient } = require('mongodb');

const client = new MongoClient('mongodb://localhost:27017');
const db = client.db('advanced_validation_platform');

// Advanced MongoDB Schema Validation System
class MongoDBSchemaValidator {
  constructor(db, options = {}) {
    this.db = db;
    this.options = {
      validationLevel: options.validationLevel || 'strict', // strict, moderate
      validationAction: options.validationAction || 'error', // error, warn
      enableVersioning: options.enableVersioning || true,
      enableMetrics: options.enableMetrics || true,
      customValidators: options.customValidators || new Map(),
      ...options
    };

    this.validationSchemas = new Map();
    this.validationMetrics = {
      validationsPassed: 0,
      validationsFailed: 0,
      validationErrors: [],
      lastUpdated: new Date()
    };

    this.setupValidationCollections();
  }

  async setupValidationCollections() {
    console.log('Setting up advanced schema validation system...');

    try {
      // User profiles with comprehensive validation
      await this.createValidatedCollection('user_profiles', {
        $jsonSchema: {
          bsonType: 'object',
          title: 'User Profile Validation Schema',
          required: ['email', 'username', 'profile_type', 'created_at'],
          additionalProperties: false,

          properties: {
            _id: {
              bsonType: 'objectId',
              description: 'Unique identifier for user profile'
            },

            // Basic user information with comprehensive validation
            email: {
              bsonType: 'string',
              pattern: '^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}$',
              maxLength: 320,
              description: 'Valid email address following RFC 5322 standard'
            },

            username: {
              bsonType: 'string',
              pattern: '^[a-zA-Z0-9_-]{3,30}$',
              description: 'Username: 3-30 characters, alphanumeric, underscore, or dash only'
            },

            full_name: {
              bsonType: 'string',
              minLength: 2,
              maxLength: 200,
              description: 'Full name: 2-200 characters'
            },

            profile_type: {
              enum: ['individual', 'business', 'organization', 'developer'],
              description: 'Type of user profile'
            },

            // Age validation with business rules
            age: {
              bsonType: 'int',
              minimum: 13, // Minimum age for account creation
              maximum: 150,
              description: 'User age: must be between 13 and 150'
            },

            // Account status with workflow validation
            account_status: {
              bsonType: 'object',
              required: ['status', 'last_updated'],
              additionalProperties: false,
              properties: {
                status: {
                  enum: ['active', 'inactive', 'suspended', 'pending_verification', 'closed'],
                  description: 'Current account status'
                },
                last_updated: {
                  bsonType: 'date',
                  description: 'When status was last updated'
                },
                reason: {
                  bsonType: 'string',
                  maxLength: 500,
                  description: 'Reason for status change (optional)'
                },
                updated_by: {
                  bsonType: 'objectId',
                  description: 'ID of user/admin who updated status'
                }
              }
            },

            // Contact information with regional validation
            contact_info: {
              bsonType: 'object',
              additionalProperties: false,
              properties: {
                phone: {
                  bsonType: 'object',
                  properties: {
                    country_code: {
                      bsonType: 'string',
                      pattern: '^\\+[1-9][0-9]{0,3}$',
                      description: 'Country code with + prefix'
                    },
                    number: {
                      bsonType: 'string',
                      pattern: '^[0-9]{7,15}$',
                      description: 'Phone number: 7-15 digits'
                    },
                    verified: {
                      bsonType: 'bool',
                      description: 'Whether phone number is verified'
                    },
                    verified_at: {
                      bsonType: 'date',
                      description: 'When phone was verified'
                    }
                  },
                  required: ['country_code', 'number', 'verified']
                },

                address: {
                  bsonType: 'object',
                  properties: {
                    street: { bsonType: 'string', maxLength: 200 },
                    city: { bsonType: 'string', maxLength: 100 },
                    state_province: { bsonType: 'string', maxLength: 100 },
                    postal_code: { bsonType: 'string', maxLength: 20 },
                    country: {
                      bsonType: 'string',
                      pattern: '^[A-Z]{2}$', // ISO 3166-1 alpha-2 country codes
                      description: 'Two-letter country code (ISO 3166-1)'
                    }
                  },
                  required: ['city', 'country']
                }
              }
            },

            // Nested preferences with conditional validation
            preferences: {
              bsonType: 'object',
              additionalProperties: false,
              properties: {
                notifications: {
                  bsonType: 'object',
                  required: ['email', 'push', 'sms'],
                  additionalProperties: false,
                  properties: {
                    email: {
                      bsonType: 'object',
                      required: ['enabled'],
                      properties: {
                        enabled: { bsonType: 'bool' },
                        frequency: {
                          enum: ['immediate', 'daily', 'weekly', 'never'],
                          description: 'Email notification frequency'
                        },
                        categories: {
                          bsonType: 'array',
                          items: {
                            enum: ['security', 'marketing', 'product_updates', 'billing']
                          },
                          uniqueItems: true,
                          description: 'Notification categories to receive'
                        }
                      }
                    },
                    push: {
                      bsonType: 'object',
                      required: ['enabled'],
                      properties: {
                        enabled: { bsonType: 'bool' },
                        quiet_hours: {
                          bsonType: 'object',
                          properties: {
                            enabled: { bsonType: 'bool' },
                            start_time: {
                              bsonType: 'string',
                              pattern: '^([01]?[0-9]|2[0-3]):[0-5][0-9]$',
                              description: 'Start time in HH:MM format'
                            },
                            end_time: {
                              bsonType: 'string',
                              pattern: '^([01]?[0-9]|2[0-3]):[0-5][0-9]$',
                              description: 'End time in HH:MM format'
                            }
                          },
                          required: ['enabled']
                        }
                      }
                    },
                    sms: {
                      bsonType: 'object',
                      required: ['enabled'],
                      properties: {
                        enabled: { bsonType: 'bool' },
                        emergency_only: {
                          bsonType: 'bool',
                          description: 'Only send SMS for emergency notifications'
                        }
                      }
                    }
                  }
                },

                privacy: {
                  bsonType: 'object',
                  required: ['profile_visibility', 'data_processing_consent'],
                  properties: {
                    profile_visibility: {
                      enum: ['public', 'friends_only', 'private'],
                      description: 'Who can view this profile'
                    },
                    search_visibility: {
                      bsonType: 'bool',
                      description: 'Whether profile appears in search results'
                    },
                    data_processing_consent: {
                      bsonType: 'object',
                      required: ['analytics', 'marketing', 'given_at'],
                      properties: {
                        analytics: { bsonType: 'bool' },
                        marketing: { bsonType: 'bool' },
                        third_party_sharing: { bsonType: 'bool' },
                        given_at: { bsonType: 'date' },
                        ip_address: { bsonType: 'string' },
                        user_agent: { bsonType: 'string' }
                      }
                    }
                  }
                },

                // User interface preferences
                ui_preferences: {
                  bsonType: 'object',
                  properties: {
                    theme: {
                      enum: ['light', 'dark', 'auto'],
                      description: 'User interface theme preference'
                    },
                    language: {
                      bsonType: 'string',
                      pattern: '^[a-z]{2}(-[A-Z]{2})?$',
                      description: 'Language code (ISO 639-1 with optional country)'
                    },
                    timezone: {
                      bsonType: 'string',
                      description: 'IANA timezone identifier'
                    },
                    date_format: {
                      enum: ['MM/DD/YYYY', 'DD/MM/YYYY', 'YYYY-MM-DD'],
                      description: 'Preferred date display format'
                    }
                  }
                }
              }
            },

            // Security settings with validation
            security: {
              bsonType: 'object',
              properties: {
                two_factor_enabled: { bsonType: 'bool' },
                backup_codes: {
                  bsonType: 'array',
                  maxItems: 10,
                  items: {
                    bsonType: 'string',
                    pattern: '^[A-Z0-9]{8}$',
                    description: '8-character backup codes'
                  },
                  uniqueItems: true
                },
                security_questions: {
                  bsonType: 'array',
                  maxItems: 5,
                  items: {
                    bsonType: 'object',
                    required: ['question', 'answer_hash'],
                    properties: {
                      question: {
                        bsonType: 'string',
                        maxLength: 200
                      },
                      answer_hash: {
                        bsonType: 'string',
                        description: 'Hashed security question answer'
                      },
                      created_at: { bsonType: 'date' }
                    }
                  }
                },
                login_restrictions: {
                  bsonType: 'object',
                  properties: {
                    allowed_countries: {
                      bsonType: 'array',
                      items: {
                        bsonType: 'string',
                        pattern: '^[A-Z]{2}$'
                      },
                      description: 'ISO country codes where login is allowed'
                    },
                    require_device_verification: { bsonType: 'bool' }
                  }
                }
              }
            },

            // Audit trail information
            created_at: {
              bsonType: 'date',
              description: 'Account creation timestamp'
            },

            updated_at: {
              bsonType: 'date',
              description: 'Last profile update timestamp'
            },

            created_by: {
              bsonType: 'objectId',
              description: 'ID of user/system that created this profile'
            },

            // Schema versioning
            schema_version: {
              bsonType: 'string',
              pattern: '^\\d+\\.\\d+\\.\\d+$',
              description: 'Schema version (semantic versioning)'
            }
          }
        }
      }, {
        validationLevel: 'strict',
        validationAction: 'error'
      });

      // Products collection with complex business rule validation
      await this.createValidatedCollection('products', {
        $jsonSchema: {
          bsonType: 'object',
          title: 'Product Validation Schema',
          required: ['name', 'category', 'pricing', 'availability', 'created_at'],
          additionalProperties: false,

          properties: {
            _id: { bsonType: 'objectId' },

            // Basic product information
            name: {
              bsonType: 'string',
              minLength: 2,
              maxLength: 500,
              description: 'Product name: 2-500 characters'
            },

            description: {
              bsonType: 'string',
              maxLength: 5000,
              description: 'Product description: max 5000 characters'
            },

            sku: {
              bsonType: 'string',
              pattern: '^[A-Z0-9]{3,20}$',
              description: 'Stock Keeping Unit: 3-20 uppercase alphanumeric characters'
            },

            category: {
              bsonType: 'object',
              required: ['primary', 'path'],
              properties: {
                primary: {
                  enum: ['electronics', 'clothing', 'home_garden', 'books', 'sports', 'automotive', 'health', 'toys'],
                  description: 'Primary product category'
                },
                secondary: {
                  bsonType: 'string',
                  maxLength: 100,
                  description: 'Secondary category classification'
                },
                path: {
                  bsonType: 'array',
                  items: { bsonType: 'string' },
                  minItems: 1,
                  maxItems: 5,
                  description: 'Category hierarchy path'
                },
                tags: {
                  bsonType: 'array',
                  items: {
                    bsonType: 'string',
                    pattern: '^[a-z0-9_-]+$',
                    maxLength: 50
                  },
                  maxItems: 20,
                  uniqueItems: true,
                  description: 'Product tags for search and filtering'
                }
              }
            },

            // Complex pricing structure with conditional validation
            pricing: {
              bsonType: 'object',
              required: ['base_price', 'currency', 'pricing_model'],
              additionalProperties: false,
              properties: {
                base_price: {
                  bsonType: 'decimal',
                  minimum: 0.01,
                  description: 'Base price must be positive'
                },
                currency: {
                  bsonType: 'string',
                  pattern: '^[A-Z]{3}$',
                  description: 'ISO 4217 currency code'
                },
                pricing_model: {
                  enum: ['fixed', 'tiered', 'subscription', 'auction', 'negotiable'],
                  description: 'Product pricing model'
                },

                // Conditional pricing based on model
                tier_pricing: {
                  bsonType: 'array',
                  items: {
                    bsonType: 'object',
                    required: ['min_quantity', 'price_per_unit'],
                    properties: {
                      min_quantity: {
                        bsonType: 'int',
                        minimum: 1
                      },
                      price_per_unit: {
                        bsonType: 'decimal',
                        minimum: 0.01
                      },
                      description: { bsonType: 'string', maxLength: 200 }
                    }
                  },
                  description: 'Tiered pricing structure (required if pricing_model is tiered)'
                },

                subscription_options: {
                  bsonType: 'object',
                  properties: {
                    billing_cycles: {
                      bsonType: 'array',
                      items: {
                        enum: ['monthly', 'quarterly', 'annually', 'biennial']
                      },
                      minItems: 1
                    },
                    trial_period_days: {
                      bsonType: 'int',
                      minimum: 0,
                      maximum: 365
                    }
                  },
                  required: ['billing_cycles'],
                  description: 'Subscription details (required if pricing_model is subscription)'
                },

                discounts: {
                  bsonType: 'array',
                  items: {
                    bsonType: 'object',
                    required: ['type', 'value', 'valid_from', 'valid_until'],
                    properties: {
                      type: {
                        enum: ['percentage', 'fixed_amount', 'buy_x_get_y'],
                        description: 'Type of discount'
                      },
                      value: {
                        bsonType: 'decimal',
                        minimum: 0,
                        description: 'Discount value (percentage or amount)'
                      },
                      min_purchase_amount: {
                        bsonType: 'decimal',
                        minimum: 0
                      },
                      valid_from: { bsonType: 'date' },
                      valid_until: { bsonType: 'date' },
                      max_uses: {
                        bsonType: 'int',
                        minimum: 1
                      },
                      code: {
                        bsonType: 'string',
                        pattern: '^[A-Z0-9]{4,20}$'
                      }
                    }
                  },
                  maxItems: 10
                }
              }
            },

            // Availability and inventory
            availability: {
              bsonType: 'object',
              required: ['status', 'stock_tracking'],
              properties: {
                status: {
                  enum: ['available', 'out_of_stock', 'discontinued', 'coming_soon', 'back_order'],
                  description: 'Product availability status'
                },
                stock_tracking: {
                  bsonType: 'object',
                  required: ['enabled'],
                  properties: {
                    enabled: { bsonType: 'bool' },
                    current_stock: {
                      bsonType: 'int',
                      minimum: 0,
                      description: 'Current stock quantity (required if tracking enabled)'
                    },
                    reserved_stock: {
                      bsonType: 'int',
                      minimum: 0,
                      description: 'Stock reserved for pending orders'
                    },
                    low_stock_threshold: {
                      bsonType: 'int',
                      minimum: 0,
                      description: 'Threshold for low stock alerts'
                    },
                    max_order_quantity: {
                      bsonType: 'int',
                      minimum: 1,
                      description: 'Maximum quantity per order'
                    }
                  }
                },
                estimated_delivery: {
                  bsonType: 'object',
                  properties: {
                    min_days: { bsonType: 'int', minimum: 0 },
                    max_days: { bsonType: 'int', minimum: 0 },
                    shipping_regions: {
                      bsonType: 'array',
                      items: {
                        bsonType: 'string',
                        pattern: '^[A-Z]{2}$'
                      }
                    }
                  }
                }
              }
            },

            // Product specifications with category-specific validation
            specifications: {
              bsonType: 'object',
              properties: {
                dimensions: {
                  bsonType: 'object',
                  required: ['unit'],
                  properties: {
                    length: { bsonType: 'decimal', minimum: 0 },
                    width: { bsonType: 'decimal', minimum: 0 },
                    height: { bsonType: 'decimal', minimum: 0 },
                    weight: { bsonType: 'decimal', minimum: 0 },
                    unit: {
                      enum: ['metric', 'imperial'],
                      description: 'Measurement unit system'
                    }
                  }
                },

                materials: {
                  bsonType: 'array',
                  items: {
                    bsonType: 'object',
                    required: ['name', 'percentage'],
                    properties: {
                      name: { bsonType: 'string', maxLength: 100 },
                      percentage: {
                        bsonType: 'decimal',
                        minimum: 0,
                        maximum: 100
                      },
                      certified: { bsonType: 'bool' },
                      certification: { bsonType: 'string', maxLength: 200 }
                    }
                  }
                },

                care_instructions: {
                  bsonType: 'array',
                  items: { bsonType: 'string', maxLength: 200 },
                  maxItems: 10
                },

                warranty: {
                  bsonType: 'object',
                  properties: {
                    duration_months: {
                      bsonType: 'int',
                      minimum: 0,
                      maximum: 600 // 50 years max
                    },
                    type: {
                      enum: ['manufacturer', 'store', 'extended', 'none']
                    },
                    coverage: {
                      bsonType: 'array',
                      items: {
                        enum: ['defects', 'wear_and_tear', 'accidental_damage', 'theft']
                      }
                    }
                  }
                },

                // Category-specific specifications (conditional validation)
                electronics: {
                  bsonType: 'object',
                  properties: {
                    brand: {
                      bsonType: 'string',
                      minLength: 2,
                      maxLength: 100,
                      description: 'Electronics must have a brand'
                    },
                    model: {
                      bsonType: 'string',
                      minLength: 1,
                      maxLength: 100,
                      description: 'Electronics must have a model'
                    },
                    power_requirements: {
                      bsonType: 'object',
                      properties: {
                        voltage: { bsonType: 'int', minimum: 1 },
                        wattage: { bsonType: 'int', minimum: 1 },
                        frequency: { bsonType: 'int', minimum: 50, maximum: 60 }
                      }
                    },
                    connectivity: {
                      bsonType: 'array',
                      items: {
                        enum: ['wifi', 'bluetooth', 'ethernet', 'usb', 'hdmi', 'aux', 'nfc']
                      }
                    }
                  }
                }
              }
            },

            // Quality and compliance
            quality_control: {
              bsonType: 'object',
              properties: {
                certifications: {
                  bsonType: 'array',
                  items: {
                    bsonType: 'object',
                    required: ['name', 'issuing_body', 'valid_until'],
                    properties: {
                      name: { bsonType: 'string', maxLength: 200 },
                      issuing_body: { bsonType: 'string', maxLength: 200 },
                      certificate_number: { bsonType: 'string', maxLength: 100 },
                      valid_until: { bsonType: 'date' },
                      document_url: { bsonType: 'string', maxLength: 500 }
                    }
                  }
                },
                safety_warnings: {
                  bsonType: 'array',
                  items: {
                    bsonType: 'object',
                    required: ['type', 'description'],
                    properties: {
                      type: {
                        enum: ['choking_hazard', 'electrical', 'chemical', 'fire', 'sharp_edges', 'other']
                      },
                      description: { bsonType: 'string', maxLength: 500 },
                      age_restriction: { bsonType: 'int', minimum: 0, maximum: 21 }
                    }
                  }
                }
              }
            },

            // Audit and metadata
            created_at: { bsonType: 'date' },
            updated_at: { bsonType: 'date' },
            created_by: { bsonType: 'objectId' },
            last_modified_by: { bsonType: 'objectId' },
            schema_version: {
              bsonType: 'string',
              pattern: '^\\d+\\.\\d+\\.\\d+$'
            }
          }
        }
      }, {
        validationLevel: 'strict',
        validationAction: 'error'
      });

      // Order validation with complex business rules
      await this.createValidatedCollection('orders', {
        $jsonSchema: {
          bsonType: 'object',
          title: 'Order Validation Schema',
          required: ['customer_id', 'items', 'totals', 'status', 'created_at'],
          additionalProperties: false,

          properties: {
            _id: { bsonType: 'objectId' },

            order_number: {
              bsonType: 'string',
              pattern: '^ORD-[0-9]{8}-[A-Z]{3}$',
              description: 'Order number format: ORD-12345678-ABC'
            },

            customer_id: {
              bsonType: 'objectId',
              description: 'Reference to customer profile'
            },

            // Order items with validation
            items: {
              bsonType: 'array',
              minItems: 1,
              maxItems: 100,
              items: {
                bsonType: 'object',
                required: ['product_id', 'quantity', 'unit_price', 'total_price'],
                additionalProperties: false,
                properties: {
                  product_id: { bsonType: 'objectId' },
                  product_name: { bsonType: 'string', maxLength: 500 },
                  sku: { bsonType: 'string' },
                  quantity: {
                    bsonType: 'int',
                    minimum: 1,
                    maximum: 1000
                  },
                  unit_price: {
                    bsonType: 'decimal',
                    minimum: 0
                  },
                  total_price: {
                    bsonType: 'decimal',
                    minimum: 0
                  },
                  discounts_applied: {
                    bsonType: 'array',
                    items: {
                      bsonType: 'object',
                      required: ['type', 'amount'],
                      properties: {
                        type: { bsonType: 'string' },
                        amount: { bsonType: 'decimal' },
                        code: { bsonType: 'string' }
                      }
                    }
                  },
                  customizations: {
                    bsonType: 'object',
                    description: 'Product customization options'
                  }
                }
              },
              description: 'Order must contain 1-100 items'
            },

            // Order totals with validation
            totals: {
              bsonType: 'object',
              required: ['subtotal', 'tax_amount', 'shipping_cost', 'total_amount', 'currency'],
              additionalProperties: false,
              properties: {
                subtotal: {
                  bsonType: 'decimal',
                  minimum: 0,
                  description: 'Subtotal before taxes and shipping'
                },
                tax_amount: {
                  bsonType: 'decimal',
                  minimum: 0,
                  description: 'Total tax amount'
                },
                tax_breakdown: {
                  bsonType: 'array',
                  items: {
                    bsonType: 'object',
                    required: ['type', 'rate', 'amount'],
                    properties: {
                      type: { bsonType: 'string', maxLength: 50 },
                      rate: { bsonType: 'decimal', minimum: 0, maximum: 1 },
                      amount: { bsonType: 'decimal', minimum: 0 }
                    }
                  }
                },
                shipping_cost: {
                  bsonType: 'decimal',
                  minimum: 0,
                  description: 'Shipping and handling cost'
                },
                discount_amount: {
                  bsonType: 'decimal',
                  minimum: 0,
                  description: 'Total discount amount'
                },
                total_amount: {
                  bsonType: 'decimal',
                  minimum: 0.01,
                  description: 'Final order total'
                },
                currency: {
                  bsonType: 'string',
                  pattern: '^[A-Z]{3}$',
                  description: 'ISO 4217 currency code'
                }
              }
            },

            // Order status workflow
            status: {
              bsonType: 'object',
              required: ['current', 'history'],
              additionalProperties: false,
              properties: {
                current: {
                  enum: ['pending', 'confirmed', 'processing', 'shipped', 'delivered', 'cancelled', 'refunded'],
                  description: 'Current order status'
                },
                history: {
                  bsonType: 'array',
                  minItems: 1,
                  items: {
                    bsonType: 'object',
                    required: ['status', 'timestamp'],
                    properties: {
                      status: {
                        enum: ['pending', 'confirmed', 'processing', 'shipped', 'delivered', 'cancelled', 'refunded']
                      },
                      timestamp: { bsonType: 'date' },
                      notes: { bsonType: 'string', maxLength: 1000 },
                      updated_by: { bsonType: 'objectId' }
                    }
                  }
                }
              }
            },

            // Shipping information
            shipping: {
              bsonType: 'object',
              required: ['method', 'address'],
              properties: {
                method: {
                  bsonType: 'object',
                  required: ['carrier', 'service_type', 'estimated_delivery'],
                  properties: {
                    carrier: { bsonType: 'string', maxLength: 100 },
                    service_type: { bsonType: 'string', maxLength: 100 },
                    tracking_number: { bsonType: 'string', maxLength: 100 },
                    estimated_delivery: { bsonType: 'date' },
                    actual_delivery: { bsonType: 'date' }
                  }
                },
                address: {
                  bsonType: 'object',
                  required: ['recipient_name', 'street_address', 'city', 'country'],
                  properties: {
                    recipient_name: { bsonType: 'string', maxLength: 200 },
                    street_address: { bsonType: 'string', maxLength: 500 },
                    city: { bsonType: 'string', maxLength: 100 },
                    state_province: { bsonType: 'string', maxLength: 100 },
                    postal_code: { bsonType: 'string', maxLength: 20 },
                    country: {
                      bsonType: 'string',
                      pattern: '^[A-Z]{2}$',
                      description: 'ISO 3166-1 alpha-2 country code'
                    },
                    special_instructions: { bsonType: 'string', maxLength: 500 }
                  }
                }
              }
            },

            // Payment information
            payment: {
              bsonType: 'object',
              required: ['method', 'status'],
              properties: {
                method: {
                  enum: ['credit_card', 'debit_card', 'paypal', 'bank_transfer', 'digital_wallet', 'cryptocurrency', 'cash_on_delivery'],
                  description: 'Payment method used'
                },
                status: {
                  enum: ['pending', 'authorized', 'captured', 'failed', 'refunded', 'partially_refunded'],
                  description: 'Payment processing status'
                },
                transaction_id: { bsonType: 'string', maxLength: 200 },
                authorization_code: { bsonType: 'string', maxLength: 100 },
                payment_processor: { bsonType: 'string', maxLength: 100 },
                processed_at: { bsonType: 'date' },
                failure_reason: { bsonType: 'string', maxLength: 500 },
                refund_details: {
                  bsonType: 'array',
                  items: {
                    bsonType: 'object',
                    required: ['amount', 'reason', 'processed_at'],
                    properties: {
                      amount: { bsonType: 'decimal', minimum: 0 },
                      reason: { bsonType: 'string', maxLength: 500 },
                      processed_at: { bsonType: 'date' },
                      refund_id: { bsonType: 'string' }
                    }
                  }
                }
              }
            },

            // Audit trail
            created_at: { bsonType: 'date' },
            updated_at: { bsonType: 'date' },
            schema_version: {
              bsonType: 'string',
              pattern: '^\\d+\\.\\d+\\.\\d+$'
            }
          }
        }
      }, {
        validationLevel: 'strict',
        validationAction: 'error'
      });

      console.log('Advanced validation schemas created successfully');
      return true;

    } catch (error) {
      console.error('Error setting up validation collections:', error);
      throw error;
    }
  }

  async createValidatedCollection(collectionName, validationSchema, options = {}) {
    console.log(`Creating validated collection: ${collectionName}`);

    try {
      // Check if collection already exists
      const collections = await this.db.listCollections({ name: collectionName }).toArray();

      if (collections.length > 0) {
        console.log(`Collection ${collectionName} already exists, updating validation`);

        // Update existing collection validation
        await this.db.command({
          collMod: collectionName,
          validator: validationSchema,
          validationLevel: options.validationLevel || this.options.validationLevel,
          validationAction: options.validationAction || this.options.validationAction
        });
      } else {
        // Create new collection with validation
        await this.db.createCollection(collectionName, {
          validator: validationSchema,
          validationLevel: options.validationLevel || this.options.validationLevel,
          validationAction: options.validationAction || this.options.validationAction
        });
      }

      // Store schema for versioning
      this.validationSchemas.set(collectionName, {
        schema: validationSchema,
        version: options.version || '1.0.0',
        createdAt: new Date(),
        ...options
      });

      console.log(`Validation schema applied to collection: ${collectionName}`);
      return true;

    } catch (error) {
      console.error(`Error creating validated collection ${collectionName}:`, error);
      throw error;
    }
  }

  async validateDocument(collectionName, document) {
    console.log(`Validating document for collection: ${collectionName}`);

    try {
      const schema = this.validationSchemas.get(collectionName);
      if (!schema) {
        throw new Error(`No validation schema found for collection: ${collectionName}`);
      }

      // Perform pre-validation checks
      const preValidationResult = await this.performPreValidation(collectionName, document);
      if (!preValidationResult.valid) {
        this.validationMetrics.validationsFailed++;
        return {
          valid: false,
          errors: preValidationResult.errors,
          warnings: preValidationResult.warnings || []
        };
      }

      // Test document against schema by attempting insertion with validation
      const testCollection = this.db.collection(collectionName);

      try {
        // Use a transaction to test validation without persisting
        const session = this.db.client.startSession();

        await session.withTransaction(async () => {
          await testCollection.insertOne(document, { session });
          // Abort transaction to avoid persisting test document
          await session.abortTransaction();
        });

        await session.endSession();

        this.validationMetrics.validationsPassed++;
        return {
          valid: true,
          errors: [],
          warnings: preValidationResult.warnings || []
        };

      } catch (validationError) {
        this.validationMetrics.validationsFailed++;
        this.validationMetrics.validationErrors.push({
          collection: collectionName,
          error: validationError.message,
          document: document,
          timestamp: new Date()
        });

        return {
          valid: false,
          errors: [this.parseValidationError(validationError)],
          warnings: preValidationResult.warnings || []
        };
      }

    } catch (error) {
      console.error(`Error validating document:`, error);
      return {
        valid: false,
        errors: [`Validation system error: ${error.message}`],
        warnings: []
      };
    }
  }

  async performPreValidation(collectionName, document) {
    // Custom pre-validation logic for business rules
    const warnings = [];
    const errors = [];

    if (collectionName === 'products') {
      // Category-specific validation
      if (document.category?.primary === 'electronics' && !document.specifications?.electronics) {
        errors.push('Electronics products must include electronics specifications');
      }

      // Pricing model validation
      if (document.pricing?.pricing_model === 'tiered' && !document.pricing?.tier_pricing) {
        errors.push('Tiered pricing model requires tier_pricing configuration');
      }

      if (document.pricing?.pricing_model === 'subscription' && !document.pricing?.subscription_options) {
        errors.push('Subscription pricing model requires subscription_options configuration');
      }

      // Stock validation
      if (document.availability?.stock_tracking?.enabled && 
          document.availability?.stock_tracking?.current_stock === undefined) {
        errors.push('Stock tracking enabled but current_stock not provided');
      }

      // Price validation by category
      if (document.category?.primary === 'electronics' && document.pricing?.base_price < 1.00) {
        warnings.push('Electronics products with price below $1.00 are unusual');
      }

      // Warranty validation
      if (document.specifications?.warranty?.duration_months > 120) {
        warnings.push('Warranty period over 10 years is unusual');
      }
    }

    if (collectionName === 'orders') {
      // Order total validation
      const itemsTotal = document.items?.reduce((sum, item) => sum + parseFloat(item.total_price), 0) || 0;
      const calculatedTotal = itemsTotal + parseFloat(document.totals?.tax_amount || 0) + 
                             parseFloat(document.totals?.shipping_cost || 0) - 
                             parseFloat(document.totals?.discount_amount || 0);

      if (Math.abs(calculatedTotal - parseFloat(document.totals?.total_amount || 0)) > 0.01) {
        errors.push('Order total calculation does not match sum of items, tax, and shipping');
      }

      // Status workflow validation
      if (document.status?.current === 'delivered' && !document.shipping?.method?.actual_delivery) {
        warnings.push('Order marked as delivered but no actual delivery date provided');
      }

      if (document.payment?.status === 'failed' && document.status?.current !== 'cancelled') {
        errors.push('Order with failed payment must be cancelled');
      }
    }

    if (collectionName === 'user_profiles') {
      // Age and contact validation
      if (document.age && document.age < 18 && document.contact_info?.phone) {
        warnings.push('Phone contact for users under 18 may require parental consent');
      }

      // Privacy compliance validation
      if (document.preferences?.privacy?.data_processing_consent?.marketing && 
          !document.preferences?.privacy?.data_processing_consent?.given_at) {
        errors.push('Marketing consent requires timestamp when consent was given');
      }

      // Security settings validation
      if (document.security?.two_factor_enabled && !document.security?.backup_codes) {
        warnings.push('Two-factor authentication enabled but no backup codes provided');
      }
    }

    return {
      valid: errors.length === 0,
      errors: errors,
      warnings: warnings
    };
  }

  parseValidationError(error) {
    // Parse MongoDB validation error messages into user-friendly format
    let message = error.message;

    // Extract specific field errors from MongoDB validation messages
    const fieldMatch = message.match(/Document failed validation.*properties\.(\w+)/);
    if (fieldMatch) {
      const field = fieldMatch[1];
      return `Validation failed for field '${field}': ${message}`;
    }

    // Extract type errors
    const typeMatch = message.match(/Expected type (\w+) but found (\w+)/);
    if (typeMatch) {
      return `Type mismatch: Expected ${typeMatch[1]} but received ${typeMatch[2]}`;
    }

    // Extract pattern errors
    const patternMatch = message.match(/String does not match regex pattern/);
    if (patternMatch) {
      return 'Value does not match required format pattern';
    }

    return message;
  }

  async getValidationMetrics(collectionName = null) {
    const metrics = {
      ...this.validationMetrics,
      collectionsWithValidation: this.validationSchemas.size,
      schemas: {}
    };

    // Add schema-specific metrics
    for (const [name, schema] of this.validationSchemas.entries()) {
      if (!collectionName || name === collectionName) {
        metrics.schemas[name] = {
          version: schema.version,
          createdAt: schema.createdAt,
          validationLevel: schema.validationLevel,
          validationAction: schema.validationAction
        };
      }
    }

    // Add recent validation errors
    if (collectionName) {
      metrics.recentErrors = this.validationMetrics.validationErrors
        .filter(error => error.collection === collectionName)
        .slice(-10);
    } else {
      metrics.recentErrors = this.validationMetrics.validationErrors.slice(-20);
    }

    return metrics;
  }

  async updateValidationSchema(collectionName, newSchema, version) {
    console.log(`Updating validation schema for collection: ${collectionName}`);

    try {
      // Backup current schema
      const currentSchema = this.validationSchemas.get(collectionName);
      if (currentSchema) {
        await this.backupSchema(collectionName, currentSchema);
      }

      // Update collection validation
      await this.db.command({
        collMod: collectionName,
        validator: newSchema,
        validationLevel: this.options.validationLevel,
        validationAction: this.options.validationAction
      });

      // Update stored schema
      this.validationSchemas.set(collectionName, {
        schema: newSchema,
        version: version,
        createdAt: new Date(),
        previousVersion: currentSchema?.version
      });

      console.log(`Schema updated for collection: ${collectionName} to version: ${version}`);
      return true;

    } catch (error) {
      console.error(`Error updating schema for ${collectionName}:`, error);
      throw error;
    }
  }

  async backupSchema(collectionName, schema) {
    // Store schema backup for rollback purposes
    const backupCollection = this.db.collection('_schema_backups');

    await backupCollection.insertOne({
      collectionName: collectionName,
      schema: schema,
      backedUpAt: new Date()
    });

    console.log(`Schema backed up for collection: ${collectionName}`);
  }

  async generateValidationReport() {
    console.log('Generating comprehensive validation report...');

    const report = {
      reportId: require('crypto').randomUUID(),
      generatedAt: new Date(),

      // Overall metrics
      overview: {
        totalCollectionsWithValidation: this.validationSchemas.size,
        totalValidationsPassed: this.validationMetrics.validationsPassed,
        totalValidationsFailed: this.validationMetrics.validationsFailed,
        successRate: this.validationMetrics.validationsPassed + this.validationMetrics.validationsFailed > 0 ?
          (this.validationMetrics.validationsPassed / (this.validationMetrics.validationsPassed + this.validationMetrics.validationsFailed) * 100).toFixed(2) :
          0,
        lastUpdated: this.validationMetrics.lastUpdated
      },

      // Collection-specific details
      collections: {},

      // Error analysis
      errorAnalysis: {
        totalErrors: this.validationMetrics.validationErrors.length,
        errorsByCollection: {},
        commonErrors: {},
        recentErrors: this.validationMetrics.validationErrors.slice(-10)
      },

      // Recommendations
      recommendations: []
    };

    // Analyze each collection
    for (const [collectionName, schema] of this.validationSchemas.entries()) {
      const collectionErrors = this.validationMetrics.validationErrors
        .filter(error => error.collection === collectionName);

      report.collections[collectionName] = {
        schemaVersion: schema.version,
        validationLevel: schema.validationLevel,
        validationAction: schema.validationAction,
        errorCount: collectionErrors.length,
        lastError: collectionErrors.length > 0 ? collectionErrors[collectionErrors.length - 1] : null
      };

      report.errorAnalysis.errorsByCollection[collectionName] = collectionErrors.length;

      // Generate recommendations
      if (collectionErrors.length > 10) {
        report.recommendations.push({
          type: 'high_error_rate',
          collection: collectionName,
          message: `Collection ${collectionName} has ${collectionErrors.length} validation errors. Consider reviewing schema requirements.`
        });
      }

      if (schema.validationLevel === 'moderate') {
        report.recommendations.push({
          type: 'validation_level',
          collection: collectionName,
          message: `Collection ${collectionName} uses moderate validation. Consider upgrading to strict for better data integrity.`
        });
      }
    }

    // Analyze common error patterns
    const errorMessages = this.validationMetrics.validationErrors.map(error => error.error);
    const errorCounts = {};
    errorMessages.forEach(msg => {
      const key = msg.substring(0, 50) + '...';
      errorCounts[key] = (errorCounts[key] || 0) + 1;
    });

    report.errorAnalysis.commonErrors = Object.entries(errorCounts)
      .sort(([,a], [,b]) => b - a)
      .slice(0, 10)
      .reduce((obj, [key, count]) => ({ ...obj, [key]: count }), {});

    return report;
  }
}

// Example usage and testing
const validationSystem = new MongoDBSchemaValidator(db, {
  validationLevel: 'strict',
  validationAction: 'error',
  enableVersioning: true,
  enableMetrics: true
});

// Benefits of MongoDB Schema Validation:
// - Database-level data integrity enforcement
// - Flexible validation rules with conditional logic
// - Support for complex nested document validation
// - Real-time validation with detailed error reporting
// - Schema versioning and migration capabilities
// - Business rule enforcement at the database level
// - Integration with application development workflows
// - Comprehensive validation metrics and reporting
// - Support for gradual migration and validation levels
// - Advanced error handling and user-friendly feedback

module.exports = {
  MongoDBSchemaValidator
};

Understanding MongoDB Schema Validation Architecture

Advanced Validation Patterns and Business Rule Enforcement

Implement sophisticated validation strategies for production MongoDB deployments:

// Production-ready MongoDB Schema Validation with advanced business rules
class ProductionSchemaValidationManager extends MongoDBSchemaValidator {
  constructor(db, productionConfig) {
    super(db, productionConfig);

    this.productionConfig = {
      ...productionConfig,
      enableConditionalValidation: true,
      enableCrossCollectionValidation: true,
      enableDataMigration: true,
      enableComplianceValidation: true,
      enablePerformanceOptimization: true
    };

    this.setupProductionValidationFeatures();
    this.initializeComplianceFrameworks();
    this.setupValidationMiddleware();
  }

  async implementAdvancedValidationPatterns() {
    console.log('Implementing advanced validation patterns...');

    // Conditional validation based on document context
    const conditionalValidationRules = {
      // User profile validation based on account type
      userProfileConditional: {
        $or: [
          {
            profile_type: 'individual',
            $and: [
              { age: { $gte: 13 } },
              { full_name: { $exists: true } }
            ]
          },
          {
            profile_type: 'business',
            $and: [
              { business_info: { $exists: true } },
              { 'business_info.registration_number': { $exists: true } },
              { 'business_info.tax_id': { $exists: true } }
            ]
          },
          {
            profile_type: 'organization',
            $and: [
              { organization_info: { $exists: true } },
              { 'organization_info.type': { $in: ['nonprofit', 'government', 'educational'] } }
            ]
          }
        ]
      },

      // Product validation based on category
      productCategoryConditional: {
        $or: [
          {
            'category.primary': 'electronics',
            $and: [
              { 'specifications.electronics.brand': { $exists: true } },
              { 'specifications.electronics.model': { $exists: true } },
              { 'specifications.warranty.duration_months': { $gte: 12 } }
            ]
          },
          {
            'category.primary': 'clothing',
            $and: [
              { 'specifications.materials': { $exists: true } },
              { 'specifications.care_instructions': { $exists: true } }
            ]
          },
          {
            'category.primary': { $in: ['food', 'supplements'] },
            $and: [
              { 'specifications.nutrition_facts': { $exists: true } },
              { 'specifications.allergen_info': { $exists: true } },
              { expiration_date: { $exists: true } }
            ]
          }
        ]
      }
    };

    return await this.deployConditionalValidationRules(conditionalValidationRules);
  }

  async setupComplianceValidationFrameworks() {
    console.log('Setting up compliance validation frameworks...');

    const complianceFrameworks = {
      // GDPR compliance validation
      gdprCompliance: {
        userDataProcessing: {
          $and: [
            { 'preferences.privacy.data_processing_consent.given_at': { $exists: true } },
            { 'preferences.privacy.data_processing_consent.ip_address': { $exists: true } },
            { 'preferences.privacy.data_processing_consent.analytics': { $type: 'bool' } },
            { 'preferences.privacy.data_processing_consent.marketing': { $type: 'bool' } }
          ]
        },
        dataRetention: {
          $or: [
            { account_status: 'active' },
            { 
              $and: [
                { account_status: 'closed' },
                { data_retention_expiry: { $gte: new Date() } }
              ]
            }
          ]
        }
      },

      // PCI DSS compliance for payment data
      pciCompliance: {
        paymentDataHandling: {
          $and: [
            { 'payment.card_number': { $exists: false } }, // No plain text card numbers
            { 'payment.cvv': { $exists: false } }, // No CVV storage
            { 'payment.transaction_id': { $exists: true } },
            { 'payment.payment_processor': { $exists: true } }
          ]
        }
      },

      // SOX compliance for financial records
      soxCompliance: {
        financialRecordIntegrity: {
          $and: [
            { audit_trail: { $exists: true } },
            { 'audit_trail.created_by': { $exists: true } },
            { 'audit_trail.last_modified_by': { $exists: true } },
            { 'audit_trail.approval_chain': { $exists: true } }
          ]
        }
      }
    };

    return await this.implementComplianceFrameworks(complianceFrameworks);
  }

  async performCrossCollectionValidation(collectionName, document) {
    console.log(`Performing cross-collection validation for: ${collectionName}`);

    const crossValidationRules = [];

    if (collectionName === 'orders') {
      // Validate customer exists
      const customer = await this.db.collection('user_profiles')
        .findOne({ _id: document.customer_id });

      if (!customer) {
        crossValidationRules.push({
          field: 'customer_id',
          error: 'Customer does not exist'
        });
      } else if (customer.account_status?.status !== 'active') {
        crossValidationRules.push({
          field: 'customer_id',
          error: 'Customer account is not active'
        });
      }

      // Validate products exist and are available
      for (const item of document.items || []) {
        const product = await this.db.collection('products')
          .findOne({ _id: item.product_id });

        if (!product) {
          crossValidationRules.push({
            field: `items.product_id`,
            error: `Product ${item.product_id} does not exist`
          });
        } else {
          // Check product availability
          if (product.availability?.status !== 'available') {
            crossValidationRules.push({
              field: `items.product_id`,
              error: `Product ${product.name} is not available`
            });
          }

          // Check stock if tracking is enabled
          if (product.availability?.stock_tracking?.enabled) {
            const availableStock = product.availability.stock_tracking.current_stock - 
                                  (product.availability.stock_tracking.reserved_stock || 0);

            if (item.quantity > availableStock) {
              crossValidationRules.push({
                field: `items.quantity`,
                error: `Insufficient stock for ${product.name}. Available: ${availableStock}, Requested: ${item.quantity}`
              });
            }
          }

          // Validate pricing consistency
          if (Math.abs(parseFloat(item.unit_price) - parseFloat(product.pricing.base_price)) > 0.01) {
            crossValidationRules.push({
              field: `items.unit_price`,
              warning: `Unit price for ${product.name} may be outdated`
            });
          }
        }
      }
    }

    if (collectionName === 'user_profiles') {
      // Check for duplicate email addresses
      const existingUser = await this.db.collection('user_profiles')
        .findOne({ 
          email: document.email,
          _id: { $ne: document._id }
        });

      if (existingUser) {
        crossValidationRules.push({
          field: 'email',
          error: 'Email address is already registered'
        });
      }

      // Check for duplicate usernames
      const existingUsername = await this.db.collection('user_profiles')
        .findOne({ 
          username: document.username,
          _id: { $ne: document._id }
        });

      if (existingUsername) {
        crossValidationRules.push({
          field: 'username',
          error: 'Username is already taken'
        });
      }
    }

    return {
      valid: crossValidationRules.filter(rule => rule.error).length === 0,
      errors: crossValidationRules.filter(rule => rule.error),
      warnings: crossValidationRules.filter(rule => rule.warning)
    };
  }

  async implementDataMigrationValidation() {
    console.log('Implementing data migration validation strategies...');

    const migrationStrategies = {
      // Gradual validation rollout
      gradualValidation: {
        phase1: { validationLevel: 'off' }, // No validation
        phase2: { validationLevel: 'moderate', validationAction: 'warn' }, // Warnings only
        phase3: { validationLevel: 'moderate', validationAction: 'error' }, // Moderate validation
        phase4: { validationLevel: 'strict', validationAction: 'error' } // Full validation
      },

      // Schema version migration
      schemaVersioning: {
        v1_to_v2: {
          transformationRules: {
            'old_field': 'new_field',
            'deprecated_structure': 'new_structure'
          },
          validationOverrides: {
            allowMissingFields: ['optional_new_field'],
            temporaryRules: {
              'legacy_format': { $exists: true }
            }
          }
        }
      },

      // Data quality improvement
      dataQualityEnforcement: {
        cleanupRules: [
          { field: 'email', action: 'trim_and_lowercase' },
          { field: 'phone', action: 'normalize_format' },
          { field: 'tags', action: 'remove_duplicates' }
        ],
        enrichmentRules: [
          { field: 'created_at', action: 'set_if_missing', value: new Date() },
          { field: 'schema_version', action: 'set_current_version' }
        ]
      }
    };

    return await this.deployMigrationStrategies(migrationStrategies);
  }
}

SQL-Style Schema Validation with QueryLeaf

QueryLeaf provides familiar SQL syntax for MongoDB schema validation and data integrity operations:

-- QueryLeaf advanced schema validation with SQL-familiar syntax

-- Create collections with comprehensive validation rules
CREATE COLLECTION user_profiles
WITH VALIDATION (
  -- Basic field requirements and types
  email VARCHAR(320) NOT NULL UNIQUE 
    PATTERN '^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$',
  username VARCHAR(30) NOT NULL UNIQUE 
    PATTERN '^[a-zA-Z0-9_-]{3,30}$',
  full_name VARCHAR(200) NOT NULL,
  profile_type ENUM('individual', 'business', 'organization', 'developer') NOT NULL,
  age INT CHECK (age >= 13 AND age <= 150),

  -- Complex nested object validation
  account_status OBJECT (
    status ENUM('active', 'inactive', 'suspended', 'pending_verification', 'closed') NOT NULL,
    last_updated DATETIME NOT NULL,
    reason VARCHAR(500),
    updated_by OBJECTID
  ) NOT NULL,

  -- Contact information with regional validation
  contact_info OBJECT (
    phone OBJECT (
      country_code VARCHAR(5) PATTERN '^\+[1-9][0-9]{0,3}$' NOT NULL,
      number VARCHAR(15) PATTERN '^[0-9]{7,15}$' NOT NULL,
      verified BOOLEAN NOT NULL,
      verified_at DATETIME
    ),
    address OBJECT (
      street VARCHAR(200),
      city VARCHAR(100) NOT NULL,
      state_province VARCHAR(100),
      postal_code VARCHAR(20),
      country CHAR(2) PATTERN '^[A-Z]{2}$' NOT NULL
    )
  ),

  -- Nested preferences with conditional validation
  preferences OBJECT (
    notifications OBJECT (
      email OBJECT (
        enabled BOOLEAN NOT NULL,
        frequency ENUM('immediate', 'daily', 'weekly', 'never'),
        categories ARRAY OF ENUM('security', 'marketing', 'product_updates', 'billing') UNIQUE
      ) NOT NULL,
      push OBJECT (
        enabled BOOLEAN NOT NULL,
        quiet_hours OBJECT (
          enabled BOOLEAN NOT NULL,
          start_time TIME PATTERN '^([01]?[0-9]|2[0-3]):[0-5][0-9]$',
          end_time TIME PATTERN '^([01]?[0-9]|2[0-3]):[0-5][0-9]$'
        )
      ) NOT NULL,
      sms OBJECT (
        enabled BOOLEAN NOT NULL,
        emergency_only BOOLEAN
      ) NOT NULL
    ),
    privacy OBJECT (
      profile_visibility ENUM('public', 'friends_only', 'private') NOT NULL,
      search_visibility BOOLEAN,
      data_processing_consent OBJECT (
        analytics BOOLEAN NOT NULL,
        marketing BOOLEAN NOT NULL,
        third_party_sharing BOOLEAN,
        given_at DATETIME NOT NULL,
        ip_address VARCHAR(45),
        user_agent TEXT
      ) NOT NULL
    ) NOT NULL
  ),

  -- Security settings
  security OBJECT (
    two_factor_enabled BOOLEAN,
    backup_codes ARRAY OF VARCHAR(8) PATTERN '^[A-Z0-9]{8}$' MAX_SIZE 10 UNIQUE,
    security_questions ARRAY OF OBJECT (
      question VARCHAR(200) NOT NULL,
      answer_hash VARCHAR(255) NOT NULL,
      created_at DATETIME
    ) MAX_SIZE 5
  ),

  -- Audit fields
  created_at DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP,
  updated_at DATETIME DEFAULT CURRENT_TIMESTAMP ON UPDATE,
  created_by OBJECTID,
  schema_version VARCHAR(10) PATTERN '^\d+\.\d+\.\d+$'

) WITH (
  validation_level = 'strict',
  validation_action = 'error',
  additional_properties = false
);

-- Product collection with category-specific conditional validation
CREATE COLLECTION products
WITH VALIDATION (
  name VARCHAR(500) NOT NULL,
  description TEXT MAX_LENGTH 5000,
  sku VARCHAR(20) PATTERN '^[A-Z0-9]{3,20}$' UNIQUE,

  -- Category with hierarchical structure
  category OBJECT (
    primary ENUM('electronics', 'clothing', 'home_garden', 'books', 'sports', 'automotive', 'health', 'toys') NOT NULL,
    secondary VARCHAR(100),
    path ARRAY OF VARCHAR(100) MIN_SIZE 1 MAX_SIZE 5 NOT NULL,
    tags ARRAY OF VARCHAR(50) PATTERN '^[a-z0-9_-]+$' MAX_SIZE 20 UNIQUE
  ) NOT NULL,

  -- Complex pricing structure
  pricing OBJECT (
    base_price DECIMAL(10,2) CHECK (base_price > 0) NOT NULL,
    currency CHAR(3) PATTERN '^[A-Z]{3}$' NOT NULL,
    pricing_model ENUM('fixed', 'tiered', 'subscription', 'auction', 'negotiable') NOT NULL,

    -- Conditional validation based on pricing model
    tier_pricing ARRAY OF OBJECT (
      min_quantity INT CHECK (min_quantity >= 1) NOT NULL,
      price_per_unit DECIMAL(10,2) CHECK (price_per_unit > 0) NOT NULL,
      description VARCHAR(200)
    ) -- Required when pricing_model = 'tiered'
    CHECK (
      (pricing_model != 'tiered') OR 
      (pricing_model = 'tiered' AND tier_pricing IS NOT NULL AND ARRAY_LENGTH(tier_pricing) > 0)
    ),

    subscription_options OBJECT (
      billing_cycles ARRAY OF ENUM('monthly', 'quarterly', 'annually', 'biennial') MIN_SIZE 1 NOT NULL,
      trial_period_days INT CHECK (trial_period_days >= 0 AND trial_period_days <= 365)
    ) -- Required when pricing_model = 'subscription'
    CHECK (
      (pricing_model != 'subscription') OR 
      (pricing_model = 'subscription' AND subscription_options IS NOT NULL)
    ),

    discounts ARRAY OF OBJECT (
      type ENUM('percentage', 'fixed_amount', 'buy_x_get_y') NOT NULL,
      value DECIMAL(8,2) CHECK (value >= 0) NOT NULL,
      min_purchase_amount DECIMAL(10,2) CHECK (min_purchase_amount >= 0),
      valid_from DATETIME NOT NULL,
      valid_until DATETIME NOT NULL,
      max_uses INT CHECK (max_uses >= 1),
      code VARCHAR(20) PATTERN '^[A-Z0-9]{4,20}$',
      CHECK (valid_until > valid_from)
    ) MAX_SIZE 10
  ) NOT NULL,

  -- Availability and inventory
  availability OBJECT (
    status ENUM('available', 'out_of_stock', 'discontinued', 'coming_soon', 'back_order') NOT NULL,
    stock_tracking OBJECT (
      enabled BOOLEAN NOT NULL,
      current_stock INT CHECK (current_stock >= 0), -- Required when enabled = true
      reserved_stock INT CHECK (reserved_stock >= 0),
      low_stock_threshold INT CHECK (low_stock_threshold >= 0),
      max_order_quantity INT CHECK (max_order_quantity >= 1),
      CHECK (
        (enabled = false) OR 
        (enabled = true AND current_stock IS NOT NULL)
      )
    ) NOT NULL
  ) NOT NULL,

  -- Category-specific specifications with conditional validation
  specifications OBJECT (
    -- Electronics-specific fields (required when category.primary = 'electronics')
    electronics OBJECT (
      brand VARCHAR(100) NOT NULL,
      model VARCHAR(100) NOT NULL,
      power_requirements OBJECT (
        voltage INT CHECK (voltage > 0),
        wattage INT CHECK (wattage > 0),
        frequency INT CHECK (frequency IN (50, 60))
      ),
      connectivity ARRAY OF ENUM('wifi', 'bluetooth', 'ethernet', 'usb', 'hdmi', 'aux', 'nfc')
    ) CHECK (
      (category.primary != 'electronics') OR 
      (category.primary = 'electronics' AND electronics IS NOT NULL)
    ),

    -- Clothing-specific fields (required when category.primary = 'clothing')
    clothing OBJECT (
      sizes ARRAY OF VARCHAR(10) MIN_SIZE 1 NOT NULL,
      colors ARRAY OF VARCHAR(50) MIN_SIZE 1 NOT NULL,
      materials ARRAY OF OBJECT (
        name VARCHAR(100) NOT NULL,
        percentage DECIMAL(5,2) CHECK (percentage > 0 AND percentage <= 100) NOT NULL
      ) NOT NULL,
      care_instructions ARRAY OF VARCHAR(200) MAX_SIZE 10
    ) CHECK (
      (category.primary != 'clothing') OR 
      (category.primary = 'clothing' AND clothing IS NOT NULL)
    ),

    -- Common specifications for all products
    dimensions OBJECT (
      length DECIMAL(8,2) CHECK (length > 0),
      width DECIMAL(8,2) CHECK (width > 0),
      height DECIMAL(8,2) CHECK (height > 0),
      weight DECIMAL(8,2) CHECK (weight > 0),
      unit ENUM('metric', 'imperial') NOT NULL
    ),

    warranty OBJECT (
      duration_months INT CHECK (duration_months >= 0 AND duration_months <= 600),
      type ENUM('manufacturer', 'store', 'extended', 'none'),
      coverage ARRAY OF ENUM('defects', 'wear_and_tear', 'accidental_damage', 'theft')
    )
  ),

  -- Quality and compliance
  quality_control OBJECT (
    certifications ARRAY OF OBJECT (
      name VARCHAR(200) NOT NULL,
      issuing_body VARCHAR(200) NOT NULL,
      certificate_number VARCHAR(100),
      valid_until DATETIME NOT NULL,
      document_url TEXT
    ),
    safety_warnings ARRAY OF OBJECT (
      type ENUM('choking_hazard', 'electrical', 'chemical', 'fire', 'sharp_edges', 'other') NOT NULL,
      description VARCHAR(500) NOT NULL,
      age_restriction INT CHECK (age_restriction >= 0 AND age_restriction <= 21)
    )
  ),

  -- Audit trail
  created_at DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP,
  updated_at DATETIME DEFAULT CURRENT_TIMESTAMP ON UPDATE,
  created_by OBJECTID,
  last_modified_by OBJECTID,
  schema_version VARCHAR(10) PATTERN '^\d+\.\d+\.\d+$'

) WITH (
  validation_level = 'strict',
  validation_action = 'error'
);

-- Order collection with complex business rule validation
CREATE COLLECTION orders
WITH VALIDATION (
  order_number VARCHAR(20) PATTERN '^ORD-[0-9]{8}-[A-Z]{3}$' UNIQUE,
  customer_id OBJECTID NOT NULL REFERENCES user_profiles(_id),

  -- Order items with item-level validation
  items ARRAY OF OBJECT (
    product_id OBJECTID NOT NULL REFERENCES products(_id),
    product_name VARCHAR(500),
    sku VARCHAR(20),
    quantity INT CHECK (quantity >= 1 AND quantity <= 1000) NOT NULL,
    unit_price DECIMAL(10,2) CHECK (unit_price >= 0) NOT NULL,
    total_price DECIMAL(10,2) CHECK (total_price >= 0) NOT NULL,

    -- Validate that total_price = quantity * unit_price
    CHECK (ABS(total_price - (quantity * unit_price)) < 0.01),

    discounts_applied ARRAY OF OBJECT (
      type VARCHAR(50) NOT NULL,
      amount DECIMAL(8,2) NOT NULL,
      code VARCHAR(20)
    ),
    customizations OBJECT
  ) MIN_SIZE 1 MAX_SIZE 100 NOT NULL,

  -- Order totals with cross-field validation
  totals OBJECT (
    subtotal DECIMAL(10,2) CHECK (subtotal >= 0) NOT NULL,
    tax_amount DECIMAL(10,2) CHECK (tax_amount >= 0) NOT NULL,
    shipping_cost DECIMAL(10,2) CHECK (shipping_cost >= 0) NOT NULL,
    discount_amount DECIMAL(10,2) CHECK (discount_amount >= 0),
    total_amount DECIMAL(10,2) CHECK (total_amount >= 0.01) NOT NULL,
    currency CHAR(3) PATTERN '^[A-Z]{3}$' NOT NULL,

    tax_breakdown ARRAY OF OBJECT (
      type VARCHAR(50) NOT NULL,
      rate DECIMAL(6,4) CHECK (rate >= 0 AND rate <= 1) NOT NULL,
      amount DECIMAL(10,2) CHECK (amount >= 0) NOT NULL
    ),

    -- Validate total calculation
    CHECK (
      ABS(total_amount - (subtotal + tax_amount + shipping_cost - COALESCE(discount_amount, 0))) < 0.01
    )
  ) NOT NULL,

  -- Order status with workflow validation
  status OBJECT (
    current ENUM('pending', 'confirmed', 'processing', 'shipped', 'delivered', 'cancelled', 'refunded') NOT NULL,
    history ARRAY OF OBJECT (
      status ENUM('pending', 'confirmed', 'processing', 'shipped', 'delivered', 'cancelled', 'refunded') NOT NULL,
      timestamp DATETIME NOT NULL,
      notes TEXT,
      updated_by OBJECTID
    ) MIN_SIZE 1 NOT NULL
  ) NOT NULL,

  -- Shipping information
  shipping OBJECT (
    method OBJECT (
      carrier VARCHAR(100) NOT NULL,
      service_type VARCHAR(100) NOT NULL,
      tracking_number VARCHAR(100),
      estimated_delivery DATETIME NOT NULL,
      actual_delivery DATETIME,
      CHECK (actual_delivery IS NULL OR actual_delivery >= estimated_delivery)
    ) NOT NULL,
    address OBJECT (
      recipient_name VARCHAR(200) NOT NULL,
      street_address VARCHAR(500) NOT NULL,
      city VARCHAR(100) NOT NULL,
      state_province VARCHAR(100),
      postal_code VARCHAR(20),
      country CHAR(2) PATTERN '^[A-Z]{2}$' NOT NULL,
      special_instructions TEXT
    ) NOT NULL
  ),

  -- Payment information with validation
  payment OBJECT (
    method ENUM('credit_card', 'debit_card', 'paypal', 'bank_transfer', 'digital_wallet', 'cryptocurrency', 'cash_on_delivery') NOT NULL,
    status ENUM('pending', 'authorized', 'captured', 'failed', 'refunded', 'partially_refunded') NOT NULL,
    transaction_id VARCHAR(200),
    authorization_code VARCHAR(100),
    payment_processor VARCHAR(100),
    processed_at DATETIME,
    failure_reason TEXT,

    refund_details ARRAY OF OBJECT (
      amount DECIMAL(10,2) CHECK (amount > 0) NOT NULL,
      reason TEXT NOT NULL,
      processed_at DATETIME NOT NULL,
      refund_id VARCHAR(100)
    ),

    -- Business rule: failed payments must result in cancelled orders
    CHECK (
      (status != 'failed') OR 
      (status = 'failed' AND status.current = 'cancelled')
    )
  ),

  -- Audit fields
  created_at DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP,
  updated_at DATETIME DEFAULT CURRENT_TIMESTAMP ON UPDATE,
  schema_version VARCHAR(10) PATTERN '^\d+\.\d+\.\d+$'

) WITH (
  validation_level = 'strict',
  validation_action = 'error'
);

-- Data validation analysis and reporting queries

-- Comprehensive validation status report
WITH validation_metrics AS (
  SELECT 
    collection_name,
    validation_level,
    validation_action,
    schema_version,

    -- Document count and validation statistics
    COUNT(*) as total_documents,
    COUNT(*) FILTER (WHERE validation_passed = true) as valid_documents,
    COUNT(*) FILTER (WHERE validation_passed = false) as invalid_documents,

    -- Calculate data quality score
    (COUNT(*) FILTER (WHERE validation_passed = true)::numeric / COUNT(*)) * 100 as data_quality_percent,

    -- Validation error analysis
    COUNT(DISTINCT validation_error_type) as unique_error_types,
    MODE() WITHIN GROUP (ORDER BY validation_error_type) as most_common_error,

    -- Recent validation trends
    COUNT(*) FILTER (WHERE validated_at >= CURRENT_TIMESTAMP - INTERVAL '24 hours') as validations_last_24h,
    COUNT(*) FILTER (WHERE validation_passed = false AND validated_at >= CURRENT_TIMESTAMP - INTERVAL '24 hours') as errors_last_24h

  FROM VALIDATION_RESULTS()
  GROUP BY collection_name, validation_level, validation_action, schema_version
),

validation_error_details AS (
  SELECT 
    collection_name,
    validation_error_type,
    validation_error_field,
    COUNT(*) as error_frequency,
    AVG(EXTRACT(EPOCH FROM (CURRENT_TIMESTAMP - first_occurred))) as avg_age_seconds,
    array_agg(
      json_build_object(
        'document_id', document_id,
        'error_message', validation_error_message,
        'occurred_at', occurred_at
      ) ORDER BY occurred_at DESC
    )[1:5] as recent_examples

  FROM VALIDATION_ERRORS()
  WHERE occurred_at >= CURRENT_TIMESTAMP - INTERVAL '7 days'
  GROUP BY collection_name, validation_error_type, validation_error_field
),

collection_health_assessment AS (
  SELECT 
    vm.collection_name,
    vm.total_documents,
    vm.data_quality_percent,
    vm.validations_last_24h,
    vm.errors_last_24h,

    -- Health status determination
    CASE 
      WHEN vm.data_quality_percent >= 99.5 THEN 'EXCELLENT'
      WHEN vm.data_quality_percent >= 95.0 THEN 'GOOD'
      WHEN vm.data_quality_percent >= 90.0 THEN 'FAIR'
      WHEN vm.data_quality_percent >= 80.0 THEN 'POOR'
      ELSE 'CRITICAL'
    END as health_status,

    -- Trending analysis
    CASE 
      WHEN vm.errors_last_24h = 0 THEN 'STABLE'
      WHEN vm.errors_last_24h <= vm.total_documents * 0.01 THEN 'MINOR_ISSUES'
      WHEN vm.errors_last_24h <= vm.total_documents * 0.05 THEN 'MODERATE_ISSUES'
      ELSE 'SIGNIFICANT_ISSUES'
    END as trend_status,

    -- Top error types
    array_agg(
      json_build_object(
        'error_type', ved.validation_error_type,
        'field', ved.validation_error_field,
        'frequency', ved.error_frequency,
        'avg_age_hours', ROUND(ved.avg_age_seconds / 3600.0, 1)
      ) ORDER BY ved.error_frequency DESC
    )[1:3] as top_errors

  FROM validation_metrics vm
  LEFT JOIN validation_error_details ved ON vm.collection_name = ved.collection_name
  GROUP BY vm.collection_name, vm.total_documents, vm.data_quality_percent, 
           vm.validations_last_24h, vm.errors_last_24h
)

SELECT 
  collection_name,
  total_documents,
  ROUND(data_quality_percent, 2) as data_quality_pct,
  health_status,
  trend_status,
  validations_last_24h,
  errors_last_24h,
  top_errors,

  -- Recommendations based on health status
  CASE health_status
    WHEN 'CRITICAL' THEN 'URGENT: Review validation rules and fix data quality issues immediately'
    WHEN 'POOR' THEN 'Review validation errors and implement data cleanup procedures'
    WHEN 'FAIR' THEN 'Monitor validation trends and address recurring error patterns'
    WHEN 'GOOD' THEN 'Continue monitoring and maintain current validation standards'
    ELSE 'Data quality is excellent - consider sharing best practices'
  END as recommendation,

  -- Priority level for remediation
  CASE 
    WHEN health_status IN ('CRITICAL', 'POOR') AND trend_status = 'SIGNIFICANT_ISSUES' THEN 'P0_CRITICAL'
    WHEN health_status = 'POOR' OR trend_status = 'SIGNIFICANT_ISSUES' THEN 'P1_HIGH'
    WHEN health_status = 'FAIR' AND trend_status = 'MODERATE_ISSUES' THEN 'P2_MEDIUM'
    WHEN trend_status = 'MINOR_ISSUES' THEN 'P3_LOW'
    ELSE 'P4_MONITORING'
  END as priority_level,

  CURRENT_TIMESTAMP as report_generated_at

FROM collection_health_assessment
ORDER BY 
  CASE health_status
    WHEN 'CRITICAL' THEN 1
    WHEN 'POOR' THEN 2 
    WHEN 'FAIR' THEN 3
    WHEN 'GOOD' THEN 4
    ELSE 5
  END,
  errors_last_24h DESC;

-- Advanced validation rule analysis
WITH validation_rule_effectiveness AS (
  SELECT 
    vr.collection_name,
    vr.rule_name,
    vr.rule_type,
    vr.field_path,

    -- Rule utilization metrics
    COUNT(DISTINCT ve.document_id) as documents_validated,
    COUNT(*) FILTER (WHERE ve.validation_passed = false) as violations_caught,
    COUNT(*) FILTER (WHERE ve.validation_passed = true) as validations_passed,

    -- Effectiveness calculation
    CASE 
      WHEN COUNT(*) > 0 THEN
        (COUNT(*) FILTER (WHERE ve.validation_passed = false)::numeric / COUNT(*)) * 100
      ELSE 0
    END as violation_rate_percent,

    -- Performance impact
    AVG(ve.validation_duration_ms) as avg_validation_time_ms,
    MAX(ve.validation_duration_ms) as max_validation_time_ms,

    -- Rule complexity assessment
    CASE vr.rule_type
      WHEN 'simple_type_check' THEN 1
      WHEN 'pattern_match' THEN 2
      WHEN 'range_check' THEN 2
      WHEN 'conditional_logic' THEN 4
      WHEN 'cross_field_validation' THEN 5
      WHEN 'cross_collection_validation' THEN 8
      ELSE 3
    END as complexity_score

  FROM VALIDATION_RULES() vr
  LEFT JOIN VALIDATION_EVENTS() ve ON (
    vr.collection_name = ve.collection_name AND 
    vr.rule_name = ve.rule_triggered
  )
  WHERE ve.validated_at >= CURRENT_TIMESTAMP - INTERVAL '30 days'
  GROUP BY vr.collection_name, vr.rule_name, vr.rule_type, vr.field_path
),

rule_optimization_analysis AS (
  SELECT 
    vre.*,

    -- Performance classification
    CASE 
      WHEN avg_validation_time_ms > 1000 THEN 'SLOW'
      WHEN avg_validation_time_ms > 100 THEN 'MODERATE'
      ELSE 'FAST'
    END as performance_class,

    -- Effectiveness classification  
    CASE 
      WHEN violation_rate_percent > 10 THEN 'HIGH_VIOLATION'
      WHEN violation_rate_percent > 5 THEN 'MODERATE_VIOLATION'
      WHEN violation_rate_percent > 0 THEN 'LOW_VIOLATION'
      ELSE 'NO_VIOLATIONS'
    END as effectiveness_class,

    -- Optimization recommendations
    CASE 
      WHEN avg_validation_time_ms > 1000 AND violation_rate_percent = 0 THEN 'Consider removing or simplifying unused rule'
      WHEN avg_validation_time_ms > 500 AND violation_rate_percent < 1 THEN 'Rule may be too strict or complex'
      WHEN violation_rate_percent > 15 THEN 'High violation rate indicates data quality issues'
      WHEN complexity_score > 6 AND avg_validation_time_ms > 100 THEN 'Complex rule impacting performance'
      ELSE 'Rule is operating within normal parameters'
    END as optimization_recommendation

  FROM validation_rule_effectiveness vre
)

SELECT 
  collection_name,
  rule_name,
  rule_type,
  field_path,
  documents_validated,
  violations_caught,
  ROUND(violation_rate_percent, 2) as violation_rate_pct,
  ROUND(avg_validation_time_ms, 2) as avg_validation_ms,
  complexity_score,
  performance_class,
  effectiveness_class,
  optimization_recommendation,

  -- Priority for optimization
  CASE 
    WHEN performance_class = 'SLOW' AND effectiveness_class = 'NO_VIOLATIONS' THEN 'HIGH_PRIORITY'
    WHEN performance_class = 'SLOW' AND violation_rate_percent < 1 THEN 'MEDIUM_PRIORITY'
    WHEN effectiveness_class = 'HIGH_VIOLATION' THEN 'DATA_QUALITY_ISSUE'
    ELSE 'LOW_PRIORITY'
  END as optimization_priority

FROM rule_optimization_analysis
WHERE documents_validated > 0
ORDER BY 
  CASE optimization_priority
    WHEN 'HIGH_PRIORITY' THEN 1
    WHEN 'DATA_QUALITY_ISSUE' THEN 2
    WHEN 'MEDIUM_PRIORITY' THEN 3
    ELSE 4
  END,
  avg_validation_time_ms DESC;

-- QueryLeaf provides comprehensive MongoDB schema validation capabilities:
-- 1. SQL-familiar validation syntax with complex nested object support
-- 2. Conditional validation rules based on document context and business logic
-- 3. Cross-field and cross-collection validation for referential integrity
-- 4. Advanced pattern matching and constraint enforcement
-- 5. Comprehensive validation reporting and error analysis
-- 6. Performance monitoring and rule optimization recommendations
-- 7. Schema versioning and migration support with gradual enforcement
-- 8. Compliance framework integration for regulatory requirements
-- 9. Real-time validation metrics and health monitoring
-- 10. Production-ready validation management with automated optimization

Best Practices for Production Schema Validation

Validation Strategy Design

Essential principles for effective MongoDB schema validation implementation:

  1. Incremental Implementation: Start with moderate validation levels and gradually increase strictness
  2. Business Rule Alignment: Ensure validation rules reflect actual business requirements and constraints
  3. Performance Consideration: Balance comprehensive validation with acceptable performance overhead
  4. Error Handling: Implement user-friendly error messages and validation feedback systems
  5. Schema Evolution: Plan for schema changes and maintain backwards compatibility during transitions
  6. Monitoring and Alerting: Continuously monitor validation effectiveness and data quality metrics

Compliance and Data Integrity

Implement validation frameworks for regulatory and business compliance:

  1. Regulatory Compliance: Integrate validation rules for GDPR, PCI DSS, SOX, and industry-specific requirements
  2. Data Quality Enforcement: Establish validation rules that maintain high data quality standards
  3. Audit Trail Maintenance: Ensure all validation events and changes are properly logged and tracked
  4. Cross-System Validation: Implement validation that works across multiple applications and data sources
  5. Documentation Standards: Maintain comprehensive documentation of validation rules and business logic
  6. Testing Procedures: Establish thorough testing procedures for validation rule changes and updates

Conclusion

MongoDB Schema Validation provides comprehensive document validation capabilities that ensure data integrity, enforce business rules, and maintain data quality at the database level. Unlike application-level validation that can be bypassed or inconsistently applied, MongoDB's validation system provides a reliable foundation for data governance and compliance in production environments.

Key MongoDB Schema Validation benefits include:

  • Database-Level Integrity: Enforcement of data validation rules regardless of application or data source
  • Flexible Rule Definition: Support for complex nested validation, conditional logic, and business rule enforcement
  • Real-Time Validation: Immediate validation feedback with detailed error reporting and user guidance
  • Schema Evolution: Support for gradual migration strategies and schema versioning for evolving applications
  • Performance Optimization: Efficient validation processing with minimal impact on application performance
  • Compliance Support: Built-in frameworks for regulatory compliance and data governance requirements

Whether you're building new applications with strict data requirements, migrating existing systems to enforce better data quality, or implementing compliance frameworks, MongoDB Schema Validation with QueryLeaf's familiar SQL interface provides the foundation for robust data integrity management.

QueryLeaf Integration: QueryLeaf automatically translates SQL-style validation rules into MongoDB's native JSON Schema validation, making advanced document validation accessible through familiar SQL constraint syntax. Complex nested object validation, conditional business rules, and cross-collection integrity checks are seamlessly handled through familiar SQL patterns, enabling sophisticated data validation without requiring deep MongoDB expertise.

The combination of MongoDB's powerful validation capabilities with SQL-style rule definition makes it an ideal platform for applications requiring both flexible document storage and rigorous data integrity enforcement, ensuring your data remains consistent and reliable as your application scales and evolves.

MongoDB Performance Monitoring and Diagnostics: Advanced Optimization Techniques for Production Database Management

Production MongoDB deployments require comprehensive performance monitoring and optimization strategies to maintain optimal query response times, efficient resource utilization, and predictable application performance under varying workload conditions. Traditional database monitoring approaches often struggle with MongoDB's document-oriented structure, dynamic schema capabilities, and distributed architecture patterns, making specialized monitoring tools and techniques essential for effective performance management.

MongoDB provides sophisticated built-in performance monitoring capabilities including query profiling, execution statistics, index utilization analysis, and comprehensive metrics collection that enable deep insights into database performance characteristics. Unlike relational databases that rely primarily on table-level statistics, MongoDB's monitoring encompasses collection-level metrics, document-level analysis, aggregation pipeline performance, and shard-level resource utilization patterns.

The Traditional Database Monitoring Challenge

Conventional database monitoring approaches often lack the granularity and flexibility needed for MongoDB environments:

-- Traditional PostgreSQL performance monitoring - limited insight into document-level operations

-- Basic query performance analysis with limited MongoDB-style insights
SELECT 
  schemaname,
  tablename,
  attname,
  n_distinct,
  correlation,
  most_common_vals,
  most_common_freqs,

  -- Basic statistics available in PostgreSQL
  pg_stat_get_live_tuples(c.oid) as live_tuples,
  pg_stat_get_dead_tuples(c.oid) as dead_tuples,
  pg_stat_get_tuples_inserted(c.oid) as tuples_inserted,
  pg_stat_get_tuples_updated(c.oid) as tuples_updated,
  pg_stat_get_tuples_deleted(c.oid) as tuples_deleted,

  -- Table scan statistics
  pg_stat_get_numscans(c.oid) as table_scans,
  pg_stat_get_tuples_returned(c.oid) as tuples_returned,
  pg_stat_get_tuples_fetched(c.oid) as tuples_fetched,

  -- Index usage statistics (limited compared to MongoDB index insights)
  pg_stat_get_blocks_fetched(c.oid) as blocks_fetched,
  pg_stat_get_blocks_hit(c.oid) as blocks_hit

FROM pg_stats ps
JOIN pg_class c ON ps.tablename = c.relname
JOIN pg_namespace n ON c.relnamespace = n.oid
WHERE ps.schemaname = 'public'
ORDER BY pg_stat_get_live_tuples(c.oid) DESC;

-- Query performance analysis with limited flexibility for document operations
WITH slow_queries AS (
  SELECT 
    query,
    calls,
    total_time,
    mean_time,
    stddev_time,
    min_time,
    max_time,
    rows,

    -- Limited insight into query complexity and document operations
    100.0 * shared_blks_hit / nullif(shared_blks_hit + shared_blks_read, 0) AS hit_percent,

    -- Basic classification limited to SQL operations
    CASE 
      WHEN query LIKE 'SELECT%' THEN 'read'
      WHEN query LIKE 'INSERT%' THEN 'write'
      WHEN query LIKE 'UPDATE%' THEN 'update'
      WHEN query LIKE 'DELETE%' THEN 'delete'
      ELSE 'other'
    END as query_type

  FROM pg_stat_statements
  WHERE calls > 100  -- Focus on frequently executed queries
)
SELECT 
  query_type,
  COUNT(*) as query_count,
  SUM(calls) as total_calls,
  AVG(mean_time) as avg_response_time,
  SUM(total_time) as total_execution_time,
  AVG(hit_percent) as avg_cache_hit_rate,

  -- Limited aggregation capabilities compared to MongoDB aggregation insights
  percentile_cont(0.95) WITHIN GROUP (ORDER BY mean_time) as p95_response_time,
  percentile_cont(0.99) WITHIN GROUP (ORDER BY mean_time) as p99_response_time

FROM slow_queries
GROUP BY query_type
ORDER BY total_execution_time DESC;

-- Problems with traditional monitoring approaches:
-- 1. Limited understanding of document-level operations and nested field access
-- 2. No insight into aggregation pipeline performance and optimization
-- 3. Lack of collection-level and field-level usage statistics
-- 4. No support for analyzing dynamic schema evolution and performance impact
-- 5. Limited index utilization analysis for compound and sparse indexes
-- 6. No understanding of MongoDB-specific operations like upserts and bulk operations
-- 7. Inability to analyze shard key distribution and query routing efficiency
-- 8. No support for analyzing replica set read preference impact on performance
-- 9. Limited insight into connection pooling and driver-level optimization opportunities
-- 10. No understanding of MongoDB-specific caching behavior and working set analysis

-- Manual index analysis with limited insights into MongoDB index strategies
SELECT 
  schemaname,
  tablename,
  indexname,
  idx_tup_read,
  idx_tup_fetch,
  idx_blks_read,
  idx_blks_hit,

  -- Basic index efficiency calculation (limited compared to MongoDB index metrics)
  CASE 
    WHEN idx_tup_read > 0 THEN 
      ROUND(100.0 * idx_tup_fetch / idx_tup_read, 2)
    ELSE 0 
  END as index_efficiency_percent,

  -- Cache hit ratio (basic compared to MongoDB's comprehensive cache analysis)
  CASE 
    WHEN (idx_blks_read + idx_blks_hit) > 0 THEN
      ROUND(100.0 * idx_blks_hit / (idx_blks_read + idx_blks_hit), 2)
    ELSE 0
  END as cache_hit_percent

FROM pg_stat_user_indexes
ORDER BY idx_tup_read DESC;

-- Limitations of traditional approaches:
-- 1. No understanding of MongoDB's document structure impact on performance
-- 2. Limited aggregation pipeline analysis and optimization insights  
-- 3. No collection-level sharding and distribution analysis
-- 4. Lack of real-time profiling capabilities for individual operations
-- 5. No support for analyzing GridFS performance and large document handling
-- 6. Limited understanding of MongoDB's memory management and working set optimization
-- 7. No insight into oplog performance and replica set optimization
-- 8. Inability to analyze change streams and real-time operation performance
-- 9. Limited connection and driver optimization analysis
-- 10. No support for analyzing MongoDB Atlas-specific performance metrics

MongoDB provides comprehensive performance monitoring and optimization capabilities:

// MongoDB Advanced Performance Monitoring and Optimization System
const { MongoClient } = require('mongodb');

const client = new MongoClient('mongodb://localhost:27017');
const db = client.db('production_performance_monitoring');

// Comprehensive MongoDB Performance Monitoring and Diagnostics Manager
class AdvancedMongoPerformanceMonitor {
  constructor(db, config = {}) {
    this.db = db;
    this.adminDb = db.admin();
    this.collections = {
      performanceMetrics: db.collection('performance_metrics'),
      slowQueries: db.collection('slow_queries'),
      indexAnalysis: db.collection('index_analysis'),
      collectionStats: db.collection('collection_stats'),
      profilingData: db.collection('profiling_data'),
      optimizationRecommendations: db.collection('optimization_recommendations')
    };

    // Advanced monitoring configuration
    this.config = {
      profilingLevel: config.profilingLevel || 2, // Profile all operations
      slowOperationThreshold: config.slowOperationThreshold || 100, // 100ms
      samplingRate: config.samplingRate || 1.0, // Sample all operations
      metricsCollectionInterval: config.metricsCollectionInterval || 60000, // 1 minute
      indexAnalysisInterval: config.indexAnalysisInterval || 300000, // 5 minutes
      performanceReportInterval: config.performanceReportInterval || 900000, // 15 minutes

      // Advanced monitoring features
      enableOperationProfiling: config.enableOperationProfiling !== false,
      enableIndexAnalysis: config.enableIndexAnalysis !== false,
      enableCollectionStats: config.enableCollectionStats !== false,
      enableQueryOptimization: config.enableQueryOptimization !== false,
      enableRealTimeAlerts: config.enableRealTimeAlerts !== false,
      enablePerformanceBaseline: config.enablePerformanceBaseline !== false,

      // Alerting thresholds
      alertThresholds: {
        avgResponseTime: config.alertThresholds?.avgResponseTime || 500, // 500ms
        connectionCount: config.alertThresholds?.connectionCount || 1000,
        indexHitRatio: config.alertThresholds?.indexHitRatio || 0.95,
        replicationLag: config.alertThresholds?.replicationLag || 5000, // 5 seconds
        diskUtilization: config.alertThresholds?.diskUtilization || 0.8, // 80%
        memoryUtilization: config.alertThresholds?.memoryUtilization || 0.85 // 85%
      },

      // Optimization settings
      optimizationRules: {
        enableAutoIndexSuggestions: true,
        enableQueryRewriting: false,
        enableCollectionCompaction: false,
        enableShardKeyAnalysis: true
      }
    };

    // Performance metrics storage
    this.metrics = {
      operationCounts: new Map(),
      responseTimes: new Map(),
      indexUsage: new Map(),
      collectionMetrics: new Map()
    };

    // Initialize monitoring systems
    this.initializePerformanceMonitoring();
    this.setupRealTimeProfiler();
    this.startPerformanceCollection();
  }

  async initializePerformanceMonitoring() {
    console.log('Initializing comprehensive MongoDB performance monitoring...');

    try {
      // Enable database profiling with advanced configuration
      await this.enableAdvancedProfiling();

      // Setup performance metrics collection
      await this.setupMetricsCollection();

      // Initialize index analysis
      await this.initializeIndexAnalysis();

      // Setup collection statistics monitoring
      await this.setupCollectionStatsMonitoring();

      // Initialize performance baseline
      if (this.config.enablePerformanceBaseline) {
        await this.initializePerformanceBaseline();
      }

      console.log('Performance monitoring system initialized successfully');

    } catch (error) {
      console.error('Error initializing performance monitoring:', error);
      throw error;
    }
  }

  async enableAdvancedProfiling() {
    console.log('Enabling advanced database profiling...');

    try {
      // Enable profiling for all operations with detailed analysis
      const profilingResult = await this.db.command({
        profile: this.config.profilingLevel,
        slowms: this.config.slowOperationThreshold,
        sampleRate: this.config.samplingRate,

        // Advanced profiling options
        filter: {
          // Profile operations based on specific criteria
          $or: [
            { ts: { $gte: new Date(Date.now() - 3600000) } }, // Last hour
            { millis: { $gte: this.config.slowOperationThreshold } }, // Slow operations
            { planSummary: { $regex: 'COLLSCAN' } }, // Collection scans
            { 'locks.Global.acquireCount.r': { $exists: true } } // Lock-intensive operations
          ]
        }
      });

      console.log('Database profiling enabled:', profilingResult);

      // Configure profiler collection size for optimal performance
      await this.configureProfilerCollection();

    } catch (error) {
      console.error('Error enabling profiling:', error);
      throw error;
    }
  }

  async configureProfilerCollection() {
    try {
      // Ensure profiler collection is appropriately sized
      const profilerCollStats = await this.db.collection('system.profile').stats();

      if (profilerCollStats.capped && profilerCollStats.maxSize < 100 * 1024 * 1024) {
        console.log('Recreating profiler collection with larger size...');

        // Drop and recreate with optimal size
        await this.db.collection('system.profile').drop();
        await this.db.createCollection('system.profile', {
          capped: true,
          size: 100 * 1024 * 1024, // 100MB
          max: 1000000 // 1M documents
        });
      }

    } catch (error) {
      console.warn('Could not configure profiler collection:', error.message);
    }
  }

  async collectComprehensivePerformanceMetrics() {
    console.log('Collecting comprehensive performance metrics...');

    try {
      const startTime = Date.now();

      // Collect server status metrics
      const serverStatus = await this.adminDb.command({ serverStatus: 1 });

      // Collect database statistics
      const dbStats = await this.db.stats();

      // Collect profiling data
      const profilingData = await this.analyzeProfilingData();

      // Collect index usage statistics
      const indexStats = await this.analyzeIndexUsage();

      // Collect collection-level metrics
      const collectionMetrics = await this.collectCollectionMetrics();

      // Collect operation metrics
      const operationMetrics = await this.analyzeOperationMetrics();

      // Collect connection metrics
      const connectionMetrics = this.extractConnectionMetrics(serverStatus);

      // Collect memory and resource metrics
      const resourceMetrics = this.extractResourceMetrics(serverStatus);

      // Collect replication metrics (if applicable)
      const replicationMetrics = await this.collectReplicationMetrics();

      // Collect sharding metrics (if applicable)  
      const shardingMetrics = await this.collectShardingMetrics();

      // Assemble comprehensive performance report
      const performanceReport = {
        timestamp: new Date(),
        collectionTime: Date.now() - startTime,

        // Core performance metrics
        serverStatus: {
          uptime: serverStatus.uptime,
          version: serverStatus.version,
          process: serverStatus.process,
          pid: serverStatus.pid,
          host: serverStatus.host
        },

        // Database-level metrics
        database: {
          collections: dbStats.collections,
          objects: dbStats.objects,
          avgObjSize: dbStats.avgObjSize,
          dataSize: dbStats.dataSize,
          storageSize: dbStats.storageSize,
          indexes: dbStats.indexes,
          indexSize: dbStats.indexSize,

          // Efficiency metrics
          dataToIndexRatio: dbStats.indexSize > 0 ? dbStats.dataSize / dbStats.indexSize : 0,
          storageEfficiency: dbStats.dataSize / dbStats.storageSize,
          avgDocumentSize: dbStats.avgObjSize
        },

        // Operation performance metrics
        operations: operationMetrics,

        // Query performance analysis
        queryPerformance: profilingData,

        // Index performance analysis
        indexPerformance: indexStats,

        // Collection-level metrics
        collections: collectionMetrics,

        // Connection and concurrency metrics
        connections: connectionMetrics,

        // Resource utilization metrics
        resources: resourceMetrics,

        // Replication metrics
        replication: replicationMetrics,

        // Sharding metrics (if applicable)
        sharding: shardingMetrics,

        // Performance analysis
        analysis: await this.generatePerformanceAnalysis({
          serverStatus,
          dbStats,
          profilingData,
          indexStats,
          collectionMetrics,
          operationMetrics,
          connectionMetrics,
          resourceMetrics
        }),

        // Optimization recommendations
        recommendations: await this.generateOptimizationRecommendations({
          profilingData,
          indexStats,
          collectionMetrics,
          operationMetrics
        })
      };

      // Store performance metrics
      await this.collections.performanceMetrics.insertOne(performanceReport);

      // Update real-time metrics
      this.updateRealTimeMetrics(performanceReport);

      // Check for performance alerts
      await this.checkPerformanceAlerts(performanceReport);

      return performanceReport;

    } catch (error) {
      console.error('Error collecting performance metrics:', error);
      throw error;
    }
  }

  async analyzeProfilingData(timeWindow = 300000) {
    console.log('Analyzing profiling data for query performance insights...');

    try {
      const cutoffTime = new Date(Date.now() - timeWindow);

      // Aggregate profiling data with comprehensive analysis
      const profilingAnalysis = await this.db.collection('system.profile').aggregate([
        {
          $match: {
            ts: { $gte: cutoffTime },
            ns: { $regex: `^${this.db.databaseName}\.` } // Current database only
          }
        },
        {
          $addFields: {
            // Categorize operations
            operationType: {
              $switch: {
                branches: [
                  { case: { $ne: ['$command.find', null] }, then: 'find' },
                  { case: { $ne: ['$command.aggregate', null] }, then: 'aggregate' },
                  { case: { $ne: ['$command.insert', null] }, then: 'insert' },
                  { case: { $ne: ['$command.update', null] }, then: 'update' },
                  { case: { $ne: ['$command.delete', null] }, then: 'delete' },
                  { case: { $ne: ['$command.count', null] }, then: 'count' },
                  { case: { $ne: ['$command.distinct', null] }, then: 'distinct' }
                ],
                default: 'other'
              }
            },

            // Analyze execution efficiency
            executionEfficiency: {
              $cond: {
                if: { $and: [{ $gt: ['$docsExamined', 0] }, { $gt: ['$nreturned', 0] }] },
                then: { $divide: ['$nreturned', '$docsExamined'] },
                else: 0
              }
            },

            // Categorize response times
            responseTimeCategory: {
              $switch: {
                branches: [
                  { case: { $lt: ['$millis', 10] }, then: 'very_fast' },
                  { case: { $lt: ['$millis', 100] }, then: 'fast' },
                  { case: { $lt: ['$millis', 500] }, then: 'moderate' },
                  { case: { $lt: ['$millis', 2000] }, then: 'slow' }
                ],
                default: 'very_slow'
              }
            },

            // Index usage analysis
            indexUsageType: {
              $cond: {
                if: { $regexMatch: { input: { $ifNull: ['$planSummary', ''] }, regex: 'IXSCAN' } },
                then: 'index_scan',
                else: {
                  $cond: {
                    if: { $regexMatch: { input: { $ifNull: ['$planSummary', ''] }, regex: 'COLLSCAN' } },
                    then: 'collection_scan',
                    else: 'other'
                  }
                }
              }
            }
          }
        },
        {
          $group: {
            _id: {
              collection: { $arrayElemAt: [{ $split: ['$ns', '.'] }, -1] },
              operationType: '$operationType',
              indexUsageType: '$indexUsageType'
            },

            // Performance statistics
            totalOperations: { $sum: 1 },
            avgResponseTime: { $avg: '$millis' },
            minResponseTime: { $min: '$millis' },
            maxResponseTime: { $max: '$millis' },
            p95ResponseTime: { $percentile: { input: '$millis', p: [0.95] } },
            p99ResponseTime: { $percentile: { input: '$millis', p: [0.99] } },

            // Document examination efficiency
            totalDocsExamined: { $sum: { $ifNull: ['$docsExamined', 0] } },
            totalDocsReturned: { $sum: { $ifNull: ['$nreturned', 0] } },
            avgExecutionEfficiency: { $avg: '$executionEfficiency' },

            // Response time distribution
            veryFastOps: { $sum: { $cond: [{ $eq: ['$responseTimeCategory', 'very_fast'] }, 1, 0] } },
            fastOps: { $sum: { $cond: [{ $eq: ['$responseTimeCategory', 'fast'] }, 1, 0] } },
            moderateOps: { $sum: { $cond: [{ $eq: ['$responseTimeCategory', 'moderate'] }, 1, 0] } },
            slowOps: { $sum: { $cond: [{ $eq: ['$responseTimeCategory', 'slow'] }, 1, 0] } },
            verySlowOps: { $sum: { $cond: [{ $eq: ['$responseTimeCategory', 'very_slow'] }, 1, 0] } },

            // Sample queries for analysis
            sampleQueries: { $push: { 
              command: '$command',
              millis: '$millis',
              planSummary: '$planSummary',
              ts: '$ts'
            } }
          }
        },
        {
          $addFields: {
            // Calculate efficiency metrics
            overallEfficiency: {
              $cond: {
                if: { $gt: ['$totalDocsExamined', 0] },
                then: { $divide: ['$totalDocsReturned', '$totalDocsExamined'] },
                else: 1
              }
            },

            // Calculate performance score
            performanceScore: {
              $multiply: [
                // Response time component (lower is better)
                { $subtract: [1, { $min: [{ $divide: ['$avgResponseTime', 2000] }, 1] }] },
                // Efficiency component (higher is better)
                { $multiply: ['$avgExecutionEfficiency', 100] }
              ]
            },

            // Performance classification
            performanceClass: {
              $switch: {
                branches: [
                  { case: { $gte: ['$performanceScore', 80] }, then: 'excellent' },
                  { case: { $gte: ['$performanceScore', 60] }, then: 'good' },
                  { case: { $gte: ['$performanceScore', 40] }, then: 'fair' },
                  { case: { $gte: ['$performanceScore', 20] }, then: 'poor' }
                ],
                default: 'critical'
              }
            }
          }
        },
        {
          $project: {
            collection: '$_id.collection',
            operationType: '$_id.operationType',
            indexUsageType: '$_id.indexUsageType',

            // Core metrics
            totalOperations: 1,
            avgResponseTime: { $round: ['$avgResponseTime', 2] },
            minResponseTime: 1,
            maxResponseTime: 1,
            p95ResponseTime: { $round: [{ $arrayElemAt: ['$p95ResponseTime', 0] }, 2] },
            p99ResponseTime: { $round: [{ $arrayElemAt: ['$p99ResponseTime', 0] }, 2] },

            // Efficiency metrics
            totalDocsExamined: 1,
            totalDocsReturned: 1,
            overallEfficiency: { $round: ['$overallEfficiency', 4] },
            avgExecutionEfficiency: { $round: ['$avgExecutionEfficiency', 4] },

            // Performance distribution
            responseTimeDistribution: {
              veryFast: '$veryFastOps',
              fast: '$fastOps',
              moderate: '$moderateOps',
              slow: '$slowOps',
              verySlow: '$verySlowOps'
            },

            // Performance scoring
            performanceScore: { $round: ['$performanceScore', 2] },
            performanceClass: 1,

            // Sample queries (limit to 3 most recent)
            sampleQueries: { $slice: [{ $sortArray: { input: '$sampleQueries', sortBy: { ts: -1 } } }, 3] }
          }
        },
        { $sort: { avgResponseTime: -1 } }
      ]).toArray();

      return {
        analysisTimeWindow: timeWindow,
        totalProfiledOperations: profilingAnalysis.reduce((sum, item) => sum + item.totalOperations, 0),
        collections: profilingAnalysis,

        // Summary statistics
        summary: {
          avgResponseTimeOverall: profilingAnalysis.reduce((sum, item) => sum + (item.avgResponseTime * item.totalOperations), 0) / 
                                 Math.max(profilingAnalysis.reduce((sum, item) => sum + item.totalOperations, 0), 1),

          slowOperationsCount: profilingAnalysis.reduce((sum, item) => sum + item.responseTimeDistribution.slow + item.responseTimeDistribution.verySlow, 0),

          collectionScansCount: profilingAnalysis.filter(item => item.indexUsageType === 'collection_scan')
                                               .reduce((sum, item) => sum + item.totalOperations, 0),

          inefficientOperationsCount: profilingAnalysis.filter(item => item.overallEfficiency < 0.1)
                                                      .reduce((sum, item) => sum + item.totalOperations, 0)
        }
      };

    } catch (error) {
      console.error('Error analyzing profiling data:', error);
      return { error: error.message, collections: [] };
    }
  }

  async analyzeIndexUsage() {
    console.log('Analyzing index usage and efficiency...');

    try {
      const collections = await this.db.listCollections().toArray();
      const indexAnalysis = [];

      for (const collInfo of collections) {
        const collection = this.db.collection(collInfo.name);

        try {
          // Get index statistics
          const indexStats = await collection.aggregate([
            { $indexStats: {} }
          ]).toArray();

          // Get collection statistics for context
          const collStats = await collection.stats();

          // Analyze each index
          for (const index of indexStats) {
            const indexAnalysisItem = {
              collection: collInfo.name,
              indexName: index.name,
              indexSpec: index.spec,

              // Usage statistics
              accesses: {
                ops: index.accesses?.ops || 0,
                since: index.accesses?.since || new Date()
              },

              // Index characteristics
              indexSize: index.size || 0,
              isUnique: index.spec && Object.values(index.spec).some(v => v === 1 && index.unique),
              isSparse: index.sparse || false,
              isPartial: !!index.partialFilterExpression,
              isCompound: Object.keys(index.spec || {}).length > 1,

              // Calculate index efficiency metrics
              collectionDocuments: collStats.count,
              collectionSize: collStats.size,
              indexToCollectionRatio: collStats.size > 0 ? index.size / collStats.size : 0,

              // Usage analysis
              usageCategory: this.categorizeIndexUsage(index.accesses?.ops || 0, collStats.count),

              // Performance metrics
              avgDocumentSize: collStats.avgObjSize || 0,
              indexSelectivity: this.estimateIndexSelectivity(index.spec, collStats.count)
            };

            indexAnalysis.push(indexAnalysisItem);
          }

        } catch (collError) {
          console.warn(`Error analyzing indexes for collection ${collInfo.name}:`, collError.message);
        }
      }

      // Generate index usage report
      return {
        totalIndexes: indexAnalysis.length,
        indexes: indexAnalysis,

        // Index usage summary
        usageSummary: {
          highUsage: indexAnalysis.filter(idx => idx.usageCategory === 'high').length,
          mediumUsage: indexAnalysis.filter(idx => idx.usageCategory === 'medium').length,
          lowUsage: indexAnalysis.filter(idx => idx.usageCategory === 'low').length,
          unused: indexAnalysis.filter(idx => idx.usageCategory === 'unused').length
        },

        // Index type distribution
        typeDistribution: {
          simple: indexAnalysis.filter(idx => !idx.isCompound).length,
          compound: indexAnalysis.filter(idx => idx.isCompound).length,
          unique: indexAnalysis.filter(idx => idx.isUnique).length,
          sparse: indexAnalysis.filter(idx => idx.isSparse).length,
          partial: indexAnalysis.filter(idx => idx.isPartial).length
        },

        // Performance insights
        performanceInsights: {
          totalIndexSize: indexAnalysis.reduce((sum, idx) => sum + idx.indexSize, 0),
          avgIndexToCollectionRatio: indexAnalysis.reduce((sum, idx) => sum + idx.indexToCollectionRatio, 0) / indexAnalysis.length,
          potentiallyRedundantIndexes: indexAnalysis.filter(idx => idx.usageCategory === 'unused' && idx.indexName !== '_id_'),
          oversizedIndexes: indexAnalysis.filter(idx => idx.indexToCollectionRatio > 0.5)
        }
      };

    } catch (error) {
      console.error('Error analyzing index usage:', error);
      return { error: error.message, indexes: [] };
    }
  }

  categorizeIndexUsage(accessCount, collectionDocuments) {
    if (accessCount === 0) return 'unused';
    if (accessCount < collectionDocuments * 0.01) return 'low';
    if (accessCount < collectionDocuments * 0.1) return 'medium';
    return 'high';
  }

  estimateIndexSelectivity(indexSpec, collectionDocuments) {
    // Simple estimation - in practice, would need sampling
    if (!indexSpec || collectionDocuments === 0) return 1;

    // Compound indexes generally more selective
    if (Object.keys(indexSpec).length > 1) return 0.1;

    // Simple heuristic based on field types
    return 0.5; // Default moderate selectivity
  }

  async collectCollectionMetrics() {
    console.log('Collecting detailed collection-level metrics...');

    try {
      const collections = await this.db.listCollections().toArray();
      const collectionMetrics = [];

      for (const collInfo of collections) {
        try {
          const collection = this.db.collection(collInfo.name);
          const stats = await collection.stats();

          // Calculate additional metrics
          const avgDocSize = stats.avgObjSize || 0;
          const storageEfficiency = stats.size > 0 ? stats.size / stats.storageSize : 0;
          const indexOverhead = stats.size > 0 ? stats.totalIndexSize / stats.size : 0;

          const collectionMetric = {
            name: collInfo.name,
            type: collInfo.type,

            // Core statistics
            documentCount: stats.count,
            dataSize: stats.size,
            storageSize: stats.storageSize,
            avgDocumentSize: avgDocSize,

            // Index statistics
            indexCount: stats.nindexes,
            totalIndexSize: stats.totalIndexSize,

            // Efficiency metrics
            storageEfficiency: storageEfficiency,
            indexOverhead: indexOverhead,
            fragmentationRatio: stats.storageSize > 0 ? 1 - (stats.size / stats.storageSize) : 0,

            // Performance characteristics
            performanceCategory: this.categorizeCollectionPerformance({
              documentCount: stats.count,
              avgDocumentSize: avgDocSize,
              indexOverhead: indexOverhead,
              storageEfficiency: storageEfficiency
            }),

            // Optimization opportunities
            optimizationFlags: {
              highFragmentation: (1 - storageEfficiency) > 0.3,
              excessiveIndexing: indexOverhead > 1.0,
              largeDocs: avgDocSize > 16384, // 16KB
              noIndexes: stats.nindexes <= 1 // Only _id index
            },

            timestamp: new Date()
          };

          collectionMetrics.push(collectionMetric);

        } catch (collError) {
          console.warn(`Error collecting stats for collection ${collInfo.name}:`, collError.message);
        }
      }

      return {
        collections: collectionMetrics,
        summary: {
          totalCollections: collectionMetrics.length,
          totalDocuments: collectionMetrics.reduce((sum, c) => sum + c.documentCount, 0),
          totalDataSize: collectionMetrics.reduce((sum, c) => sum + c.dataSize, 0),
          totalStorageSize: collectionMetrics.reduce((sum, c) => sum + c.storageSize, 0),
          totalIndexSize: collectionMetrics.reduce((sum, c) => sum + c.totalIndexSize, 0),
          avgStorageEfficiency: collectionMetrics.reduce((sum, c) => sum + c.storageEfficiency, 0) / collectionMetrics.length
        }
      };

    } catch (error) {
      console.error('Error collecting collection metrics:', error);
      return { error: error.message, collections: [] };
    }
  }

  categorizeCollectionPerformance({ documentCount, avgDocumentSize, indexOverhead, storageEfficiency }) {
    let score = 0;

    // Document count efficiency
    if (documentCount < 10000) score += 10;
    else if (documentCount < 1000000) score += 5;

    // Document size efficiency
    if (avgDocumentSize < 1024) score += 10; // < 1KB
    else if (avgDocumentSize < 16384) score += 5; // < 16KB

    // Index efficiency
    if (indexOverhead < 0.2) score += 10;
    else if (indexOverhead < 0.5) score += 5;

    // Storage efficiency
    if (storageEfficiency > 0.8) score += 10;
    else if (storageEfficiency > 0.6) score += 5;

    if (score >= 30) return 'excellent';
    if (score >= 20) return 'good';
    if (score >= 10) return 'fair';
    return 'poor';
  }

  async generateOptimizationRecommendations(performanceData) {
    console.log('Generating performance optimization recommendations...');

    const recommendations = [];

    try {
      // Analyze profiling data for query optimization
      if (performanceData.profilingData?.collections) {
        for (const collection of performanceData.profilingData.collections) {
          // Recommend indexes for collection scans
          if (collection.indexUsageType === 'collection_scan' && collection.totalOperations > 100) {
            recommendations.push({
              type: 'index_recommendation',
              priority: 'high',
              collection: collection.collection,
              title: 'Add index to eliminate collection scans',
              description: `Collection "${collection.collection}" has ${collection.totalOperations} collection scans with average response time of ${collection.avgResponseTime}ms`,
              recommendation: `Consider adding an index on frequently queried fields for ${collection.operationType} operations`,
              impact: 'high',
              effort: 'medium',
              estimatedImprovement: '60-90% response time reduction'
            });
          }

          // Recommend query optimization for slow operations
          if (collection.avgResponseTime > 1000) {
            recommendations.push({
              type: 'query_optimization',
              priority: 'high',
              collection: collection.collection,
              title: 'Optimize slow queries',
              description: `Queries on "${collection.collection}" average ${collection.avgResponseTime}ms response time`,
              recommendation: 'Review query patterns and consider compound indexes or query restructuring',
              impact: 'high',
              effort: 'medium',
              estimatedImprovement: '40-70% response time reduction'
            });
          }

          // Recommend efficiency improvements
          if (collection.overallEfficiency < 0.1) {
            recommendations.push({
              type: 'efficiency_improvement',
              priority: 'medium',
              collection: collection.collection,
              title: 'Improve query efficiency',
              description: `Queries examine ${collection.totalDocsExamined} documents but return only ${collection.totalDocsReturned} (${Math.round(collection.overallEfficiency * 100)}% efficiency)`,
              recommendation: 'Add more selective indexes or modify query patterns to reduce document examination',
              impact: 'medium',
              effort: 'medium',
              estimatedImprovement: '30-50% efficiency improvement'
            });
          }
        }
      }

      // Analyze index usage for recommendations
      if (performanceData.indexStats?.indexes) {
        for (const index of performanceData.indexStats.indexes) {
          // Recommend removing unused indexes
          if (index.usageCategory === 'unused' && index.indexName !== '_id_') {
            recommendations.push({
              type: 'index_removal',
              priority: 'low',
              collection: index.collection,
              title: 'Remove unused index',
              description: `Index "${index.indexName}" on collection "${index.collection}" is unused`,
              recommendation: 'Consider removing this index to reduce storage overhead and improve write performance',
              impact: 'low',
              effort: 'low',
              estimatedImprovement: 'Reduced storage usage and faster writes'
            });
          }

          // Recommend index optimization for oversized indexes
          if (index.indexToCollectionRatio > 0.5) {
            recommendations.push({
              type: 'index_optimization',
              priority: 'medium',
              collection: index.collection,
              title: 'Optimize oversized index',
              description: `Index "${index.indexName}" size is ${Math.round(index.indexToCollectionRatio * 100)}% of collection size`,
              recommendation: 'Review index design and consider using sparse or partial indexes',
              impact: 'medium',
              effort: 'medium',
              estimatedImprovement: '20-40% storage reduction'
            });
          }
        }
      }

      // Analyze collection metrics for recommendations
      if (performanceData.collectionMetrics?.collections) {
        for (const collection of performanceData.collectionMetrics.collections) {
          // Recommend addressing fragmentation
          if (collection.optimizationFlags.highFragmentation) {
            recommendations.push({
              type: 'storage_optimization',
              priority: 'medium',
              collection: collection.name,
              title: 'Address storage fragmentation',
              description: `Collection "${collection.name}" has ${Math.round(collection.fragmentationRatio * 100)}% fragmentation`,
              recommendation: 'Consider running compact command or rebuilding indexes during maintenance window',
              impact: 'medium',
              effort: 'high',
              estimatedImprovement: '15-30% storage efficiency improvement'
            });
          }

          // Recommend index strategy for collections with no custom indexes
          if (collection.optimizationFlags.noIndexes && collection.documentCount > 1000) {
            recommendations.push({
              type: 'index_strategy',
              priority: 'medium',
              collection: collection.name,
              title: 'Implement indexing strategy',
              description: `Collection "${collection.name}" has ${collection.documentCount} documents but no custom indexes`,
              recommendation: 'Analyze query patterns and add appropriate indexes for common queries',
              impact: 'high',
              effort: 'medium',
              estimatedImprovement: '50-80% query performance improvement'
            });
          }
        }
      }

      // Sort recommendations by priority and impact
      recommendations.sort((a, b) => {
        const priorityOrder = { high: 3, medium: 2, low: 1 };
        const impactOrder = { high: 3, medium: 2, low: 1 };

        const priorityDiff = priorityOrder[b.priority] - priorityOrder[a.priority];
        if (priorityDiff !== 0) return priorityDiff;

        return impactOrder[b.impact] - impactOrder[a.impact];
      });

      return {
        totalRecommendations: recommendations.length,
        recommendations: recommendations,

        // Summary by type
        summaryByType: {
          indexRecommendations: recommendations.filter(r => r.type.includes('index')).length,
          queryOptimizations: recommendations.filter(r => r.type === 'query_optimization').length,
          storageOptimizations: recommendations.filter(r => r.type === 'storage_optimization').length,
          efficiencyImprovements: recommendations.filter(r => r.type === 'efficiency_improvement').length
        },

        // Priority distribution
        priorityDistribution: {
          high: recommendations.filter(r => r.priority === 'high').length,
          medium: recommendations.filter(r => r.priority === 'medium').length,
          low: recommendations.filter(r => r.priority === 'low').length
        },

        generatedAt: new Date()
      };

    } catch (error) {
      console.error('Error generating optimization recommendations:', error);
      return { error: error.message, recommendations: [] };
    }
  }

  async generatePerformanceReport() {
    console.log('Generating comprehensive performance report...');

    try {
      // Collect all performance metrics
      const performanceData = await this.collectComprehensivePerformanceMetrics();

      // Generate executive summary
      const executiveSummary = this.generateExecutiveSummary(performanceData);

      // Create comprehensive report
      const performanceReport = {
        reportId: require('crypto').randomUUID(),
        generatedAt: new Date(),
        reportPeriod: {
          start: new Date(Date.now() - 3600000), // Last hour
          end: new Date()
        },

        // Executive summary
        executiveSummary: executiveSummary,

        // Detailed performance data
        performanceData: performanceData,

        // Key performance indicators
        kpis: {
          avgResponseTime: performanceData.queryPerformance?.summary?.avgResponseTimeOverall || 0,
          slowQueriesCount: performanceData.queryPerformance?.summary?.slowOperationsCount || 0,
          collectionScansCount: performanceData.queryPerformance?.summary?.collectionScansCount || 0,
          indexEfficiency: this.calculateOverallIndexEfficiency(performanceData.indexPerformance),
          storageEfficiency: performanceData.collections?.summary?.avgStorageEfficiency || 0,
          connectionUtilization: performanceData.connections?.utilizationPercent || 0
        },

        // Performance trends (if baseline available)
        trends: await this.calculatePerformanceTrends(),

        // Optimization recommendations
        recommendations: performanceData.recommendations,

        // Action items
        actionItems: this.generateActionItems(performanceData.recommendations),

        // Health score
        overallHealthScore: this.calculateOverallHealthScore(performanceData)
      };

      // Store report
      await this.collections.performanceMetrics.insertOne(performanceReport);

      return performanceReport;

    } catch (error) {
      console.error('Error generating performance report:', error);
      throw error;
    }
  }

  generateExecutiveSummary(performanceData) {
    const issues = [];
    const highlights = [];

    // Identify key issues
    if (performanceData.queryPerformance?.summary?.avgResponseTimeOverall > 500) {
      issues.push(`Average query response time is ${Math.round(performanceData.queryPerformance.summary.avgResponseTimeOverall)}ms (target: <100ms)`);
    }

    if (performanceData.queryPerformance?.summary?.collectionScansCount > 0) {
      issues.push(`${performanceData.queryPerformance.summary.collectionScansCount} queries are performing collection scans`);
    }

    if (performanceData.collections?.summary?.avgStorageEfficiency < 0.7) {
      issues.push(`Storage efficiency is ${Math.round(performanceData.collections.summary.avgStorageEfficiency * 100)}% (target: >80%)`);
    }

    // Identify highlights
    if (performanceData.queryPerformance?.summary?.avgResponseTimeOverall < 100) {
      highlights.push('Query performance is excellent with average response time under 100ms');
    }

    if (performanceData.indexPerformance?.usageSummary?.unused < 2) {
      highlights.push('Index usage is well optimized with minimal unused indexes');
    }

    return {
      status: issues.length === 0 ? 'healthy' : issues.length < 3 ? 'warning' : 'critical',
      keyIssues: issues,
      highlights: highlights,
      recommendationsCount: performanceData.recommendations?.totalRecommendations || 0,
      criticalRecommendations: performanceData.recommendations?.priorityDistribution?.high || 0
    };
  }

  calculateOverallIndexEfficiency(indexPerformance) {
    if (!indexPerformance?.indexes || indexPerformance.indexes.length === 0) return 0;

    const usedIndexes = indexPerformance.indexes.filter(idx => idx.usageCategory !== 'unused').length;
    return usedIndexes / indexPerformance.indexes.length;
  }

  generateActionItems(recommendations) {
    if (!recommendations?.recommendations) return [];

    return recommendations.recommendations
      .filter(rec => rec.priority === 'high')
      .slice(0, 5) // Top 5 high-priority items
      .map(rec => ({
        title: rec.title,
        collection: rec.collection,
        action: rec.recommendation,
        estimatedEffort: rec.effort,
        expectedImpact: rec.estimatedImprovement
      }));
  }

  calculateOverallHealthScore(performanceData) {
    let score = 100;

    // Query performance impact
    const avgResponseTime = performanceData.queryPerformance?.summary?.avgResponseTimeOverall || 0;
    if (avgResponseTime > 1000) score -= 30;
    else if (avgResponseTime > 500) score -= 20;
    else if (avgResponseTime > 100) score -= 10;

    // Collection scans impact
    const collectionScans = performanceData.queryPerformance?.summary?.collectionScansCount || 0;
    if (collectionScans > 100) score -= 25;
    else if (collectionScans > 10) score -= 15;
    else if (collectionScans > 0) score -= 5;

    // Storage efficiency impact
    const storageEfficiency = performanceData.collections?.summary?.avgStorageEfficiency || 1;
    if (storageEfficiency < 0.5) score -= 20;
    else if (storageEfficiency < 0.7) score -= 10;

    // Index efficiency impact
    const indexEfficiency = this.calculateOverallIndexEfficiency(performanceData.indexPerformance);
    if (indexEfficiency < 0.7) score -= 15;
    else if (indexEfficiency < 0.9) score -= 5;

    return Math.max(0, score);
  }

  // Additional helper methods for comprehensive monitoring

  extractConnectionMetrics(serverStatus) {
    const connections = serverStatus.connections || {};
    const network = serverStatus.network || {};

    return {
      current: connections.current || 0,
      available: connections.available || 0,
      totalCreated: connections.totalCreated || 0,
      utilizationPercent: connections.available > 0 ? 
        (connections.current / (connections.current + connections.available)) * 100 : 0,

      // Network metrics
      bytesIn: network.bytesIn || 0,
      bytesOut: network.bytesOut || 0,
      numRequests: network.numRequests || 0
    };
  }

  extractResourceMetrics(serverStatus) {
    const mem = serverStatus.mem || {};
    const extra_info = serverStatus.extra_info || {};

    return {
      // Memory usage
      residentMemoryMB: mem.resident || 0,
      virtualMemoryMB: mem.virtual || 0,
      mappedMemoryMB: mem.mapped || 0,

      // System metrics
      pageFaults: extra_info.page_faults || 0,
      heapUsageMB: mem.heap_usage_bytes ? mem.heap_usage_bytes / (1024 * 1024) : 0,

      // CPU and system load would require additional system commands
      cpuUsagePercent: 0, // Would need external monitoring
      diskIOPS: 0 // Would need external monitoring
    };
  }

  async collectReplicationMetrics() {
    try {
      const replSetStatus = await this.adminDb.command({ replSetGetStatus: 1 });

      if (!replSetStatus.ok) {
        return { replicated: false };
      }

      const primary = replSetStatus.members.find(m => m.state === 1);
      const secondaries = replSetStatus.members.filter(m => m.state === 2);

      return {
        replicated: true,
        setName: replSetStatus.set,
        primary: primary ? {
          name: primary.name,
          health: primary.health,
          uptime: primary.uptime
        } : null,
        secondaries: secondaries.map(s => ({
          name: s.name,
          health: s.health,
          lag: primary && s.optimeDate ? primary.optimeDate - s.optimeDate : 0,
          uptime: s.uptime
        })),
        totalMembers: replSetStatus.members.length
      };
    } catch (error) {
      return { replicated: false, error: error.message };
    }
  }

  async collectShardingMetrics() {
    try {
      const shardingStatus = await this.adminDb.command({ isdbgrid: 1 });

      if (!shardingStatus.isdbgrid) {
        return { sharded: false };
      }

      const configDB = this.client.db('config');
      const shards = await configDB.collection('shards').find().toArray();
      const chunks = await configDB.collection('chunks').find().toArray();

      return {
        sharded: true,
        shardCount: shards.length,
        totalChunks: chunks.length,
        shards: shards.map(s => ({
          id: s._id,
          host: s.host,
          state: s.state
        }))
      };
    } catch (error) {
      return { sharded: false, error: error.message };
    }
  }

  async startPerformanceCollection() {
    console.log('Starting continuous performance metrics collection...');

    // Collect metrics at regular intervals
    setInterval(async () => {
      try {
        await this.collectComprehensivePerformanceMetrics();
      } catch (error) {
        console.error('Error in scheduled performance collection:', error);
      }
    }, this.config.metricsCollectionInterval);

    // Generate reports at longer intervals
    setInterval(async () => {
      try {
        await this.generatePerformanceReport();
      } catch (error) {
        console.error('Error in scheduled report generation:', error);
      }
    }, this.config.performanceReportInterval);
  }

  updateRealTimeMetrics(performanceData) {
    // Update in-memory metrics for real-time dashboard
    this.metrics.operationCounts.set('current', performanceData.operations);
    this.metrics.responseTimes.set('current', performanceData.queryPerformance);
    this.metrics.indexUsage.set('current', performanceData.indexPerformance);
    this.metrics.collectionMetrics.set('current', performanceData.collections);
  }

  async checkPerformanceAlerts(performanceData) {
    const alerts = [];

    // Check response time thresholds
    const avgResponseTime = performanceData.queryPerformance?.summary?.avgResponseTimeOverall || 0;
    if (avgResponseTime > this.config.alertThresholds.avgResponseTime) {
      alerts.push({
        type: 'high_response_time',
        severity: 'warning',
        message: `Average response time ${avgResponseTime}ms exceeds threshold ${this.config.alertThresholds.avgResponseTime}ms`
      });
    }

    // Check collection scans
    const collectionScans = performanceData.queryPerformance?.summary?.collectionScansCount || 0;
    if (collectionScans > 0) {
      alerts.push({
        type: 'collection_scans',
        severity: 'warning',
        message: `${collectionScans} queries performing collection scans`
      });
    }

    // Process alerts if any
    if (alerts.length > 0 && this.config.enableRealTimeAlerts) {
      await this.processPerformanceAlerts(alerts);
    }
  }

  async processPerformanceAlerts(alerts) {
    for (const alert of alerts) {
      console.warn(`⚠️ Performance Alert [${alert.severity}]: ${alert.message}`);

      // Store alert for historical tracking
      await this.collections.performanceMetrics.insertOne({
        type: 'alert',
        alert: alert,
        timestamp: new Date()
      });

      // Trigger external alerting systems here
      // (email, Slack, PagerDuty, etc.)
    }
  }
}

// Benefits of MongoDB Advanced Performance Monitoring:
// - Comprehensive query profiling with detailed execution analysis
// - Advanced index usage analysis and optimization recommendations
// - Collection-level performance metrics and storage efficiency tracking
// - Real-time performance monitoring with automated alerting
// - Intelligent optimization recommendations based on actual usage patterns
// - Integration with MongoDB's native profiling and statistics capabilities
// - Production-ready monitoring suitable for large-scale deployments
// - Historical performance trend analysis and baseline establishment
// - Automated performance report generation with executive summaries
// - SQL-compatible monitoring operations through QueryLeaf integration

module.exports = {
  AdvancedMongoPerformanceMonitor
};

Understanding MongoDB Performance Monitoring Architecture

Advanced Profiling and Optimization Strategies

Implement sophisticated monitoring patterns for production MongoDB deployments:

// Production-ready MongoDB performance monitoring with advanced optimization patterns
class ProductionPerformanceOptimizer extends AdvancedMongoPerformanceMonitor {
  constructor(db, productionConfig) {
    super(db, productionConfig);

    this.productionConfig = {
      ...productionConfig,
      enablePredictiveAnalytics: true,
      enableAutomaticOptimization: false, // Require manual approval
      enableCapacityPlanning: true,
      enablePerformanceBaseline: true,
      enableAnomalyDetection: true,
      enableCostOptimization: true
    };

    this.setupProductionOptimizations();
    this.initializePredictiveAnalytics();
    this.setupCapacityPlanningModels();
  }

  async implementAdvancedQueryOptimization(optimizationConfig) {
    console.log('Implementing advanced query optimization strategies...');

    const optimizationStrategies = {
      // Intelligent index recommendations
      indexOptimization: {
        compoundIndexAnalysis: true,
        partialIndexOptimization: true,
        sparseIndexRecommendations: true,
        indexIntersectionAnalysis: true
      },

      // Query pattern analysis
      queryOptimization: {
        aggregationPipelineOptimization: true,
        queryShapeAnalysis: true,
        executionPlanOptimization: true,
        sortOptimization: true
      },

      // Schema optimization
      schemaOptimization: {
        documentStructureAnalysis: true,
        fieldUsageAnalysis: true,
        embeddingVsReferencingAnalysis: true,
        denormalizationRecommendations: true
      },

      // Resource optimization
      resourceOptimization: {
        connectionPoolOptimization: true,
        memoryUsageOptimization: true,
        diskIOOptimization: true,
        networkOptimization: true
      }
    };

    return await this.executeOptimizationStrategies(optimizationStrategies);
  }

  async setupCapacityPlanningModels(planningRequirements) {
    console.log('Setting up capacity planning and growth prediction models...');

    const planningModels = {
      // Growth prediction models
      growthPrediction: {
        documentGrowthRate: await this.analyzeDocumentGrowthRate(),
        storageGrowthProjection: await this.projectStorageGrowth(),
        queryVolumeProjection: await this.projectQueryVolumeGrowth(),
        indexGrowthAnalysis: await this.analyzeIndexGrowthPatterns()
      },

      // Resource requirement models
      resourcePlanning: {
        cpuRequirements: await this.calculateCPURequirements(),
        memoryRequirements: await this.calculateMemoryRequirements(),
        storageRequirements: await this.calculateStorageRequirements(),
        networkRequirements: await this.calculateNetworkRequirements()
      },

      // Scaling recommendations
      scalingStrategy: {
        verticalScaling: await this.analyzeVerticalScalingNeeds(),
        horizontalScaling: await this.analyzeHorizontalScalingNeeds(),
        shardingRecommendations: await this.analyzeShardingRequirements(),
        replicaSetOptimization: await this.analyzeReplicaSetOptimization()
      }
    };

    return await this.implementCapacityPlanningModels(planningModels);
  }

  async enableAnomalyDetection(detectionConfig) {
    console.log('Enabling performance anomaly detection system...');

    const anomalyDetectionSystem = {
      // Statistical anomaly detection
      statisticalDetection: {
        responseTimeAnomalies: true,
        queryVolumeAnomalies: true,
        indexUsageAnomalies: true,
        resourceUsageAnomalies: true
      },

      // Machine learning based detection
      mlDetection: {
        queryPatternAnomalies: true,
        performanceDegradationPrediction: true,
        capacityThresholdPrediction: true,
        failurePatternRecognition: true
      },

      // Business logic anomalies
      businessLogicDetection: {
        unexpectedDataPatterns: true,
        unusualApplicationBehavior: true,
        securityAnomalies: true,
        complianceViolations: true
      }
    };

    return await this.implementAnomalyDetectionSystem(anomalyDetectionSystem);
  }
}

SQL-Style Performance Monitoring with QueryLeaf

QueryLeaf provides familiar SQL syntax for MongoDB performance monitoring and optimization operations:

-- QueryLeaf advanced performance monitoring and optimization with SQL-familiar syntax

-- Enable comprehensive database profiling with advanced configuration
CONFIGURE PROFILING 
SET profiling_level = 2,
    slow_operation_threshold = 100,
    sample_rate = 1.0,
    filter_criteria = {
      include_slow_ops: true,
      include_collection_scans: true,
      include_lock_operations: true,
      include_index_analysis: true
    },
    collection_size = '100MB',
    max_documents = 1000000;

-- Comprehensive performance metrics analysis with detailed insights
WITH performance_analysis AS (
  SELECT 
    -- Operation characteristics
    operation_type,
    collection_name,
    execution_time_ms,
    documents_examined,
    documents_returned,
    index_keys_examined,
    execution_plan,

    -- Efficiency calculations
    CASE 
      WHEN documents_examined > 0 THEN 
        CAST(documents_returned AS FLOAT) / documents_examined
      ELSE 1.0
    END as query_efficiency,

    -- Performance categorization
    CASE 
      WHEN execution_time_ms < 10 THEN 'very_fast'
      WHEN execution_time_ms < 100 THEN 'fast'
      WHEN execution_time_ms < 500 THEN 'moderate'
      WHEN execution_time_ms < 2000 THEN 'slow'
      ELSE 'very_slow'
    END as performance_category,

    -- Index usage analysis
    CASE 
      WHEN execution_plan LIKE '%IXSCAN%' THEN 'index_scan'
      WHEN execution_plan LIKE '%COLLSCAN%' THEN 'collection_scan'
      ELSE 'other'
    END as index_usage_type,

    -- Lock analysis
    locks_acquired,
    lock_wait_time_ms,

    -- Resource usage
    cpu_time_ms,
    memory_usage_bytes,

    -- Timestamp for trend analysis
    DATE_TRUNC('minute', operation_timestamp) as time_bucket

  FROM PROFILE_DATA
  WHERE operation_timestamp >= CURRENT_TIMESTAMP - INTERVAL '1 hour'
    AND database_name = CURRENT_DATABASE()
),

aggregated_metrics AS (
  SELECT 
    collection_name,
    operation_type,
    index_usage_type,
    time_bucket,

    -- Operation volume metrics
    COUNT(*) as operation_count,

    -- Performance metrics
    AVG(execution_time_ms) as avg_response_time,
    PERCENTILE_CONT(0.50) WITHIN GROUP (ORDER BY execution_time_ms) as median_response_time,
    PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY execution_time_ms) as p95_response_time,
    PERCENTILE_CONT(0.99) WITHIN GROUP (ORDER BY execution_time_ms) as p99_response_time,
    MIN(execution_time_ms) as min_response_time,
    MAX(execution_time_ms) as max_response_time,

    -- Efficiency metrics
    AVG(query_efficiency) as avg_efficiency,
    SUM(documents_examined) as total_docs_examined,
    SUM(documents_returned) as total_docs_returned,
    SUM(index_keys_examined) as total_index_keys_examined,

    -- Performance distribution
    COUNT(*) FILTER (WHERE performance_category = 'very_fast') as very_fast_ops,
    COUNT(*) FILTER (WHERE performance_category = 'fast') as fast_ops,
    COUNT(*) FILTER (WHERE performance_category = 'moderate') as moderate_ops,
    COUNT(*) FILTER (WHERE performance_category = 'slow') as slow_ops,
    COUNT(*) FILTER (WHERE performance_category = 'very_slow') as very_slow_ops,

    -- Resource utilization
    AVG(cpu_time_ms) as avg_cpu_time,
    AVG(memory_usage_bytes) as avg_memory_usage,
    SUM(lock_wait_time_ms) as total_lock_wait_time,

    -- Index efficiency
    COUNT(*) FILTER (WHERE index_usage_type = 'collection_scan') as collection_scan_count,
    COUNT(*) FILTER (WHERE index_usage_type = 'index_scan') as index_scan_count,

    -- Calculate performance score
    (
      -- Response time component (lower is better)
      (1000 - LEAST(AVG(execution_time_ms), 1000)) / 1000 * 40 +

      -- Efficiency component (higher is better)  
      AVG(query_efficiency) * 30 +

      -- Index usage component (index scans preferred)
      CASE 
        WHEN COUNT(*) FILTER (WHERE index_usage_type = 'index_scan') > 
             COUNT(*) FILTER (WHERE index_usage_type = 'collection_scan') THEN 20
        ELSE 0
      END +

      -- Volume stability component
      LEAST(COUNT(*) / 100.0, 1.0) * 10

    ) as performance_score

  FROM performance_analysis
  GROUP BY collection_name, operation_type, index_usage_type, time_bucket
),

performance_trends AS (
  SELECT 
    am.*,

    -- Trend analysis with window functions
    LAG(avg_response_time) OVER (
      PARTITION BY collection_name, operation_type, index_usage_type
      ORDER BY time_bucket
    ) as prev_response_time,

    LAG(operation_count) OVER (
      PARTITION BY collection_name, operation_type, index_usage_type  
      ORDER BY time_bucket
    ) as prev_operation_count,

    -- Moving averages for smoothing
    AVG(avg_response_time) OVER (
      PARTITION BY collection_name, operation_type, index_usage_type
      ORDER BY time_bucket
      ROWS BETWEEN 4 PRECEDING AND CURRENT ROW
    ) as moving_avg_response_time,

    AVG(performance_score) OVER (
      PARTITION BY collection_name, operation_type, index_usage_type
      ORDER BY time_bucket  
      ROWS BETWEEN 4 PRECEDING AND CURRENT ROW
    ) as moving_avg_performance_score

  FROM aggregated_metrics am
)

SELECT 
  collection_name,
  operation_type,
  index_usage_type,
  time_bucket,

  -- Core performance metrics
  operation_count,
  ROUND(avg_response_time::NUMERIC, 2) as avg_response_time_ms,
  ROUND(median_response_time::NUMERIC, 2) as median_response_time_ms,
  ROUND(p95_response_time::NUMERIC, 2) as p95_response_time_ms,
  ROUND(p99_response_time::NUMERIC, 2) as p99_response_time_ms,

  -- Efficiency metrics
  ROUND((avg_efficiency * 100)::NUMERIC, 2) as efficiency_percentage,
  total_docs_examined,
  total_docs_returned,

  -- Performance distribution
  JSON_OBJECT(
    'very_fast', very_fast_ops,
    'fast', fast_ops, 
    'moderate', moderate_ops,
    'slow', slow_ops,
    'very_slow', very_slow_ops
  ) as performance_distribution,

  -- Index usage analysis
  collection_scan_count,
  index_scan_count,
  ROUND(
    (index_scan_count::FLOAT / NULLIF(collection_scan_count + index_scan_count, 0) * 100)::NUMERIC, 
    2
  ) as index_usage_percentage,

  -- Performance scoring
  ROUND(performance_score::NUMERIC, 2) as performance_score,
  CASE 
    WHEN performance_score >= 90 THEN 'excellent'
    WHEN performance_score >= 75 THEN 'good'
    WHEN performance_score >= 60 THEN 'fair'
    WHEN performance_score >= 40 THEN 'poor'
    ELSE 'critical'
  END as performance_grade,

  -- Trend analysis
  CASE 
    WHEN prev_response_time IS NOT NULL THEN
      ROUND(((avg_response_time - prev_response_time) / prev_response_time * 100)::NUMERIC, 2)
    ELSE NULL
  END as response_time_change_percent,

  CASE 
    WHEN prev_operation_count IS NOT NULL THEN
      ROUND(((operation_count - prev_operation_count)::FLOAT / prev_operation_count * 100)::NUMERIC, 2)
    ELSE NULL
  END as volume_change_percent,

  -- Moving averages for trend smoothing
  ROUND(moving_avg_response_time::NUMERIC, 2) as trend_response_time,
  ROUND(moving_avg_performance_score::NUMERIC, 2) as trend_performance_score,

  -- Resource utilization
  ROUND(avg_cpu_time::NUMERIC, 2) as avg_cpu_time_ms,
  ROUND((avg_memory_usage / 1024.0 / 1024)::NUMERIC, 2) as avg_memory_usage_mb,
  total_lock_wait_time as total_lock_wait_ms,

  -- Alert indicators
  CASE 
    WHEN avg_response_time > 1000 THEN 'high_response_time'
    WHEN collection_scan_count > index_scan_count THEN 'excessive_collection_scans'
    WHEN avg_efficiency < 0.1 THEN 'low_efficiency'
    WHEN total_lock_wait_time > 1000 THEN 'lock_contention'
    ELSE 'normal'
  END as alert_status,

  CURRENT_TIMESTAMP as analysis_timestamp

FROM performance_trends
WHERE operation_count > 0  -- Filter out empty buckets
ORDER BY 
  performance_score ASC,  -- Show problematic areas first
  avg_response_time DESC,
  collection_name,
  operation_type;

-- Advanced index analysis and optimization recommendations
WITH index_statistics AS (
  SELECT 
    collection_name,
    index_name,
    index_spec,
    index_size_bytes,

    -- Usage statistics
    access_count,
    last_access_time,

    -- Index characteristics
    is_unique,
    is_sparse, 
    is_partial,
    is_compound,

    -- Calculate metrics
    EXTRACT(DAYS FROM CURRENT_TIMESTAMP - last_access_time) as days_since_access,

    -- Index type classification
    CASE 
      WHEN access_count = 0 THEN 'unused'
      WHEN access_count < 100 THEN 'low_usage'
      WHEN access_count < 10000 THEN 'medium_usage'
      ELSE 'high_usage'
    END as usage_category,

    -- Get collection statistics for context
    (SELECT document_count FROM COLLECTION_STATS cs WHERE cs.collection_name = idx.collection_name) as collection_doc_count,
    (SELECT total_size_bytes FROM COLLECTION_STATS cs WHERE cs.collection_name = idx.collection_name) as collection_size_bytes

  FROM INDEX_STATS idx
  WHERE database_name = CURRENT_DATABASE()
),

index_analysis AS (
  SELECT 
    *,

    -- Calculate index efficiency metrics
    CASE 
      WHEN collection_size_bytes > 0 THEN 
        CAST(index_size_bytes AS FLOAT) / collection_size_bytes
      ELSE 0
    END as size_ratio,

    -- Usage intensity
    CASE 
      WHEN collection_doc_count > 0 THEN
        CAST(access_count AS FLOAT) / collection_doc_count
      ELSE 0
    END as usage_intensity,

    -- ROI calculation (simplified)
    CASE 
      WHEN index_size_bytes > 0 THEN
        CAST(access_count AS FLOAT) / (index_size_bytes / 1024 / 1024)  -- accesses per MB
      ELSE 0
    END as access_per_mb,

    -- Optimization opportunity scoring
    CASE 
      WHEN access_count = 0 AND index_name != '_id_' THEN 100  -- Remove unused
      WHEN access_count < 10 AND days_since_access > 30 THEN 80  -- Consider removal
      WHEN size_ratio > 0.5 THEN 60  -- Oversized index
      WHEN is_compound = false AND usage_intensity < 0.01 THEN 40  -- Underutilized single field
      ELSE 0
    END as optimization_priority

  FROM index_statistics
),

optimization_recommendations AS (
  SELECT 
    collection_name,
    index_name,
    usage_category,

    -- Current metrics
    access_count,
    ROUND((index_size_bytes / 1024.0 / 1024)::NUMERIC, 2) as index_size_mb,
    ROUND((size_ratio * 100)::NUMERIC, 2) as size_ratio_percent,
    days_since_access,

    -- Optimization recommendations
    CASE 
      WHEN optimization_priority >= 100 THEN 
        JSON_OBJECT(
          'action', 'remove_index',
          'reason', 'Index is unused and consuming storage',
          'impact', 'Reduced storage usage and faster writes',
          'priority', 'high'
        )
      WHEN optimization_priority >= 80 THEN
        JSON_OBJECT(
          'action', 'consider_removal',
          'reason', 'Index has very low usage and is stale',
          'impact', 'Potential storage savings with minimal risk',
          'priority', 'medium'
        )
      WHEN optimization_priority >= 60 THEN
        JSON_OBJECT(
          'action', 'optimize_index',
          'reason', 'Index size is disproportionately large',
          'impact', 'Consider sparse or partial index options',
          'priority', 'medium'
        )
      WHEN optimization_priority >= 40 THEN
        JSON_OBJECT(
          'action', 'review_usage',
          'reason', 'Single field index with low utilization',
          'impact', 'Evaluate if compound index would be more effective',
          'priority', 'low'
        )
      ELSE
        JSON_OBJECT(
          'action', 'maintain',
          'reason', 'Index appears to be well utilized',
          'impact', 'No immediate action required',
          'priority', 'none'
        )
    END as recommendation,

    -- Performance impact estimation
    CASE 
      WHEN optimization_priority >= 80 THEN
        JSON_OBJECT(
          'storage_savings_mb', ROUND((index_size_bytes / 1024.0 / 1024)::NUMERIC, 2),
          'write_performance_improvement', '5-15%',
          'query_performance_impact', 'minimal'
        )
      WHEN optimization_priority >= 40 THEN
        JSON_OBJECT(
          'storage_savings_mb', ROUND((index_size_bytes / 1024.0 / 1024 * 0.3)::NUMERIC, 2),
          'write_performance_improvement', '2-8%', 
          'query_performance_impact', 'requires_analysis'
        )
      ELSE
        JSON_OBJECT(
          'storage_savings_mb', 0,
          'write_performance_improvement', '0%',
          'query_performance_impact', 'none'
        )
    END as impact_estimate,

    optimization_priority

  FROM index_analysis
  WHERE optimization_priority > 0
)

SELECT 
  collection_name,
  index_name,
  usage_category,
  access_count,
  index_size_mb,
  size_ratio_percent,
  days_since_access,

  -- Recommendation details
  JSON_EXTRACT(recommendation, '$.action') as recommended_action,
  JSON_EXTRACT(recommendation, '$.reason') as recommendation_reason,
  JSON_EXTRACT(recommendation, '$.impact') as expected_impact,
  JSON_EXTRACT(recommendation, '$.priority') as priority_level,

  -- Impact estimation
  CAST(JSON_EXTRACT(impact_estimate, '$.storage_savings_mb') AS DECIMAL(10,2)) as potential_storage_savings_mb,
  JSON_EXTRACT(impact_estimate, '$.write_performance_improvement') as write_performance_gain,
  JSON_EXTRACT(impact_estimate, '$.query_performance_impact') as query_impact_assessment,

  -- Implementation guidance
  CASE 
    WHEN JSON_EXTRACT(recommendation, '$.action') = 'remove_index' THEN
      'DROP INDEX ' || index_name || ' ON ' || collection_name
    WHEN JSON_EXTRACT(recommendation, '$.action') = 'optimize_index' THEN
      'Review index definition and consider sparse/partial options'
    ELSE 'Monitor usage patterns before taking action'
  END as implementation_command,

  optimization_priority,
  CURRENT_TIMESTAMP as analysis_date

FROM optimization_recommendations
ORDER BY optimization_priority DESC, index_size_mb DESC;

-- Real-time performance monitoring dashboard query
CREATE VIEW real_time_performance_dashboard AS
WITH current_metrics AS (
  SELECT 
    -- Time-based grouping for real-time updates
    DATE_TRUNC('minute', CURRENT_TIMESTAMP) as current_minute,

    -- Operation volume in last minute
    (SELECT COUNT(*) FROM PROFILE_DATA 
     WHERE operation_timestamp >= CURRENT_TIMESTAMP - INTERVAL '1 minute') as ops_per_minute,

    -- Average response time in last minute  
    (SELECT AVG(execution_time_ms) FROM PROFILE_DATA
     WHERE operation_timestamp >= CURRENT_TIMESTAMP - INTERVAL '1 minute') as avg_response_time_1m,

    -- Collection scans in last minute
    (SELECT COUNT(*) FROM PROFILE_DATA
     WHERE operation_timestamp >= CURRENT_TIMESTAMP - INTERVAL '1 minute'
     AND execution_plan LIKE '%COLLSCAN%') as collection_scans_1m,

    -- Slow queries in last minute (>500ms)
    (SELECT COUNT(*) FROM PROFILE_DATA  
     WHERE operation_timestamp >= CURRENT_TIMESTAMP - INTERVAL '1 minute'
     AND execution_time_ms > 500) as slow_queries_1m,

    -- Connection statistics
    (SELECT current_connections FROM CONNECTION_STATS) as current_connections,
    (SELECT max_connections FROM CONNECTION_STATS) as max_connections,

    -- Memory usage
    (SELECT resident_memory_mb FROM MEMORY_STATS) as memory_usage_mb,
    (SELECT cache_hit_ratio FROM MEMORY_STATS) as cache_hit_ratio,

    -- Storage metrics
    (SELECT SUM(data_size_bytes) FROM COLLECTION_STATS) as total_data_size_bytes,
    (SELECT SUM(storage_size_bytes) FROM COLLECTION_STATS) as total_storage_size_bytes,
    (SELECT SUM(index_size_bytes) FROM COLLECTION_STATS) as total_index_size_bytes
),

health_indicators AS (
  SELECT 
    cm.*,

    -- Calculate health scores
    CASE 
      WHEN avg_response_time_1m > 1000 THEN 'critical'
      WHEN avg_response_time_1m > 500 THEN 'warning' 
      WHEN avg_response_time_1m > 100 THEN 'ok'
      ELSE 'excellent'
    END as response_time_health,

    CASE 
      WHEN collection_scans_1m > 10 THEN 'critical'
      WHEN collection_scans_1m > 5 THEN 'warning'
      WHEN collection_scans_1m > 0 THEN 'ok'
      ELSE 'excellent'  
    END as index_usage_health,

    CASE 
      WHEN current_connections::FLOAT / NULLIF(max_connections, 0) > 0.9 THEN 'critical'
      WHEN current_connections::FLOAT / NULLIF(max_connections, 0) > 0.8 THEN 'warning'
      WHEN current_connections::FLOAT / NULLIF(max_connections, 0) > 0.7 THEN 'ok'
      ELSE 'excellent'
    END as connection_health,

    CASE 
      WHEN cache_hit_ratio < 0.8 THEN 'critical'
      WHEN cache_hit_ratio < 0.9 THEN 'warning'
      WHEN cache_hit_ratio < 0.95 THEN 'ok'
      ELSE 'excellent'
    END as memory_health

  FROM current_metrics cm
)

SELECT 
  current_minute,

  -- Real-time performance metrics
  ops_per_minute,
  ROUND(avg_response_time_1m::NUMERIC, 2) as avg_response_time_ms,
  collection_scans_1m,
  slow_queries_1m,

  -- Health indicators
  response_time_health,
  index_usage_health, 
  connection_health,
  memory_health,

  -- Overall health score
  CASE 
    WHEN response_time_health = 'critical' OR index_usage_health = 'critical' OR 
         connection_health = 'critical' OR memory_health = 'critical' THEN 'critical'
    WHEN response_time_health = 'warning' OR index_usage_health = 'warning' OR
         connection_health = 'warning' OR memory_health = 'warning' THEN 'warning'  
    WHEN response_time_health = 'ok' OR index_usage_health = 'ok' OR
         connection_health = 'ok' OR memory_health = 'ok' THEN 'ok'
    ELSE 'excellent'
  END as overall_health,

  -- Resource utilization
  current_connections,
  max_connections,
  ROUND((current_connections::FLOAT / NULLIF(max_connections, 0) * 100)::NUMERIC, 2) as connection_usage_percent,

  memory_usage_mb,
  ROUND((cache_hit_ratio * 100)::NUMERIC, 2) as cache_hit_percent,

  -- Storage information
  ROUND((total_data_size_bytes / 1024.0 / 1024 / 1024)::NUMERIC, 2) as total_data_gb,
  ROUND((total_storage_size_bytes / 1024.0 / 1024 / 1024)::NUMERIC, 2) as total_storage_gb,
  ROUND((total_index_size_bytes / 1024.0 / 1024 / 1024)::NUMERIC, 2) as total_index_gb,

  -- Efficiency metrics
  ROUND((total_data_size_bytes::FLOAT / NULLIF(total_storage_size_bytes, 0))::NUMERIC, 4) as storage_efficiency,
  ROUND((total_index_size_bytes::FLOAT / NULLIF(total_data_size_bytes, 0))::NUMERIC, 4) as index_to_data_ratio,

  -- Alert conditions
  CASE 
    WHEN ops_per_minute = 0 THEN 'no_activity'
    WHEN slow_queries_1m > ops_per_minute * 0.1 THEN 'high_slow_query_ratio'
    WHEN collection_scans_1m > ops_per_minute * 0.05 THEN 'excessive_collection_scans'
    ELSE 'normal'
  END as alert_condition,

  -- Recommendations
  ARRAY[
    CASE WHEN response_time_health IN ('critical', 'warning') THEN 'Review slow queries and indexing strategy' END,
    CASE WHEN index_usage_health IN ('critical', 'warning') THEN 'Add indexes to eliminate collection scans' END, 
    CASE WHEN connection_health IN ('critical', 'warning') THEN 'Monitor connection pooling and usage patterns' END,
    CASE WHEN memory_health IN ('critical', 'warning') THEN 'Review memory allocation and cache settings' END
  ]::TEXT[] as immediate_recommendations

FROM health_indicators;

-- QueryLeaf provides comprehensive MongoDB performance monitoring capabilities:
-- 1. SQL-familiar syntax for MongoDB profiling configuration and analysis
-- 2. Advanced performance metrics collection with detailed execution insights  
-- 3. Real-time index usage analysis and optimization recommendations
-- 4. Comprehensive query performance analysis with efficiency scoring
-- 5. Production-ready monitoring dashboards with health indicators
-- 6. Automated optimization recommendations based on actual usage patterns
-- 7. Trend analysis and performance baseline establishment
-- 8. Integration with MongoDB's native profiling and statistics systems
-- 9. Advanced alerting and anomaly detection capabilities
-- 10. Capacity planning and resource optimization insights

Best Practices for Production MongoDB Performance Monitoring

Monitoring Strategy Implementation

Essential principles for effective MongoDB performance monitoring and optimization:

  1. Profiling Configuration: Configure appropriate profiling levels and sampling rates to balance insight with performance impact
  2. Metrics Collection: Implement comprehensive metrics collection covering queries, indexes, resources, and business operations
  3. Baseline Establishment: Establish performance baselines to enable meaningful trend analysis and anomaly detection
  4. Alert Strategy: Design intelligent alerting that focuses on actionable issues rather than metric noise
  5. Optimization Workflow: Implement systematic optimization workflows with testing and validation procedures
  6. Capacity Planning: Utilize historical data and growth patterns for proactive capacity planning and scaling decisions

Production Deployment Optimization

Optimize MongoDB monitoring deployments for enterprise environments:

  1. Automated Analysis: Implement automated performance analysis and recommendation generation to reduce manual overhead
  2. Integration Ecosystem: Integrate monitoring with existing observability platforms and operational workflows
  3. Cost Optimization: Balance monitoring comprehensiveness with resource costs and performance impact
  4. Scalability Design: Design monitoring systems that scale effectively with database growth and complexity
  5. Security Integration: Ensure monitoring systems comply with security requirements and access control policies
  6. Documentation Standards: Maintain comprehensive documentation of monitoring configurations, thresholds, and procedures

Conclusion

MongoDB performance monitoring and optimization requires sophisticated tooling and methodologies that understand the unique characteristics of document databases, distributed architectures, and dynamic schema patterns. Advanced monitoring capabilities including query profiling, index analysis, resource tracking, and automated optimization recommendations enable proactive performance management that prevents issues before they impact application users.

Key MongoDB Performance Monitoring benefits include:

  • Comprehensive Profiling: Deep insights into query execution, index usage, and resource utilization patterns
  • Intelligent Optimization: Automated analysis and recommendations based on actual usage patterns and performance data
  • Real-time Monitoring: Continuous performance tracking with proactive alerting and anomaly detection
  • Capacity Planning: Data-driven insights for scaling decisions and resource optimization
  • Production Integration: Enterprise-ready monitoring that integrates with existing operational workflows
  • SQL Accessibility: Familiar SQL-style monitoring operations through QueryLeaf for accessible performance management

Whether you're managing development environments, production deployments, or large-scale distributed MongoDB systems, comprehensive performance monitoring with QueryLeaf's familiar SQL interface provides the foundation for optimal database performance and reliability.

QueryLeaf Integration: QueryLeaf automatically translates SQL-style monitoring queries into MongoDB's native profiling and statistics operations, making advanced performance analysis accessible to SQL-oriented teams. Complex profiling configurations, index analysis, and optimization recommendations are seamlessly handled through familiar SQL constructs, enabling sophisticated performance management without requiring deep MongoDB expertise.

The combination of MongoDB's robust performance monitoring capabilities with SQL-style analysis operations makes it an ideal platform for applications requiring both advanced performance optimization and familiar database management patterns, ensuring your MongoDB deployments maintain optimal performance as they scale and evolve.

MongoDB Vector Search and AI Applications: Building Semantic Search and Similarity Systems for Modern AI-Powered Applications

Modern artificial intelligence applications require sophisticated search capabilities that understand semantic meaning beyond traditional keyword matching, enabling natural language queries, content recommendation systems, and intelligent document retrieval based on conceptual similarity rather than exact text matches. Traditional full-text search approaches struggle with understanding context, synonyms, and conceptual relationships, limiting their effectiveness for AI-powered applications that need to comprehend user intent and content meaning.

MongoDB Vector Search provides comprehensive vector similarity capabilities that enable semantic search, recommendation engines, and AI-powered content discovery through high-dimensional vector embeddings and advanced similarity algorithms. Unlike traditional search systems that rely on exact keyword matching, MongoDB Vector Search leverages machine learning embeddings to understand content semantics, enabling applications to find conceptually similar documents, perform natural language search, and power intelligent recommendation systems.

The Traditional Search Limitation Challenge

Conventional text-based search approaches have significant limitations for modern AI applications:

-- Traditional PostgreSQL full-text search - limited semantic understanding and context awareness

-- Basic full-text search setup with limited semantic capabilities
CREATE TABLE documents (
    document_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    title VARCHAR(500) NOT NULL,
    content TEXT NOT NULL,
    category VARCHAR(100),
    author VARCHAR(200),
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,

    -- Traditional metadata
    tags TEXT[],
    keywords VARCHAR(1000),
    summary TEXT,
    document_type VARCHAR(50),
    language VARCHAR(10) DEFAULT 'en',

    -- Basic search vectors (limited functionality)
    search_vector tsvector GENERATED ALWAYS AS (
        setweight(to_tsvector('english', coalesce(title, '')), 'A') ||
        setweight(to_tsvector('english', coalesce(content, '')), 'B') ||
        setweight(to_tsvector('english', coalesce(summary, '')), 'C') ||
        setweight(to_tsvector('english', array_to_string(coalesce(tags, '{}'), ' ')), 'D')
    ) STORED
);

-- Additional tables for recommendation attempts
CREATE TABLE user_interactions (
    interaction_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    user_id UUID NOT NULL,
    document_id UUID NOT NULL REFERENCES documents(document_id),
    interaction_type VARCHAR(50) NOT NULL, -- 'view', 'like', 'share', 'download'
    interaction_timestamp TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    duration_seconds INTEGER,
    rating INTEGER CHECK (rating BETWEEN 1 AND 5)
);

CREATE TABLE document_similarity (
    document_id_1 UUID NOT NULL REFERENCES documents(document_id),
    document_id_2 UUID NOT NULL REFERENCES documents(document_id),
    similarity_score DECIMAL(5,4) NOT NULL CHECK (similarity_score BETWEEN 0 AND 1),
    similarity_type VARCHAR(50) NOT NULL, -- 'keyword', 'category', 'manual'
    calculated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    PRIMARY KEY (document_id_1, document_id_2)
);

-- Traditional keyword-based search with limited semantic understanding
WITH search_query AS (
    SELECT 
        'machine learning artificial intelligence neural networks deep learning' as query_text,
        to_tsquery('english', 
            'machine & learning & artificial & intelligence & neural & networks & deep & learning'
        ) as search_tsquery
),
basic_search AS (
    SELECT 
        d.document_id,
        d.title,
        d.content,
        d.category,
        d.author,
        d.created_at,
        d.tags,

        -- Basic relevance scoring (limited effectiveness)
        ts_rank(d.search_vector, sq.search_tsquery) as basic_relevance,
        ts_rank_cd(d.search_vector, sq.search_tsquery) as weighted_relevance,

        -- Simple keyword matching
        array_length(
            string_to_array(
                regexp_replace(
                    lower(d.title || ' ' || d.content), 
                    '[^a-z0-9\s]', ' ', 'g'
                ), 
                ' '
            ) & string_to_array(lower(sq.query_text), ' '), 1
        ) as keyword_matches,

        -- Category-based scoring
        CASE d.category 
            WHEN 'AI/ML' THEN 2.0
            WHEN 'Technology' THEN 1.5 
            WHEN 'Science' THEN 1.2
            ELSE 1.0 
        END as category_boost,

        -- Recency scoring
        CASE 
            WHEN d.created_at > CURRENT_DATE - INTERVAL '30 days' THEN 1.5
            WHEN d.created_at > CURRENT_DATE - INTERVAL '90 days' THEN 1.2  
            WHEN d.created_at > CURRENT_DATE - INTERVAL '365 days' THEN 1.1
            ELSE 1.0
        END as recency_boost

    FROM documents d
    CROSS JOIN search_query sq
    WHERE d.search_vector @@ sq.search_tsquery
),
popularity_metrics AS (
    SELECT 
        ui.document_id,
        COUNT(*) as interaction_count,
        COUNT(*) FILTER (WHERE ui.interaction_type = 'view') as view_count,
        COUNT(*) FILTER (WHERE ui.interaction_type = 'like') as like_count,
        COUNT(*) FILTER (WHERE ui.interaction_type = 'share') as share_count,
        AVG(ui.rating) FILTER (WHERE ui.rating IS NOT NULL) as avg_rating,
        AVG(ui.duration_seconds) FILTER (WHERE ui.duration_seconds IS NOT NULL) as avg_duration
    FROM user_interactions ui
    WHERE ui.interaction_timestamp > CURRENT_DATE - INTERVAL '90 days'
    GROUP BY ui.document_id
),
similarity_expansion AS (
    -- Manual similarity relationships (limited and maintenance-heavy)
    SELECT DISTINCT
        ds.document_id_2 as document_id,
        MAX(ds.similarity_score) as max_similarity,
        COUNT(*) as similar_document_count
    FROM basic_search bs
    JOIN document_similarity ds ON bs.document_id = ds.document_id_1
    WHERE ds.similarity_score > 0.3
    GROUP BY ds.document_id_2
),
final_search_results AS (
    SELECT 
        bs.document_id,
        bs.title,
        SUBSTRING(bs.content, 1, 300) || '...' as content_preview,
        bs.category,
        bs.author,
        bs.created_at,
        bs.tags,

        -- Complex relevance calculation (limited effectiveness)
        (
            (bs.basic_relevance * 10) +
            (bs.weighted_relevance * 15) + 
            (COALESCE(bs.keyword_matches, 0) * 5) +
            (bs.category_boost * 3) +
            (bs.recency_boost * 2) +
            (COALESCE(pm.like_count, 0) * 0.1) +
            (COALESCE(pm.avg_rating, 0) * 2) +
            (COALESCE(se.max_similarity, 0) * 8)
        ) as final_relevance_score,

        -- Metrics for debugging
        bs.basic_relevance,
        bs.keyword_matches,
        pm.interaction_count,
        pm.like_count,
        pm.avg_rating,
        se.similar_document_count,

        -- Formatted information
        CASE 
            WHEN pm.interaction_count > 100 THEN 'Popular'
            WHEN pm.interaction_count > 50 THEN 'Moderately Popular'
            WHEN pm.interaction_count > 10 THEN 'Some Interest'
            ELSE 'New/Limited Interest'
        END as popularity_status

    FROM basic_search bs
    LEFT JOIN popularity_metrics pm ON bs.document_id = pm.document_id
    LEFT JOIN similarity_expansion se ON bs.document_id = se.document_id
)
SELECT 
    document_id,
    title,
    content_preview,
    category,
    author,
    created_at,
    tags,
    ROUND(final_relevance_score::numeric, 2) as relevance_score,
    popularity_status,

    -- Limited recommendation capability
    CASE 
        WHEN final_relevance_score > 25 THEN 'Highly Relevant'
        WHEN final_relevance_score > 15 THEN 'Relevant' 
        WHEN final_relevance_score > 8 THEN 'Potentially Relevant'
        ELSE 'Low Relevance'
    END as relevance_category

FROM final_search_results
WHERE final_relevance_score > 5  -- Filter low-relevance results
ORDER BY final_relevance_score DESC
LIMIT 20;

-- Problems with traditional search approaches:
-- 1. No semantic understanding - "ML" vs "machine learning" treated as completely different
-- 2. Limited context awareness - cannot understand conceptual relationships
-- 3. Poor synonym handling - requires manual synonym dictionaries
-- 4. No natural language query support - requires exact keyword matching
-- 5. Complex manual similarity calculations that don't scale
-- 6. No understanding of document embeddings or vector representations
-- 7. Limited recommendation capabilities based on simple collaborative filtering
-- 8. Poor handling of multilingual content and cross-language search
-- 9. No support for image, audio, or multi-modal content search
-- 10. Maintenance-heavy similarity relationships that become stale

-- Attempt at content-based recommendation (ineffective)
CREATE OR REPLACE FUNCTION calculate_basic_similarity(doc1_id UUID, doc2_id UUID)
RETURNS DECIMAL AS $$
DECLARE
    doc1_vector tsvector;
    doc2_vector tsvector;
    similarity_score DECIMAL;
BEGIN
    SELECT search_vector INTO doc1_vector FROM documents WHERE document_id = doc1_id;
    SELECT search_vector INTO doc2_vector FROM documents WHERE document_id = doc2_id;

    -- Extremely limited similarity calculation
    SELECT ts_rank(doc1_vector, plainto_tsquery('english', 
        array_to_string(
            string_to_array(
                regexp_replace(doc2_vector::text, '[^a-zA-Z0-9\s]', ' ', 'g'), 
                ' '
            ), 
            ' '
        )
    )) INTO similarity_score;

    RETURN COALESCE(similarity_score, 0);
END;
$$ LANGUAGE plpgsql;

-- Manual batch similarity calculation (expensive and inaccurate)
INSERT INTO document_similarity (document_id_1, document_id_2, similarity_score, similarity_type)
SELECT 
    d1.document_id,
    d2.document_id,
    calculate_basic_similarity(d1.document_id, d2.document_id),
    'keyword'
FROM documents d1
CROSS JOIN documents d2
WHERE d1.document_id != d2.document_id
  AND d1.category = d2.category  -- Only calculate within same category
  AND NOT EXISTS (
    SELECT 1 FROM document_similarity ds 
    WHERE ds.document_id_1 = d1.document_id 
    AND ds.document_id_2 = d2.document_id
  )
LIMIT 10000; -- Batch processing required due to computational cost

-- Traditional approach limitations:
-- 1. No understanding of semantic meaning or context
-- 2. Poor performance with large document collections
-- 3. Manual maintenance of similarity relationships
-- 4. Limited multilingual and cross-domain search capabilities  
-- 5. No support for natural language queries or conversational search
-- 6. Inability to handle synonyms and conceptual relationships
-- 7. No integration with modern AI/ML embedding models
-- 8. Poor recommendation quality based on simple keyword overlap
-- 9. No support for multi-modal content (images, videos, audio)
-- 10. Scalability issues with growing content collections

MongoDB Vector Search provides sophisticated AI-powered semantic capabilities:

// MongoDB Vector Search - advanced AI-powered semantic search with comprehensive embedding management
const { MongoClient } = require('mongodb');
const { OpenAI } = require('openai');
const tf = require('@tensorflow/tfjs-node');

const client = new MongoClient('mongodb+srv://username:[email protected]');
const db = client.db('advanced_ai_search_platform');

// Advanced AI-powered search and recommendation engine
class AdvancedVectorSearchEngine {
  constructor(db, aiConfig = {}) {
    this.db = db;
    this.collections = {
      documents: db.collection('documents'),
      embeddings: db.collection('document_embeddings'),
      userProfiles: db.collection('user_profiles'), 
      searchLogs: db.collection('search_logs'),
      recommendations: db.collection('recommendations'),
      modelMetadata: db.collection('model_metadata')
    };

    // AI model configuration
    this.aiConfig = {
      embeddingModel: aiConfig.embeddingModel || 'text-embedding-3-large',
      embeddingDimensions: aiConfig.embeddingDimensions || 3072,
      maxTokens: aiConfig.maxTokens || 8191,
      batchSize: aiConfig.batchSize || 50,
      similarityThreshold: aiConfig.similarityThreshold || 0.7,

      // Advanced AI configurations
      useMultimodalEmbeddings: aiConfig.useMultimodalEmbeddings || false,
      enableSemanticCaching: aiConfig.enableSemanticCaching || true,
      enableQueryExpansion: aiConfig.enableQueryExpansion || true,
      enablePersonalization: aiConfig.enablePersonalization || true,

      // Model providers
      openaiApiKey: aiConfig.openaiApiKey || process.env.OPENAI_API_KEY,
      huggingFaceApiKey: aiConfig.huggingFaceApiKey || process.env.HUGGINGFACE_API_KEY,
      cohereApiKey: aiConfig.cohereApiKey || process.env.COHERE_API_KEY
    };

    // Initialize AI clients
    this.openai = new OpenAI({ apiKey: this.aiConfig.openaiApiKey });
    this.embeddingCache = new Map();
    this.searchCache = new Map();

    this.setupVectorSearchIndexes();
    this.initializeEmbeddingModels();
  }

  async setupVectorSearchIndexes() {
    console.log('Setting up MongoDB Vector Search indexes...');

    try {
      // Primary document embedding index
      await this.collections.documents.createSearchIndex({
        name: 'document_vector_index',
        definition: {
          fields: [
            {
              type: 'vector',
              path: 'embedding',
              numDimensions: this.aiConfig.embeddingDimensions,
              similarity: 'cosine'
            },
            {
              type: 'filter',
              path: 'category'
            },
            {
              type: 'filter', 
              path: 'language'
            },
            {
              type: 'filter',
              path: 'contentType'
            },
            {
              type: 'filter',
              path: 'accessLevel'
            },
            {
              type: 'filter',
              path: 'createdAt'
            }
          ]
        }
      });

      // Multi-modal content index for images and multimedia
      await this.collections.documents.createSearchIndex({
        name: 'multimodal_vector_index',
        definition: {
          fields: [
            {
              type: 'vector',
              path: 'multimodalEmbedding',
              numDimensions: 1536, // Different dimension for multi-modal models
              similarity: 'cosine'
            },
            {
              type: 'filter',
              path: 'mediaType'
            }
          ]
        }
      });

      // User profile vector index for personalization
      await this.collections.userProfiles.createSearchIndex({
        name: 'user_profile_vector_index',
        definition: {
          fields: [
            {
              type: 'vector',
              path: 'interestEmbedding',
              numDimensions: this.aiConfig.embeddingDimensions,
              similarity: 'cosine'
            }
          ]
        }
      });

      console.log('Vector Search indexes created successfully');
    } catch (error) {
      console.error('Error setting up Vector Search indexes:', error);
      throw error;
    }
  }

  async generateDocumentEmbedding(document, options = {}) {
    console.log(`Generating embeddings for document: ${document.title}`);

    try {
      // Prepare content for embedding generation
      const embeddingContent = this.prepareContentForEmbedding(document, options);

      // Check cache first
      const cacheKey = this.generateCacheKey(embeddingContent);
      if (this.embeddingCache.has(cacheKey) && this.aiConfig.enableSemanticCaching) {
        console.log('Using cached embedding');
        return this.embeddingCache.get(cacheKey);
      }

      // Generate embedding using OpenAI
      const embeddingResponse = await this.openai.embeddings.create({
        model: this.aiConfig.embeddingModel,
        input: embeddingContent,
        dimensions: this.aiConfig.embeddingDimensions
      });

      const embedding = embeddingResponse.data[0].embedding;

      // Cache the embedding
      if (this.aiConfig.enableSemanticCaching) {
        this.embeddingCache.set(cacheKey, embedding);
      }

      // Store embedding with comprehensive metadata
      const embeddingDocument = {
        documentId: document._id,
        embedding: embedding,

        // Embedding metadata
        model: this.aiConfig.embeddingModel,
        dimensions: this.aiConfig.embeddingDimensions,
        contentLength: embeddingContent.length,
        tokensUsed: embeddingResponse.usage?.total_tokens || 0,

        // Content characteristics
        contentType: document.contentType || 'text',
        language: document.language || 'en',
        category: document.category,

        // Processing metadata
        generatedAt: new Date(),
        modelVersion: embeddingResponse.model,
        processingTime: Date.now() - (options.startTime || Date.now()),

        // Quality metrics
        contentQuality: this.assessContentQuality(document),
        embeddingNorm: this.calculateVectorNorm(embedding),

        // Optimization metadata
        batchProcessed: options.batchProcessed || false,
        cacheHit: false
      };

      // Store in embedding collection for tracking
      await this.collections.embeddings.insertOne(embeddingDocument);

      // Update main document with embedding
      await this.collections.documents.updateOne(
        { _id: document._id },
        {
          $set: {
            embedding: embedding,
            embeddingMetadata: {
              model: this.aiConfig.embeddingModel,
              generatedAt: new Date(),
              dimensions: this.aiConfig.embeddingDimensions,
              contentHash: this.generateContentHash(embeddingContent)
            }
          }
        }
      );

      return embedding;

    } catch (error) {
      console.error(`Error generating embedding for document ${document._id}:`, error);
      throw error;
    }
  }

  prepareContentForEmbedding(document, options = {}) {
    // Intelligent content preparation for optimal embedding generation
    let content = '';

    // Title with higher weight
    if (document.title) {
      content += `Title: ${document.title}\n\n`;
    }

    // Summary if available
    if (document.summary) {
      content += `Summary: ${document.summary}\n\n`;
    }

    // Main content with intelligent truncation
    if (document.content) {
      const maxContentLength = this.aiConfig.maxTokens * 0.7; // Reserve space for title/metadata
      let mainContent = document.content;

      if (mainContent.length > maxContentLength) {
        // Intelligent content truncation - keep beginning and key sections
        const beginningChunk = mainContent.substring(0, maxContentLength * 0.6);
        const endingChunk = mainContent.substring(mainContent.length - maxContentLength * 0.2);

        mainContent = beginningChunk + '\n...\n' + endingChunk;
      }

      content += `Content: ${mainContent}\n\n`;
    }

    // Metadata context
    if (document.category) {
      content += `Category: ${document.category}\n`;
    }

    if (document.tags && document.tags.length > 0) {
      content += `Tags: ${document.tags.join(', ')}\n`;
    }

    if (document.keywords) {
      content += `Keywords: ${document.keywords}\n`;
    }

    return content.trim();
  }

  async performSemanticSearch(query, options = {}) {
    console.log(`Performing semantic search for: "${query}"`);
    const startTime = Date.now();

    try {
      // Generate query embedding
      const queryEmbedding = await this.generateQueryEmbedding(query, options);

      // Build comprehensive search pipeline
      const searchPipeline = await this.buildSemanticSearchPipeline(queryEmbedding, query, options);

      // Execute vector search with MongoDB Atlas Vector Search
      const searchResults = await this.collections.documents.aggregate(searchPipeline).toArray();

      // Post-process and enhance results
      const enhancedResults = await this.enhanceSearchResults(searchResults, query, options);

      // Log search for analytics and improvement
      await this.logSearchQuery(query, queryEmbedding, enhancedResults, options);

      // Generate personalized recommendations if user context available
      let personalizedRecommendations = [];
      if (options.userId && this.aiConfig.enablePersonalization) {
        personalizedRecommendations = await this.generatePersonalizedRecommendations(
          options.userId, 
          enhancedResults.slice(0, 5),
          options
        );
      }

      return {
        query: query,
        results: enhancedResults,
        personalizedRecommendations: personalizedRecommendations,

        // Search metadata
        metadata: {
          totalResults: enhancedResults.length,
          searchTime: Date.now() - startTime,
          queryEmbeddingDimensions: queryEmbedding.length,
          embeddingModel: this.aiConfig.embeddingModel,
          similarityThreshold: options.similarityThreshold || this.aiConfig.similarityThreshold,
          filtersApplied: this.extractAppliedFilters(options),
          personalizationEnabled: this.aiConfig.enablePersonalization && !!options.userId
        },

        // Query insights
        insights: {
          queryComplexity: this.assessQueryComplexity(query),
          semanticCategories: this.identifySemanticCategories(enhancedResults),
          resultDiversity: this.calculateResultDiversity(enhancedResults),
          averageSimilarity: this.calculateAverageSimilarity(enhancedResults)
        },

        // Related queries and suggestions
        relatedQueries: await this.generateRelatedQueries(query, enhancedResults),
        searchSuggestions: await this.generateSearchSuggestions(query, options)
      };

    } catch (error) {
      console.error(`Semantic search error for query "${query}":`, error);
      throw error;
    }
  }

  async buildSemanticSearchPipeline(queryEmbedding, query, options = {}) {
    const pipeline = [];

    // Stage 1: Vector similarity search
    pipeline.push({
      $vectorSearch: {
        index: options.multimodal ? 'multimodal_vector_index' : 'document_vector_index',
        path: options.multimodal ? 'multimodalEmbedding' : 'embedding',
        queryVector: queryEmbedding,
        numCandidates: options.numCandidates || 1000,
        limit: options.vectorSearchLimit || 100,

        // Apply filters for performance and relevance
        filter: this.buildSearchFilters(options)
      }
    });

    // Stage 2: Add similarity score and metadata
    pipeline.push({
      $addFields: {
        vectorSimilarityScore: { $meta: 'vectorSearchScore' },
        searchMetadata: {
          searchTime: new Date(),
          searchQuery: query,
          searchModel: this.aiConfig.embeddingModel
        }
      }
    });

    // Stage 3: Hybrid scoring combining vector similarity with text relevance
    if (options.enableHybridSearch !== false) {
      pipeline.push({
        $addFields: {
          // Text match scoring for hybrid approach
          textMatchScore: {
            $cond: {
              if: { $regexMatch: { input: '$title', regex: query, options: 'i' } },
              then: 0.3,
              else: {
                $cond: {
                  if: { $regexMatch: { input: '$content', regex: query, options: 'i' } },
                  then: 0.2,
                  else: 0
                }
              }
            }
          },

          // Recency scoring
          recencyScore: {
            $switch: {
              branches: [
                {
                  case: { $gte: ['$createdAt', new Date(Date.now() - 30 * 24 * 60 * 60 * 1000)] },
                  then: 0.1
                },
                {
                  case: { $gte: ['$createdAt', new Date(Date.now() - 90 * 24 * 60 * 60 * 1000)] },
                  then: 0.05
                }
              ],
              default: 0
            }
          },

          // Popularity scoring based on user interactions
          popularityScore: {
            $multiply: [
              { $log10: { $add: [{ $ifNull: ['$metrics.viewCount', 0] }, 1] } },
              0.05
            ]
          },

          // Content quality scoring
          qualityScore: {
            $multiply: [
              { $divide: [{ $strLenCP: { $ifNull: ['$content', ''] } }, 10000] },
              0.02
            ]
          }
        }
      });

      // Combined hybrid score
      pipeline.push({
        $addFields: {
          hybridScore: {
            $add: [
              { $multiply: ['$vectorSimilarityScore', 0.7] }, // Vector similarity weight
              '$textMatchScore',
              '$recencyScore', 
              '$popularityScore',
              '$qualityScore'
            ]
          }
        }
      });
    }

    // Stage 4: Apply similarity threshold filtering
    pipeline.push({
      $match: {
        vectorSimilarityScore: { 
          $gte: options.similarityThreshold || this.aiConfig.similarityThreshold 
        }
      }
    });

    // Stage 5: Lookup related collections for rich context
    pipeline.push({
      $lookup: {
        from: 'users',
        localField: 'createdBy',
        foreignField: '_id',
        as: 'authorInfo',
        pipeline: [
          { $project: { name: 1, avatar: 1, expertise: 1, reputation: 1 } }
        ]
      }
    });

    // Stage 6: Add computed fields for result enhancement
    pipeline.push({
      $addFields: {
        // Content preview generation
        contentPreview: {
          $cond: {
            if: { $gt: [{ $strLenCP: { $ifNull: ['$content', ''] } }, 300] },
            then: { $concat: [{ $substr: ['$content', 0, 300] }, '...'] },
            else: '$content'
          }
        },

        // Relevance category
        relevanceCategory: {
          $switch: {
            branches: [
              { case: { $gte: ['$vectorSimilarityScore', 0.9] }, then: 'Highly Relevant' },
              { case: { $gte: ['$vectorSimilarityScore', 0.8] }, then: 'Very Relevant' },
              { case: { $gte: ['$vectorSimilarityScore', 0.7] }, then: 'Relevant' },
              { case: { $gte: ['$vectorSimilarityScore', 0.6] }, then: 'Moderately Relevant' }
            ],
            default: 'Potentially Relevant'
          }
        },

        // Author information
        authorName: { $arrayElemAt: ['$authorInfo.name', 0] },
        authorExpertise: { $arrayElemAt: ['$authorInfo.expertise', 0] },

        // Formatted metadata
        formattedCreatedAt: {
          $dateToString: {
            format: '%Y-%m-%d',
            date: '$createdAt'
          }
        }
      }
    });

    // Stage 7: Final projection for clean output
    pipeline.push({
      $project: {
        _id: 1,
        title: 1,
        contentPreview: 1,
        category: 1,
        tags: 1,
        language: 1,
        contentType: 1,
        createdAt: 1,
        formattedCreatedAt: 1,

        // Scoring information
        vectorSimilarityScore: { $round: ['$vectorSimilarityScore', 4] },
        hybridScore: { $round: [{ $ifNull: ['$hybridScore', '$vectorSimilarityScore'] }, 4] },
        relevanceCategory: 1,

        // Author information
        authorName: 1,
        authorExpertise: 1,

        // Access and metadata
        accessLevel: 1,
        downloadUrl: { $concat: ['/api/documents/', { $toString: '$_id' }] },

        // Analytics metadata
        metrics: {
          viewCount: { $ifNull: ['$metrics.viewCount', 0] },
          likeCount: { $ifNull: ['$metrics.likeCount', 0] },
          shareCount: { $ifNull: ['$metrics.shareCount', 0] }
        },

        searchMetadata: 1
      }
    });

    // Stage 8: Sort by hybrid score or vector similarity
    const sortField = options.enableHybridSearch !== false ? 'hybridScore' : 'vectorSimilarityScore';
    pipeline.push({ $sort: { [sortField]: -1 } });

    // Stage 9: Apply final limit
    pipeline.push({ $limit: options.limit || 20 });

    return pipeline;
  }

  buildSearchFilters(options) {
    const filters = {};

    // Category filtering
    if (options.category) {
      filters.category = { $eq: options.category };
    }

    // Language filtering
    if (options.language) {
      filters.language = { $eq: options.language };
    }

    // Content type filtering
    if (options.contentType) {
      filters.contentType = { $eq: options.contentType };
    }

    // Access level filtering
    if (options.accessLevel) {
      filters.accessLevel = { $eq: options.accessLevel };
    }

    // Date range filtering
    if (options.dateFrom || options.dateTo) {
      filters.createdAt = {};
      if (options.dateFrom) filters.createdAt.$gte = new Date(options.dateFrom);
      if (options.dateTo) filters.createdAt.$lte = new Date(options.dateTo);
    }

    // Author filtering
    if (options.authorId) {
      filters.createdBy = { $eq: options.authorId };
    }

    // Tags filtering
    if (options.tags && options.tags.length > 0) {
      filters.tags = { $in: options.tags };
    }

    return filters;
  }

  async generateQueryEmbedding(query, options = {}) {
    console.log(`Generating query embedding for: "${query}"`);

    try {
      // Enhance query with expansion if enabled
      let enhancedQuery = query;

      if (this.aiConfig.enableQueryExpansion && options.expandQuery !== false) {
        enhancedQuery = await this.expandQuery(query, options);
      }

      // Generate embedding
      const embeddingResponse = await this.openai.embeddings.create({
        model: this.aiConfig.embeddingModel,
        input: enhancedQuery,
        dimensions: this.aiConfig.embeddingDimensions
      });

      return embeddingResponse.data[0].embedding;

    } catch (error) {
      console.error(`Error generating query embedding for "${query}":`, error);
      throw error;
    }
  }

  async expandQuery(query, options = {}) {
    console.log(`Expanding query: "${query}"`);

    try {
      // Use GPT to expand the query with related terms and concepts
      const expansionPrompt = `
        Given the search query: "${query}"

        Generate an expanded version that includes:
        1. Synonyms and related terms
        2. Alternative phrasings
        3. Conceptually related topics
        4. Common variations and abbreviations

        Keep the expansion focused and relevant. Return only the expanded query text.

        Original query: ${query}
        Expanded query:`;

      const completion = await this.openai.chat.completions.create({
        model: 'gpt-4',
        messages: [{ role: 'user', content: expansionPrompt }],
        max_tokens: 150,
        temperature: 0.3
      });

      const expandedQuery = completion.choices[0].message.content.trim();
      console.log(`Query expanded to: "${expandedQuery}"`);

      return expandedQuery;

    } catch (error) {
      console.error(`Error expanding query "${query}":`, error);
      return query; // Fall back to original query
    }
  }

  async generatePersonalizedRecommendations(userId, searchResults, options = {}) {
    console.log(`Generating personalized recommendations for user: ${userId}`);

    try {
      // Get user profile and interaction history
      const userProfile = await this.collections.userProfiles.findOne({ userId: userId });
      if (!userProfile) {
        console.log('No user profile found, returning general recommendations');
        return this.generateGeneralRecommendations(searchResults, options);
      }

      // Generate personalized recommendations based on user interests
      const recommendationPipeline = [
        {
          $vectorSearch: {
            index: 'document_vector_index',
            path: 'embedding', 
            queryVector: userProfile.interestEmbedding,
            numCandidates: 500,
            limit: 50,
            filter: {
              _id: { $nin: searchResults.map(r => r._id) }, // Exclude current results
              accessLevel: { $in: ['public', 'user'] }
            }
          }
        },
        {
          $addFields: {
            personalizedScore: { $meta: 'vectorSearchScore' },
            recommendationReason: 'Based on your interests and reading history'
          }
        },
        {
          $lookup: {
            from: 'users',
            localField: 'createdBy',
            foreignField: '_id',
            as: 'authorInfo',
            pipeline: [{ $project: { name: 1, expertise: 1 } }]
          }
        },
        {
          $project: {
            _id: 1,
            title: 1,
            category: 1,
            tags: 1,
            createdAt: 1,
            personalizedScore: { $round: ['$personalizedScore', 4] },
            recommendationReason: 1,
            authorName: { $arrayElemAt: ['$authorInfo.name', 0] },
            downloadUrl: { $concat: ['/api/documents/', { $toString: '$_id' }] }
          }
        },
        { $sort: { personalizedScore: -1 } },
        { $limit: options.recommendationLimit || 10 }
      ];

      const recommendations = await this.collections.documents.aggregate(recommendationPipeline).toArray();

      return recommendations;

    } catch (error) {
      console.error(`Error generating personalized recommendations for user ${userId}:`, error);
      return [];
    }
  }

  async enhanceSearchResults(results, query, options = {}) {
    console.log(`Enhancing ${results.length} search results`);

    try {
      // Add result enhancements
      const enhancedResults = await Promise.all(results.map(async (result, index) => {
        // Calculate additional metadata
        const enhancedResult = {
          ...result,

          // Result ranking
          rank: index + 1,

          // Enhanced content preview with query highlighting
          highlightedPreview: this.highlightQueryInText(result.contentPreview || '', query),

          // Semantic category classification
          semanticCategory: await this.classifyContentSemantics(result),

          // Reading time estimation
          estimatedReadingTime: this.estimateReadingTime(result.content || result.contentPreview || ''),

          // Related concepts extraction
          extractedConcepts: this.extractKeyConcepts(result.title + ' ' + (result.contentPreview || '')),

          // Confidence scoring
          confidenceScore: this.calculateConfidenceScore(result),

          // Access recommendations
          accessRecommendation: this.generateAccessRecommendation(result, options)
        };

        return enhancedResult;
      }));

      return enhancedResults;

    } catch (error) {
      console.error('Error enhancing search results:', error);
      return results; // Return original results if enhancement fails
    }
  }

  highlightQueryInText(text, query) {
    if (!text || !query) return text;

    // Simple highlighting - in production, use more sophisticated highlighting
    const queryWords = query.toLowerCase().split(/\s+/);
    let highlightedText = text;

    queryWords.forEach(word => {
      if (word.length > 2) { // Only highlight words longer than 2 characters
        const regex = new RegExp(`\\b${word}\\b`, 'gi');
        highlightedText = highlightedText.replace(regex, `**${word}**`);
      }
    });

    return highlightedText;
  }

  estimateReadingTime(text) {
    const wordsPerMinute = 250; // Average reading speed
    const wordCount = text.split(/\s+/).length;
    const readingTime = Math.ceil(wordCount / wordsPerMinute);

    return {
      minutes: readingTime,
      wordCount: wordCount,
      formattedTime: readingTime === 1 ? '1 minute' : `${readingTime} minutes`
    };
  }

  extractKeyConcepts(text) {
    // Simple concept extraction - in production, use NLP libraries
    const concepts = [];
    const words = text.toLowerCase().split(/\s+/);

    // Technical terms and concepts (simplified approach)
    const technicalTerms = [
      'artificial intelligence', 'machine learning', 'deep learning', 'neural networks',
      'data science', 'analytics', 'algorithm', 'optimization', 'automation',
      'cloud computing', 'blockchain', 'cybersecurity', 'api', 'database'
    ];

    technicalTerms.forEach(term => {
      if (text.toLowerCase().includes(term)) {
        concepts.push(term);
      }
    });

    return concepts.slice(0, 5); // Return top 5 concepts
  }

  calculateConfidenceScore(result) {
    // Multi-factor confidence calculation
    let confidence = result.vectorSimilarityScore * 0.6; // Base similarity

    // Content length factor
    const contentLength = (result.content || result.contentPreview || '').length;
    if (contentLength > 1000) confidence += 0.1;
    if (contentLength > 3000) confidence += 0.1;

    // Metadata completeness factor
    if (result.category) confidence += 0.05;
    if (result.tags && result.tags.length > 0) confidence += 0.05;
    if (result.authorName) confidence += 0.05;

    // Popularity factor
    if (result.metrics.viewCount > 100) confidence += 0.05;
    if (result.metrics.likeCount > 10) confidence += 0.05;

    return Math.min(confidence, 1.0); // Cap at 1.0
  }

  generateAccessRecommendation(result, options) {
    // Generate recommendations for how to use/access the content
    const recommendations = [];

    if (result.vectorSimilarityScore > 0.9) {
      recommendations.push('Highly recommended - very relevant to your search');
    }

    if (result.metrics.viewCount > 1000) {
      recommendations.push('Popular content - frequently viewed by users');
    }

    if (result.estimatedReadingTime && result.estimatedReadingTime.minutes <= 5) {
      recommendations.push('Quick read - can be completed in a few minutes');
    }

    if (result.category === 'tutorial') {
      recommendations.push('Step-by-step guidance available');
    }

    return recommendations;
  }

  async logSearchQuery(query, queryEmbedding, results, options) {
    try {
      const searchLog = {
        query: query,
        queryEmbedding: queryEmbedding,
        userId: options.userId || null,
        sessionId: options.sessionId || null,

        // Search configuration
        searchConfig: {
          model: this.aiConfig.embeddingModel,
          similarityThreshold: options.similarityThreshold || this.aiConfig.similarityThreshold,
          limit: options.limit || 20,
          enableHybridSearch: options.enableHybridSearch !== false,
          enablePersonalization: this.aiConfig.enablePersonalization && !!options.userId
        },

        // Results metadata
        resultsMetadata: {
          totalResults: results.length,
          averageSimilarity: results.length > 0 ? 
            results.reduce((sum, r) => sum + r.vectorSimilarityScore, 0) / results.length : 0,
          topCategories: this.extractTopCategories(results),
          searchTime: Date.now() - (options.startTime || Date.now())
        },

        // User context
        userContext: {
          ipAddress: options.ipAddress,
          userAgent: options.userAgent,
          referrer: options.referrer
        },

        timestamp: new Date()
      };

      await this.collections.searchLogs.insertOne(searchLog);

    } catch (error) {
      console.error('Error logging search query:', error);
      // Don't throw - logging shouldn't break search
    }
  }

  extractTopCategories(results) {
    const categoryCount = {};
    results.forEach(result => {
      if (result.category) {
        categoryCount[result.category] = (categoryCount[result.category] || 0) + 1;
      }
    });

    return Object.entries(categoryCount)
      .sort(([,a], [,b]) => b - a)
      .slice(0, 5)
      .map(([category, count]) => ({ category, count }));
  }

  // Additional utility methods for comprehensive vector search functionality

  generateCacheKey(content) {
    const crypto = require('crypto');
    return crypto.createHash('sha256').update(content).digest('hex');
  }

  generateContentHash(content) {
    const crypto = require('crypto');
    return crypto.createHash('md5').update(content).digest('hex');
  }

  calculateVectorNorm(vector) {
    return Math.sqrt(vector.reduce((sum, val) => sum + val * val, 0));
  }

  assessContentQuality(document) {
    let qualityScore = 0;

    // Length factor
    const contentLength = (document.content || '').length;
    if (contentLength > 1000) qualityScore += 0.3;
    if (contentLength > 5000) qualityScore += 0.2;

    // Metadata completeness
    if (document.title) qualityScore += 0.1;
    if (document.summary) qualityScore += 0.1;
    if (document.tags && document.tags.length > 0) qualityScore += 0.1;
    if (document.category) qualityScore += 0.1;

    // Structure indicators
    if (document.content && document.content.includes('\n\n')) qualityScore += 0.1; // Paragraphs

    return Math.min(qualityScore, 1.0);
  }
}

// Benefits of MongoDB Vector Search for AI Applications:
// - Native vector similarity search with cosine similarity
// - Seamless integration with embedding models (OpenAI, Hugging Face, etc.)
// - High-performance vector indexing and retrieval at scale
// - Advanced filtering and hybrid search capabilities
// - Built-in support for multi-modal content (text, images, audio)
// - Personalization through user profile vector matching
// - Real-time search with low-latency vector operations
// - Comprehensive search analytics and query optimization
// - Integration with MongoDB's document model for rich metadata
// - Production-ready scalability with sharding and replication

module.exports = {
  AdvancedVectorSearchEngine
};

Understanding MongoDB Vector Search Architecture

Advanced AI Integration Patterns and Semantic Search Optimization

Implement sophisticated vector search strategies for production AI applications:

// Production-ready MongoDB Vector Search with advanced AI integration and optimization patterns
class ProductionVectorSearchPlatform extends AdvancedVectorSearchEngine {
  constructor(db, productionConfig) {
    super(db, productionConfig);

    this.productionConfig = {
      ...productionConfig,
      multiModelSupport: true,
      realtimeIndexing: true,
      distributedEmbedding: true,
      autoOptimization: true,
      advancedAnalytics: true,
      contentModeration: true
    };

    this.setupProductionOptimizations();
    this.initializeAdvancedFeatures();
    this.setupMonitoringAndAlerts();
  }

  async implementAdvancedSemanticCapabilities() {
    console.log('Implementing advanced semantic capabilities...');

    // Multi-model embedding strategy
    const embeddingStrategy = {
      textEmbeddings: {
        primary: 'text-embedding-3-large',
        fallback: 'text-embedding-ada-002',
        specialized: {
          code: 'code-search-babbage-code-001',
          legal: 'text-similarity-curie-001',
          medical: 'text-search-curie-doc-001'
        }
      },

      multimodalEmbeddings: {
        imageText: 'clip-vit-base-patch32',
        audioText: 'wav2vec2-base-960h', 
        videoText: 'video-text-retrieval'
      },

      domainSpecific: {
        scientific: 'scibert-scivocab-uncased',
        financial: 'finbert-base-uncased',
        biomedical: 'biobert-base-cased'
      }
    };

    return await this.deployEmbeddingStrategy(embeddingStrategy);
  }

  async setupRealtimeSemanticIndexing() {
    console.log('Setting up real-time semantic indexing...');

    const indexingPipeline = {
      // Change stream monitoring for real-time updates
      changeStreams: [
        {
          collection: 'documents',
          pipeline: [
            { $match: { 'operationType': { $in: ['insert', 'update'] } } }
          ],
          handler: this.processDocumentChange.bind(this)
        }
      ],

      // Batch processing for bulk operations
      batchProcessor: {
        batchSize: 100,
        maxWaitTime: 30000, // 30 seconds
        retryLogic: true,
        errorHandling: 'resilient'
      },

      // Quality assurance pipeline
      qualityChecks: [
        'contentValidation',
        'languageDetection', 
        'duplicateDetection',
        'contentModeration'
      ]
    };

    return await this.deployIndexingPipeline(indexingPipeline);
  }

  async implementAdvancedRecommendationEngine() {
    console.log('Implementing advanced recommendation engine...');

    const recommendationStrategies = {
      // Collaborative filtering with vector embeddings
      collaborative: {
        userSimilarity: 'cosine',
        itemSimilarity: 'cosine',
        hybridWeighting: {
          contentBased: 0.6,
          collaborative: 0.4
        }
      },

      // Content-based recommendations
      contentBased: {
        semanticSimilarity: true,
        categoryWeighting: true,
        temporalDecay: true,
        diversityOptimization: true
      },

      // Deep learning recommendations
      deepLearning: {
        neuralCollaborativeFiltering: true,
        sequentialRecommendations: true,
        multiTaskLearning: true
      }
    };

    return await this.deployRecommendationStrategies(recommendationStrategies);
  }

  async optimizeVectorSearchPerformance() {
    console.log('Optimizing vector search performance...');

    const optimizations = {
      // Index optimization strategies
      indexOptimization: {
        approximateNearestNeighbor: true,
        hierarchicalNavigableSmallWorld: true,
        productQuantization: true,
        localitySensitiveHashing: true
      },

      // Query optimization
      queryOptimization: {
        queryExpansion: true,
        queryRewriting: true,
        candidatePrefiltering: true,
        adaptiveSimilarityThresholds: true
      },

      // Caching strategies
      cachingStrategy: {
        embeddingCache: '10GB',
        resultCache: '5GB',
        queryCache: '2GB',
        indexCache: '20GB'
      }
    };

    return await this.implementOptimizations(optimizations);
  }
}

SQL-Style Vector Search Operations with QueryLeaf

QueryLeaf provides familiar SQL syntax for MongoDB Vector Search operations and AI-powered semantic queries:

-- QueryLeaf advanced vector search and AI operations with SQL-familiar syntax

-- Create vector search indexes for different content types and embedding models
CREATE VECTOR INDEX document_semantic_index 
ON documents (
  embedding VECTOR(3072) USING COSINE_SIMILARITY,
  category,
  language,
  contentType,
  accessLevel,
  createdAt
)
WITH (
  model = 'text-embedding-3-large',
  auto_update = true,
  optimization_level = 'performance',

  -- Advanced index configuration
  approximate_nn = true,
  candidate_multiplier = 10,
  ef_construction = 200,
  m_connections = 16
);

CREATE VECTOR INDEX multimodal_content_index
ON documents (
  multimodalEmbedding VECTOR(1536) USING COSINE_SIMILARITY,
  mediaType,
  contentFormat
)
WITH (
  model = 'clip-vit-base-patch32',
  multimodal = true
);

-- Advanced semantic search with vector similarity and hybrid scoring
WITH semantic_search AS (
  SELECT 
    d.*,
    -- Vector similarity search using embeddings
    VECTOR_SEARCH(
      d.embedding,
      GENERATE_EMBEDDING(
        'Find research papers about machine learning applications in healthcare diagnostics',
        'text-embedding-3-large'
      ),
      'COSINE'
    ) as vector_similarity,

    -- Hybrid scoring combining vector and traditional text search
    (
      VECTOR_SEARCH(
        d.embedding,
        GENERATE_EMBEDDING(
          'machine learning healthcare diagnostics medical AI',
          'text-embedding-3-large'
        ),
        'COSINE'
      ) * 0.7 +

      MATCH_SCORE(d.title || ' ' || d.content, 'machine learning healthcare diagnostics') * 0.2 +

      -- Recency boost
      CASE 
        WHEN d.createdAt > CURRENT_DATE - INTERVAL '30 days' THEN 0.1
        WHEN d.createdAt > CURRENT_DATE - INTERVAL '90 days' THEN 0.05
        ELSE 0
      END +

      -- Quality and popularity boost
      (LOG(d.metrics.citationCount + 1) * 0.02) +
      (d.metrics.averageRating / 5.0 * 0.03)

    ) as hybrid_score

  FROM documents d
  WHERE 
    -- Vector similarity threshold
    VECTOR_SEARCH(
      d.embedding,
      GENERATE_EMBEDDING(
        'machine learning healthcare diagnostics',
        'text-embedding-3-large'
      ),
      'COSINE'
    ) > 0.75

    -- Additional filters for precision
    AND d.category IN ('research', 'academic', 'medical')
    AND d.language = 'en'
    AND d.accessLevel IN ('public', 'academic')
    AND d.contentType = 'research_paper'
),

-- Enhanced search with semantic category classification and concept extraction
enriched_results AS (
  SELECT 
    ss.*,

    -- Semantic category classification using AI
    AI_CLASSIFY_CATEGORY(
      ss.title || ' ' || SUBSTRING(ss.content, 1, 1000),
      ['machine_learning', 'healthcare', 'diagnostics', 'medical_imaging', 'clinical_ai']
    ) as semantic_categories,

    -- Key concept extraction
    AI_EXTRACT_CONCEPTS(
      ss.title || ' ' || ss.abstract,
      10 -- top 10 concepts
    ) as key_concepts,

    -- Content summary generation
    AI_SUMMARIZE(
      ss.content,
      max_length => 200,
      style => 'academic'
    ) as ai_summary,

    -- Reading difficulty assessment
    AI_ASSESS_DIFFICULTY(
      ss.content,
      domain => 'medical'
    ) as reading_difficulty,

    -- Related research identification
    FIND_SIMILAR_DOCUMENTS(
      ss.embedding,
      limit => 5,
      exclude_ids => ARRAY[ss.document_id],
      similarity_threshold => 0.8
    ) as related_research,

    -- Citation and reference analysis
    ANALYZE_CITATIONS(ss.content) as citation_analysis,

    -- Author expertise scoring
    u.expertise_score,
    u.h_index,
    u.research_domains,

    -- Impact metrics
    CALCULATE_IMPACT_SCORE(
      ss.metrics.citationCount,
      ss.metrics.downloadCount,
      ss.metrics.viewCount,
      ss.createdAt
    ) as impact_score

  FROM semantic_search ss
  JOIN users u ON ss.createdBy = u.user_id
  WHERE ss.vector_similarity > 0.7
),

-- Personalized recommendations based on user research interests
personalized_recommendations AS (
  SELECT 
    er.*,

    -- User interest alignment scoring
    VECTOR_SIMILARITY(
      er.embedding,
      (SELECT interest_embedding FROM user_profiles WHERE user_id = CURRENT_USER_ID()),
      'COSINE'
    ) as interest_alignment,

    -- Reading history similarity
    CALCULATE_READING_HISTORY_SIMILARITY(
      CURRENT_USER_ID(),
      er.document_id,
      window_days => 180
    ) as reading_history_similarity,

    -- Collaborative filtering score
    COLLABORATIVE_FILTERING_SCORE(
      CURRENT_USER_ID(),
      er.document_id,
      algorithm => 'neural_collaborative_filtering'
    ) as collaborative_score,

    -- Personalized relevance scoring
    (
      er.hybrid_score * 0.5 +
      interest_alignment * 0.3 +
      reading_history_similarity * 0.1 +
      collaborative_score * 0.1
    ) as personalized_relevance

  FROM enriched_results er
  WHERE interest_alignment > 0.6
),

-- Advanced analytics and search insights
search_analytics AS (
  SELECT 
    COUNT(*) as total_results,
    AVG(pr.vector_similarity) as avg_similarity,
    AVG(pr.hybrid_score) as avg_hybrid_score,
    AVG(pr.personalized_relevance) as avg_personalized_relevance,

    -- Category distribution analysis
    JSON_OBJECT_AGG(
      pr.category,
      COUNT(*)
    ) as category_distribution,

    -- Semantic category insights
    FLATTEN_ARRAY(
      ARRAY_AGG(pr.semantic_categories)
    ) as all_semantic_categories,

    -- Concept frequency analysis
    AI_ANALYZE_CONCEPT_TRENDS(
      ARRAY_AGG(pr.key_concepts),
      time_window => '30 days'
    ) as concept_trends,

    -- Research domain coverage
    CALCULATE_DOMAIN_COVERAGE(
      ARRAY_AGG(pr.research_domains)
    ) as domain_coverage,

    -- Quality distribution
    JSON_OBJECT(
      'high_impact', COUNT(*) FILTER (WHERE pr.impact_score > 80),
      'medium_impact', COUNT(*) FILTER (WHERE pr.impact_score BETWEEN 50 AND 80),
      'emerging', COUNT(*) FILTER (WHERE pr.impact_score BETWEEN 20 AND 50),
      'new_research', COUNT(*) FILTER (WHERE pr.impact_score < 20)
    ) as quality_distribution

  FROM personalized_recommendations pr
)

-- Final comprehensive search results with analytics and recommendations
SELECT 
  -- Document information
  pr.document_id,
  pr.title,
  pr.ai_summary,
  pr.category,
  pr.semantic_categories,
  pr.key_concepts,
  pr.reading_difficulty,
  pr.createdAt,

  -- Author information
  JSON_OBJECT(
    'name', u.name,
    'expertise_score', pr.expertise_score,
    'h_index', pr.h_index,
    'research_domains', pr.research_domains
  ) as author_info,

  -- Relevance scoring
  ROUND(pr.vector_similarity, 4) as semantic_similarity,
  ROUND(pr.hybrid_score, 4) as hybrid_relevance,
  ROUND(pr.personalized_relevance, 4) as personalized_score,
  ROUND(pr.interest_alignment, 4) as interest_match,

  -- Content characteristics
  pr.reading_difficulty,
  pr.impact_score,
  pr.citation_analysis,

  -- Related content
  pr.related_research,

  -- Access information
  CASE pr.accessLevel
    WHEN 'public' THEN 'Open Access'
    WHEN 'academic' THEN 'Academic Access Required'
    WHEN 'subscription' THEN 'Subscription Required'
    ELSE 'Restricted Access'
  END as access_type,

  -- Download and interaction URLs
  CONCAT('/api/documents/', pr.document_id, '/download') as download_url,
  CONCAT('/api/documents/', pr.document_id, '/cite') as citation_url,
  CONCAT('/api/documents/', pr.document_id, '/related') as related_url,

  -- Recommendation metadata
  JSON_OBJECT(
    'recommendation_reason', CASE 
      WHEN pr.interest_alignment > 0.9 THEN 'Highly aligned with your research interests'
      WHEN pr.collaborative_score > 0.8 THEN 'Recommended by researchers with similar interests'
      WHEN pr.reading_history_similarity > 0.7 THEN 'Similar to your recent reading patterns'
      ELSE 'Semantically relevant to your search'
    END,
    'confidence_level', CASE
      WHEN pr.personalized_relevance > 0.9 THEN 'Very High'
      WHEN pr.personalized_relevance > 0.8 THEN 'High'
      WHEN pr.personalized_relevance > 0.7 THEN 'Medium'
      ELSE 'Low'
    END
  ) as recommendation_metadata,

  -- Search analytics (same for all results)
  (SELECT ROW_TO_JSON(sa.*) FROM search_analytics sa) as search_insights

FROM personalized_recommendations pr
JOIN users u ON pr.createdBy = u.user_id
WHERE pr.personalized_relevance > 0.6
ORDER BY pr.personalized_relevance DESC
LIMIT 20;

-- Advanced vector operations for content discovery and analysis

-- Find conceptually similar documents across different languages
WITH multilingual_search AS (
  SELECT 
    d.document_id,
    d.title,
    d.language,
    d.category,

    -- Cross-language semantic similarity
    VECTOR_SEARCH(
      d.embedding,
      GENERATE_MULTILINGUAL_EMBEDDING(
        'intelligence artificielle apprentissage automatique', -- French query
        source_language => 'fr',
        target_embedding_language => 'en'
      ),
      'COSINE'
    ) as cross_language_similarity

  FROM documents d
  WHERE d.language IN ('en', 'fr', 'de', 'es', 'zh')
    AND VECTOR_SEARCH(
      d.embedding,
      GENERATE_MULTILINGUAL_EMBEDDING(
        'intelligence artificielle apprentissage automatique',
        source_language => 'fr',
        target_embedding_language => 'en'
      ),
      'COSINE'
    ) > 0.8
)
SELECT * FROM multilingual_search
ORDER BY cross_language_similarity DESC;

-- Content recommendation based on user behavior patterns
CREATE VIEW personalized_content_feed AS
WITH user_interaction_embedding AS (
  SELECT 
    ui.user_id,

    -- Generate user interest embedding from interaction history
    AGGREGATE_EMBEDDINGS(
      ARRAY_AGG(d.embedding),
      weights => ARRAY_AGG(
        CASE ui.interaction_type
          WHEN 'download' THEN 1.0
          WHEN 'like' THEN 0.8
          WHEN 'share' THEN 0.9
          WHEN 'view' THEN 0.3
          ELSE 0.1
        END * 
        -- Temporal decay
        GREATEST(0.1, 1.0 - EXTRACT(DAYS FROM CURRENT_DATE - ui.interaction_timestamp) / 365.0)
      ),
      aggregation_method => 'weighted_average'
    ) as interest_embedding

  FROM user_interactions ui
  JOIN documents d ON ui.document_id = d.document_id
  WHERE ui.interaction_timestamp > CURRENT_DATE - INTERVAL '1 year'
  GROUP BY ui.user_id
),
content_recommendations AS (
  SELECT 
    uie.user_id,
    d.document_id,
    d.title,
    d.category,
    d.createdAt,

    -- Interest-based similarity
    VECTOR_SIMILARITY(
      d.embedding,
      uie.interest_embedding,
      'COSINE'
    ) as interest_similarity,

    -- Trending factor
    CALCULATE_TRENDING_SCORE(
      d.document_id,
      time_window => '7 days'
    ) as trending_score,

    -- Novelty factor (encourages discovery)
    CALCULATE_NOVELTY_SCORE(
      uie.user_id,
      d.document_id,
      d.category
    ) as novelty_score,

    -- Combined recommendation score
    (
      VECTOR_SIMILARITY(d.embedding, uie.interest_embedding, 'COSINE') * 0.6 +
      CALCULATE_TRENDING_SCORE(d.document_id, time_window => '7 days') * 0.2 +
      CALCULATE_NOVELTY_SCORE(uie.user_id, d.document_id, d.category) * 0.2
    ) as recommendation_score

  FROM user_interaction_embedding uie
  CROSS JOIN documents d
  WHERE NOT EXISTS (
    -- Exclude already interacted content
    SELECT 1 FROM user_interactions ui2 
    WHERE ui2.user_id = uie.user_id 
    AND ui2.document_id = d.document_id
  )
  AND VECTOR_SIMILARITY(d.embedding, uie.interest_embedding, 'COSINE') > 0.7
)
SELECT 
  user_id,
  document_id,
  title,
  category,
  ROUND(interest_similarity, 4) as interest_match,
  ROUND(trending_score, 4) as trending_score,
  ROUND(novelty_score, 4) as discovery_potential,
  ROUND(recommendation_score, 4) as overall_score,

  -- Recommendation explanation
  CASE 
    WHEN interest_similarity > 0.9 THEN 'Perfect match for your interests'
    WHEN trending_score > 0.8 THEN 'Trending content in your area'
    WHEN novelty_score > 0.7 THEN 'New topic for you to explore'
    ELSE 'Related to your reading patterns'
  END as recommendation_reason

FROM content_recommendations
WHERE recommendation_score > 0.75
ORDER BY user_id, recommendation_score DESC;

-- Advanced analytics for content optimization and performance monitoring
WITH vector_search_analytics AS (
  SELECT 
    -- Search performance metrics
    sl.query,
    COUNT(*) as search_frequency,
    AVG(sl.resultsMetadata.totalResults) as avg_results_count,
    AVG(sl.resultsMetadata.averageSimilarity) as avg_similarity_score,
    AVG(sl.resultsMetadata.searchTime) as avg_search_time_ms,

    -- Query characteristics
    AI_ANALYZE_QUERY_INTENT(sl.query) as query_intent,
    AI_EXTRACT_ENTITIES(sl.query) as query_entities,
    LENGTH(sl.query) as query_length,

    -- Result quality metrics
    AVG(
      (SELECT COUNT(*) FROM JSON_ARRAY_ELEMENTS_TEXT(sl.resultsMetadata.topCategories))
    ) as category_diversity,

    -- User engagement with results
    COALESCE(
      (
        SELECT AVG(ui.rating) 
        FROM user_interactions ui
        WHERE ui.session_id = sl.sessionId
        AND ui.interaction_timestamp >= sl.timestamp
        AND ui.interaction_timestamp <= sl.timestamp + INTERVAL '1 hour'
      ), 0
    ) as result_satisfaction_score

  FROM search_logs sl
  WHERE sl.timestamp >= CURRENT_DATE - INTERVAL '30 days'
  GROUP BY sl.query, AI_ANALYZE_QUERY_INTENT(sl.query), AI_EXTRACT_ENTITIES(sl.query), LENGTH(sl.query)
),
content_performance_analysis AS (
  SELECT 
    d.document_id,
    d.title,
    d.category,
    d.createdAt,

    -- Discoverability metrics
    COUNT(sl.query) as times_found_in_search,
    AVG(sl.resultsMetadata.averageSimilarity) as avg_search_relevance,

    -- Engagement metrics
    COUNT(ui.interaction_id) as total_interactions,
    COUNT(ui.interaction_id) FILTER (WHERE ui.interaction_type = 'view') as view_count,
    COUNT(ui.interaction_id) FILTER (WHERE ui.interaction_type = 'download') as download_count,
    AVG(ui.rating) FILTER (WHERE ui.rating IS NOT NULL) as avg_rating,

    -- Content optimization recommendations
    CASE 
      WHEN COUNT(sl.query) < 5 THEN 'Low discoverability - consider SEO optimization'
      WHEN AVG(sl.resultsMetadata.averageSimilarity) < 0.7 THEN 'Low relevance - review content structure'
      WHEN COUNT(ui.interaction_id) FILTER (WHERE ui.interaction_type = 'download') / 
           NULLIF(COUNT(ui.interaction_id) FILTER (WHERE ui.interaction_type = 'view'), 0) < 0.1 
        THEN 'Low conversion - improve content value proposition'
      ELSE 'Performance within normal parameters'
    END as optimization_recommendation

  FROM documents d
  LEFT JOIN search_logs sl ON d.document_id = ANY(
    SELECT JSON_ARRAY_ELEMENTS_TEXT(sl.resultsMetadata.resultIds)::UUID
  )
  LEFT JOIN user_interactions ui ON d.document_id = ui.document_id
  WHERE d.createdAt >= CURRENT_DATE - INTERVAL '90 days'
  GROUP BY d.document_id, d.title, d.category, d.createdAt
)
SELECT 
  -- Search analytics summary
  vsa.query,
  vsa.search_frequency,
  vsa.query_intent,
  vsa.query_entities,
  ROUND(vsa.avg_similarity_score, 3) as avg_relevance,
  ROUND(vsa.avg_search_time_ms, 1) as avg_response_time_ms,
  ROUND(vsa.result_satisfaction_score, 2) as user_satisfaction,

  -- Content performance insights
  cpa.title as top_performing_content,
  cpa.times_found_in_search,
  cpa.total_interactions,
  cpa.optimization_recommendation,

  -- Improvement recommendations
  CASE 
    WHEN vsa.avg_search_time_ms > 1000 THEN 'Consider index optimization'
    WHEN vsa.avg_similarity_score < 0.7 THEN 'Review embedding model performance'
    WHEN vsa.result_satisfaction_score < 3.0 THEN 'Improve result quality and relevance'
    ELSE 'Search performance is optimal'
  END as search_optimization_recommendation

FROM vector_search_analytics vsa
LEFT JOIN content_performance_analysis cpa ON true
WHERE vsa.search_frequency > 10  -- Focus on frequently searched queries
ORDER BY vsa.search_frequency DESC, vsa.result_satisfaction_score DESC
LIMIT 50;

-- QueryLeaf provides comprehensive vector search capabilities:
-- 1. Native vector similarity search with advanced embedding models
-- 2. Hybrid scoring combining semantic and traditional text search
-- 3. Personalized recommendations based on user interest embeddings  
-- 4. Multi-language semantic search with cross-language understanding
-- 5. Real-time content recommendations and discovery systems
-- 6. Advanced analytics for search optimization and content performance
-- 7. AI-powered content classification and concept extraction
-- 8. Production-ready vector indexing with performance optimization
-- 9. Comprehensive search logging and user behavior analysis
-- 10. SQL-familiar syntax for complex vector operations and AI workflows

Best Practices for Production Vector Search Implementation

Embedding Strategy and Model Selection

Essential principles for effective MongoDB Vector Search deployment:

  1. Model Selection: Choose appropriate embedding models based on content type, domain, and language requirements
  2. Embedding Quality: Implement comprehensive content preparation and preprocessing for optimal embedding generation
  3. Index Optimization: Configure vector indexes with appropriate similarity metrics and performance parameters
  4. Hybrid Approach: Combine vector similarity with traditional text search for comprehensive relevance scoring
  5. Personalization: Implement user profile embeddings for personalized search and recommendation experiences
  6. Performance Monitoring: Track search performance, result quality, and user satisfaction metrics continuously

Scalability and Performance Optimization

Optimize vector search deployments for production-scale requirements:

  1. Index Strategy: Design efficient vector indexes with appropriate dimensionality and similarity algorithms
  2. Caching Implementation: Implement multi-tier caching for embeddings, queries, and search results
  3. Batch Processing: Optimize embedding generation and indexing through intelligent batch processing
  4. Query Optimization: Implement query expansion, rewriting, and adaptive similarity thresholds
  5. Resource Management: Monitor and optimize computational resources for embedding generation and vector operations
  6. Distribution Strategy: Design sharding and replication strategies for large-scale vector collections

Conclusion

MongoDB Vector Search provides comprehensive AI-powered semantic search capabilities that enable natural language queries, intelligent content discovery, and sophisticated recommendation systems through high-dimensional vector embeddings and advanced similarity algorithms. The native MongoDB integration ensures that vector search benefits from the same scalability, performance, and operational features as traditional database operations.

Key MongoDB Vector Search benefits include:

  • Semantic Understanding: AI-powered semantic search that understands meaning and context beyond keyword matching
  • Advanced Similarity: Sophisticated vector similarity algorithms with cosine similarity and approximate nearest neighbor search
  • Hybrid Capabilities: Seamless integration of vector similarity with traditional text search and metadata filtering
  • Personalization: User profile embeddings for personalized search results and intelligent recommendations
  • Multi-Modal Support: Vector search across text, images, audio, and multi-modal content with unified similarity operations
  • Production Ready: High-performance vector indexing with automatic optimization and comprehensive analytics

Whether you're building AI-powered search applications, recommendation engines, content discovery platforms, or intelligent document retrieval systems, MongoDB Vector Search with QueryLeaf's familiar SQL interface provides the foundation for sophisticated semantic capabilities.

QueryLeaf Integration: QueryLeaf automatically manages MongoDB Vector Search operations while providing SQL-familiar syntax for vector similarity queries, embedding generation, and AI-powered content discovery. Advanced vector search patterns, personalization algorithms, and semantic analytics are seamlessly handled through familiar SQL constructs, making sophisticated AI capabilities accessible to SQL-oriented development teams.

The combination of MongoDB's robust vector search capabilities with SQL-style AI operations makes it an ideal platform for modern AI applications that require both advanced semantic understanding and familiar database management patterns, ensuring your AI-powered search solutions can scale efficiently while remaining maintainable and feature-rich.