2025

November 7, 2025
24 min read

MongoDB Change Streams for Event-Driven Microservices: Real-Time Data Processing and Distributed System Architecture

Modern distributed applications require sophisticated event-driven architectures that can react to data changes in real-time, maintain consistency across microservices, and process streaming data with minimal latency. Traditional database approaches struggle to provide efficient change detection, often requiring complex polling mechanisms, external message brokers, or custom trigger implementations that introduce significant overhead and operational complexity.

MongoDB Change Streams provide native, real-time change detection capabilities that enable applications to reactively process database modifications with millisecond latency. Unlike traditional approaches that require periodic polling or complex event sourcing implementations, Change Streams deliver ordered, resumable streams of database changes that integrate seamlessly with microservices architectures and event-driven patterns.

The Traditional Change Detection Challenge

Conventional approaches to detecting and reacting to database changes have significant limitations for modern applications:

-- Traditional PostgreSQL change detection - complex and resource-intensive

-- Polling-based approach with timestamps
CREATE TABLE orders (
    order_id SERIAL PRIMARY KEY,
    customer_id INTEGER NOT NULL,
    order_status VARCHAR(50) DEFAULT 'pending',
    total_amount DECIMAL(10,2) NOT NULL,
    items JSONB NOT NULL,
    shipping_address JSONB NOT NULL,
    payment_info JSONB,

    -- Tracking fields for change detection
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    version INTEGER DEFAULT 1,

    -- Change tracking
    last_processed_at TIMESTAMP,
    change_events TEXT[] DEFAULT ARRAY[]::TEXT[],

    -- Indexes for polling queries
    INDEX idx_orders_updated_at (updated_at),
    INDEX idx_orders_status_updated (order_status, updated_at),
    INDEX idx_orders_processing (last_processed_at, updated_at)
);

-- Trigger-based change tracking (complex maintenance)
CREATE TABLE order_change_log (
    log_id SERIAL PRIMARY KEY,
    order_id INTEGER REFERENCES orders(order_id),
    change_type VARCHAR(20) NOT NULL, -- INSERT, UPDATE, DELETE
    old_values JSONB,
    new_values JSONB,
    changed_fields TEXT[],
    changed_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    processed BOOLEAN DEFAULT FALSE,
    processing_attempts INTEGER DEFAULT 0,

    INDEX idx_change_log_processing (processed, changed_at),
    INDEX idx_change_log_order (order_id, changed_at)
);

-- Complex trigger function for change tracking
CREATE OR REPLACE FUNCTION track_order_changes()
RETURNS TRIGGER AS $$
DECLARE
    old_json JSONB;
    new_json JSONB;
    changed_fields TEXT[] := ARRAY[]::TEXT[];
    field_name TEXT;
BEGIN
    -- Handle different operation types
    IF TG_OP = 'DELETE' THEN
        INSERT INTO order_change_log (order_id, change_type, old_values)
        VALUES (OLD.order_id, 'DELETE', to_jsonb(OLD));

        RETURN OLD;
    END IF;

    IF TG_OP = 'INSERT' THEN
        INSERT INTO order_change_log (order_id, change_type, new_values)
        VALUES (NEW.order_id, 'INSERT', to_jsonb(NEW));

        RETURN NEW;
    END IF;

    -- UPDATE operation - detect changed fields
    old_json := to_jsonb(OLD);
    new_json := to_jsonb(NEW);

    -- Compare each field
    FOR field_name IN 
        SELECT DISTINCT key 
        FROM jsonb_each(old_json) 
        UNION 
        SELECT DISTINCT key 
        FROM jsonb_each(new_json)
    LOOP
        IF old_json->field_name != new_json->field_name OR 
           (old_json->field_name IS NULL) != (new_json->field_name IS NULL) THEN
            changed_fields := array_append(changed_fields, field_name);
        END IF;
    END LOOP;

    -- Only log if fields actually changed
    IF array_length(changed_fields, 1) > 0 THEN
        INSERT INTO order_change_log (
            order_id, change_type, old_values, new_values, changed_fields
        ) VALUES (
            NEW.order_id, 'UPDATE', old_json, new_json, changed_fields
        );

        -- Update version and tracking fields
        NEW.updated_at := CURRENT_TIMESTAMP;
        NEW.version := OLD.version + 1;
        NEW.change_events := array_append(OLD.change_events, 
            'updated_' || array_to_string(changed_fields, ',') || '_at_' || 
            extract(epoch from CURRENT_TIMESTAMP)::text
        );
    END IF;

    RETURN NEW;
END;
$$ LANGUAGE plpgsql;

-- Create triggers (high overhead on write operations)
CREATE TRIGGER orders_change_trigger
    BEFORE INSERT OR UPDATE OR DELETE ON orders
    FOR EACH ROW EXECUTE FUNCTION track_order_changes();

-- Application polling logic (inefficient and high-latency)
WITH pending_changes AS (
    SELECT 
        ocl.*,
        o.order_status,
        o.customer_id,
        o.total_amount,

        -- Determine change significance
        CASE 
            WHEN 'order_status' = ANY(ocl.changed_fields) THEN 'high'
            WHEN 'total_amount' = ANY(ocl.changed_fields) THEN 'medium'
            WHEN 'items' = ANY(ocl.changed_fields) THEN 'medium'
            ELSE 'low'
        END as change_priority,

        -- Calculate processing delay
        EXTRACT(EPOCH FROM (CURRENT_TIMESTAMP - ocl.changed_at)) as delay_seconds

    FROM order_change_log ocl
    JOIN orders o ON ocl.order_id = o.order_id
    WHERE 
        ocl.processed = FALSE
        AND ocl.processing_attempts < 3
        AND ocl.changed_at >= CURRENT_TIMESTAMP - INTERVAL '1 hour'
),
prioritized_changes AS (
    SELECT *,
        ROW_NUMBER() OVER (
            PARTITION BY order_id 
            ORDER BY changed_at DESC
        ) as change_sequence,

        -- Batch processing grouping
        CASE change_priority
            WHEN 'high' THEN 1
            WHEN 'medium' THEN 2  
            ELSE 3
        END as processing_batch

    FROM pending_changes
)
SELECT 
    pc.log_id,
    pc.order_id,
    pc.change_type,
    pc.changed_fields,
    pc.old_values,
    pc.new_values,
    pc.change_priority,
    pc.delay_seconds,
    pc.processing_batch,

    -- Processing metadata
    CASE 
        WHEN pc.delay_seconds > 300 THEN 'DELAYED'
        WHEN pc.processing_attempts > 0 THEN 'RETRY'
        ELSE 'READY'
    END as processing_status,

    -- Related order context
    pc.order_status,
    pc.customer_id,
    pc.total_amount

FROM prioritized_changes pc
WHERE 
    pc.change_sequence = 1 -- Only latest change per order
    AND (
        pc.change_priority = 'high' 
        OR (pc.change_priority = 'medium' AND pc.delay_seconds < 60)
        OR (pc.change_priority = 'low' AND pc.delay_seconds < 300)
    )
ORDER BY 
    pc.processing_batch,
    pc.changed_at ASC
LIMIT 100;

-- Problems with traditional change detection:
-- 1. High overhead from triggers on every write operation
-- 2. Complex polling logic with high latency and resource usage  
-- 3. Risk of missing changes during application downtime
-- 4. Difficult to scale across multiple application instances
-- 5. No guaranteed delivery or ordering of change events
-- 6. Complex state management for processed vs unprocessed changes
-- 7. Performance degradation with high-volume write workloads
-- 8. Backup and restore complications with change log tables
-- 9. Cross-database change coordination challenges
-- 10. Limited filtering and transformation capabilities

-- MySQL change detection (even more limited)
CREATE TABLE mysql_orders (
    id INT AUTO_INCREMENT PRIMARY KEY,
    status VARCHAR(50),
    amount DECIMAL(10,2),
    updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,

    INDEX(updated_at)
);

-- Basic polling approach (no trigger support in standard MySQL)
SELECT 
    id, status, amount, updated_at,
    UNIX_TIMESTAMP() - UNIX_TIMESTAMP(updated_at) as age_seconds
FROM mysql_orders
WHERE updated_at > DATE_SUB(NOW(), INTERVAL 5 MINUTE)
ORDER BY updated_at DESC
LIMIT 1000;

-- MySQL limitations:
-- - No comprehensive trigger system for change tracking
-- - Limited JSON functionality for change metadata
-- - Basic polling only - no streaming capabilities
-- - Poor performance with high-volume change detection
-- - No built-in change stream or event sourcing support
-- - Complex custom implementation required for real-time processing
-- - Limited scalability for distributed architectures

MongoDB Change Streams provide powerful, real-time change detection with minimal overhead:

// MongoDB Change Streams - native real-time change processing
const { MongoClient } = require('mongodb');

const client = new MongoClient('mongodb://localhost:27017');
const db = client.db('ecommerce');

// Comprehensive order management with change stream support
const setupOrderManagement = async () => {
  const orders = db.collection('orders');

  // Create sample order document structure
  const orderDocument = {
    _id: new ObjectId(),
    customerId: new ObjectId("64a1b2c3d4e5f6789012347a"),

    // Order details
    orderNumber: "ORD-2024-001234",
    status: "pending", // pending, confirmed, processing, shipped, delivered, cancelled

    // Financial information
    financial: {
      subtotal: 299.99,
      tax: 24.00,
      shipping: 15.99,
      discount: 25.00,
      total: 314.98,
      currency: "USD"
    },

    // Items with detailed tracking
    items: [
      {
        productId: new ObjectId("64b2c3d4e5f6789012347b1a"),
        sku: "LAPTOP-PRO-2024",
        name: "Professional Laptop 2024",
        quantity: 1,
        unitPrice: 1299.99,
        totalPrice: 1299.99,

        // Item-level tracking
        status: "pending", // pending, reserved, picked, shipped
        warehouse: "WEST-01",
        trackingNumber: null
      },
      {
        productId: new ObjectId("64b2c3d4e5f6789012347b1b"), 
        sku: "MOUSE-WIRELESS-PREMIUM",
        name: "Premium Wireless Mouse",
        quantity: 2,
        unitPrice: 79.99,
        totalPrice: 159.98,
        status: "pending",
        warehouse: "WEST-01",
        trackingNumber: null
      }
    ],

    // Customer information
    customer: {
      customerId: new ObjectId("64a1b2c3d4e5f6789012347a"),
      email: "[email protected]",
      name: "John Smith",
      phone: "+1-555-0123",
      loyaltyTier: "gold"
    },

    // Shipping details
    shipping: {
      method: "standard", // standard, express, overnight
      carrier: "FedEx",
      trackingNumber: null,

      address: {
        street: "123 Main Street",
        unit: "Apt 4B", 
        city: "San Francisco",
        state: "CA",
        country: "USA",
        postalCode: "94105",

        // Geospatial data for logistics
        coordinates: {
          type: "Point",
          coordinates: [-122.3937, 37.7955]
        }
      },

      // Delivery preferences
      preferences: {
        signatureRequired: true,
        leaveAtDoor: false,
        deliveryInstructions: "Ring doorbell, apartment entrance on left side",
        preferredTimeWindow: "9AM-12PM"
      }
    },

    // Payment information
    payment: {
      method: "credit_card", // credit_card, paypal, apple_pay, etc.
      status: "pending", // pending, authorized, captured, failed, refunded
      transactionId: null,

      // Payment processor details (sensitive data encrypted/redacted)
      processor: {
        name: "stripe",
        paymentIntentId: "pi_1234567890abcdef",
        chargeId: null,
        receiptUrl: null
      },

      // Billing address
      billingAddress: {
        street: "123 Main Street", 
        city: "San Francisco",
        state: "CA",
        country: "USA",
        postalCode: "94105"
      }
    },

    // Fulfillment tracking
    fulfillment: {
      warehouseId: "WEST-01",
      assignedAt: null,
      pickedAt: null,
      packedAt: null,
      shippedAt: null,
      deliveredAt: null,

      // Fulfillment team
      assignedTo: {
        pickerId: null,
        packerId: null
      },

      // Special handling
      specialInstructions: [],
      requiresSignature: true,
      isFragile: false,
      isGift: false
    },

    // Analytics and tracking
    analytics: {
      source: "web", // web, mobile, api, phone
      campaign: "summer_sale_2024",
      referrer: "google_ads",
      sessionId: "sess_abc123def456",

      // Customer journey
      customerJourney: [
        {
          event: "product_view",
          productId: new ObjectId("64b2c3d4e5f6789012347b1a"),
          timestamp: new Date("2024-11-07T14:15:00Z")
        },
        {
          event: "add_to_cart", 
          productId: new ObjectId("64b2c3d4e5f6789012347b1a"),
          timestamp: new Date("2024-11-07T14:18:00Z")
        },
        {
          event: "checkout_initiated",
          timestamp: new Date("2024-11-07T14:25:00Z")
        }
      ]
    },

    // Communication history
    communications: [
      {
        type: "email",
        subject: "Order Confirmation",
        sentAt: new Date("2024-11-07T14:30:00Z"),
        status: "sent",
        templateId: "order_confirmation_v2"
      }
    ],

    // Audit trail
    audit: {
      createdAt: new Date("2024-11-07T14:30:00Z"),
      createdBy: "system",
      updatedAt: new Date("2024-11-07T14:30:00Z"),
      updatedBy: "system",
      version: 1,

      // Change history for compliance
      changes: []
    },

    // System metadata
    metadata: {
      environment: "production",
      region: "us-west-1",
      tenantId: "tenant_123",

      // Feature flags
      features: {
        realTimeTracking: true,
        smsNotifications: true,
        expressDelivery: false
      }
    }
  };

  // Insert sample order
  await orders.insertOne(orderDocument);

  // Create indexes for change stream performance
  await orders.createIndex({ status: 1, "audit.updatedAt": 1 });
  await orders.createIndex({ customerId: 1, "audit.createdAt": -1 });
  await orders.createIndex({ "items.status": 1 });
  await orders.createIndex({ "payment.status": 1 });
  await orders.createIndex({ "shipping.trackingNumber": 1 });

  console.log('Order management setup completed');
  return orders;
};

// Advanced Change Stream processing for event-driven architecture
class OrderEventProcessor {
  constructor(db) {
    this.db = db;
    this.orders = db.collection('orders');
    this.eventHandlers = new Map();
    this.changeStream = null;
    this.resumeToken = null;
    this.processedEvents = new Set();

    // Event processing statistics
    this.stats = {
      eventsProcessed: 0,
      eventsSkipped: 0,
      processingErrors: 0,
      lastProcessedAt: null
    };
  }

  async startChangeStreamProcessing() {
    console.log('Starting MongoDB Change Stream processing...');

    // Configure change stream with comprehensive options
    const changeStreamOptions = {
      // Pipeline to filter relevant changes
      pipeline: [
        {
          // Only process specific operation types
          $match: {
            operationType: { $in: ['insert', 'update', 'delete', 'replace'] }
          }
        },
        {
          // Add additional metadata for processing
          $addFields: {
            // Extract key fields for quick processing decisions
            documentKey: '$documentKey',
            changeType: '$operationType',
            changedFields: { $objectToArray: '$updateDescription.updatedFields' },
            removedFields: '$updateDescription.removedFields',

            // Processing metadata
            processingPriority: {
              $switch: {
                branches: [
                  // High priority changes
                  {
                    case: {
                      $or: [
                        { $eq: ['$operationType', 'insert'] },
                        { $eq: ['$operationType', 'delete'] },
                        {
                          $anyElementTrue: {
                            $map: {
                              input: '$updateDescription.updatedFields',
                              in: {
                                $regexMatch: {
                                  input: '$$this.k',
                                  regex: '^(status|payment\\.status|fulfillment\\.).*'
                                }
                              }
                            }
                          }
                        }
                      ]
                    },
                    then: 'high'
                  },
                  // Medium priority changes
                  {
                    case: {
                      $anyElementTrue: {
                        $map: {
                          input: '$updateDescription.updatedFields', 
                          in: {
                            $regexMatch: {
                              input: '$$this.k',
                              regex: '^(items\\.|shipping\\.|customer\\.).*'
                            }
                          }
                        }
                      }
                    },
                    then: 'medium'
                  }
                ],
                default: 'low'
              }
            }
          }
        }
      ],

      // Change stream configuration
      fullDocument: 'updateLookup', // Always include full document
      fullDocumentBeforeChange: 'whenAvailable', // Include pre-change document when available
      maxAwaitTimeMS: 1000, // Maximum wait time for new changes
      batchSize: 100, // Process changes in batches

      // Resume from stored token if available
      startAfter: this.resumeToken
    };

    try {
      // Create change stream on orders collection
      this.changeStream = this.orders.watch(changeStreamOptions);

      console.log('Change stream established - listening for events...');

      // Process change events asynchronously
      for await (const change of this.changeStream) {
        try {
          await this.processChangeEvent(change);

          // Store resume token for recovery
          this.resumeToken = change._id;

          // Update statistics
          this.stats.eventsProcessed++;
          this.stats.lastProcessedAt = new Date();

        } catch (error) {
          console.error('Error processing change event:', error);
          this.stats.processingErrors++;

          // Implement retry logic or dead letter queue here
          await this.handleProcessingError(change, error);
        }
      }

    } catch (error) {
      console.error('Change stream error:', error);

      // Implement reconnection logic
      await this.handleChangeStreamError(error);
    }
  }

  async processChangeEvent(change) {
    const { operationType, fullDocument, documentKey, updateDescription } = change;
    const orderId = documentKey._id;

    console.log(`Processing ${operationType} event for order ${orderId}`);

    // Prevent duplicate processing
    const eventId = `${orderId}_${change._id.toString()}`;
    if (this.processedEvents.has(eventId)) {
      console.log(`Skipping duplicate event: ${eventId}`);
      this.stats.eventsSkipped++;
      return;
    }

    // Add to processed events (with TTL cleanup)
    this.processedEvents.add(eventId);
    setTimeout(() => this.processedEvents.delete(eventId), 300000); // 5 minute TTL

    // Route event based on operation type and changed fields
    switch (operationType) {
      case 'insert':
        await this.handleOrderCreated(fullDocument);
        break;

      case 'update':
        await this.handleOrderUpdated(fullDocument, updateDescription, change.fullDocumentBeforeChange);
        break;

      case 'delete':
        await this.handleOrderDeleted(documentKey);
        break;

      case 'replace':
        await this.handleOrderReplaced(fullDocument, change.fullDocumentBeforeChange);
        break;

      default:
        console.warn(`Unhandled operation type: ${operationType}`);
    }
  }

  async handleOrderCreated(order) {
    console.log(`New order created: ${order.orderNumber}`);

    // Parallel processing of order creation events
    const creationTasks = [
      // Send order confirmation email
      this.sendOrderConfirmation(order),

      // Reserve inventory for ordered items
      this.reserveInventory(order),

      // Create payment authorization
      this.authorizePayment(order),

      // Notify fulfillment center
      this.notifyFulfillmentCenter(order),

      // Update customer metrics
      this.updateCustomerMetrics(order),

      // Log analytics event
      this.logAnalyticsEvent('order_created', order),

      // Check for fraud indicators
      this.performFraudCheck(order)
    ];

    try {
      const results = await Promise.allSettled(creationTasks);

      // Handle any failed tasks
      const failedTasks = results.filter(result => result.status === 'rejected');
      if (failedTasks.length > 0) {
        console.error(`${failedTasks.length} tasks failed for order creation:`, failedTasks);
        await this.handlePartialFailure(order, failedTasks);
      }

      console.log(`Order creation processing completed: ${order.orderNumber}`);

    } catch (error) {
      console.error(`Error processing order creation: ${error}`);
      throw error;
    }
  }

  async handleOrderUpdated(order, updateDescription, previousOrder) {
    console.log(`Order updated: ${order.orderNumber}`);

    const updatedFields = Object.keys(updateDescription.updatedFields || {});
    const removedFields = updateDescription.removedFields || [];

    // Process specific field changes
    for (const fieldPath of updatedFields) {
      await this.processFieldChange(order, previousOrder, fieldPath, updateDescription.updatedFields[fieldPath]);
    }

    // Handle removed fields
    for (const fieldPath of removedFields) {
      await this.processFieldRemoval(order, previousOrder, fieldPath);
    }

    // Log comprehensive change event
    await this.logAnalyticsEvent('order_updated', {
      order,
      changedFields: updatedFields,
      removedFields: removedFields
    });
  }

  async processFieldChange(order, previousOrder, fieldPath, newValue) {
    console.log(`Field changed: ${fieldPath} = ${JSON.stringify(newValue)}`);

    // Route processing based on changed field
    if (fieldPath === 'status') {
      await this.handleStatusChange(order, previousOrder);
    } else if (fieldPath.startsWith('payment.')) {
      await this.handlePaymentChange(order, previousOrder, fieldPath, newValue);
    } else if (fieldPath.startsWith('fulfillment.')) {
      await this.handleFulfillmentChange(order, previousOrder, fieldPath, newValue);
    } else if (fieldPath.startsWith('shipping.')) {
      await this.handleShippingChange(order, previousOrder, fieldPath, newValue);
    } else if (fieldPath.startsWith('items.')) {
      await this.handleItemChange(order, previousOrder, fieldPath, newValue);
    }
  }

  async handleStatusChange(order, previousOrder) {
    const newStatus = order.status;
    const previousStatus = previousOrder?.status;

    console.log(`Order status changed: ${previousStatus} → ${newStatus}`);

    // Status-specific processing
    switch (newStatus) {
      case 'confirmed':
        await Promise.all([
          this.processPayment(order),
          this.sendStatusUpdateEmail(order, 'order_confirmed'),
          this.createFulfillmentTasks(order)
        ]);
        break;

      case 'processing':
        await Promise.all([
          this.notifyWarehouse(order),
          this.updateInventoryReservations(order),
          this.sendStatusUpdateEmail(order, 'order_processing')
        ]);
        break;

      case 'shipped':
        await Promise.all([
          this.generateTrackingInfo(order),
          this.sendShippingNotification(order),
          this.releaseInventoryReservations(order),
          this.updateDeliveryEstimate(order)
        ]);
        break;

      case 'delivered':
        await Promise.all([
          this.sendDeliveryConfirmation(order),
          this.triggerReviewRequest(order),
          this.updateCustomerLoyaltyPoints(order),
          this.closeOrderInSystems(order)
        ]);
        break;

      case 'cancelled':
        await Promise.all([
          this.processRefund(order),
          this.releaseInventoryReservations(order),
          this.sendCancellationNotification(order),
          this.updateAnalytics(order, 'cancelled')
        ]);
        break;
    }
  }

  async handlePaymentChange(order, previousOrder, fieldPath, newValue) {
    console.log(`Payment change: ${fieldPath} = ${newValue}`);

    if (fieldPath === 'payment.status') {
      switch (newValue) {
        case 'authorized':
          await this.handlePaymentAuthorized(order);
          break;
        case 'captured':
          await this.handlePaymentCaptured(order);
          break;
        case 'failed':
          await this.handlePaymentFailed(order);
          break;
        case 'refunded':
          await this.handlePaymentRefunded(order);
          break;
      }
    }
  }

  async handleShippingChange(order, previousOrder, fieldPath, newValue) {
    if (fieldPath === 'shipping.trackingNumber' && newValue) {
      console.log(`Tracking number assigned: ${newValue}`);

      // Send tracking information to customer
      await Promise.all([
        this.sendTrackingInfo(order),
        this.setupDeliveryNotifications(order),
        this.updateShippingPartnerSystems(order)
      ]);
    }
  }

  // Implementation of helper methods for event processing
  async sendOrderConfirmation(order) {
    console.log(`Sending order confirmation for ${order.orderNumber}`);

    // Simulate email service call
    const emailData = {
      to: order.customer.email,
      subject: `Order Confirmation - ${order.orderNumber}`,
      template: 'order_confirmation',
      data: {
        orderNumber: order.orderNumber,
        customerName: order.customer.name,
        items: order.items,
        total: order.financial.total,
        estimatedDelivery: this.calculateDeliveryDate(order)
      }
    };

    // Would integrate with actual email service
    await this.sendEmail(emailData);
  }

  async reserveInventory(order) {
    console.log(`Reserving inventory for order ${order.orderNumber}`);

    const inventoryUpdates = order.items.map(item => ({
      productId: item.productId,
      sku: item.sku,
      quantity: item.quantity,
      reservedFor: order._id,
      reservedAt: new Date()
    }));

    // Update inventory collection
    const inventory = this.db.collection('inventory');

    for (const update of inventoryUpdates) {
      await inventory.updateOne(
        { 
          productId: update.productId,
          availableQuantity: { $gte: update.quantity }
        },
        {
          $inc: { 
            availableQuantity: -update.quantity,
            reservedQuantity: update.quantity
          },
          $push: {
            reservations: {
              orderId: update.reservedFor,
              quantity: update.quantity,
              reservedAt: update.reservedAt
            }
          }
        }
      );
    }
  }

  async authorizePayment(order) {
    console.log(`Authorizing payment for order ${order.orderNumber}`);

    // Simulate payment processor call
    const paymentResult = await this.callPaymentProcessor({
      action: 'authorize',
      amount: order.financial.total,
      currency: order.financial.currency,
      paymentMethod: order.payment.method,
      paymentIntentId: order.payment.processor.paymentIntentId
    });

    if (paymentResult.success) {
      // Update order with payment authorization
      await this.orders.updateOne(
        { _id: order._id },
        {
          $set: {
            'payment.status': 'authorized',
            'payment.processor.chargeId': paymentResult.chargeId,
            'audit.updatedAt': new Date(),
            'audit.updatedBy': 'payment_processor'
          },
          $inc: { 'audit.version': 1 }
        }
      );
    } else {
      throw new Error(`Payment authorization failed: ${paymentResult.error}`);
    }
  }

  // Helper methods (simplified implementations)
  async sendEmail(emailData) {
    console.log(`Email sent: ${emailData.subject} to ${emailData.to}`);
  }

  async callPaymentProcessor(request) {
    // Simulate payment processor response
    await new Promise(resolve => setTimeout(resolve, 100));
    return {
      success: true,
      chargeId: `ch_${Math.random().toString(36).substr(2, 9)}`
    };
  }

  calculateDeliveryDate(order) {
    const baseDate = new Date();
    const daysToAdd = order.shipping.method === 'express' ? 2 : 
                      order.shipping.method === 'overnight' ? 1 : 5;
    baseDate.setDate(baseDate.getDate() + daysToAdd);
    return baseDate;
  }

  async logAnalyticsEvent(eventType, data) {
    const analytics = this.db.collection('analytics_events');
    await analytics.insertOne({
      eventType,
      data,
      timestamp: new Date(),
      source: 'change_stream_processor'
    });
  }

  async handleProcessingError(change, error) {
    console.error(`Processing error for change ${change._id}:`, error);

    // Log error for monitoring
    const errorLog = this.db.collection('processing_errors');
    await errorLog.insertOne({
      changeId: change._id,
      operationType: change.operationType,
      documentKey: change.documentKey,
      error: {
        message: error.message,
        stack: error.stack
      },
      timestamp: new Date(),
      retryCount: 0
    });
  }

  async handleChangeStreamError(error) {
    console.error('Change stream error:', error);

    // Wait before attempting reconnection
    await new Promise(resolve => setTimeout(resolve, 5000));

    // Restart change stream processing
    await this.startChangeStreamProcessing();
  }

  getProcessingStatistics() {
    return {
      ...this.stats,
      resumeToken: this.resumeToken,
      processedEventsInMemory: this.processedEvents.size
    };
  }
}

// Multi-service change stream coordination
class DistributedOrderEventSystem {
  constructor(db) {
    this.db = db;
    this.serviceProcessors = new Map();
    this.eventBus = new Map(); // Simple in-memory event bus
    this.globalStats = {
      totalEventsProcessed: 0,
      servicesActive: 0,
      lastProcessingTime: null
    };
  }

  async setupDistributedProcessing() {
    console.log('Setting up distributed order event processing...');

    // Create specialized processors for different services
    const services = [
      'inventory-service',
      'payment-service', 
      'fulfillment-service',
      'notification-service',
      'analytics-service',
      'customer-service'
    ];

    for (const serviceName of services) {
      const processor = new ServiceSpecificProcessor(this.db, serviceName, this);
      await processor.initialize();
      this.serviceProcessors.set(serviceName, processor);
    }

    console.log(`Distributed processing setup completed with ${services.length} services`);
  }

  async publishEvent(eventType, data, source) {
    console.log(`Publishing event: ${eventType} from ${source}`);

    // Add to event bus
    if (!this.eventBus.has(eventType)) {
      this.eventBus.set(eventType, []);
    }

    const event = {
      id: new ObjectId(),
      type: eventType,
      data,
      source,
      timestamp: new Date(),
      processed: new Set()
    };

    this.eventBus.get(eventType).push(event);

    // Notify interested services
    for (const [serviceName, processor] of this.serviceProcessors.entries()) {
      if (processor.isInterestedInEvent(eventType)) {
        await processor.processEvent(event);
      }
    }

    this.globalStats.totalEventsProcessed++;
    this.globalStats.lastProcessingTime = new Date();
  }

  getGlobalStatistics() {
    const serviceStats = {};
    for (const [serviceName, processor] of this.serviceProcessors.entries()) {
      serviceStats[serviceName] = processor.getStatistics();
    }

    return {
      global: this.globalStats,
      services: serviceStats,
      eventBusSize: Array.from(this.eventBus.values()).reduce((total, events) => total + events.length, 0)
    };
  }
}

// Service-specific processor for handling events relevant to each microservice
class ServiceSpecificProcessor {
  constructor(db, serviceName, eventSystem) {
    this.db = db;
    this.serviceName = serviceName;
    this.eventSystem = eventSystem;
    this.eventFilters = new Map();
    this.stats = {
      eventsProcessed: 0,
      eventsFiltered: 0,
      lastProcessedAt: null
    };

    this.setupEventFilters();
  }

  setupEventFilters() {
    // Define which events each service cares about
    const filterConfigs = {
      'inventory-service': [
        'order_created',
        'order_cancelled', 
        'item_status_changed'
      ],
      'payment-service': [
        'order_created',
        'order_confirmed',
        'order_cancelled',
        'payment_status_changed'
      ],
      'fulfillment-service': [
        'order_confirmed',
        'payment_authorized',
        'inventory_reserved'
      ],
      'notification-service': [
        'order_created',
        'status_changed',
        'payment_status_changed',
        'shipping_updated'
      ],
      'analytics-service': [
        '*' // Analytics service processes all events
      ],
      'customer-service': [
        'order_created',
        'order_delivered',
        'order_cancelled'
      ]
    };

    const filters = filterConfigs[this.serviceName] || [];
    filters.forEach(filter => this.eventFilters.set(filter, true));
  }

  async initialize() {
    console.log(`Initializing ${this.serviceName} processor...`);

    // Service-specific initialization
    switch (this.serviceName) {
      case 'inventory-service':
        await this.initializeInventoryTracking();
        break;
      case 'payment-service':
        await this.initializePaymentProcessing();
        break;
      // ... other services
    }

    console.log(`${this.serviceName} processor initialized`);
  }

  isInterestedInEvent(eventType) {
    return this.eventFilters.has('*') || this.eventFilters.has(eventType);
  }

  async processEvent(event) {
    if (!this.isInterestedInEvent(event.type)) {
      this.stats.eventsFiltered++;
      return;
    }

    console.log(`${this.serviceName} processing event: ${event.type}`);

    try {
      // Service-specific event processing
      await this.handleServiceEvent(event);

      event.processed.add(this.serviceName);
      this.stats.eventsProcessed++;
      this.stats.lastProcessedAt = new Date();

    } catch (error) {
      console.error(`${this.serviceName} error processing event ${event.id}:`, error);
      throw error;
    }
  }

  async handleServiceEvent(event) {
    // Dispatch to service-specific handlers
    const handlerMethod = `handle${event.type.split('_').map(word => 
      word.charAt(0).toUpperCase() + word.slice(1)
    ).join('')}`;

    if (typeof this[handlerMethod] === 'function') {
      await this[handlerMethod](event);
    } else {
      console.warn(`No handler found: ${handlerMethod} in ${this.serviceName}`);
    }
  }

  // Service-specific event handlers
  async handleOrderCreated(event) {
    if (this.serviceName === 'inventory-service') {
      await this.reserveInventoryForOrder(event.data);
    } else if (this.serviceName === 'notification-service') {
      await this.sendOrderConfirmationEmail(event.data);
    }
  }

  async handleStatusChanged(event) {
    if (this.serviceName === 'customer-service') {
      await this.updateCustomerOrderHistory(event.data);
    }
  }

  // Helper methods for specific services
  async reserveInventoryForOrder(order) {
    console.log(`Reserving inventory for order: ${order.orderNumber}`);
    // Implementation would interact with inventory management system
  }

  async sendOrderConfirmationEmail(order) {
    console.log(`Sending confirmation email for order: ${order.orderNumber}`);
    // Implementation would use email service
  }

  async initializeInventoryTracking() {
    // Setup inventory-specific collections and indexes
    const inventory = this.db.collection('inventory');
    await inventory.createIndex({ productId: 1, warehouse: 1 });
  }

  async initializePaymentProcessing() {
    // Setup payment-specific configurations
    console.log('Payment service initialized with fraud detection enabled');
  }

  getStatistics() {
    return this.stats;
  }
}

// Benefits of MongoDB Change Streams:
// - Real-time change detection with minimal latency
// - Native event sourcing capabilities without complex triggers  
// - Resumable streams with automatic recovery from failures
// - Ordered event processing with guaranteed delivery
// - Fine-grained filtering and transformation pipelines
// - Horizontal scaling across multiple application instances
// - Integration with MongoDB's replica set and sharding architecture
// - No polling overhead or resource waste
// - Built-in clustering and high availability support
// - Simple integration with existing MongoDB applications

module.exports = {
  setupOrderManagement,
  OrderEventProcessor,
  DistributedOrderEventSystem,
  ServiceSpecificProcessor
};

Understanding MongoDB Change Streams Architecture

Change Stream Processing Patterns

MongoDB Change Streams operate at the replica set level and provide several key capabilities for event-driven architectures:

// Advanced change stream patterns and configurations
class AdvancedChangeStreamManager {
  constructor(client) {
    this.client = client;
    this.db = client.db('ecommerce');
    this.changeStreams = new Map();
    this.resumeTokens = new Map();
    this.errorHandlers = new Map();
  }

  async setupMultiCollectionStreams() {
    console.log('Setting up multi-collection change streams...');

    // 1. Collection-specific streams with targeted processing
    const collectionConfigs = [
      {
        name: 'orders',
        pipeline: [
          {
            $match: {
              $or: [
                { operationType: 'insert' },
                { operationType: 'update', 'updateDescription.updatedFields.status': { $exists: true } },
                { operationType: 'update', 'updateDescription.updatedFields.payment.status': { $exists: true } }
              ]
            }
          }
        ],
        handler: this.handleOrderChanges.bind(this)
      },
      {
        name: 'inventory', 
        pipeline: [
          {
            $match: {
              $or: [
                { operationType: 'update', 'updateDescription.updatedFields.availableQuantity': { $exists: true } },
                { operationType: 'update', 'updateDescription.updatedFields.reservedQuantity': { $exists: true } }
              ]
            }
          }
        ],
        handler: this.handleInventoryChanges.bind(this)
      },
      {
        name: 'customers',
        pipeline: [
          {
            $match: {
              operationType: { $in: ['insert', 'update'] },
              $or: [
                { 'fullDocument.loyaltyTier': { $exists: true } },
                { 'updateDescription.updatedFields.loyaltyTier': { $exists: true } },
                { 'updateDescription.updatedFields.preferences': { $exists: true } }
              ]
            }
          }
        ],
        handler: this.handleCustomerChanges.bind(this)
      }
    ];

    // Start streams for each collection
    for (const config of collectionConfigs) {
      await this.startCollectionStream(config);
    }

    // 2. Database-level change stream for cross-collection events
    await this.startDatabaseStream();

    console.log(`Started ${collectionConfigs.length + 1} change streams`);
  }

  async startCollectionStream(config) {
    const collection = this.db.collection(config.name);
    const resumeToken = this.resumeTokens.get(config.name);

    const options = {
      pipeline: config.pipeline,
      fullDocument: 'updateLookup',
      fullDocumentBeforeChange: 'whenAvailable',
      maxAwaitTimeMS: 1000,
      startAfter: resumeToken
    };

    try {
      const changeStream = collection.watch(options);
      this.changeStreams.set(config.name, changeStream);

      // Process changes asynchronously
      this.processChangeStream(config.name, changeStream, config.handler);

    } catch (error) {
      console.error(`Error starting stream for ${config.name}:`, error);
      this.scheduleStreamRestart(config);
    }
  }

  async startDatabaseStream() {
    // Database-level stream for cross-collection coordination
    const pipeline = [
      {
        $match: {
          // Monitor for significant cross-collection events
          $or: [
            { 
              operationType: 'insert',
              'fullDocument.metadata.requiresCrossCollectionSync': true
            },
            {
              operationType: 'update',
              'updateDescription.updatedFields.syncRequired': { $exists: true }
            }
          ]
        }
      },
      {
        $addFields: {
          // Add processing metadata
          collectionName: '$ns.coll',
          databaseName: '$ns.db',
          changeSignature: {
            $concat: [
              '$ns.coll', '_',
              '$operationType', '_',
              { $toString: '$clusterTime' }
            ]
          }
        }
      }
    ];

    const options = {
      pipeline,
      fullDocument: 'updateLookup',
      maxAwaitTimeMS: 2000
    };

    const dbStream = this.db.watch(options);
    this.changeStreams.set('_database', dbStream);

    this.processChangeStream('_database', dbStream, this.handleDatabaseChanges.bind(this));
  }

  async processChangeStream(streamName, changeStream, handler) {
    console.log(`Processing change stream: ${streamName}`);

    try {
      for await (const change of changeStream) {
        try {
          // Store resume token
          this.resumeTokens.set(streamName, change._id);

          // Process the change
          await handler(change);

          // Persist resume token for recovery
          await this.persistResumeToken(streamName, change._id);

        } catch (processingError) {
          console.error(`Error processing change in ${streamName}:`, processingError);
          await this.handleProcessingError(streamName, change, processingError);
        }
      }
    } catch (streamError) {
      console.error(`Stream error in ${streamName}:`, streamError);
      await this.handleStreamError(streamName, streamError);
    }
  }

  async handleOrderChanges(change) {
    console.log(`Order change detected: ${change.operationType}`);

    const { operationType, fullDocument, documentKey, updateDescription } = change;

    // Route based on change type and affected fields
    if (operationType === 'insert') {
      await this.processNewOrder(fullDocument);
    } else if (operationType === 'update') {
      const updatedFields = Object.keys(updateDescription.updatedFields || {});

      // Process specific field updates
      if (updatedFields.includes('status')) {
        await this.processOrderStatusChange(fullDocument, updateDescription);
      }

      if (updatedFields.some(field => field.startsWith('payment.'))) {
        await this.processPaymentChange(fullDocument, updateDescription);
      }

      if (updatedFields.some(field => field.startsWith('fulfillment.'))) {
        await this.processFulfillmentChange(fullDocument, updateDescription);
      }
    }
  }

  async handleInventoryChanges(change) {
    console.log(`Inventory change detected: ${change.operationType}`);

    const { fullDocument, updateDescription } = change;
    const updatedFields = Object.keys(updateDescription.updatedFields || {});

    // Check for low stock conditions
    if (updatedFields.includes('availableQuantity')) {
      const newQuantity = updateDescription.updatedFields.availableQuantity;
      if (newQuantity <= fullDocument.reorderLevel) {
        await this.triggerReorderAlert(fullDocument);
      }
    }

    // Propagate inventory changes to dependent systems
    await this.syncInventoryWithExternalSystems(fullDocument, updatedFields);
  }

  async handleCustomerChanges(change) {
    console.log(`Customer change detected: ${change.operationType}`);

    const { fullDocument, updateDescription } = change;

    // Handle loyalty tier changes
    if (updateDescription?.updatedFields?.loyaltyTier) {
      await this.processLoyaltyTierChange(fullDocument, updateDescription);
    }

    // Handle preference updates
    if (updateDescription?.updatedFields?.preferences) {
      await this.updatePersonalizationEngine(fullDocument);
    }
  }

  async handleDatabaseChanges(change) {
    console.log(`Database-level change: ${change.collectionName}.${change.operationType}`);

    // Handle cross-collection synchronization events
    await this.coordinateCrossCollectionSync(change);
  }

  // Resilience and error handling
  async handleStreamError(streamName, error) {
    console.error(`Stream ${streamName} encountered error:`, error);

    // Implement exponential backoff for reconnection
    const baseDelay = 1000; // 1 second
    const maxRetries = 5;
    let retryCount = 0;

    while (retryCount < maxRetries) {
      const delay = baseDelay * Math.pow(2, retryCount);
      console.log(`Attempting to restart ${streamName} in ${delay}ms (retry ${retryCount + 1})`);

      await new Promise(resolve => setTimeout(resolve, delay));

      try {
        // Restart the specific stream
        await this.restartStream(streamName);
        console.log(`Successfully restarted ${streamName}`);
        break;
      } catch (restartError) {
        console.error(`Failed to restart ${streamName}:`, restartError);
        retryCount++;
      }
    }

    if (retryCount >= maxRetries) {
      console.error(`Failed to restart ${streamName} after ${maxRetries} attempts`);
      // Implement alerting for operations team
      await this.sendOperationalAlert(`Critical: Change stream ${streamName} failed to restart`);
    }
  }

  async restartStream(streamName) {
    // Close existing stream if it exists
    const existingStream = this.changeStreams.get(streamName);
    if (existingStream) {
      try {
        await existingStream.close();
      } catch (closeError) {
        console.warn(`Error closing ${streamName}:`, closeError);
      }
      this.changeStreams.delete(streamName);
    }

    // Restart based on stream type
    if (streamName === '_database') {
      await this.startDatabaseStream();
    } else {
      // Find and restart collection stream
      const config = this.getCollectionConfig(streamName);
      if (config) {
        await this.startCollectionStream(config);
      }
    }
  }

  async persistResumeToken(streamName, resumeToken) {
    // Store resume tokens in MongoDB for crash recovery
    const tokenCollection = this.db.collection('change_stream_tokens');

    await tokenCollection.updateOne(
      { streamName },
      {
        $set: {
          resumeToken,
          lastUpdated: new Date(),
          streamName
        }
      },
      { upsert: true }
    );
  }

  async loadPersistedResumeTokens() {
    console.log('Loading persisted resume tokens...');

    const tokenCollection = this.db.collection('change_stream_tokens');
    const tokens = await tokenCollection.find({}).toArray();

    for (const token of tokens) {
      this.resumeTokens.set(token.streamName, token.resumeToken);
      console.log(`Loaded resume token for ${token.streamName}`);
    }
  }

  // Performance monitoring and optimization
  async getChangeStreamMetrics() {
    const metrics = {
      activeStreams: this.changeStreams.size,
      resumeTokens: this.resumeTokens.size,
      streamStatus: {},
      systemHealth: await this.checkSystemHealth()
    };

    // Check status of each stream
    for (const [streamName, stream] of this.changeStreams.entries()) {
      metrics.streamStatus[streamName] = {
        isActive: !stream.closed,
        hasResumeToken: this.resumeTokens.has(streamName)
      };
    }

    return metrics;
  }

  async checkSystemHealth() {
    try {
      // Check MongoDB replica set status
      const replicaSetStatus = await this.client.db('admin').admin().replSetGetStatus();

      const healthMetrics = {
        replicaSetHealthy: replicaSetStatus.ok === 1,
        primaryNode: replicaSetStatus.members.find(member => member.state === 1)?.name,
        secondaryNodes: replicaSetStatus.members.filter(member => member.state === 2).length,
        oplogSize: await this.getOplogSize(),
        changeStreamSupported: true
      };

      return healthMetrics;
    } catch (error) {
      console.error('Error checking system health:', error);
      return {
        replicaSetHealthy: false,
        error: error.message
      };
    }
  }

  async getOplogSize() {
    // Check oplog size to ensure sufficient retention for change streams
    const oplog = this.client.db('local').collection('oplog.rs');
    const stats = await oplog.stats();

    return {
      sizeBytes: stats.size,
      sizeMB: Math.round(stats.size / 1024 / 1024),
      maxSizeBytes: stats.maxSize,
      maxSizeMB: Math.round(stats.maxSize / 1024 / 1024),
      utilizationPercent: Math.round((stats.size / stats.maxSize) * 100)
    };
  }

  // Cleanup and shutdown
  async shutdown() {
    console.log('Shutting down change stream manager...');

    const shutdownPromises = [];

    // Close all active streams
    for (const [streamName, stream] of this.changeStreams.entries()) {
      console.log(`Closing stream: ${streamName}`);
      shutdownPromises.push(
        stream.close().catch(error => 
          console.warn(`Error closing ${streamName}:`, error)
        )
      );
    }

    await Promise.allSettled(shutdownPromises);

    // Clear internal state
    this.changeStreams.clear();
    this.resumeTokens.clear();

    console.log('Change stream manager shutdown complete');
  }
}

// Helper methods for event processing
async function processNewOrder(order) {
  console.log(`Processing new order: ${order.orderNumber}`);

  // Comprehensive order processing workflow
  const processingTasks = [
    validateOrderData(order),
    checkInventoryAvailability(order), 
    validatePaymentMethod(order),
    calculateShippingOptions(order),
    applyPromotionsAndDiscounts(order),
    createFulfillmentWorkflow(order),
    sendCustomerNotifications(order),
    updateAnalyticsAndReporting(order)
  ];

  const results = await Promise.allSettled(processingTasks);

  // Handle any failed tasks
  const failures = results.filter(result => result.status === 'rejected');
  if (failures.length > 0) {
    console.error(`${failures.length} tasks failed for order ${order.orderNumber}`);
    await handleOrderProcessingFailures(order, failures);
  }
}

async function triggerReorderAlert(inventoryItem) {
  console.log(`Low stock alert: ${inventoryItem.sku} - ${inventoryItem.availableQuantity} remaining`);

  // Create automatic reorder if conditions are met
  if (inventoryItem.autoReorder && inventoryItem.availableQuantity <= inventoryItem.criticalLevel) {
    const reorderQuantity = inventoryItem.maxStock - inventoryItem.availableQuantity;

    await createPurchaseOrder({
      productId: inventoryItem.productId,
      sku: inventoryItem.sku,
      quantity: reorderQuantity,
      supplier: inventoryItem.preferredSupplier,
      urgency: 'high',
      reason: 'automated_reorder_low_stock'
    });
  }
}

// Example helper implementations
async function validateOrderData(order) {
  // Comprehensive order validation
  const validationResults = {
    customerValid: await validateCustomer(order.customerId),
    itemsValid: await validateOrderItems(order.items),
    addressValid: await validateShippingAddress(order.shipping.address),
    paymentValid: await validatePaymentInfo(order.payment)
  };

  const isValid = Object.values(validationResults).every(result => result === true);
  if (!isValid) {
    throw new Error(`Order validation failed: ${JSON.stringify(validationResults)}`);
  }
}

async function createPurchaseOrder(orderData) {
  console.log(`Creating purchase order: ${orderData.sku} x ${orderData.quantity}`);
  // Implementation would create purchase order in procurement system
}

async function sendOperationalAlert(message) {
  console.error(`OPERATIONAL ALERT: ${message}`);
  // Implementation would integrate with alerting system (PagerDuty, Slack, etc.)
}

SQL-Style Change Stream Operations with QueryLeaf

QueryLeaf provides SQL-familiar syntax for MongoDB Change Stream operations:

-- QueryLeaf change stream operations with SQL-familiar syntax

-- Create change stream listeners with SQL-style syntax
CREATE CHANGE STREAM order_status_changes 
ON orders 
WHERE 
  operation_type IN ('update', 'insert')
  AND (
    changed_fields CONTAINS 'status' 
    OR changed_fields CONTAINS 'payment.status'
  )
WITH (
  full_document = 'update_lookup',
  full_document_before_change = 'when_available',
  max_await_time = '1 second',
  batch_size = 50
);

-- Multi-collection change stream with filtering
CREATE CHANGE STREAM inventory_and_orders
ON DATABASE ecommerce
WHERE 
  collection_name IN ('orders', 'inventory', 'products')
  AND (
    (collection_name = 'orders' AND operation_type = 'insert')
    OR (collection_name = 'inventory' AND changed_fields CONTAINS 'availableQuantity')
    OR (collection_name = 'products' AND changed_fields CONTAINS 'price')
  )
WITH (
  resume_after = '8264BEB9F3000000012B0229296E04'
);

-- Real-time order processing with change stream triggers
CREATE TRIGGER process_order_changes
ON CHANGE STREAM order_status_changes
FOR EACH CHANGE AS
BEGIN
  -- Route processing based on change type
  CASE change.operation_type
    WHEN 'insert' THEN
      -- New order created
      CALL process_new_order(change.full_document);

      -- Send notifications
      INSERT INTO notification_queue (
        recipient, 
        type, 
        message, 
        data
      )
      VALUES (
        change.full_document.customer.email,
        'order_confirmation',
        'Your order has been received',
        change.full_document
      );

    WHEN 'update' THEN
      -- Order updated - check what changed
      IF change.changed_fields CONTAINS 'status' THEN
        CALL process_status_change(
          change.full_document,
          change.update_description.updated_fields.status
        );
      END IF;

      IF change.changed_fields CONTAINS 'payment.status' THEN
        CALL process_payment_status_change(
          change.full_document,
          change.update_description.updated_fields['payment.status']
        );
      END IF;
  END CASE;

  -- Update processing metrics
  UPDATE change_stream_metrics 
  SET 
    events_processed = events_processed + 1,
    last_processed_at = CURRENT_TIMESTAMP
  WHERE stream_name = 'order_status_changes';
END;

-- Change stream analytics and monitoring
WITH change_stream_analytics AS (
  SELECT 
    stream_name,
    operation_type,
    collection_name,
    DATE_TRUNC('minute', change_timestamp) as minute_bucket,

    COUNT(*) as change_count,
    COUNT(DISTINCT document_key._id) as unique_documents,

    -- Processing latency analysis
    AVG(processing_time_ms) as avg_processing_time,
    PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY processing_time_ms) as p95_processing_time,

    -- Change characteristics
    COUNT(*) FILTER (WHERE operation_type = 'insert') as insert_count,
    COUNT(*) FILTER (WHERE operation_type = 'update') as update_count,
    COUNT(*) FILTER (WHERE operation_type = 'delete') as delete_count,

    -- Field change patterns
    STRING_AGG(DISTINCT changed_fields, ',') as common_changed_fields,

    -- Error tracking
    COUNT(*) FILTER (WHERE processing_status = 'error') as error_count,
    COUNT(*) FILTER (WHERE processing_status = 'retry') as retry_count

  FROM change_stream_events
  WHERE change_timestamp >= CURRENT_TIMESTAMP - INTERVAL '1 hour'
  GROUP BY stream_name, operation_type, collection_name, minute_bucket
),
stream_performance AS (
  SELECT 
    stream_name,
    SUM(change_count) as total_changes,
    AVG(avg_processing_time) as overall_avg_processing_time,
    MAX(p95_processing_time) as max_p95_processing_time,

    -- Throughput analysis
    SUM(change_count) / 60.0 as changes_per_second,

    -- Error rates
    SUM(error_count) as total_errors,
    (SUM(error_count)::numeric / SUM(change_count)) * 100 as error_rate_percent,

    -- Change type distribution
    SUM(insert_count) as total_inserts,
    SUM(update_count) as total_updates, 
    SUM(delete_count) as total_deletes,

    -- Field change frequency
    COUNT(DISTINCT common_changed_fields) as unique_field_patterns,

    -- Performance assessment
    CASE 
      WHEN AVG(avg_processing_time) > 1000 THEN 'SLOW'
      WHEN AVG(avg_processing_time) > 500 THEN 'MODERATE'
      ELSE 'FAST'
    END as performance_rating,

    -- Health indicators
    CASE
      WHEN (SUM(error_count)::numeric / SUM(change_count)) > 0.05 THEN 'UNHEALTHY'
      WHEN (SUM(error_count)::numeric / SUM(change_count)) > 0.01 THEN 'WARNING' 
      ELSE 'HEALTHY'
    END as health_status

  FROM change_stream_analytics
  GROUP BY stream_name
)
SELECT 
  sp.stream_name,
  sp.total_changes,
  ROUND(sp.changes_per_second, 2) as changes_per_sec,
  ROUND(sp.overall_avg_processing_time, 1) as avg_processing_ms,
  ROUND(sp.max_p95_processing_time, 1) as max_p95_ms,
  sp.performance_rating,
  sp.health_status,

  -- Change breakdown
  sp.total_inserts,
  sp.total_updates,
  sp.total_deletes,

  -- Error analysis  
  sp.total_errors,
  ROUND(sp.error_rate_percent, 2) as error_rate_pct,

  -- Field change patterns
  sp.unique_field_patterns,

  -- Recommendations
  CASE 
    WHEN sp.performance_rating = 'SLOW' THEN 'Optimize change processing logic or increase resources'
    WHEN sp.error_rate_percent > 5 THEN 'Investigate error patterns and improve error handling'
    WHEN sp.changes_per_second > 1000 THEN 'Consider stream partitioning for better throughput'
    ELSE 'Performance within acceptable parameters'
  END as recommendation

FROM stream_performance sp
ORDER BY sp.total_changes DESC;

-- Advanced change stream query patterns
CREATE VIEW real_time_order_insights AS
WITH order_changes AS (
  SELECT 
    full_document.*,
    change_timestamp,
    operation_type,
    changed_fields,

    -- Calculate order lifecycle timing
    CASE 
      WHEN operation_type = 'insert' THEN 'order_created'
      WHEN changed_fields CONTAINS 'status' THEN 
        CONCAT('status_changed_to_', full_document.status)
      WHEN changed_fields CONTAINS 'payment.status' THEN
        CONCAT('payment_', full_document.payment.status) 
      ELSE 'other_update'
    END as change_event_type,

    -- Time-based analytics
    DATE_TRUNC('hour', change_timestamp) as hour_bucket,
    EXTRACT(DOW FROM change_timestamp) as day_of_week,
    EXTRACT(HOUR FROM change_timestamp) as hour_of_day,

    -- Order value categories
    CASE 
      WHEN full_document.financial.total >= 500 THEN 'high_value'
      WHEN full_document.financial.total >= 100 THEN 'medium_value'
      ELSE 'low_value'
    END as order_value_category,

    -- Customer segment analysis
    full_document.customer.loyaltyTier as customer_segment,

    -- Geographic analysis
    full_document.shipping.address.state as shipping_state,
    full_document.shipping.address.country as shipping_country

  FROM CHANGE_STREAM(orders)
  WHERE change_timestamp >= CURRENT_TIMESTAMP - INTERVAL '24 hours'
),
order_metrics AS (
  SELECT 
    hour_bucket,
    day_of_week,
    hour_of_day,
    change_event_type,
    order_value_category,
    customer_segment,
    shipping_state,

    COUNT(*) as event_count,
    COUNT(DISTINCT full_document._id) as unique_orders,
    AVG(full_document.financial.total) as avg_order_value,
    SUM(full_document.financial.total) as total_order_value,

    -- Conversion funnel analysis
    COUNT(*) FILTER (WHERE change_event_type = 'order_created') as orders_created,
    COUNT(*) FILTER (WHERE change_event_type = 'status_changed_to_confirmed') as orders_confirmed,
    COUNT(*) FILTER (WHERE change_event_type = 'status_changed_to_shipped') as orders_shipped,
    COUNT(*) FILTER (WHERE change_event_type = 'status_changed_to_delivered') as orders_delivered,
    COUNT(*) FILTER (WHERE change_event_type = 'status_changed_to_cancelled') as orders_cancelled,

    -- Payment analysis
    COUNT(*) FILTER (WHERE change_event_type = 'payment_authorized') as payments_authorized,
    COUNT(*) FILTER (WHERE change_event_type = 'payment_captured') as payments_captured,
    COUNT(*) FILTER (WHERE change_event_type = 'payment_failed') as payments_failed,

    -- Customer behavior
    COUNT(DISTINCT full_document.customer.customerId) as unique_customers,
    AVG(ARRAY_LENGTH(full_document.items, 1)) as avg_items_per_order

  FROM order_changes
  GROUP BY 
    hour_bucket, day_of_week, hour_of_day, change_event_type,
    order_value_category, customer_segment, shipping_state
)
SELECT 
  hour_bucket,
  change_event_type,
  order_value_category,
  customer_segment,

  event_count,
  unique_orders,
  ROUND(avg_order_value, 2) as avg_order_value,
  ROUND(total_order_value, 2) as total_order_value,

  -- Conversion rates
  CASE 
    WHEN orders_created > 0 THEN 
      ROUND((orders_confirmed::numeric / orders_created) * 100, 1)
    ELSE 0
  END as confirmation_rate_pct,

  CASE 
    WHEN orders_confirmed > 0 THEN
      ROUND((orders_shipped::numeric / orders_confirmed) * 100, 1) 
    ELSE 0
  END as fulfillment_rate_pct,

  CASE
    WHEN orders_shipped > 0 THEN
      ROUND((orders_delivered::numeric / orders_shipped) * 100, 1)
    ELSE 0  
  END as delivery_rate_pct,

  -- Payment success rates
  CASE
    WHEN payments_authorized > 0 THEN
      ROUND((payments_captured::numeric / payments_authorized) * 100, 1)
    ELSE 0
  END as payment_success_rate_pct,

  -- Business insights
  unique_customers,
  ROUND(avg_items_per_order, 1) as avg_items_per_order,

  -- Time-based patterns
  day_of_week,
  hour_of_day,

  -- Geographic insights
  shipping_state,

  -- Performance indicators
  CASE 
    WHEN change_event_type = 'order_created' AND event_count > 100 THEN 'HIGH_VOLUME'
    WHEN change_event_type = 'payment_failed' AND event_count > 10 THEN 'PAYMENT_ISSUES'
    WHEN change_event_type = 'status_changed_to_cancelled' AND event_count > 20 THEN 'HIGH_CANCELLATION'
    ELSE 'NORMAL'
  END as alert_status

FROM order_metrics
WHERE event_count > 0
ORDER BY hour_bucket DESC, event_count DESC;

-- Resume token management for change stream reliability
CREATE TABLE change_stream_resume_tokens (
  stream_name VARCHAR(255) PRIMARY KEY,
  resume_token TEXT NOT NULL,
  last_updated TIMESTAMP DEFAULT CURRENT_TIMESTAMP,

  -- Stream configuration
  collection_name VARCHAR(255),
  database_name VARCHAR(255),
  filter_pipeline JSONB,

  -- Monitoring
  events_processed BIGINT DEFAULT 0,
  last_event_timestamp TIMESTAMP,
  stream_status VARCHAR(50) DEFAULT 'active',

  -- Performance tracking
  avg_processing_latency_ms INTEGER,
  last_error_message TEXT,
  last_error_timestamp TIMESTAMP,
  consecutive_errors INTEGER DEFAULT 0
);

-- Automatic resume token persistence
CREATE TRIGGER update_resume_tokens
AFTER INSERT OR UPDATE ON change_stream_events
FOR EACH ROW
EXECUTE FUNCTION update_stream_resume_token();

-- Change stream health monitoring
SELECT 
  cst.stream_name,
  cst.collection_name,
  cst.events_processed,
  cst.stream_status,

  -- Time since last activity
  EXTRACT(EPOCH FROM (CURRENT_TIMESTAMP - cst.last_event_timestamp)) / 60 as minutes_since_last_event,

  -- Performance metrics
  cst.avg_processing_latency_ms,
  cst.consecutive_errors,

  -- Health assessment
  CASE 
    WHEN cst.stream_status != 'active' THEN 'INACTIVE'
    WHEN cst.consecutive_errors >= 5 THEN 'FAILING'
    WHEN EXTRACT(EPOCH FROM (CURRENT_TIMESTAMP - cst.last_event_timestamp)) > 300 THEN 'STALE'
    WHEN cst.avg_processing_latency_ms > 1000 THEN 'SLOW'
    ELSE 'HEALTHY'
  END as health_status,

  -- Recovery information
  cst.resume_token,
  cst.last_updated,

  -- Error details
  cst.last_error_message,
  cst.last_error_timestamp

FROM change_stream_resume_tokens cst
ORDER BY 
  CASE health_status
    WHEN 'FAILING' THEN 1
    WHEN 'INACTIVE' THEN 2
    WHEN 'STALE' THEN 3
    WHEN 'SLOW' THEN 4
    ELSE 5
  END,
  cst.events_processed DESC;

-- QueryLeaf change stream features provide:
-- 1. SQL-familiar syntax for MongoDB Change Stream operations
-- 2. Real-time event processing with familiar trigger patterns
-- 3. Advanced filtering and transformation using SQL expressions
-- 4. Built-in analytics and monitoring with SQL aggregation functions
-- 5. Resume token management for reliable stream processing
-- 6. Performance monitoring and health assessment queries
-- 7. Integration with existing SQL-based reporting and analytics
-- 8. Event-driven architecture patterns using familiar SQL constructs
-- 9. Multi-collection change coordination with SQL joins and unions
-- 10. Seamless scaling from simple change detection to complex event processing

Best Practices for Change Stream Implementation

Performance and Scalability Considerations

Optimize Change Streams for high-throughput, production environments:

Pipeline Filtering: Use aggregation pipelines to filter changes at the database level
Resume Token Management: Implement robust resume token persistence for crash recovery
Batch Processing: Process changes in batches to improve throughput
Resource Management: Monitor memory and connection usage for long-running streams
Error Handling: Implement comprehensive error handling and retry logic
Oplog Sizing: Ensure adequate oplog size for change stream retention requirements

Event-Driven Architecture Patterns

Design scalable event-driven systems with Change Streams:

Event Sourcing: Use Change Streams as the foundation for event sourcing patterns
CQRS Integration: Implement Command Query Responsibility Segregation with change-driven read model updates
Microservice Communication: Coordinate microservices through change-driven events
Data Synchronization: Maintain consistency across distributed systems
Real-time Analytics: Power real-time dashboards and analytics with streaming changes
Audit and Compliance: Implement comprehensive audit trails with change event logging

Conclusion

MongoDB Change Streams provide comprehensive real-time change detection capabilities that eliminate the complexity and overhead of traditional polling-based approaches while enabling sophisticated event-driven architectures. The native integration with MongoDB's replica set architecture, combined with resumable streams and fine-grained filtering, makes building reactive applications both powerful and reliable.

Key Change Stream benefits include:

Real-time Processing: Millisecond latency change detection without polling overhead
Guaranteed Delivery: Ordered, resumable streams with crash recovery capabilities
Rich Filtering: Aggregation pipeline-based change filtering and transformation
Horizontal Scaling: Native support for distributed processing across multiple application instances
Operational Simplicity: No external message brokers or complex trigger maintenance required
Event Sourcing Support: Built-in capabilities for implementing event sourcing patterns

Whether you're building microservices architectures, real-time analytics platforms, data synchronization systems, or event-driven applications, MongoDB Change Streams with QueryLeaf's familiar SQL interface provides the foundation for sophisticated reactive data processing. This combination enables you to implement complex event-driven functionality while preserving familiar database interaction patterns.

QueryLeaf Integration: QueryLeaf automatically manages MongoDB Change Stream operations while providing SQL-familiar event processing syntax, resume token handling, and stream analytics functions. Advanced change detection, event routing, and stream monitoring are seamlessly handled through familiar SQL patterns, making event-driven architecture development both powerful and accessible.

The integration of native change detection capabilities with SQL-style stream processing makes MongoDB an ideal platform for applications requiring both real-time reactivity and familiar database interaction patterns, ensuring your event-driven solutions remain both effective and maintainable as they scale and evolve.

November 6, 2025
25 min read

MongoDB GridFS: Advanced Binary File Management and Distributed Storage for Large-Scale Applications

Modern applications require sophisticated file storage capabilities that can handle large binary files, support efficient streaming operations, and integrate seamlessly with existing data workflows while maintaining high availability and performance. Traditional file storage approaches often struggle with scenarios involving large files, distributed systems, metadata management, and the complexity of coordinating file operations with database transactions, leading to data inconsistency, performance bottlenecks, and operational complexity in production environments.

MongoDB GridFS provides comprehensive distributed file storage that automatically chunks large files, maintains file metadata, and integrates directly with MongoDB's distributed architecture and transaction capabilities. Unlike traditional file storage solutions that require separate file servers and complex synchronization logic, GridFS delivers unified file and data management through automatic file chunking, integrated metadata storage, and seamless integration with MongoDB's replication and sharding capabilities.

The Traditional File Storage Challenge

Conventional file storage architectures face significant limitations when handling large files and distributed systems:

-- Traditional PostgreSQL file storage - complex management and limited scalability

-- Basic file metadata table with limited binary storage capabilities
CREATE TABLE file_metadata (
    file_id BIGSERIAL PRIMARY KEY,
    original_filename VARCHAR(500) NOT NULL,
    content_type VARCHAR(200),
    file_size_bytes BIGINT NOT NULL,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,

    -- File organization
    directory_path VARCHAR(1000),
    file_category VARCHAR(100),

    -- User and access control
    uploaded_by BIGINT NOT NULL,
    access_level VARCHAR(20) DEFAULT 'private',

    -- File processing status
    processing_status VARCHAR(50) DEFAULT 'pending',
    thumbnail_generated BOOLEAN DEFAULT FALSE,
    virus_scan_status VARCHAR(50) DEFAULT 'pending',

    -- Storage location (external file system required)
    storage_path VARCHAR(1500) NOT NULL,
    storage_server VARCHAR(200),
    backup_locations TEXT[],

    -- File versioning (complex to implement)
    version_number INTEGER DEFAULT 1,
    parent_file_id BIGINT REFERENCES file_metadata(file_id),
    is_current_version BOOLEAN DEFAULT TRUE,

    -- Performance optimization fields
    download_count BIGINT DEFAULT 0,
    last_accessed TIMESTAMP,

    -- File integrity
    md5_hash VARCHAR(32),
    sha256_hash VARCHAR(64),

    -- Metadata for different file types
    image_metadata JSONB,
    document_metadata JSONB,
    video_metadata JSONB,

    CONSTRAINT valid_access_level CHECK (access_level IN ('public', 'private', 'shared', 'restricted')),
    CONSTRAINT valid_processing_status CHECK (processing_status IN ('pending', 'processing', 'completed', 'failed'))
);

-- Complex indexing strategy for file management
CREATE INDEX idx_files_user_category ON file_metadata(uploaded_by, file_category, created_at DESC);
CREATE INDEX idx_files_directory ON file_metadata(directory_path, original_filename);
CREATE INDEX idx_files_size ON file_metadata(file_size_bytes DESC);
CREATE INDEX idx_files_type ON file_metadata(content_type, created_at DESC);
CREATE INDEX idx_files_processing ON file_metadata(processing_status, created_at);

-- File chunks table for large file handling (manual implementation required)
CREATE TABLE file_chunks (
    chunk_id BIGSERIAL PRIMARY KEY,
    file_id BIGINT NOT NULL REFERENCES file_metadata(file_id) ON DELETE CASCADE,
    chunk_number INTEGER NOT NULL,
    chunk_size INTEGER NOT NULL,
    chunk_data BYTEA NOT NULL, -- Limited to 1GB per field in PostgreSQL
    chunk_hash VARCHAR(64),
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,

    UNIQUE(file_id, chunk_number)
);

CREATE INDEX idx_chunks_file_order ON file_chunks(file_id, chunk_number);

-- Complex file upload procedure with chunking logic
CREATE OR REPLACE FUNCTION upload_large_file(
    p_filename VARCHAR(500),
    p_content_type VARCHAR(200),
    p_file_data BYTEA,
    p_uploaded_by BIGINT,
    p_directory_path VARCHAR(1000) DEFAULT '/',
    p_chunk_size INTEGER DEFAULT 1048576 -- 1MB chunks
) RETURNS TABLE (
    file_id BIGINT,
    total_chunks INTEGER,
    upload_status VARCHAR(50),
    processing_time_ms INTEGER
) AS $$
DECLARE
    new_file_id BIGINT;
    file_size BIGINT;
    chunk_count INTEGER;
    chunk_data BYTEA;
    chunk_start INTEGER;
    chunk_end INTEGER;
    current_chunk INTEGER := 1;
    upload_start_time TIMESTAMP := clock_timestamp();
    file_hash VARCHAR(64);
BEGIN

    -- Calculate file size and hash
    file_size := LENGTH(p_file_data);
    file_hash := encode(digest(p_file_data, 'sha256'), 'hex');

    -- Insert file metadata
    INSERT INTO file_metadata (
        original_filename, content_type, file_size_bytes,
        uploaded_by, directory_path, storage_path,
        sha256_hash, processing_status
    ) VALUES (
        p_filename, p_content_type, file_size,
        p_uploaded_by, p_directory_path, 
        p_directory_path || '/' || p_filename,
        file_hash, 'processing'
    ) RETURNING file_metadata.file_id INTO new_file_id;

    -- Calculate number of chunks needed
    chunk_count := CEILING(file_size::DECIMAL / p_chunk_size);

    -- Process file in chunks (inefficient for large files)
    FOR current_chunk IN 1..chunk_count LOOP
        chunk_start := ((current_chunk - 1) * p_chunk_size) + 1;
        chunk_end := LEAST(current_chunk * p_chunk_size, file_size);

        -- Extract chunk data (memory intensive)
        chunk_data := SUBSTRING(p_file_data FROM chunk_start FOR (chunk_end - chunk_start + 1));

        -- Store chunk
        INSERT INTO file_chunks (
            file_id, chunk_number, chunk_size, chunk_data,
            chunk_hash
        ) VALUES (
            new_file_id, current_chunk, LENGTH(chunk_data), chunk_data,
            encode(digest(chunk_data, 'sha256'), 'hex')
        );

        -- Performance degradation with large number of chunks
        IF current_chunk % 100 = 0 THEN
            COMMIT; -- Partial commits to avoid long transactions
        END IF;
    END LOOP;

    -- Update file status
    UPDATE file_metadata 
    SET processing_status = 'completed', updated_at = CURRENT_TIMESTAMP
    WHERE file_metadata.file_id = new_file_id;

    RETURN QUERY SELECT 
        new_file_id,
        chunk_count,
        'completed'::VARCHAR(50),
        EXTRACT(MILLISECONDS FROM clock_timestamp() - upload_start_time)::INTEGER;

EXCEPTION WHEN OTHERS THEN
    -- Cleanup on failure
    DELETE FROM file_metadata WHERE file_metadata.file_id = new_file_id;
    RAISE EXCEPTION 'File upload failed: %', SQLERRM;
END;
$$ LANGUAGE plpgsql;

-- Complex file download procedure with chunked retrieval
CREATE OR REPLACE FUNCTION download_file_chunks(
    p_file_id BIGINT,
    p_start_chunk INTEGER DEFAULT 1,
    p_end_chunk INTEGER DEFAULT NULL
) RETURNS TABLE (
    chunk_number INTEGER,
    chunk_data BYTEA,
    chunk_size INTEGER,
    is_final_chunk BOOLEAN
) AS $$
DECLARE
    total_chunks INTEGER;
    effective_end_chunk INTEGER;
BEGIN

    -- Get total number of chunks
    SELECT COUNT(*) INTO total_chunks
    FROM file_chunks 
    WHERE file_id = p_file_id;

    IF total_chunks = 0 THEN
        RAISE EXCEPTION 'File not found or has no chunks: %', p_file_id;
    END IF;

    -- Set effective end chunk
    effective_end_chunk := COALESCE(p_end_chunk, total_chunks);

    -- Return requested chunks (memory intensive for large ranges)
    RETURN QUERY
    SELECT 
        fc.chunk_number,
        fc.chunk_data,
        fc.chunk_size,
        fc.chunk_number = total_chunks as is_final_chunk
    FROM file_chunks fc
    WHERE fc.file_id = p_file_id
      AND fc.chunk_number BETWEEN p_start_chunk AND effective_end_chunk
    ORDER BY fc.chunk_number;

END;
$$ LANGUAGE plpgsql;

-- File streaming simulation with complex logic
CREATE OR REPLACE FUNCTION stream_file(
    p_file_id BIGINT,
    p_range_start BIGINT DEFAULT 0,
    p_range_end BIGINT DEFAULT NULL
) RETURNS TABLE (
    file_info JSONB,
    chunk_data BYTEA,
    content_range VARCHAR(100),
    total_size BIGINT
) AS $$
DECLARE
    file_record RECORD;
    chunk_size_bytes INTEGER := 1048576; -- 1MB chunks
    start_chunk INTEGER;
    end_chunk INTEGER;
    effective_range_end BIGINT;
    current_position BIGINT := 0;
    chunk_record RECORD;
BEGIN

    -- Get file metadata
    SELECT * INTO file_record
    FROM file_metadata fm
    WHERE fm.file_id = p_file_id;

    IF NOT FOUND THEN
        RAISE EXCEPTION 'File not found: %', p_file_id;
    END IF;

    -- Calculate effective range
    effective_range_end := COALESCE(p_range_end, file_record.file_size_bytes - 1);

    -- Calculate chunk range
    start_chunk := (p_range_start / chunk_size_bytes) + 1;
    end_chunk := (effective_range_end / chunk_size_bytes) + 1;

    -- Return file info
    file_info := json_build_object(
        'file_id', file_record.file_id,
        'filename', file_record.original_filename,
        'content_type', file_record.content_type,
        'total_size', file_record.file_size_bytes,
        'range_start', p_range_start,
        'range_end', effective_range_end
    );

    -- Stream chunks (inefficient for large files)
    FOR chunk_record IN
        SELECT fc.chunk_number, fc.chunk_data, fc.chunk_size
        FROM file_chunks fc
        WHERE fc.file_id = p_file_id
          AND fc.chunk_number BETWEEN start_chunk AND end_chunk
        ORDER BY fc.chunk_number
    LOOP

        -- Calculate partial chunk data if needed
        IF chunk_record.chunk_number = start_chunk AND p_range_start % chunk_size_bytes != 0 THEN
            -- Partial first chunk
            chunk_data := SUBSTRING(
                chunk_record.chunk_data 
                FROM (p_range_start % chunk_size_bytes) + 1
            );
        ELSIF chunk_record.chunk_number = end_chunk AND effective_range_end % chunk_size_bytes != chunk_size_bytes - 1 THEN
            -- Partial last chunk
            chunk_data := SUBSTRING(
                chunk_record.chunk_data 
                FOR (effective_range_end % chunk_size_bytes) + 1
            );
        ELSE
            -- Full chunk
            chunk_data := chunk_record.chunk_data;
        END IF;

        content_range := format('bytes %s-%s/%s', 
            current_position, 
            current_position + LENGTH(chunk_data) - 1,
            file_record.file_size_bytes
        );

        total_size := file_record.file_size_bytes;

        current_position := current_position + LENGTH(chunk_data);

        RETURN NEXT;
    END LOOP;

END;
$$ LANGUAGE plpgsql;

-- Complex analytics query for file storage management
WITH file_storage_analysis AS (
    SELECT 
        file_category,
        content_type,
        DATE_TRUNC('month', created_at) as month_bucket,

        -- Storage utilization
        COUNT(*) as total_files,
        SUM(file_size_bytes) as total_storage_bytes,
        AVG(file_size_bytes) as avg_file_size,
        PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY file_size_bytes) as median_file_size,
        PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY file_size_bytes) as p95_file_size,

        -- Performance metrics
        AVG(download_count) as avg_downloads,
        SUM(download_count) as total_downloads,
        COUNT(*) FILTER (WHERE download_count = 0) as unused_files,

        -- Processing status
        COUNT(*) FILTER (WHERE processing_status = 'completed') as processed_files,
        COUNT(*) FILTER (WHERE processing_status = 'failed') as failed_files,
        COUNT(*) FILTER (WHERE thumbnail_generated = true) as files_with_thumbnails,

        -- Storage efficiency
        COUNT(DISTINCT uploaded_by) as unique_uploaders,
        AVG(version_number) as avg_version_number

    FROM file_metadata
    WHERE created_at >= CURRENT_TIMESTAMP - INTERVAL '12 months'
    GROUP BY file_category, content_type, DATE_TRUNC('month', created_at)
),

storage_growth_projection AS (
    SELECT 
        month_bucket,
        total_storage_bytes,

        -- Growth calculations (complex and expensive)
        LAG(total_storage_bytes) OVER (ORDER BY month_bucket) as prev_month_storage,
        (total_storage_bytes - LAG(total_storage_bytes) OVER (ORDER BY month_bucket))::DECIMAL / 
        NULLIF(LAG(total_storage_bytes) OVER (ORDER BY month_bucket), 0) * 100 as growth_percent

    FROM (
        SELECT 
            month_bucket,
            SUM(total_storage_bytes) as total_storage_bytes
        FROM file_storage_analysis
        GROUP BY month_bucket
    ) monthly_totals
)

SELECT 
    fsa.month_bucket,
    fsa.file_category,
    fsa.content_type,

    -- File statistics
    fsa.total_files,
    ROUND(fsa.total_storage_bytes / 1024.0 / 1024.0 / 1024.0, 2) as storage_gb,
    ROUND(fsa.avg_file_size / 1024.0 / 1024.0, 2) as avg_file_size_mb,
    ROUND(fsa.median_file_size / 1024.0 / 1024.0, 2) as median_file_size_mb,

    -- Usage patterns
    fsa.avg_downloads,
    fsa.total_downloads,
    ROUND((fsa.unused_files::DECIMAL / fsa.total_files) * 100, 1) as unused_files_percent,

    -- Processing efficiency
    ROUND((fsa.processed_files::DECIMAL / fsa.total_files) * 100, 1) as processing_success_rate,
    ROUND((fsa.files_with_thumbnails::DECIMAL / fsa.total_files) * 100, 1) as thumbnail_generation_rate,

    -- Growth metrics
    sgp.growth_percent as monthly_growth_percent,

    -- Storage recommendations
    CASE 
        WHEN fsa.unused_files::DECIMAL / fsa.total_files > 0.5 THEN 'implement_cleanup_policy'
        WHEN fsa.avg_file_size > 100 * 1024 * 1024 THEN 'consider_compression'
        WHEN sgp.growth_percent > 50 THEN 'monitor_storage_capacity'
        ELSE 'storage_optimized'
    END as storage_recommendation

FROM file_storage_analysis fsa
JOIN storage_growth_projection sgp ON DATE_TRUNC('month', fsa.month_bucket) = sgp.month_bucket
WHERE fsa.total_files > 0
ORDER BY fsa.month_bucket DESC, fsa.total_storage_bytes DESC;

-- Traditional file storage approach problems:
-- 1. Complex manual chunking implementation with performance limitations
-- 2. Separate metadata and binary data management requiring coordination
-- 3. Limited streaming capabilities and memory-intensive operations
-- 4. No built-in distributed storage or replication support
-- 5. Complex versioning and concurrent access management
-- 6. Expensive maintenance operations for large file collections
-- 7. No native integration with database transactions and consistency
-- 8. Limited file processing and metadata extraction capabilities
-- 9. Difficult backup and disaster recovery for large binary datasets
-- 10. Complex sharding and distribution strategies for file data

MongoDB GridFS provides comprehensive distributed file storage with automatic chunking and metadata management:

// MongoDB GridFS - Advanced distributed file storage with automatic chunking and metadata management
const { MongoClient, GridFSBucket, ObjectId } = require('mongodb');
const fs = require('fs');
const crypto = require('crypto');
const { Transform, Readable } = require('stream');

const client = new MongoClient('mongodb://localhost:27017');
const db = client.db('advanced_file_storage');

// Comprehensive MongoDB GridFS Manager
class AdvancedGridFSManager {
  constructor(db, config = {}) {
    this.db = db;
    this.config = {
      // Default GridFS configuration
      defaultBucketName: config.defaultBucketName || 'fs',
      defaultChunkSizeBytes: config.defaultChunkSizeBytes || 255 * 1024, // 255KB

      // Performance optimization
      enableConcurrentUploads: config.enableConcurrentUploads !== false,
      maxConcurrentUploads: config.maxConcurrentUploads || 10,
      enableStreamOptimization: config.enableStreamOptimization !== false,
      bufferSize: config.bufferSize || 64 * 1024,

      // File processing features
      enableHashGeneration: config.enableHashGeneration !== false,
      enableMetadataExtraction: config.enableMetadataExtraction !== false,
      enableThumbnailGeneration: config.enableThumbnailGeneration !== false,
      enableContentAnalysis: config.enableContentAnalysis !== false,

      // Storage optimization
      enableCompression: config.enableCompression !== false,
      compressionLevel: config.compressionLevel || 6,
      enableDeduplication: config.enableDeduplication !== false,

      // Access control and security
      enableEncryption: config.enableEncryption !== false,
      encryptionKey: config.encryptionKey,
      enableAccessLogging: config.enableAccessLogging !== false,

      // Performance monitoring
      enablePerformanceMetrics: config.enablePerformanceMetrics !== false,
      enableUsageAnalytics: config.enableUsageAnalytics !== false,

      // Advanced features
      enableVersioning: config.enableVersioning !== false,
      enableDistributedStorage: config.enableDistributedStorage !== false,
      enableAutoCleanup: config.enableAutoCleanup !== false
    };

    // GridFS buckets for different file types
    this.buckets = new Map();
    this.uploadStreams = new Map();
    this.downloadStreams = new Map();

    // Performance tracking
    this.performanceMetrics = {
      totalUploads: 0,
      totalDownloads: 0,
      totalStorageBytes: 0,
      averageUploadTime: 0,
      averageDownloadTime: 0,
      errorCount: 0
    };

    // File processing queues
    this.processingQueue = new Map();
    this.thumbnailQueue = new Map();

    this.initializeGridFS();
  }

  async initializeGridFS() {
    console.log('Initializing advanced GridFS file storage system...');

    try {
      // Create specialized GridFS buckets for different file types
      await this.createOptimizedBucket('documents', {
        chunkSizeBytes: 512 * 1024, // 512KB chunks for documents
        metadata: {
          purpose: 'document_storage',
          contentTypes: ['application/pdf', 'application/msword', 'text/plain'],
          enableFullTextIndex: true,
          enableContentExtraction: true
        }
      });

      await this.createOptimizedBucket('images', {
        chunkSizeBytes: 256 * 1024, // 256KB chunks for images
        metadata: {
          purpose: 'image_storage',
          contentTypes: ['image/jpeg', 'image/png', 'image/gif', 'image/webp'],
          enableThumbnailGeneration: true,
          enableImageAnalysis: true
        }
      });

      await this.createOptimizedBucket('videos', {
        chunkSizeBytes: 1024 * 1024, // 1MB chunks for videos
        metadata: {
          purpose: 'video_storage',
          contentTypes: ['video/mp4', 'video/webm', 'video/avi'],
          enableVideoProcessing: true,
          enableStreamingOptimization: true
        }
      });

      await this.createOptimizedBucket('archives', {
        chunkSizeBytes: 2 * 1024 * 1024, // 2MB chunks for archives
        metadata: {
          purpose: 'archive_storage',
          contentTypes: ['application/zip', 'application/tar', 'application/gzip'],
          enableCompression: false, // Already compressed
          enableIntegrityCheck: true
        }
      });

      // Create general-purpose bucket
      await this.createOptimizedBucket('general', {
        chunkSizeBytes: this.config.defaultChunkSizeBytes,
        metadata: {
          purpose: 'general_storage',
          enableGenericProcessing: true
        }
      });

      // Setup performance monitoring
      if (this.config.enablePerformanceMetrics) {
        await this.setupPerformanceMonitoring();
      }

      // Setup automatic cleanup
      if (this.config.enableAutoCleanup) {
        await this.setupAutomaticCleanup();
      }

      console.log('Advanced GridFS system initialized successfully');

    } catch (error) {
      console.error('Error initializing GridFS:', error);
      throw error;
    }
  }

  async createOptimizedBucket(bucketName, options) {
    console.log(`Creating optimized GridFS bucket: ${bucketName}...`);

    try {
      const bucket = new GridFSBucket(this.db, {
        bucketName: bucketName,
        chunkSizeBytes: options.chunkSizeBytes || this.config.defaultChunkSizeBytes,
        writeConcern: { w: 1, j: true },
        readConcern: { level: 'majority' }
      });

      this.buckets.set(bucketName, {
        bucket: bucket,
        options: options,
        created: new Date(),
        stats: {
          fileCount: 0,
          totalSize: 0,
          uploadsInProgress: 0
        }
      });

      // Create optimized indexes for GridFS collections
      await this.createGridFSIndexes(bucketName);

      console.log(`GridFS bucket ${bucketName} created with ${options.chunkSizeBytes} byte chunks`);

    } catch (error) {
      console.error(`Error creating GridFS bucket ${bucketName}:`, error);
      throw error;
    }
  }

  async createGridFSIndexes(bucketName) {
    console.log(`Creating optimized indexes for GridFS bucket: ${bucketName}...`);

    try {
      // Files collection indexes
      const filesCollection = this.db.collection(`${bucketName}.files`);
      await filesCollection.createIndexes([
        { key: { filename: 1, uploadDate: -1 }, background: true, name: 'filename_upload_date' },
        { key: { 'metadata.contentType': 1, uploadDate: -1 }, background: true, name: 'content_type_date' },
        { key: { 'metadata.userId': 1, uploadDate: -1 }, background: true, sparse: true, name: 'user_files' },
        { key: { 'metadata.tags': 1 }, background: true, sparse: true, name: 'file_tags' },
        { key: { length: -1, uploadDate: -1 }, background: true, name: 'size_date' },
        { key: { 'metadata.hash': 1 }, background: true, sparse: true, name: 'file_hash' }
      ]);

      // Chunks collection indexes (automatically created by GridFS, but we can add custom ones)
      const chunksCollection = this.db.collection(`${bucketName}.chunks`);
      await chunksCollection.createIndexes([
        // Default GridFS index: { files_id: 1, n: 1 } is automatically created
        { key: { files_id: 1 }, background: true, name: 'chunks_file_id' }
      ]);

      console.log(`GridFS indexes created for bucket: ${bucketName}`);

    } catch (error) {
      console.error(`Error creating GridFS indexes for ${bucketName}:`, error);
      // Don't fail initialization for index creation issues
    }
  }

  async uploadFile(bucketName, filename, fileStream, metadata = {}) {
    console.log(`Starting file upload: ${filename} to bucket: ${bucketName}`);
    const uploadStartTime = Date.now();

    try {
      const bucketInfo = this.buckets.get(bucketName);
      if (!bucketInfo) {
        throw new Error(`GridFS bucket ${bucketName} not found`);
      }

      const bucket = bucketInfo.bucket;

      // Generate file hash for deduplication and integrity
      const hashStream = crypto.createHash('sha256');
      let fileSize = 0;

      // Enhanced metadata with automatic enrichment
      const enhancedMetadata = {
        ...metadata,

        // Upload context
        uploadedAt: new Date(),
        uploadedBy: metadata.userId || 'system',
        bucketName: bucketName,

        // File identification
        originalFilename: filename,
        contentType: metadata.contentType || this.detectContentType(filename),

        // Processing flags
        processingStatus: 'pending',
        processingQueue: [],

        // Access and security
        accessLevel: metadata.accessLevel || 'private',
        encryptionStatus: this.config.enableEncryption ? 'encrypted' : 'unencrypted',

        // File categorization
        category: this.categorizeFile(filename, metadata.contentType),
        tags: metadata.tags || [],

        // Version control
        version: metadata.version || 1,
        parentFileId: metadata.parentFileId,

        // System metadata
        source: metadata.source || 'api_upload',
        clientInfo: metadata.clientInfo || {},

        // Performance tracking
        uploadMetrics: {
          startTime: uploadStartTime,
          chunkSizeBytes: bucketInfo.options.chunkSizeBytes
        }
      };

      // Create upload stream with optimization
      const uploadOptions = {
        metadata: enhancedMetadata,
        chunkSizeBytes: bucketInfo.options.chunkSizeBytes,
        disableMD5: false // Enable MD5 for integrity checking
      };

      const uploadStream = bucket.openUploadStream(filename, uploadOptions);
      const uploadId = uploadStream.id.toString();

      // Track upload progress
      this.uploadStreams.set(uploadId, {
        stream: uploadStream,
        filename: filename,
        startTime: uploadStartTime,
        bucketName: bucketName
      });

      // Setup progress tracking and error handling
      let uploadedBytes = 0;

      return new Promise((resolve, reject) => {
        uploadStream.on('error', (error) => {
          console.error(`Upload error for ${filename}:`, error);
          this.uploadStreams.delete(uploadId);
          this.performanceMetrics.errorCount++;
          reject(error);
        });

        uploadStream.on('finish', async () => {
          const uploadTime = Date.now() - uploadStartTime;

          try {
            // Update file metadata with hash and final processing info
            const finalMetadata = {
              ...enhancedMetadata,

              // File integrity
              hash: hashStream.digest('hex'),
              fileSize: fileSize,

              // Upload completion
              processingStatus: 'uploaded',
              uploadMetrics: {
                ...enhancedMetadata.uploadMetrics,
                completedAt: new Date(),
                uploadTimeMs: uploadTime,
                throughputBytesPerSecond: fileSize > 0 ? Math.round(fileSize / (uploadTime / 1000)) : 0
              }
            };

            // Update the file document with enhanced metadata
            await this.db.collection(`${bucketName}.files`).updateOne(
              { _id: uploadStream.id },
              { 
                $set: { 
                  metadata: finalMetadata,
                  'metadata.hash': finalMetadata.hash,
                  'metadata.fileSize': finalMetadata.fileSize
                }
              }
            );

            // Update performance metrics
            this.updatePerformanceMetrics('upload', uploadTime, fileSize);
            bucketInfo.stats.fileCount++;
            bucketInfo.stats.totalSize += fileSize;

            // Queue for post-processing
            if (this.config.enableMetadataExtraction || this.config.enableThumbnailGeneration) {
              await this.queueFileProcessing(uploadStream.id, bucketName, finalMetadata);
            }

            this.uploadStreams.delete(uploadId);

            console.log(`File upload completed: ${filename} (${fileSize} bytes) in ${uploadTime}ms`);

            resolve({
              success: true,
              fileId: uploadStream.id,
              filename: filename,
              size: fileSize,
              hash: finalMetadata.hash,
              uploadTime: uploadTime,
              bucketName: bucketName,
              metadata: finalMetadata
            });

          } catch (error) {
            console.error('Error updating file metadata after upload:', error);
            reject(error);
          }
        });

        // Pipe the file stream through hash calculation and to GridFS
        fileStream.on('data', (chunk) => {
          hashStream.update(chunk);
          fileSize += chunk.length;
          uploadedBytes += chunk.length;

          // Report progress for large files
          if (uploadedBytes % (1024 * 1024) === 0) { // Every MB
            console.log(`Upload progress: ${filename} - ${Math.round(uploadedBytes / 1024 / 1024)}MB`);
          }
        });

        fileStream.pipe(uploadStream);
      });

    } catch (error) {
      console.error(`Error uploading file ${filename}:`, error);
      this.performanceMetrics.errorCount++;

      return {
        success: false,
        error: error.message,
        filename: filename,
        bucketName: bucketName
      };
    }
  }

  async downloadFile(bucketName, fileId, options = {}) {
    console.log(`Starting file download: ${fileId} from bucket: ${bucketName}`);
    const downloadStartTime = Date.now();

    try {
      const bucketInfo = this.buckets.get(bucketName);
      if (!bucketInfo) {
        throw new Error(`GridFS bucket ${bucketName} not found`);
      }

      const bucket = bucketInfo.bucket;
      const objectId = new ObjectId(fileId);

      // Get file metadata first
      const fileInfo = await this.db.collection(`${bucketName}.files`).findOne({ _id: objectId });
      if (!fileInfo) {
        throw new Error(`File not found: ${fileId}`);
      }

      // Log access if enabled
      if (this.config.enableAccessLogging) {
        await this.logFileAccess(fileId, bucketName, 'download', options.userId);
      }

      // Create download stream with range support
      const downloadOptions = {};

      if (options.range) {
        downloadOptions.start = options.range.start || 0;
        downloadOptions.end = options.range.end || fileInfo.length - 1;
      }

      const downloadStream = bucket.openDownloadStream(objectId, downloadOptions);
      const downloadId = new ObjectId().toString();

      // Track download
      this.downloadStreams.set(downloadId, {
        stream: downloadStream,
        fileId: fileId,
        filename: fileInfo.filename,
        startTime: downloadStartTime,
        bucketName: bucketName
      });

      // Setup progress tracking
      let downloadedBytes = 0;

      downloadStream.on('data', (chunk) => {
        downloadedBytes += chunk.length;

        // Report progress for large files
        if (downloadedBytes % (1024 * 1024) === 0) { // Every MB
          console.log(`Download progress: ${fileInfo.filename} - ${Math.round(downloadedBytes / 1024 / 1024)}MB`);
        }
      });

      downloadStream.on('end', () => {
        const downloadTime = Date.now() - downloadStartTime;

        // Update metrics
        this.updatePerformanceMetrics('download', downloadTime, downloadedBytes);
        this.downloadStreams.delete(downloadId);

        console.log(`File download completed: ${fileInfo.filename} (${downloadedBytes} bytes) in ${downloadTime}ms`);
      });

      downloadStream.on('error', (error) => {
        console.error(`Download error for ${fileId}:`, error);
        this.downloadStreams.delete(downloadId);
        this.performanceMetrics.errorCount++;
      });

      return {
        success: true,
        fileId: fileId,
        filename: fileInfo.filename,
        contentType: fileInfo.metadata?.contentType || 'application/octet-stream',
        fileSize: fileInfo.length,
        downloadStream: downloadStream,
        metadata: fileInfo.metadata,
        bucketName: bucketName
      };

    } catch (error) {
      console.error(`Error downloading file ${fileId}:`, error);
      this.performanceMetrics.errorCount++;

      return {
        success: false,
        error: error.message,
        fileId: fileId,
        bucketName: bucketName
      };
    }
  }

  async streamFileRange(bucketName, fileId, rangeStart, rangeEnd, options = {}) {
    console.log(`Streaming file range: ${fileId} bytes ${rangeStart}-${rangeEnd}`);

    try {
      const bucketInfo = this.buckets.get(bucketName);
      if (!bucketInfo) {
        throw new Error(`GridFS bucket ${bucketName} not found`);
      }

      const bucket = bucketInfo.bucket;
      const objectId = new ObjectId(fileId);

      // Get file info for validation
      const fileInfo = await this.db.collection(`${bucketName}.files`).findOne({ _id: objectId });
      if (!fileInfo) {
        throw new Error(`File not found: ${fileId}`);
      }

      // Validate range
      const fileSize = fileInfo.length;
      const validatedRangeStart = Math.max(0, rangeStart);
      const validatedRangeEnd = Math.min(rangeEnd || fileSize - 1, fileSize - 1);

      if (validatedRangeStart > validatedRangeEnd) {
        throw new Error('Invalid range: start position greater than end position');
      }

      // Create range download stream
      const downloadStream = bucket.openDownloadStream(objectId, {
        start: validatedRangeStart,
        end: validatedRangeEnd
      });

      // Log access
      if (this.config.enableAccessLogging) {
        await this.logFileAccess(fileId, bucketName, 'stream', options.userId, {
          rangeStart: validatedRangeStart,
          rangeEnd: validatedRangeEnd,
          rangeSize: validatedRangeEnd - validatedRangeStart + 1
        });
      }

      return {
        success: true,
        fileId: fileId,
        filename: fileInfo.filename,
        contentType: fileInfo.metadata?.contentType || 'application/octet-stream',
        totalSize: fileSize,
        rangeStart: validatedRangeStart,
        rangeEnd: validatedRangeEnd,
        rangeSize: validatedRangeEnd - validatedRangeStart + 1,
        downloadStream: downloadStream,
        contentRange: `bytes ${validatedRangeStart}-${validatedRangeEnd}/${fileSize}`
      };

    } catch (error) {
      console.error(`Error streaming file range for ${fileId}:`, error);
      return {
        success: false,
        error: error.message,
        fileId: fileId
      };
    }
  }

  async deleteFile(bucketName, fileId, options = {}) {
    console.log(`Deleting file: ${fileId} from bucket: ${bucketName}`);

    try {
      const bucketInfo = this.buckets.get(bucketName);
      if (!bucketInfo) {
        throw new Error(`GridFS bucket ${bucketName} not found`);
      }

      const bucket = bucketInfo.bucket;
      const objectId = new ObjectId(fileId);

      // Get file info before deletion (for logging and stats)
      const fileInfo = await this.db.collection(`${bucketName}.files`).findOne({ _id: objectId });
      if (!fileInfo) {
        throw new Error(`File not found: ${fileId}`);
      }

      // Check permissions if needed
      if (options.userId && fileInfo.metadata?.uploadedBy !== options.userId) {
        if (!options.bypassPermissions) {
          throw new Error('Insufficient permissions to delete file');
        }
      }

      // Delete file and all associated chunks
      await bucket.delete(objectId);

      // Update bucket stats
      bucketInfo.stats.fileCount = Math.max(0, bucketInfo.stats.fileCount - 1);
      bucketInfo.stats.totalSize = Math.max(0, bucketInfo.stats.totalSize - fileInfo.length);

      // Log deletion
      if (this.config.enableAccessLogging) {
        await this.logFileAccess(fileId, bucketName, 'delete', options.userId, {
          filename: fileInfo.filename,
          fileSize: fileInfo.length,
          deletedBy: options.userId || 'system'
        });
      }

      console.log(`File deleted successfully: ${fileInfo.filename} (${fileInfo.length} bytes)`);

      return {
        success: true,
        fileId: fileId,
        filename: fileInfo.filename,
        fileSize: fileInfo.length,
        bucketName: bucketName
      };

    } catch (error) {
      console.error(`Error deleting file ${fileId}:`, error);
      this.performanceMetrics.errorCount++;

      return {
        success: false,
        error: error.message,
        fileId: fileId,
        bucketName: bucketName
      };
    }
  }

  async findFiles(bucketName, query = {}, options = {}) {
    console.log(`Searching files in bucket: ${bucketName}`);

    try {
      const bucketInfo = this.buckets.get(bucketName);
      if (!bucketInfo) {
        throw new Error(`GridFS bucket ${bucketName} not found`);
      }

      const filesCollection = this.db.collection(`${bucketName}.files`);

      // Build MongoDB query from search parameters
      const mongoQuery = {};

      if (query.filename) {
        mongoQuery.filename = new RegExp(query.filename, 'i');
      }

      if (query.contentType) {
        mongoQuery['metadata.contentType'] = query.contentType;
      }

      if (query.userId) {
        mongoQuery['metadata.uploadedBy'] = query.userId;
      }

      if (query.tags && query.tags.length > 0) {
        mongoQuery['metadata.tags'] = { $in: query.tags };
      }

      if (query.dateRange) {
        mongoQuery.uploadDate = {
          $gte: query.dateRange.start,
          $lte: query.dateRange.end || new Date()
        };
      }

      if (query.sizeRange) {
        mongoQuery.length = {};
        if (query.sizeRange.min) mongoQuery.length.$gte = query.sizeRange.min;
        if (query.sizeRange.max) mongoQuery.length.$lte = query.sizeRange.max;
      }

      // Configure query options
      const queryOptions = {
        sort: options.sort || { uploadDate: -1 },
        limit: options.limit || 100,
        skip: options.skip || 0,
        projection: options.includeMetadata ? {} : { 
          filename: 1, 
          length: 1, 
          uploadDate: 1, 
          'metadata.contentType': 1,
          'metadata.category': 1,
          'metadata.tags': 1
        }
      };

      // Execute query
      const files = await filesCollection.find(mongoQuery, queryOptions).toArray();
      const totalCount = await filesCollection.countDocuments(mongoQuery);

      return {
        success: true,
        files: files.map(file => ({
          fileId: file._id.toString(),
          filename: file.filename,
          contentType: file.metadata?.contentType,
          fileSize: file.length,
          uploadDate: file.uploadDate,
          category: file.metadata?.category,
          tags: file.metadata?.tags || [],
          hash: file.metadata?.hash,
          metadata: options.includeMetadata ? file.metadata : undefined
        })),
        totalCount: totalCount,
        currentPage: Math.floor((options.skip || 0) / (options.limit || 100)) + 1,
        totalPages: Math.ceil(totalCount / (options.limit || 100)),
        query: query,
        bucketName: bucketName
      };

    } catch (error) {
      console.error(`Error finding files in ${bucketName}:`, error);
      return {
        success: false,
        error: error.message,
        bucketName: bucketName
      };
    }
  }

  categorizeFile(filename, contentType) {
    // Intelligent file categorization
    const extension = filename.toLowerCase().split('.').pop();

    if (contentType) {
      if (contentType.startsWith('image/')) return 'image';
      if (contentType.startsWith('video/')) return 'video';
      if (contentType.startsWith('audio/')) return 'audio';
      if (contentType.includes('pdf')) return 'document';
      if (contentType.includes('text/')) return 'text';
    }

    // Extension-based categorization
    const imageExts = ['jpg', 'jpeg', 'png', 'gif', 'bmp', 'webp', 'svg'];
    const videoExts = ['mp4', 'avi', 'mov', 'wmv', 'flv', 'webm'];
    const audioExts = ['mp3', 'wav', 'flac', 'aac', 'ogg'];
    const documentExts = ['pdf', 'doc', 'docx', 'xls', 'xlsx', 'ppt', 'pptx'];
    const archiveExts = ['zip', 'tar', 'gz', 'rar', '7z'];

    if (imageExts.includes(extension)) return 'image';
    if (videoExts.includes(extension)) return 'video';
    if (audioExts.includes(extension)) return 'audio';
    if (documentExts.includes(extension)) return 'document';
    if (archiveExts.includes(extension)) return 'archive';

    return 'other';
  }

  detectContentType(filename) {
    // Simple content type detection based on extension
    const extension = filename.toLowerCase().split('.').pop();
    const contentTypes = {
      'jpg': 'image/jpeg', 'jpeg': 'image/jpeg',
      'png': 'image/png', 'gif': 'image/gif',
      'pdf': 'application/pdf',
      'txt': 'text/plain', 'html': 'text/html',
      'mp4': 'video/mp4', 'webm': 'video/webm',
      'mp3': 'audio/mpeg', 'wav': 'audio/wav',
      'zip': 'application/zip',
      'json': 'application/json'
    };

    return contentTypes[extension] || 'application/octet-stream';
  }

  async logFileAccess(fileId, bucketName, action, userId, additionalInfo = {}) {
    if (!this.config.enableAccessLogging) return;

    try {
      const accessLog = {
        fileId: new ObjectId(fileId),
        bucketName: bucketName,
        action: action, // upload, download, delete, stream
        userId: userId,
        timestamp: new Date(),
        ...additionalInfo,

        // System context
        userAgent: additionalInfo.userAgent,
        ipAddress: additionalInfo.ipAddress,
        sessionId: additionalInfo.sessionId,

        // Performance context
        responseTime: additionalInfo.responseTime,
        bytesTransferred: additionalInfo.bytesTransferred
      };

      await this.db.collection('file_access_logs').insertOne(accessLog);

    } catch (error) {
      console.error('Error logging file access:', error);
      // Don't fail the operation for logging errors
    }
  }

  updatePerformanceMetrics(operation, duration, bytes = 0) {
    if (!this.config.enablePerformanceMetrics) return;

    if (operation === 'upload') {
      this.performanceMetrics.totalUploads++;
      this.performanceMetrics.averageUploadTime = 
        (this.performanceMetrics.averageUploadTime + duration) / 2;
    } else if (operation === 'download') {
      this.performanceMetrics.totalDownloads++;
      this.performanceMetrics.averageDownloadTime = 
        (this.performanceMetrics.averageDownloadTime + duration) / 2;
    }

    this.performanceMetrics.totalStorageBytes += bytes;
  }

  async getStorageStats() {
    console.log('Gathering GridFS storage statistics...');

    const stats = {
      buckets: {},
      systemStats: this.performanceMetrics,
      summary: {
        totalBuckets: this.buckets.size,
        activeUploads: this.uploadStreams.size,
        activeDownloads: this.downloadStreams.size
      }
    };

    for (const [bucketName, bucketInfo] of this.buckets.entries()) {
      try {
        // Get collection statistics
        const filesCollection = this.db.collection(`${bucketName}.files`);
        const chunksCollection = this.db.collection(`${bucketName}.chunks`);

        const [filesStats, chunksStats, fileCount, totalSize] = await Promise.all([
          filesCollection.stats().catch(() => ({})),
          chunksCollection.stats().catch(() => ({})),
          filesCollection.countDocuments({}),
          filesCollection.aggregate([
            { $group: { _id: null, totalSize: { $sum: '$length' } } }
          ]).toArray()
        ]);

        stats.buckets[bucketName] = {
          configuration: bucketInfo.options,
          fileCount: fileCount,
          totalSizeBytes: totalSize[0]?.totalSize || 0,
          totalSizeMB: Math.round((totalSize[0]?.totalSize || 0) / 1024 / 1024),
          filesCollectionStats: {
            size: filesStats.size || 0,
            storageSize: filesStats.storageSize || 0,
            indexSize: filesStats.totalIndexSize || 0
          },
          chunksCollectionStats: {
            size: chunksStats.size || 0,
            storageSize: chunksStats.storageSize || 0,
            indexSize: chunksStats.totalIndexSize || 0
          },
          chunkSizeBytes: bucketInfo.options.chunkSizeBytes,
          averageFileSize: fileCount > 0 ? Math.round((totalSize[0]?.totalSize || 0) / fileCount) : 0,
          created: bucketInfo.created
        };

      } catch (error) {
        stats.buckets[bucketName] = {
          error: error.message,
          available: false
        };
      }
    }

    return stats;
  }

  async shutdown() {
    console.log('Shutting down GridFS manager...');

    // Close all active upload streams
    for (const [uploadId, uploadInfo] of this.uploadStreams.entries()) {
      try {
        uploadInfo.stream.destroy();
        console.log(`Closed upload stream: ${uploadId}`);
      } catch (error) {
        console.error(`Error closing upload stream ${uploadId}:`, error);
      }
    }

    // Close all active download streams
    for (const [downloadId, downloadInfo] of this.downloadStreams.entries()) {
      try {
        downloadInfo.stream.destroy();
        console.log(`Closed download stream: ${downloadId}`);
      } catch (error) {
        console.error(`Error closing download stream ${downloadId}:`, error);
      }
    }

    // Clear collections and metrics
    this.buckets.clear();
    this.uploadStreams.clear();
    this.downloadStreams.clear();

    console.log('GridFS manager shutdown complete');
  }
}

// Benefits of MongoDB GridFS:
// - Automatic file chunking for large files without manual implementation
// - Integrated metadata storage with file data for consistency
// - Native support for file streaming and range requests
// - Distributed storage with MongoDB's replication and sharding
// - ACID transactions for file operations with database consistency
// - Built-in indexing and querying capabilities for file metadata
// - Automatic chunk deduplication and storage optimization
// - Native backup and disaster recovery with MongoDB tooling
// - Seamless integration with existing MongoDB security and access control
// - SQL-compatible file operations through QueryLeaf integration

module.exports = {
  AdvancedGridFSManager
};

Understanding MongoDB GridFS Architecture

Advanced File Storage and Distribution Patterns

Implement sophisticated GridFS strategies for production MongoDB deployments:

// Production-ready MongoDB GridFS with advanced optimization and enterprise features
class EnterpriseGridFSManager extends AdvancedGridFSManager {
  constructor(db, enterpriseConfig) {
    super(db, enterpriseConfig);

    this.enterpriseConfig = {
      ...enterpriseConfig,
      enableShardedStorage: true,
      enableAdvancedSecurity: true,
      enableContentDeliveryNetwork: true,
      enableAutoTiering: true,
      enableAdvancedAnalytics: true,
      enableComplianceFeatures: true
    };

    this.setupEnterpriseFeatures();
    this.initializeAdvancedSecurity();
    this.setupContentDeliveryNetwork();
  }

  async implementShardedFileStorage() {
    console.log('Implementing sharded GridFS storage...');

    const shardingStrategy = {
      // Shard key design for GridFS collections
      filesShardKey: { 'metadata.userId': 1, uploadDate: 1 },
      chunksShardKey: { files_id: 1 },

      // Distribution optimization
      enableZoneSharding: true,
      geographicDistribution: true,
      loadBalancing: true,

      // Performance optimization
      enableLocalReads: true,
      enableWriteDistribution: true,
      chunkDistributionStrategy: 'round_robin'
    };

    return await this.deployShardedGridFS(shardingStrategy);
  }

  async setupAdvancedContentDelivery() {
    console.log('Setting up advanced content delivery network...');

    const cdnConfig = {
      // Edge caching strategy
      edgeCaching: {
        enableEdgeNodes: true,
        cacheSize: '10GB',
        cacheTTL: 3600000, // 1 hour
        enableIntelligentCaching: true
      },

      // Content optimization
      contentOptimization: {
        enableImageOptimization: true,
        enableVideoTranscoding: true,
        enableCompressionOptimization: true,
        enableAdaptiveStreaming: true
      },

      // Global distribution
      globalDistribution: {
        enableMultiRegion: true,
        regions: ['us-east-1', 'eu-west-1', 'ap-southeast-1'],
        enableGeoRouting: true,
        enableFailover: true
      }
    };

    return await this.deployContentDeliveryNetwork(cdnConfig);
  }
}

SQL-Style GridFS Operations with QueryLeaf

QueryLeaf provides familiar SQL syntax for MongoDB GridFS operations and file management:

-- QueryLeaf GridFS operations with SQL-familiar syntax for MongoDB

-- Create GridFS buckets with SQL-style DDL
CREATE GRIDFS_BUCKET documents 
WITH (
  chunk_size = '512KB',
  write_concern = 'majority',
  read_concern = 'majority',
  enable_sharding = true
);

CREATE GRIDFS_BUCKET images
WITH (
  chunk_size = '256KB',
  enable_compression = true,
  enable_thumbnail_generation = true,
  content_types = ['image/jpeg', 'image/png', 'image/gif']
);

-- File upload with enhanced metadata
INSERT INTO GRIDFS('documents') (
  filename, content_type, file_data, metadata
) VALUES (
  'enterprise-report.pdf',
  'application/pdf', 
  FILE_STREAM('/path/to/enterprise-report.pdf'),
  JSON_OBJECT(
    'category', 'reports',
    'department', 'finance',
    'classification', 'internal',
    'tags', JSON_ARRAY('quarterly', 'financial', 'analysis'),
    'access_level', 'restricted',
    'retention_years', 7,
    'compliance_flags', JSON_OBJECT(
      'gdpr_applicable', true,
      'sox_applicable', true,
      'data_classification', 'sensitive'
    ),
    'business_context', JSON_OBJECT(
      'project_id', 'PROJ-2025-Q1',
      'cost_center', 'CC-FINANCE-001',
      'stakeholders', JSON_ARRAY('[email protected]', '[email protected]')
    )
  )
);

-- Bulk file upload with batch processing
INSERT INTO GRIDFS('images') (filename, content_type, file_data, metadata)
WITH file_batch AS (
  SELECT 
    original_filename as filename,
    detected_content_type as content_type,
    file_binary_data as file_data,

    -- Enhanced metadata generation
    JSON_OBJECT(
      'upload_batch_id', batch_id,
      'uploaded_by', uploader_user_id,
      'upload_source', upload_source,
      'original_path', original_file_path,

      -- Image-specific metadata
      'image_metadata', JSON_OBJECT(
        'width', image_width,
        'height', image_height,
        'format', image_format,
        'color_space', color_space,
        'has_transparency', has_alpha_channel,
        'camera_info', camera_metadata
      ),

      -- Processing instructions
      'processing_queue', JSON_ARRAY(
        'thumbnail_generation',
        'format_optimization',
        'metadata_extraction',
        'duplicate_detection'
      ),

      -- Organization
      'album_id', album_id,
      'event_date', event_date,
      'location', geo_location,
      'tags', detected_tags,

      -- Access control
      'visibility', photo_visibility,
      'sharing_permissions', sharing_rules,
      'privacy_level', privacy_setting
    ) as metadata

  FROM staging_images 
  WHERE processing_status = 'ready_for_upload'
    AND upload_date >= CURRENT_TIMESTAMP - INTERVAL '1 hour'
)
SELECT filename, content_type, file_data, metadata
FROM file_batch
WHERE content_type LIKE 'image/%'

-- GridFS bulk upload configuration  
WITH UPLOAD_OPTIONS (
  concurrent_uploads = 10,
  chunk_size = '256KB',
  enable_deduplication = true,
  enable_virus_scanning = true,
  processing_priority = 'normal'
);

-- Query files with advanced filtering and metadata search
WITH file_search AS (
  SELECT 
    file_id,
    filename,
    upload_date,
    length as file_size_bytes,

    -- Extract metadata fields
    JSON_EXTRACT(metadata, '$.category') as category,
    JSON_EXTRACT(metadata, '$.department') as department,
    JSON_EXTRACT(metadata, '$.uploaded_by') as uploaded_by,
    JSON_EXTRACT(metadata, '$.tags') as tags,
    JSON_EXTRACT(metadata, '$.access_level') as access_level,
    JSON_EXTRACT(metadata, '$.content_type') as content_type,

    -- Calculate file age and size categories
    EXTRACT(DAYS FROM CURRENT_TIMESTAMP - upload_date) as age_days,
    CASE 
      WHEN length < 1024 * 1024 THEN 'small'
      WHEN length < 10 * 1024 * 1024 THEN 'medium'
      WHEN length < 100 * 1024 * 1024 THEN 'large'
      ELSE 'very_large'
    END as size_category,

    -- Access patterns
    JSON_EXTRACT(metadata, '$.download_count') as download_count,
    JSON_EXTRACT(metadata, '$.last_accessed') as last_accessed,

    -- Processing status
    JSON_EXTRACT(metadata, '$.processing_status') as processing_status,
    JSON_EXTRACT(metadata, '$.hash') as file_hash,

    -- Business context
    JSON_EXTRACT(metadata, '$.business_context.project_id') as project_id,
    JSON_EXTRACT(metadata, '$.business_context.cost_center') as cost_center

  FROM GRIDFS_FILES('documents')
  WHERE 
    -- Time-based filtering
    upload_date >= CURRENT_TIMESTAMP - INTERVAL '90 days'

    -- Access level filtering (security)
    AND (
      JSON_EXTRACT(metadata, '$.access_level') = 'public'
      OR (
        JSON_EXTRACT(metadata, '$.access_level') = 'restricted' 
        AND CURRENT_USER_HAS_PERMISSION('restricted_files')
      )
      OR (
        JSON_EXTRACT(metadata, '$.uploaded_by') = CURRENT_USER_ID()
      )
    )

    -- Content filtering
    AND processing_status = 'completed'

  UNION ALL

  -- Include image files with different criteria
  SELECT 
    file_id,
    filename,
    upload_date,
    length as file_size_bytes,
    JSON_EXTRACT(metadata, '$.category') as category,
    'media' as department,
    JSON_EXTRACT(metadata, '$.uploaded_by') as uploaded_by,
    JSON_EXTRACT(metadata, '$.tags') as tags,
    JSON_EXTRACT(metadata, '$.visibility') as access_level,
    'image' as content_type,
    EXTRACT(DAYS FROM CURRENT_TIMESTAMP - upload_date) as age_days,
    CASE 
      WHEN length < 1024 * 1024 THEN 'small'
      WHEN length < 5 * 1024 * 1024 THEN 'medium' 
      WHEN length < 20 * 1024 * 1024 THEN 'large'
      ELSE 'very_large'
    END as size_category,
    COALESCE(JSON_EXTRACT(metadata, '$.view_count'), 0) as download_count,
    JSON_EXTRACT(metadata, '$.last_viewed') as last_accessed,
    JSON_EXTRACT(metadata, '$.processing_status') as processing_status,
    JSON_EXTRACT(metadata, '$.hash') as file_hash,
    JSON_EXTRACT(metadata, '$.album_id') as project_id,
    'MEDIA-STORAGE' as cost_center

  FROM GRIDFS_FILES('images')
  WHERE upload_date >= CURRENT_TIMESTAMP - INTERVAL '30 days'
),

usage_analytics AS (
  SELECT 
    fs.*,

    -- Usage classification
    CASE 
      WHEN download_count >= 100 THEN 'frequently_accessed'
      WHEN download_count >= 10 THEN 'moderately_accessed'
      WHEN download_count >= 1 THEN 'rarely_accessed'
      ELSE 'never_accessed'
    END as usage_pattern,

    -- Age-based classification
    CASE 
      WHEN age_days <= 7 THEN 'very_recent'
      WHEN age_days <= 30 THEN 'recent'
      WHEN age_days <= 90 THEN 'moderate_age'
      ELSE 'old'
    END as age_category,

    -- Storage optimization recommendations
    CASE 
      WHEN age_days > 365 AND download_count = 0 THEN 'candidate_for_archival'
      WHEN size_category = 'very_large' AND usage_pattern = 'never_accessed' THEN 'candidate_for_compression'
      WHEN age_days <= 30 AND usage_pattern = 'frequently_accessed' THEN 'hot_storage_candidate'
      ELSE 'standard_storage'
    END as storage_recommendation,

    -- Content insights
    ARRAY_LENGTH(
      STRING_TO_ARRAY(
        REPLACE(REPLACE(JSON_EXTRACT_TEXT(tags), '[', ''), ']', ''), 
        ','
      ), 
      1
    ) as tag_count,

    -- Cost analysis (estimated)
    CASE 
      WHEN size_category = 'small' THEN file_size_bytes * 0.000001  -- $0.001/GB/month
      WHEN size_category = 'medium' THEN file_size_bytes * 0.0000008
      WHEN size_category = 'large' THEN file_size_bytes * 0.0000005
      ELSE file_size_bytes * 0.0000003
    END as estimated_monthly_storage_cost

  FROM file_search fs
),

aggregated_insights AS (
  SELECT 
    department,
    category,
    content_type,
    age_category,
    usage_pattern,
    size_category,
    storage_recommendation,

    -- Volume metrics
    COUNT(*) as file_count,
    SUM(file_size_bytes) as total_size_bytes,
    AVG(file_size_bytes) as avg_file_size,

    -- Usage metrics
    SUM(download_count) as total_downloads,
    AVG(download_count) as avg_downloads_per_file,
    COUNT(*) FILTER (WHERE download_count = 0) as unused_files,

    -- Age distribution
    AVG(age_days) as avg_age_days,
    MIN(upload_date) as oldest_file_date,
    MAX(upload_date) as newest_file_date,

    -- Storage cost analysis
    SUM(estimated_monthly_storage_cost) as estimated_monthly_cost,

    -- Content analysis
    AVG(tag_count) as avg_tags_per_file,
    COUNT(DISTINCT uploaded_by) as unique_uploaders,
    COUNT(DISTINCT project_id) as unique_projects

  FROM usage_analytics
  GROUP BY 
    department, category, content_type, age_category, 
    usage_pattern, size_category, storage_recommendation
)

SELECT 
  -- Classification dimensions
  department,
  category,
  content_type,
  age_category,
  usage_pattern,
  size_category,

  -- Volume and size metrics
  file_count,
  ROUND(total_size_bytes / 1024.0 / 1024.0 / 1024.0, 2) as total_size_gb,
  ROUND(avg_file_size / 1024.0 / 1024.0, 2) as avg_file_size_mb,

  -- Usage analytics
  total_downloads,
  ROUND(avg_downloads_per_file, 1) as avg_downloads_per_file,
  unused_files,
  ROUND((unused_files::DECIMAL / file_count) * 100, 1) as unused_files_percent,

  -- Age and lifecycle
  ROUND(avg_age_days, 1) as avg_age_days,
  oldest_file_date,
  newest_file_date,

  -- Content insights
  ROUND(avg_tags_per_file, 1) as avg_tags_per_file,
  unique_uploaders,
  unique_projects,

  -- Cost optimization
  ROUND(estimated_monthly_cost, 4) as estimated_monthly_cost_usd,
  storage_recommendation,

  -- Actionable insights
  CASE storage_recommendation
    WHEN 'candidate_for_archival' THEN 'Move to cold storage or delete if no business value'
    WHEN 'candidate_for_compression' THEN 'Enable compression to reduce storage costs'
    WHEN 'hot_storage_candidate' THEN 'Ensure high-performance storage tier'
    ELSE 'Current storage tier appropriate'
  END as recommended_action,

  -- Priority scoring for action
  CASE 
    WHEN storage_recommendation = 'candidate_for_archival' AND unused_files_percent > 80 THEN 'high_priority'
    WHEN storage_recommendation = 'candidate_for_compression' AND total_size_gb > 10 THEN 'high_priority'
    WHEN storage_recommendation = 'hot_storage_candidate' AND avg_downloads_per_file > 50 THEN 'high_priority'
    WHEN unused_files_percent > 50 THEN 'medium_priority'
    ELSE 'low_priority'
  END as action_priority

FROM aggregated_insights
WHERE file_count > 0
ORDER BY 
  CASE action_priority
    WHEN 'high_priority' THEN 1
    WHEN 'medium_priority' THEN 2
    ELSE 3
  END,
  total_size_gb DESC,
  file_count DESC;

-- File streaming with range support and performance optimization
WITH file_stream_request AS (
  SELECT 
    file_id,
    filename,
    length as total_size,
    content_type,
    upload_date,

    -- Extract streaming metadata
    JSON_EXTRACT(metadata, '$.streaming_optimized') as streaming_optimized,
    JSON_EXTRACT(metadata, '$.cdn_enabled') as cdn_enabled,
    JSON_EXTRACT(metadata, '$.cache_headers') as cache_headers,

    -- Range request parameters (would be provided by application)
    $range_start as range_start,
    $range_end as range_end,

    -- Calculate effective range
    COALESCE($range_start, 0) as effective_start,
    COALESCE($range_end, length - 1) as effective_end,

    -- Streaming metadata
    JSON_EXTRACT(metadata, '$.video_metadata.duration') as video_duration,
    JSON_EXTRACT(metadata, '$.video_metadata.bitrate') as video_bitrate,
    JSON_EXTRACT(metadata, '$.image_metadata.width') as image_width,
    JSON_EXTRACT(metadata, '$.image_metadata.height') as image_height

  FROM GRIDFS_FILES('videos')
  WHERE file_id = $requested_file_id
)

SELECT 
  fsr.file_id,
  fsr.filename,
  fsr.content_type,
  fsr.total_size,

  -- Range information
  fsr.effective_start,
  fsr.effective_end,
  (fsr.effective_end - fsr.effective_start + 1) as range_size,

  -- Content headers for HTTP response
  'bytes ' || fsr.effective_start || '-' || fsr.effective_end || '/' || fsr.total_size as content_range_header,

  CASE 
    WHEN fsr.effective_start = 0 AND fsr.effective_end = fsr.total_size - 1 THEN '200'
    ELSE '206' -- Partial content
  END as http_status_code,

  -- Caching and performance headers
  CASE fsr.content_type
    WHEN 'image/jpeg' THEN 'public, max-age=2592000' -- 30 days
    WHEN 'image/png' THEN 'public, max-age=2592000'
    WHEN 'video/mp4' THEN 'public, max-age=3600' -- 1 hour
    WHEN 'application/pdf' THEN 'private, max-age=1800' -- 30 minutes
    ELSE 'private, max-age=300' -- 5 minutes
  END as cache_control_header,

  -- Streaming optimization flags
  fsr.streaming_optimized::BOOLEAN as is_streaming_optimized,
  fsr.cdn_enabled::BOOLEAN as use_cdn,

  -- Performance estimates
  CASE 
    WHEN fsr.video_bitrate IS NOT NULL THEN
      ROUND((fsr.effective_end - fsr.effective_start + 1) / (fsr.video_bitrate::DECIMAL * 1024 / 8), 2)
    ELSE NULL
  END as estimated_streaming_seconds,

  -- Content metadata for client
  JSON_OBJECT(
    'total_duration', fsr.video_duration,
    'bitrate_kbps', fsr.video_bitrate,
    'width', fsr.image_width,
    'height', fsr.image_height,
    'supports_range_requests', true,
    'chunk_size_optimized', true,
    'streaming_ready', fsr.streaming_optimized::BOOLEAN
  ) as content_metadata,

  -- GridFS streaming query (this would trigger the actual data retrieval)
  GRIDFS_STREAM(fsr.file_id, fsr.effective_start, fsr.effective_end) as file_stream

FROM file_stream_request fsr;

-- Advanced file analytics and storage optimization
WITH storage_utilization AS (
  SELECT 
    bucket_name,
    DATE_TRUNC('day', upload_date) as upload_day,

    -- Daily storage metrics
    COUNT(*) as daily_files,
    SUM(length) as daily_storage_bytes,
    AVG(length) as avg_file_size_daily,

    -- Content type distribution
    COUNT(*) FILTER (WHERE JSON_EXTRACT(metadata, '$.content_type') LIKE 'image/%') as image_files,
    COUNT(*) FILTER (WHERE JSON_EXTRACT(metadata, '$.content_type') LIKE 'video/%') as video_files,
    COUNT(*) FILTER (WHERE JSON_EXTRACT(metadata, '$.content_type') LIKE 'application/%') as document_files,

    -- Processing status
    COUNT(*) FILTER (WHERE JSON_EXTRACT(metadata, '$.processing_status') = 'completed') as processed_files,
    COUNT(*) FILTER (WHERE JSON_EXTRACT(metadata, '$.processing_status') = 'failed') as failed_files,

    -- Access patterns
    SUM(COALESCE(JSON_EXTRACT(metadata, '$.download_count')::INTEGER, 0)) as total_downloads,
    AVG(COALESCE(JSON_EXTRACT(metadata, '$.download_count')::INTEGER, 0)) as avg_downloads_per_file

  FROM (
    SELECT 'documents' as bucket_name, file_id, filename, length, upload_date, metadata 
    FROM GRIDFS_FILES('documents')
    UNION ALL
    SELECT 'images' as bucket_name, file_id, filename, length, upload_date, metadata 
    FROM GRIDFS_FILES('images') 
    UNION ALL
    SELECT 'videos' as bucket_name, file_id, filename, length, upload_date, metadata 
    FROM GRIDFS_FILES('videos')
  ) all_files
  WHERE upload_date >= CURRENT_TIMESTAMP - INTERVAL '30 days'
  GROUP BY bucket_name, DATE_TRUNC('day', upload_date)
),

performance_analysis AS (
  SELECT 
    su.*,

    -- Growth analysis
    LAG(daily_storage_bytes) OVER (
      PARTITION BY bucket_name 
      ORDER BY upload_day
    ) as prev_day_storage,

    -- Calculate growth rate
    CASE 
      WHEN LAG(daily_storage_bytes) OVER (PARTITION BY bucket_name ORDER BY upload_day) > 0 THEN
        ROUND(
          ((daily_storage_bytes - LAG(daily_storage_bytes) OVER (PARTITION BY bucket_name ORDER BY upload_day))::DECIMAL / 
           LAG(daily_storage_bytes) OVER (PARTITION BY bucket_name ORDER BY upload_day)) * 100, 
          1
        )
      ELSE NULL
    END as storage_growth_percent,

    -- Performance indicators
    ROUND(daily_storage_bytes / NULLIF(daily_files, 0) / 1024.0 / 1024.0, 2) as avg_file_size_mb,
    ROUND(total_downloads::DECIMAL / NULLIF(daily_files, 0), 2) as download_ratio,

    -- Processing efficiency
    ROUND((processed_files::DECIMAL / NULLIF(daily_files, 0)) * 100, 1) as processing_success_rate,
    ROUND((failed_files::DECIMAL / NULLIF(daily_files, 0)) * 100, 1) as processing_failure_rate,

    -- Storage efficiency indicators
    CASE 
      WHEN avg_downloads_per_file = 0 THEN 'unused_storage'
      WHEN avg_downloads_per_file < 0.1 THEN 'low_utilization'
      WHEN avg_downloads_per_file < 1.0 THEN 'moderate_utilization'
      ELSE 'high_utilization'
    END as utilization_category

  FROM storage_utilization su
)

SELECT 
  bucket_name,
  upload_day,

  -- Volume metrics
  daily_files,
  ROUND(daily_storage_bytes / 1024.0 / 1024.0 / 1024.0, 3) as daily_storage_gb,
  avg_file_size_mb,

  -- Content distribution
  image_files,
  video_files,
  document_files,

  -- Performance metrics
  processing_success_rate,
  processing_failure_rate,
  download_ratio,
  utilization_category,

  -- Growth analysis
  storage_growth_percent,

  -- Optimization recommendations
  CASE 
    WHEN utilization_category = 'unused_storage' THEN 'implement_retention_policy'
    WHEN processing_failure_rate > 10 THEN 'investigate_processing_issues'
    WHEN storage_growth_percent > 100 THEN 'monitor_storage_capacity'
    WHEN avg_file_size_mb > 100 THEN 'consider_compression_optimization'
    ELSE 'storage_operating_normally'
  END as optimization_recommendation,

  -- Projected storage (simple linear projection)
  CASE 
    WHEN storage_growth_percent IS NOT NULL THEN
      ROUND(
        daily_storage_bytes * (1 + storage_growth_percent / 100) * 30 / 1024.0 / 1024.0 / 1024.0, 
        2
      )
    ELSE NULL
  END as projected_monthly_storage_gb,

  -- Alert conditions
  CASE 
    WHEN processing_failure_rate > 20 THEN 'critical_processing_failure'
    WHEN storage_growth_percent > 200 THEN 'critical_storage_growth'
    WHEN utilization_category = 'unused_storage' AND daily_storage_gb > 1 THEN 'storage_waste_alert'
    ELSE 'normal_operations'
  END as alert_status

FROM performance_analysis
WHERE daily_files > 0
ORDER BY 
  CASE alert_status
    WHEN 'critical_processing_failure' THEN 1
    WHEN 'critical_storage_growth' THEN 2
    WHEN 'storage_waste_alert' THEN 3
    ELSE 4
  END,
  bucket_name,
  upload_day DESC;

-- QueryLeaf provides comprehensive GridFS capabilities:
-- 1. SQL-familiar GridFS bucket creation and management
-- 2. Advanced file upload with metadata enrichment and batch processing
-- 3. Efficient file querying with metadata search and filtering
-- 4. High-performance file streaming with range request support
-- 5. Comprehensive storage analytics and optimization recommendations
-- 6. Integration with MongoDB's native GridFS optimizations
-- 7. Advanced access control and security features
-- 8. SQL-style operations for complex file management workflows
-- 9. Built-in performance monitoring and capacity planning
-- 10. Enterprise-ready file storage with distributed capabilities

Best Practices for GridFS Implementation

File Storage Strategy and Performance Optimization

Essential principles for effective MongoDB GridFS deployment:

Chunk Size Optimization: Choose chunk sizes based on file types and access patterns - smaller chunks for random access, larger chunks for sequential streaming
Bucket Organization: Create separate buckets for different file types to optimize chunk sizes and indexing strategies
Metadata Design: Implement comprehensive metadata schemas that support efficient querying and business requirements
Index Strategy: Create strategic indexes on frequently queried metadata fields while avoiding over-indexing
Security Integration: Implement access control and encryption that integrates with application security frameworks
Performance Monitoring: Track upload/download performance, storage utilization, and access patterns for optimization

Production Deployment and Operational Excellence

Design GridFS systems for enterprise-scale requirements:

Distributed Architecture: Implement GridFS across sharded clusters with proper shard key design for balanced distribution
Backup and Recovery: Design backup strategies that account for GridFS's dual-collection structure (files and chunks)
Content Delivery: Integrate with CDN and caching layers for optimal global content delivery performance
Storage Tiering: Implement automated data lifecycle management with hot, warm, and cold storage tiers
Compliance Features: Build in data governance, audit trails, and regulatory compliance capabilities
Monitoring and Alerting: Establish comprehensive monitoring for storage utilization, performance, and system health

Conclusion

MongoDB GridFS provides comprehensive distributed file storage that eliminates the complexity of traditional file management systems through automatic chunking, integrated metadata storage, and seamless integration with MongoDB's distributed architecture. The unified approach to file and database operations enables sophisticated file management workflows while maintaining ACID properties and enterprise-grade reliability.

Key MongoDB GridFS benefits include:

Automatic Chunking: Seamless handling of large files without manual chunk management or size limitations
Integrated Metadata: Rich metadata storage with file data for complex querying and business logic integration
Distributed Storage: Native support for MongoDB's replication and sharding for global file distribution
Streaming Capabilities: Efficient file streaming and range requests for multimedia and large file applications
Transaction Support: ACID transactions for file operations integrated with database consistency guarantees
SQL Accessibility: Familiar SQL-style file operations through QueryLeaf for accessible enterprise file management

Whether you're building content management systems, media platforms, document repositories, or enterprise file storage solutions, MongoDB GridFS with QueryLeaf's familiar SQL interface provides the foundation for scalable, reliable, and feature-rich file storage architectures.

QueryLeaf Integration: QueryLeaf seamlessly manages MongoDB GridFS operations while providing SQL-familiar syntax for file uploads, downloads, streaming, and metadata management. Advanced file storage patterns including distributed storage, content delivery, and enterprise security features are elegantly handled through familiar SQL constructs, making sophisticated file management both powerful and accessible to SQL-oriented development teams.

The combination of MongoDB's robust GridFS capabilities with SQL-style file operations makes it an ideal platform for applications requiring both advanced file storage functionality and familiar database interaction patterns, ensuring your file storage infrastructure can scale efficiently while maintaining operational simplicity and developer productivity.

November 5, 2025
23 min read

MongoDB Atlas Search and Advanced Text Indexing: Full-Text Search with Vector Similarity and Multi-Language Support

Modern applications require sophisticated search capabilities that go beyond simple text matching to provide relevant, contextual results across multiple data types and languages. Traditional full-text search implementations struggle with semantic understanding, multi-language support, and the complexity of integrating machine learning-based relevance scoring, often requiring separate search engines and complex data synchronization processes that increase operational overhead and system complexity.

MongoDB Atlas Search provides comprehensive native search capabilities with advanced text indexing, vector similarity search, and intelligent relevance scoring that eliminate the need for external search engines. Unlike traditional approaches that require separate search infrastructure and complex data pipelines, Atlas Search integrates seamlessly with MongoDB collections, providing real-time search synchronization, multi-language support, and machine learning-enhanced search experiences within a unified platform.

The Traditional Search Challenge

Conventional search implementations involve significant complexity and operational burden:

-- Traditional PostgreSQL full-text search approach - limited and complex
CREATE EXTENSION IF NOT EXISTS pg_trgm;
CREATE EXTENSION IF NOT EXISTS unaccent;

-- Basic document storage with limited search capabilities
CREATE TABLE documents (
    document_id BIGSERIAL PRIMARY KEY,
    title VARCHAR(500) NOT NULL,
    content TEXT NOT NULL,
    author VARCHAR(200),
    category VARCHAR(100),
    tags VARCHAR(255)[],

    -- Language and localization
    language VARCHAR(10) DEFAULT 'en',
    content_locale VARCHAR(10),

    -- Metadata for search
    publish_date TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    modified_date TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    status VARCHAR(50) DEFAULT 'published',

    -- Basic search vectors (very limited functionality)
    title_vector TSVECTOR,
    content_vector TSVECTOR,
    combined_vector TSVECTOR
);

-- Manual maintenance of search vectors required
CREATE OR REPLACE FUNCTION update_document_search_vectors()
RETURNS TRIGGER AS $$
BEGIN
    -- Basic text search vector creation (limited language support)
    NEW.title_vector := to_tsvector('english', COALESCE(NEW.title, ''));
    NEW.content_vector := to_tsvector('english', COALESCE(NEW.content, ''));
    NEW.combined_vector := to_tsvector('english', 
        COALESCE(NEW.title, '') || ' ' || COALESCE(NEW.content, '')
    );
    RETURN NEW;
END;
$$ LANGUAGE plpgsql;

CREATE TRIGGER trigger_update_search_vectors
    BEFORE INSERT OR UPDATE ON documents
    FOR EACH ROW EXECUTE FUNCTION update_document_search_vectors();

-- Basic GIN indexes for text search (limited optimization)
CREATE INDEX idx_documents_title_search ON documents USING GIN(title_vector);
CREATE INDEX idx_documents_content_search ON documents USING GIN(content_vector);
CREATE INDEX idx_documents_combined_search ON documents USING GIN(combined_vector);
CREATE INDEX idx_documents_category_status ON documents(category, status);

-- User search behavior and analytics tracking
CREATE TABLE search_queries (
    query_id BIGSERIAL PRIMARY KEY,
    user_id BIGINT,
    session_id VARCHAR(100),
    query_text TEXT NOT NULL,
    query_language VARCHAR(10) DEFAULT 'en',

    -- Search parameters
    filters_applied JSONB,
    sort_criteria VARCHAR(100),
    page_number INTEGER DEFAULT 1,
    results_per_page INTEGER DEFAULT 10,

    -- Search results and performance
    total_results_found INTEGER,
    execution_time_ms INTEGER,
    results_clicked INTEGER[] DEFAULT '{}',

    -- User context
    user_agent TEXT,
    referrer TEXT,
    search_timestamp TIMESTAMP DEFAULT CURRENT_TIMESTAMP,

    -- Search quality metrics
    user_satisfaction INTEGER CHECK (user_satisfaction BETWEEN 1 AND 5),
    bounce_rate DECIMAL(4,2),
    conversion_achieved BOOLEAN DEFAULT FALSE
);

-- Complex search query with limited capabilities
WITH search_base AS (
    SELECT 
        d.document_id,
        d.title,
        d.content,
        d.author,
        d.category,
        d.tags,
        d.publish_date,
        d.language,

        -- Basic relevance scoring (very primitive)
        ts_rank_cd(d.title_vector, plainto_tsquery('english', $search_query)) * 2.0 as title_relevance,
        ts_rank_cd(d.content_vector, plainto_tsquery('english', $search_query)) as content_relevance,

        -- Combine relevance scores
        (ts_rank_cd(d.title_vector, plainto_tsquery('english', $search_query)) * 2.0 +
         ts_rank_cd(d.content_vector, plainto_tsquery('english', $search_query))) as combined_relevance,

        -- Simple popularity boost (no ML)
        LOG(GREATEST(1, (SELECT COUNT(*) FROM search_queries sq WHERE sq.results_clicked @> ARRAY[d.document_id]))) as popularity_score,

        -- Basic category boosting
        CASE 
            WHEN d.category = $preferred_category THEN 1.2
            ELSE 1.0
        END as category_boost,

        -- Recency boost (basic time decay)
        CASE 
            WHEN d.publish_date >= CURRENT_TIMESTAMP - INTERVAL '30 days' THEN 1.3
            WHEN d.publish_date >= CURRENT_TIMESTAMP - INTERVAL '90 days' THEN 1.1
            ELSE 1.0
        END as recency_boost

    FROM documents d
    WHERE 
        d.status = 'published'
        AND ($language IS NULL OR d.language = $language)
        AND ($category_filter IS NULL OR d.category = $category_filter)

        -- Basic text search (limited semantic understanding)
        AND (
            d.combined_vector @@ plainto_tsquery('english', $search_query)
            OR SIMILARITY(d.title, $search_query) > 0.3
            OR d.title ILIKE '%' || $search_query || '%'
            OR d.content ILIKE '%' || $search_query || '%'
        )
),

search_with_scoring AS (
    SELECT 
        sb.*,

        -- Final relevance calculation (very basic)
        GREATEST(0.1, 
            sb.combined_relevance * sb.category_boost * sb.recency_boost + 
            (sb.popularity_score * 0.1)
        ) as final_relevance_score,

        -- Extract matching snippets (primitive)
        ts_headline('english', 
            LEFT(sb.content, 1000), 
            plainto_tsquery('english', $search_query),
            'MaxWords=35, MinWords=15, MaxFragments=3'
        ) as content_snippet,

        -- Count matching terms (basic)
        (SELECT COUNT(*) 
         FROM unnest(string_to_array(lower($search_query), ' ')) as query_word
         WHERE lower(sb.title || ' ' || sb.content) LIKE '%' || query_word || '%'
        ) as matching_terms_count,

        -- Simple spell correction suggestions (very limited)
        CASE 
            WHEN SIMILARITY(sb.title, $search_query) < 0.1 THEN
                (SELECT string_agg(suggestion, ' ') 
                 FROM (
                     SELECT word as suggestion 
                     FROM unnest(string_to_array($search_query, ' ')) as word
                     ORDER BY SIMILARITY(word, sb.title) DESC 
                     LIMIT 3
                 ) suggestions)
            ELSE NULL
        END as spelling_suggestions

    FROM search_base sb
),

search_analytics AS (
    -- Track search performance (basic analytics)
    SELECT 
        CURRENT_TIMESTAMP as search_executed_at,
        $search_query as query_executed,
        COUNT(*) as total_results_found,
        AVG(sws.final_relevance_score) as avg_relevance_score,
        MAX(sws.final_relevance_score) as max_relevance_score,

        -- Category distribution
        json_object_agg(sws.category, COUNT(sws.category)) as results_by_category,

        -- Language distribution  
        json_object_agg(sws.language, COUNT(sws.language)) as results_by_language

    FROM search_with_scoring sws
    WHERE sws.final_relevance_score > 0.1
)

-- Final search results with basic ranking
SELECT 
    sws.document_id,
    sws.title,
    sws.author,
    sws.category,
    sws.tags,
    sws.publish_date,
    sws.language,

    -- Relevance and ranking
    ROUND(sws.final_relevance_score, 4) as relevance_score,
    ROW_NUMBER() OVER (ORDER BY sws.final_relevance_score DESC, sws.publish_date DESC) as search_rank,

    -- Content preview
    sws.content_snippet,
    LENGTH(sws.content) as content_length,
    sws.matching_terms_count,

    -- Search enhancements (very basic)
    sws.spelling_suggestions,

    -- Quality indicators
    CASE 
        WHEN sws.final_relevance_score > 0.8 THEN 'high'
        WHEN sws.final_relevance_score > 0.4 THEN 'medium'
        ELSE 'low'
    END as match_quality,

    -- Search metadata
    EXTRACT(DAYS FROM CURRENT_TIMESTAMP - sws.publish_date) as days_old

FROM search_with_scoring sws
WHERE sws.final_relevance_score > 0.1
ORDER BY sws.final_relevance_score DESC, sws.publish_date DESC
LIMIT $results_limit OFFSET $results_offset;

-- Insert search analytics
INSERT INTO search_queries (
    user_id, session_id, query_text, query_language, 
    total_results_found, execution_time_ms, search_timestamp
) VALUES (
    $user_id, $session_id, $search_query, $language,
    (SELECT COUNT(*) FROM search_with_scoring WHERE final_relevance_score > 0.1),
    $execution_time_ms, CURRENT_TIMESTAMP
);

-- Traditional search approach problems:
-- 1. Very limited semantic understanding and context awareness
-- 2. Poor multi-language support requiring separate configurations
-- 3. No vector similarity or machine learning capabilities
-- 4. Manual maintenance of search indexes and vectors
-- 5. Primitive relevance scoring without ML-based optimization
-- 6. No real-time search suggestions or autocomplete
-- 7. Limited spell correction and fuzzy matching capabilities
-- 8. Complex integration with external search engines required for advanced features
-- 9. No built-in search analytics or performance optimization
-- 10. Difficulty in handling multimedia and structured data search

MongoDB Atlas Search provides comprehensive search capabilities with advanced indexing and ML integration:

// MongoDB Atlas Search - Advanced full-text and vector search capabilities
const { MongoClient, ObjectId } = require('mongodb');

// Comprehensive Atlas Search Manager
class AtlasSearchManager {
  constructor(connectionString, searchConfig = {}) {
    this.connectionString = connectionString;
    this.client = null;
    this.db = null;

    this.config = {
      // Search configuration
      enableFullTextSearch: searchConfig.enableFullTextSearch !== false,
      enableVectorSearch: searchConfig.enableVectorSearch !== false,
      enableFacetedSearch: searchConfig.enableFacetedSearch !== false,
      enableAutocomplete: searchConfig.enableAutocomplete !== false,

      // Advanced features
      enableSemanticSearch: searchConfig.enableSemanticSearch !== false,
      enableMultiLanguageSearch: searchConfig.enableMultiLanguageSearch !== false,
      enableSpellCorrection: searchConfig.enableSpellCorrection !== false,
      enableSearchAnalytics: searchConfig.enableSearchAnalytics !== false,

      // Performance optimization
      searchResultLimit: searchConfig.searchResultLimit || 50,
      facetLimit: searchConfig.facetLimit || 20,
      highlightMaxChars: searchConfig.highlightMaxChars || 500,
      cacheSearchResults: searchConfig.cacheSearchResults !== false,

      // ML and AI features
      enableRelevanceScoring: searchConfig.enableRelevanceScoring !== false,
      enablePersonalization: searchConfig.enablePersonalization !== false,
      enableSearchSuggestions: searchConfig.enableSearchSuggestions !== false,

      ...searchConfig
    };

    // Collections
    this.collections = {
      documents: null,
      searchQueries: null,
      searchAnalytics: null,
      userProfiles: null,
      searchSuggestions: null,
      vectorEmbeddings: null
    };

    // Search indexes configuration
    this.searchIndexes = new Map();
    this.vectorIndexes = new Map();

    // Performance metrics
    this.searchMetrics = {
      totalSearches: 0,
      averageLatency: 0,
      searchesWithResults: 0,
      popularQueries: new Map()
    };
  }

  async initializeAtlasSearch() {
    console.log('Initializing MongoDB Atlas Search capabilities...');

    try {
      // Connect to MongoDB Atlas
      this.client = new MongoClient(this.connectionString);
      await this.client.connect();
      this.db = this.client.db();

      // Initialize collections
      await this.setupSearchCollections();

      // Create Atlas Search indexes
      await this.createAtlasSearchIndexes();

      // Setup vector search if enabled
      if (this.config.enableVectorSearch) {
        await this.setupVectorSearch();
      }

      // Initialize search analytics
      if (this.config.enableSearchAnalytics) {
        await this.setupSearchAnalytics();
      }

      console.log('Atlas Search initialization completed successfully');

    } catch (error) {
      console.error('Error initializing Atlas Search:', error);
      throw error;
    }
  }

  async setupSearchCollections() {
    console.log('Setting up search-optimized collections...');

    // Documents collection with search-optimized schema
    this.collections.documents = this.db.collection('documents');
    await this.collections.documents.createIndexes([
      { key: { title: 'text', content: 'text' }, background: true, name: 'text_search_fallback' },
      { key: { category: 1, status: 1, publishDate: -1 }, background: true },
      { key: { author: 1, publishDate: -1 }, background: true },
      { key: { tags: 1, language: 1 }, background: true },
      { key: { popularity: -1, relevanceScore: -1 }, background: true }
    ]);

    // Search queries and analytics
    this.collections.searchQueries = this.db.collection('search_queries');
    await this.collections.searchQueries.createIndexes([
      { key: { userId: 1, searchTimestamp: -1 }, background: true },
      { key: { queryText: 1, totalResults: -1 }, background: true },
      { key: { searchTimestamp: -1 }, background: true },
      { key: { sessionId: 1, searchTimestamp: -1 }, background: true }
    ]);

    // Search analytics aggregation collection
    this.collections.searchAnalytics = this.db.collection('search_analytics');
    await this.collections.searchAnalytics.createIndexes([
      { key: { analysisDate: -1 }, background: true },
      { key: { queryPattern: 1, frequency: -1 }, background: true }
    ]);

    // User profiles for personalization
    this.collections.userProfiles = this.db.collection('user_profiles');
    await this.collections.userProfiles.createIndexes([
      { key: { userId: 1 }, unique: true, background: true },
      { key: { 'searchPreferences.categories': 1 }, background: true },
      { key: { lastActivity: -1 }, background: true }
    ]);

    console.log('Search collections setup completed');
  }

  async createAtlasSearchIndexes() {
    console.log('Creating Atlas Search indexes...');

    // Main document search index with comprehensive text analysis
    const mainSearchIndex = {
      name: 'documents_search_index',
      definition: {
        mappings: {
          dynamic: false,
          fields: {
            title: {
              type: 'string',
              analyzer: 'lucene.standard',
              searchAnalyzer: 'lucene.standard',
              highlight: {
                type: 'html'
              }
            },
            content: {
              type: 'string',
              analyzer: 'lucene.standard',
              searchAnalyzer: 'lucene.standard',
              highlight: {
                type: 'html',
                maxCharsToExamine: this.config.highlightMaxChars
              }
            },
            author: {
              type: 'string',
              analyzer: 'lucene.keyword'
            },
            category: {
              type: 'string',
              analyzer: 'lucene.keyword'
            },
            tags: {
              type: 'string',
              analyzer: 'lucene.standard'
            },
            language: {
              type: 'string',
              analyzer: 'lucene.keyword'
            },
            publishDate: {
              type: 'date'
            },
            popularity: {
              type: 'number'
            },
            relevanceScore: {
              type: 'number'
            },
            // Nested content analysis
            sections: {
              type: 'document',
              fields: {
                heading: {
                  type: 'string',
                  analyzer: 'lucene.standard'
                },
                content: {
                  type: 'string',
                  analyzer: 'lucene.standard'
                },
                importance: {
                  type: 'number'
                }
              }
            },
            // Metadata for advanced search
            metadata: {
              type: 'document',
              fields: {
                readingLevel: { type: 'string' },
                contentType: { type: 'string' },
                sourceQuality: { type: 'number' },
                lastUpdated: { type: 'date' }
              }
            }
          }
        },
        analyzers: [{
          name: 'multilingual_analyzer',
          charFilters: [{
            type: 'mapping',
            mappings: {
              '&': 'and',
              '@': 'at'
            }
          }],
          tokenizer: {
            type: 'standard'
          },
          tokenFilters: [
            { type: 'lowercase' },
            { type: 'stop' },
            { type: 'stemmer', language: 'en' }
          ]
        }]
      }
    };

    // Autocomplete search index
    const autocompleteIndex = {
      name: 'autocomplete_search_index',
      definition: {
        mappings: {
          dynamic: false,
          fields: {
            title: {
              type: 'autocomplete',
              analyzer: 'lucene.standard',
              tokenization: 'edgeGram',
              minGrams: 2,
              maxGrams: 15,
              foldDiacritics: true
            },
            content: {
              type: 'autocomplete',
              analyzer: 'lucene.standard',
              tokenization: 'nGram',
              minGrams: 3,
              maxGrams: 10
            },
            tags: {
              type: 'autocomplete',
              analyzer: 'lucene.keyword',
              tokenization: 'keyword'
            },
            category: {
              type: 'string',
              analyzer: 'lucene.keyword'
            },
            popularity: {
              type: 'number'
            }
          }
        }
      }
    };

    // Faceted search index for advanced filtering
    const facetedSearchIndex = {
      name: 'faceted_search_index',
      definition: {
        mappings: {
          dynamic: false,
          fields: {
            title: {
              type: 'string',
              analyzer: 'lucene.standard'
            },
            content: {
              type: 'string',
              analyzer: 'lucene.standard'
            },
            category: {
              type: 'stringFacet'
            },
            author: {
              type: 'stringFacet'
            },
            language: {
              type: 'stringFacet'
            },
            tags: {
              type: 'stringFacet'
            },
            publishDate: {
              type: 'dateFacet',
              boundaries: [
                new Date('2020-01-01'),
                new Date('2021-01-01'),
                new Date('2022-01-01'),
                new Date('2023-01-01'),
                new Date('2024-01-01'),
                new Date('2025-01-01')
              ]
            },
            popularity: {
              type: 'numberFacet',
              boundaries: [0, 10, 50, 100, 500, 1000]
            },
            contentLength: {
              type: 'numberFacet',
              boundaries: [0, 1000, 5000, 10000, 50000]
            }
          }
        }
      }
    };

    // Store index configurations for reference
    this.searchIndexes.set('main', mainSearchIndex);
    this.searchIndexes.set('autocomplete', autocompleteIndex);
    this.searchIndexes.set('faceted', facetedSearchIndex);

    console.log('Atlas Search indexes configured');
    // Note: In production, these indexes would be created through Atlas UI or API
  }

  async performAdvancedTextSearch(query, options = {}) {
    console.log(`Performing advanced text search for: "${query}"`);

    const startTime = Date.now();

    try {
      // Build comprehensive search aggregation pipeline
      const searchPipeline = [
        {
          $search: {
            index: 'documents_search_index',
            compound: {
              should: [
                // Primary text search with boosting
                {
                  text: {
                    query: query,
                    path: ['title', 'content'],
                    score: {
                      boost: { value: 2.0 }
                    },
                    fuzzy: {
                      maxEdits: 2,
                      prefixLength: 0,
                      maxExpansions: 50
                    }
                  }
                },
                // Exact phrase matching with highest boost
                {
                  phrase: {
                    query: query,
                    path: ['title', 'content'],
                    score: {
                      boost: { value: 3.0 }
                    }
                  }
                },
                // Autocomplete matching for partial queries
                {
                  autocomplete: {
                    query: query,
                    path: 'title',
                    tokenOrder: 'sequential',
                    score: {
                      boost: { value: 1.5 }
                    }
                  }
                },
                // Semantic search using embeddings (if available)
                ...(options.enableSemanticSearch && this.config.enableVectorSearch ? [{
                  knnBeta: {
                    vector: await this.getQueryEmbedding(query),
                    path: 'contentEmbedding',
                    k: 20,
                    score: {
                      boost: { value: 1.2 }
                    }
                  }
                }] : [])
              ],

              // Apply filters
              filter: [
                ...(options.category ? [{
                  equals: {
                    path: 'category',
                    value: options.category
                  }
                }] : []),
                ...(options.language ? [{
                  equals: {
                    path: 'language',
                    value: options.language
                  }
                }] : []),
                ...(options.author ? [{
                  text: {
                    query: options.author,
                    path: 'author'
                  }
                }] : []),
                ...(options.dateRange ? [{
                  range: {
                    path: 'publishDate',
                    gte: options.dateRange.start,
                    lte: options.dateRange.end
                  }
                }] : []),
                {
                  equals: {
                    path: 'status',
                    value: 'published'
                  }
                }
              ],

              // Boost recent and popular content
              should: [
                {
                  range: {
                    path: 'publishDate',
                    gte: new Date(Date.now() - 30 * 24 * 60 * 60 * 1000), // Last 30 days
                    score: {
                      boost: { value: 1.3 }
                    }
                  }
                },
                {
                  range: {
                    path: 'popularity',
                    gte: 100,
                    score: {
                      boost: { value: 1.2 }
                    }
                  }
                }
              ]
            },

            // Add search highlighting
            highlight: {
              path: ['title', 'content'],
              maxCharsToExamine: this.config.highlightMaxChars,
              maxNumPassages: 3
            }
          }
        },

        // Add computed fields for search results
        {
          $addFields: {
            searchScore: { $meta: 'searchScore' },
            searchHighlights: { $meta: 'searchHighlights' },

            // Calculate content preview
            contentPreview: {
              $substr: ['$content', 0, 300]
            },

            // Add relevance indicators
            relevanceIndicators: {
              hasExactMatch: {
                $or: [
                  { $regexMatch: { input: '$title', regex: query, options: 'i' } },
                  { $regexMatch: { input: '$content', regex: query, options: 'i' } }
                ]
              },
              isRecent: {
                $gte: ['$publishDate', new Date(Date.now() - 30 * 24 * 60 * 60 * 1000)]
              },
              isPopular: {
                $gte: ['$popularity', 50]
              }
            }
          }
        },

        // Add user personalization (if available)
        ...(options.userId ? [{
          $lookup: {
            from: 'user_profiles',
            localField: 'category',
            foreignField: 'searchPreferences.categories',
            as: 'personalizationMatch',
            pipeline: [
              { $match: { userId: options.userId } },
              { $limit: 1 }
            ]
          }
        }, {
          $addFields: {
            personalizationBoost: {
              $cond: [
                { $gt: [{ $size: '$personalizationMatch' }, 0] },
                1.4,
                1.0
              ]
            },
            finalScore: {
              $multiply: ['$searchScore', '$personalizationBoost']
            }
          }
        }] : [{
          $addFields: {
            finalScore: '$searchScore'
          }
        }]),

        // Sort by relevance and apply limits
        { $sort: { finalScore: -1, publishDate: -1 } },
        { $limit: options.limit || this.config.searchResultLimit },

        // Project final result structure
        {
          $project: {
            documentId: '$_id',
            title: 1,
            content: { $substr: ['$content', 0, 500] },
            author: 1,
            category: 1,
            tags: 1,
            publishDate: 1,
            language: 1,
            contentPreview: 1,

            // Search-specific fields
            searchScore: { $round: ['$finalScore', 4] },
            searchHighlights: 1,
            relevanceIndicators: 1,

            // Computed fields
            contentLength: { $strLenCP: '$content' },
            estimatedReadingTime: {
              $round: [{ $divide: [{ $strLenCP: '$content' }, 200] }, 0] // 200 words per minute
            },

            // Search result metadata
            searchRank: { $add: [{ $indexOfArray: [[], '$_id'] }, 1] },
            matchQuality: {
              $switch: {
                branches: [
                  { case: { $gte: ['$finalScore', 5.0] }, then: 'excellent' },
                  { case: { $gte: ['$finalScore', 3.0] }, then: 'good' },
                  { case: { $gte: ['$finalScore', 1.0] }, then: 'fair' }
                ],
                default: 'poor'
              }
            }
          }
        }
      ];

      // Execute search pipeline
      const searchResults = await this.collections.documents.aggregate(
        searchPipeline,
        { maxTimeMS: 10000 }
      ).toArray();

      const executionTime = Date.now() - startTime;

      // Log search query for analytics
      await this.logSearchQuery(query, searchResults.length, executionTime, options);

      // Update search metrics
      this.updateSearchMetrics(query, searchResults.length, executionTime);

      console.log(`Search completed: ${searchResults.length} results in ${executionTime}ms`);

      return {
        success: true,
        query: query,
        totalResults: searchResults.length,
        executionTime: executionTime,
        results: searchResults,
        searchMetadata: {
          hasSpellingSuggestions: false, // Would implement spell checking
          appliedFilters: options,
          searchComplexity: 'advanced',
          optimizationsApplied: ['boosting', 'fuzzy_matching', 'highlighting']
        }
      };

    } catch (error) {
      console.error('Error performing advanced text search:', error);
      return {
        success: false,
        error: error.message,
        query: query,
        executionTime: Date.now() - startTime
      };
    }
  }

  async setupVectorSearch() {
    console.log('Setting up vector search capabilities...');

    // Vector embeddings collection
    this.collections.vectorEmbeddings = this.db.collection('vector_embeddings');

    // Vector search index configuration
    const vectorSearchIndex = {
      name: 'vector_search_index',
      definition: {
        fields: [{
          type: 'vector',
          path: 'contentEmbedding',
          numDimensions: 1536, // OpenAI embedding dimensions
          similarity: 'cosine'
        }, {
          type: 'filter',
          path: 'documentId'
        }, {
          type: 'filter',
          path: 'embeddingType'
        }, {
          type: 'filter',
          path: 'language'
        }]
      }
    };

    this.vectorIndexes.set('content_vectors', vectorSearchIndex);

    // Create indexes for vector collection
    await this.collections.vectorEmbeddings.createIndexes([
      { key: { documentId: 1 }, unique: true, background: true },
      { key: { embeddingType: 1, language: 1 }, background: true },
      { key: { createdAt: -1 }, background: true }
    ]);

    console.log('Vector search setup completed');
  }

  async performVectorSearch(queryEmbedding, options = {}) {
    console.log('Performing vector similarity search...');

    const startTime = Date.now();

    try {
      const vectorSearchPipeline = [
        {
          $vectorSearch: {
            index: 'vector_search_index',
            path: 'contentEmbedding',
            queryVector: queryEmbedding,
            numCandidates: options.numCandidates || 100,
            limit: options.limit || 20,
            filter: {
              ...(options.language && { language: { $eq: options.language } }),
              ...(options.embeddingType && { embeddingType: { $eq: options.embeddingType } })
            }
          }
        },

        // Join with original documents
        {
          $lookup: {
            from: 'documents',
            localField: 'documentId',
            foreignField: '_id',
            as: 'document'
          }
        },

        // Unwind and add computed fields
        { $unwind: '$document' },
        {
          $addFields: {
            similarityScore: { $meta: 'vectorSearchScore' },
            semanticRelevance: {
              $switch: {
                branches: [
                  { case: { $gte: [{ $meta: 'vectorSearchScore' }, 0.8] }, then: 'very_high' },
                  { case: { $gte: [{ $meta: 'vectorSearchScore' }, 0.6] }, then: 'high' },
                  { case: { $gte: [{ $meta: 'vectorSearchScore' }, 0.4] }, then: 'medium' }
                ],
                default: 'low'
              }
            }
          }
        },

        // Project results
        {
          $project: {
            documentId: '$document._id',
            title: '$document.title',
            content: { $substr: ['$document.content', 0, 400] },
            author: '$document.author',
            category: '$document.category',
            similarityScore: { $round: ['$similarityScore', 4] },
            semanticRelevance: 1,
            embeddingType: 1,
            language: 1
          }
        }
      ];

      const vectorResults = await this.collections.vectorEmbeddings.aggregate(
        vectorSearchPipeline,
        { maxTimeMS: 15000 }
      ).toArray();

      const executionTime = Date.now() - startTime;

      console.log(`Vector search completed: ${vectorResults.length} results in ${executionTime}ms`);

      return {
        success: true,
        totalResults: vectorResults.length,
        executionTime: executionTime,
        results: vectorResults,
        searchType: 'vector_similarity'
      };

    } catch (error) {
      console.error('Error performing vector search:', error);
      return {
        success: false,
        error: error.message,
        executionTime: Date.now() - startTime
      };
    }
  }

  async performFacetedSearch(query, options = {}) {
    console.log(`Performing faceted search for: "${query}"`);

    const startTime = Date.now();

    try {
      const facetedSearchPipeline = [
        {
          $searchMeta: {
            index: 'faceted_search_index',
            facet: {
              operator: {
                text: {
                  query: query,
                  path: ['title', 'content']
                }
              },
              facets: {
                // Category facets
                categoriesFacet: {
                  type: 'string',
                  path: 'category',
                  numBuckets: this.config.facetLimit
                },

                // Author facets
                authorsFacet: {
                  type: 'string',
                  path: 'author',
                  numBuckets: 10
                },

                // Language facets
                languagesFacet: {
                  type: 'string',
                  path: 'language',
                  numBuckets: 10
                },

                // Date range facets
                publishDateFacet: {
                  type: 'date',
                  path: 'publishDate',
                  boundaries: [
                    new Date('2020-01-01'),
                    new Date('2021-01-01'),
                    new Date('2022-01-01'),
                    new Date('2023-01-01'),
                    new Date('2024-01-01'),
                    new Date('2025-01-01')
                  ]
                },

                // Popularity range facets
                popularityFacet: {
                  type: 'number',
                  path: 'popularity',
                  boundaries: [0, 10, 50, 100, 500, 1000]
                },

                // Content length facets
                contentLengthFacet: {
                  type: 'number',
                  path: 'contentLength',
                  boundaries: [0, 1000, 5000, 10000, 50000]
                }
              }
            }
          }
        }
      ];

      const facetResults = await this.collections.documents.aggregate(
        facetedSearchPipeline
      ).toArray();

      const executionTime = Date.now() - startTime;

      console.log(`Faceted search completed in ${executionTime}ms`);

      return {
        success: true,
        query: query,
        executionTime: executionTime,
        facets: facetResults[0]?.facet || {},
        searchType: 'faceted'
      };

    } catch (error) {
      console.error('Error performing faceted search:', error);
      return {
        success: false,
        error: error.message,
        executionTime: Date.now() - startTime
      };
    }
  }

  async generateAutocompleteResults(partialQuery, options = {}) {
    console.log(`Generating autocomplete for: "${partialQuery}"`);

    try {
      const autocompletePipeline = [
        {
          $search: {
            index: 'autocomplete_search_index',
            compound: {
              should: [
                {
                  autocomplete: {
                    query: partialQuery,
                    path: 'title',
                    tokenOrder: 'sequential',
                    score: { boost: { value: 2.0 } }
                  }
                },
                {
                  autocomplete: {
                    query: partialQuery,
                    path: 'tags',
                    tokenOrder: 'any',
                    score: { boost: { value: 1.5 } }
                  }
                }
              ],
              filter: [
                { equals: { path: 'status', value: 'published' } },
                ...(options.category ? [{ equals: { path: 'category', value: options.category } }] : [])
              ]
            }
          }
        },

        { $limit: 10 },

        {
          $project: {
            suggestion: '$title',
            category: 1,
            popularity: 1,
            autocompleteScore: { $meta: 'searchScore' }
          }
        },

        { $sort: { autocompleteScore: -1, popularity: -1 } }
      ];

      const suggestions = await this.collections.documents.aggregate(
        autocompletePipeline
      ).toArray();

      return {
        success: true,
        partialQuery: partialQuery,
        suggestions: suggestions.map(s => ({
          text: s.suggestion,
          category: s.category,
          score: s.autocompleteScore
        }))
      };

    } catch (error) {
      console.error('Error generating autocomplete results:', error);
      return {
        success: false,
        error: error.message,
        suggestions: []
      };
    }
  }

  async logSearchQuery(query, resultCount, executionTime, options) {
    try {
      const searchLog = {
        queryId: new ObjectId(),
        queryText: query,
        queryLanguage: options.language || 'en',
        userId: options.userId,
        sessionId: options.sessionId,

        // Search parameters
        filtersApplied: {
          category: options.category,
          author: options.author,
          language: options.language,
          dateRange: options.dateRange
        },

        // Search results metrics
        totalResultsFound: resultCount,
        executionTimeMs: executionTime,
        searchType: options.searchType || 'text',

        // Context information
        userAgent: options.userAgent,
        referrer: options.referrer,
        searchTimestamp: new Date(),

        // Performance data
        indexesUsed: ['documents_search_index'],
        optimizationsApplied: ['boosting', 'highlighting', 'fuzzy_matching'],

        // Quality metrics (to be updated by user interaction)
        userInteraction: {
          resultsClicked: [],
          timeOnResultsPage: null,
          refinedQuery: null,
          conversionAchieved: false
        }
      };

      await this.collections.searchQueries.insertOne(searchLog);

    } catch (error) {
      console.error('Error logging search query:', error);
    }
  }

  updateSearchMetrics(query, resultCount, executionTime) {
    this.searchMetrics.totalSearches++;
    this.searchMetrics.averageLatency = 
      (this.searchMetrics.averageLatency + executionTime) / 2;

    if (resultCount > 0) {
      this.searchMetrics.searchesWithResults++;
    }

    // Track popular queries
    const queryLower = query.toLowerCase();
    this.searchMetrics.popularQueries.set(
      queryLower,
      (this.searchMetrics.popularQueries.get(queryLower) || 0) + 1
    );
  }

  async getQueryEmbedding(query) {
    // Placeholder for actual embedding generation
    // In production, this would call OpenAI API or similar service
    return Array(1536).fill(0).map(() => Math.random() - 0.5);
  }

  async getSearchAnalytics(timeRange = '7d') {
    console.log(`Retrieving search analytics for ${timeRange}...`);

    try {
      const endDate = new Date();
      const startDate = new Date();

      switch (timeRange) {
        case '1d':
          startDate.setDate(endDate.getDate() - 1);
          break;
        case '7d':
          startDate.setDate(endDate.getDate() - 7);
          break;
        case '30d':
          startDate.setDate(endDate.getDate() - 30);
          break;
        default:
          startDate.setDate(endDate.getDate() - 7);
      }

      const analyticsAggregation = [
        {
          $match: {
            searchTimestamp: { $gte: startDate, $lte: endDate }
          }
        },

        {
          $group: {
            _id: null,
            totalSearches: { $sum: 1 },
            uniqueUsers: { $addToSet: '$userId' },
            averageExecutionTime: { $avg: '$executionTimeMs' },
            searchesWithResults: {
              $sum: { $cond: [{ $gt: ['$totalResultsFound', 0] }, 1, 0] }
            },

            // Query analysis
            popularQueries: {
              $push: {
                query: '$queryText',
                results: '$totalResultsFound',
                executionTime: '$executionTimeMs'
              }
            },

            // Performance metrics
            maxExecutionTime: { $max: '$executionTimeMs' },
            minExecutionTime: { $min: '$executionTimeMs' },

            // Filter usage analysis
            categoryFilters: { $push: '$filtersApplied.category' },
            languageFilters: { $push: '$filtersApplied.language' }
          }
        },

        {
          $addFields: {
            uniqueUserCount: { $size: '$uniqueUsers' },
            successRate: {
              $round: [
                { $multiply: [
                  { $divide: ['$searchesWithResults', '$totalSearches'] },
                  100
                ]},
                2
              ]
            },
            averageExecutionTimeRounded: {
              $round: ['$averageExecutionTime', 2]
            }
          }
        }
      ];

      const analytics = await this.collections.searchQueries.aggregate(
        analyticsAggregation
      ).toArray();

      return {
        success: true,
        timeRange: timeRange,
        analytics: analytics[0] || {
          totalSearches: 0,
          uniqueUserCount: 0,
          successRate: 0,
          averageExecutionTimeRounded: 0
        },
        systemMetrics: this.searchMetrics
      };

    } catch (error) {
      console.error('Error retrieving search analytics:', error);
      return {
        success: false,
        error: error.message
      };
    }
  }

  async shutdown() {
    console.log('Shutting down Atlas Search Manager...');

    if (this.client) {
      await this.client.close();
    }

    console.log('Atlas Search Manager shutdown complete');
  }
}

// Benefits of MongoDB Atlas Search:
// - Native full-text search with no external dependencies
// - Advanced relevance scoring with machine learning integration
// - Vector similarity search for semantic understanding
// - Multi-language support with sophisticated text analysis
// - Real-time search index synchronization
// - Faceted search and advanced filtering capabilities
// - Autocomplete and search suggestions out-of-the-box
// - Comprehensive search analytics and performance monitoring
// - SQL-compatible search operations through QueryLeaf integration

module.exports = {
  AtlasSearchManager
};

Understanding MongoDB Atlas Search Architecture

Advanced Search Patterns and Performance Optimization

Implement sophisticated search strategies for production MongoDB Atlas deployments:

// Production-ready Atlas Search with advanced features and optimization
class EnterpriseAtlasSearchProcessor extends AtlasSearchManager {
  constructor(connectionString, enterpriseConfig) {
    super(connectionString, enterpriseConfig);

    this.enterpriseConfig = {
      ...enterpriseConfig,
      enableAdvancedAnalytics: true,
      enablePersonalization: true,
      enableA_B_Testing: true,
      enableSearchOptimization: true,
      enableContentIntelligence: true,
      enableMultiModalSearch: true
    };

    this.setupEnterpriseFeatures();
    this.initializeAdvancedAnalytics();
    this.setupPersonalizationEngine();
  }

  async implementAdvancedSearchStrategies() {
    console.log('Implementing enterprise search strategies...');

    const searchStrategies = {
      // Multi-modal search capabilities
      multiModalSearch: {
        textSearch: true,
        vectorSearch: true,
        imageSearch: true,
        documentSearch: true,
        semanticSearch: true
      },

      // Personalization engine
      personalizationEngine: {
        userBehaviorAnalysis: true,
        contentRecommendations: true,
        adaptiveScoringWeights: true,
        searchIntentPrediction: true
      },

      // Search optimization
      searchOptimization: {
        realTimeIndexOptimization: true,
        queryPerformanceAnalysis: true,
        automaticRelevanceTuning: true,
        resourceUtilizationOptimization: true
      }
    };

    return await this.deployEnterpriseSearchStrategies(searchStrategies);
  }

  async setupAdvancedPersonalization() {
    console.log('Setting up advanced personalization capabilities...');

    const personalizationConfig = {
      // User modeling
      userModeling: {
        behavioralTracking: true,
        preferenceAnalysis: true,
        contextualUnderstanding: true,
        intentPrediction: true
      },

      // Content intelligence
      contentIntelligence: {
        topicModeling: true,
        contentCategorization: true,
        qualityScoring: true,
        freshnessScorig: true
      },

      // Adaptive algorithms
      adaptiveAlgorithms: {
        learningFromInteraction: true,
        realTimeAdaptation: true,
        contextualAdjustment: true,
        performanceOptimization: true
      }
    };

    return await this.deployPersonalizationEngine(personalizationConfig);
  }
}

SQL-Style Search Operations with QueryLeaf

QueryLeaf provides familiar SQL syntax for MongoDB Atlas Search operations:

-- QueryLeaf Atlas Search operations with SQL-familiar syntax

-- Configure comprehensive search indexes
CREATE SEARCH INDEX documents_main_index ON documents (
  title WITH (
    analyzer = 'standard',
    search_analyzer = 'standard',
    highlight = true,
    boost = 2.0
  ),
  content WITH (
    analyzer = 'standard', 
    search_analyzer = 'standard',
    highlight = true,
    max_highlight_chars = 500
  ),
  author WITH (
    analyzer = 'keyword',
    facet = true
  ),
  category WITH (
    analyzer = 'keyword',
    facet = true
  ),
  tags WITH (
    analyzer = 'standard',
    facet = true
  ),
  language WITH (
    analyzer = 'keyword',
    facet = true
  ),
  publish_date WITH (
    type = 'date',
    facet = true
  ),
  popularity WITH (
    type = 'number',
    facet = true,
    facet_boundaries = [0, 10, 50, 100, 500, 1000]
  )
)
WITH SEARCH_OPTIONS (
  enable_highlighting = true,
  enable_faceting = true,
  enable_autocomplete = true,
  enable_fuzzy_matching = true,
  default_language = 'english'
);

-- Create autocomplete search index
CREATE AUTOCOMPLETE INDEX documents_autocomplete ON documents (
  title WITH (
    tokenization = 'edgeGram',
    min_grams = 2,
    max_grams = 15,
    fold_diacritics = true
  ),
  tags WITH (
    tokenization = 'keyword',
    max_suggestions = 20
  )
);

-- Create vector search index for semantic search
CREATE VECTOR INDEX documents_semantic ON documents (
  content_embedding WITH (
    dimensions = 1536,
    similarity = 'cosine'
  )
)
WITH VECTOR_OPTIONS (
  num_candidates = 100,
  enable_filtering = true
);

-- Advanced text search with comprehensive features
WITH advanced_search AS (
  SELECT 
    document_id,
    title,
    content,
    author,
    category,
    tags,
    publish_date,
    language,
    popularity,

    -- Search scoring and ranking
    SEARCH_SCORE() as relevance_score,
    SEARCH_HIGHLIGHTS(title, content) as search_highlights,

    -- Advanced scoring components
    CASE 
      WHEN SEARCH_EXACT_MATCH(title, 'machine learning') THEN 3.0
      WHEN SEARCH_PHRASE_MATCH(content, 'machine learning') THEN 2.5
      WHEN SEARCH_FUZZY_MATCH(title, 'machine learning', max_edits = 2) THEN 1.8
      ELSE 1.0
    END as match_type_boost,

    -- Temporal and popularity boosts
    CASE 
      WHEN publish_date >= CURRENT_DATE - INTERVAL '30 days' THEN 1.3
      WHEN publish_date >= CURRENT_DATE - INTERVAL '90 days' THEN 1.1
      ELSE 1.0
    END as recency_boost,

    CASE 
      WHEN popularity >= 1000 THEN 1.4
      WHEN popularity >= 100 THEN 1.2
      WHEN popularity >= 10 THEN 1.1
      ELSE 1.0
    END as popularity_boost,

    -- Content quality indicators
    LENGTH(content) as content_length,
    ARRAY_LENGTH(tags, 1) as tag_count,
    EXTRACT(DAYS FROM CURRENT_DATE - publish_date) as days_old

  FROM documents
  WHERE SEARCH(
    -- Primary search query
    query = 'machine learning artificial intelligence',
    paths = ['title', 'content'],

    -- Search options
    WITH (
      fuzzy_matching = true,
      max_edits = 2,
      prefix_length = 2,
      enable_highlighting = true,
      highlight_max_chars = 500,

      -- Boost strategies
      title_boost = 2.0,
      exact_phrase_boost = 3.0,
      proximity_boost = 1.5
    ),

    -- Filters
    AND category IN ('technology', 'science', 'research')
    AND language = 'en'
    AND status = 'published'
    AND publish_date >= '2020-01-01'
  )
),

search_with_personalization AS (
  SELECT 
    ads.*,

    -- User personalization (if user context available)
    CASE 
      WHEN USER_PREFERENCE_MATCH(category, user_id = 'user123') THEN 1.4
      WHEN USER_INTERACTION_HISTORY(document_id, user_id = 'user123', 
                                   interaction_type = 'positive') THEN 1.3
      ELSE 1.0
    END as personalization_boost,

    -- Final relevance calculation
    (relevance_score * match_type_boost * recency_boost * 
     popularity_boost * personalization_boost) as final_relevance_score,

    -- Search result enrichment
    CASE 
      WHEN final_relevance_score >= 8.0 THEN 'excellent'
      WHEN final_relevance_score >= 5.0 THEN 'very_good'
      WHEN final_relevance_score >= 3.0 THEN 'good'
      WHEN final_relevance_score >= 1.0 THEN 'fair'
      ELSE 'poor'
    END as match_quality,

    -- Estimated reading time
    ROUND(content_length / 200.0, 0) as estimated_reading_minutes,

    -- Search result categories
    CASE 
      WHEN SEARCH_EXACT_MATCH(title, 'machine learning') OR 
           SEARCH_EXACT_MATCH(content, 'machine learning') THEN 'exact_match'
      WHEN SEARCH_SEMANTIC_SIMILARITY(content_embedding, 
                                      QUERY_EMBEDDING('machine learning artificial intelligence')) > 0.8 
           THEN 'semantic_match'
      WHEN SEARCH_FUZZY_MATCH(title, 'machine learning', max_edits = 2) THEN 'fuzzy_match'
      ELSE 'keyword_match'
    END as match_type

  FROM advanced_search ads
),

faceted_analysis AS (
  -- Generate search facets for filtering UI
  SELECT 
    'categories' as facet_type,
    category as facet_value,
    COUNT(*) as result_count,
    AVG(final_relevance_score) as avg_relevance
  FROM search_with_personalization
  GROUP BY category

  UNION ALL

  SELECT 
    'authors' as facet_type,
    author as facet_value,
    COUNT(*) as result_count,
    AVG(final_relevance_score) as avg_relevance
  FROM search_with_personalization
  GROUP BY author

  UNION ALL

  SELECT 
    'languages' as facet_type,
    language as facet_value,
    COUNT(*) as result_count,
    AVG(final_relevance_score) as avg_relevance
  FROM search_with_personalization
  GROUP BY language

  UNION ALL

  SELECT 
    'time_periods' as facet_type,
    CASE 
      WHEN publish_date >= CURRENT_DATE - INTERVAL '30 days' THEN 'last_month'
      WHEN publish_date >= CURRENT_DATE - INTERVAL '90 days' THEN 'last_3_months'
      WHEN publish_date >= CURRENT_DATE - INTERVAL '365 days' THEN 'last_year'
      ELSE 'older'
    END as facet_value,
    COUNT(*) as result_count,
    AVG(final_relevance_score) as avg_relevance
  FROM search_with_personalization
  GROUP BY facet_value

  UNION ALL

  SELECT 
    'popularity_ranges' as facet_type,
    CASE 
      WHEN popularity >= 1000 THEN 'very_popular'
      WHEN popularity >= 100 THEN 'popular'
      WHEN popularity >= 10 THEN 'moderate'
      ELSE 'emerging'
    END as facet_value,
    COUNT(*) as result_count,
    AVG(final_relevance_score) as avg_relevance
  FROM search_with_personalization
  GROUP BY facet_value
),

search_analytics AS (
  -- Real-time search analytics
  SELECT 
    'search_performance' as metric_type,
    COUNT(*) as total_results,
    AVG(final_relevance_score) as avg_relevance,
    MAX(final_relevance_score) as max_relevance,
    COUNT(*) FILTER (WHERE match_quality IN ('excellent', 'very_good')) as high_quality_results,
    COUNT(DISTINCT category) as categories_represented,
    COUNT(DISTINCT author) as authors_represented,
    COUNT(DISTINCT language) as languages_represented,

    -- Match type distribution
    COUNT(*) FILTER (WHERE match_type = 'exact_match') as exact_matches,
    COUNT(*) FILTER (WHERE match_type = 'semantic_match') as semantic_matches,
    COUNT(*) FILTER (WHERE match_type = 'fuzzy_match') as fuzzy_matches,
    COUNT(*) FILTER (WHERE match_type = 'keyword_match') as keyword_matches,

    -- Content characteristics
    AVG(content_length) as avg_content_length,
    AVG(estimated_reading_minutes) as avg_reading_time,
    AVG(days_old) as avg_content_age_days,

    -- Search quality indicators
    ROUND((COUNT(*) FILTER (WHERE match_quality IN ('excellent', 'very_good'))::DECIMAL / COUNT(*)) * 100, 2) as high_quality_percentage,
    ROUND((COUNT(*) FILTER (WHERE final_relevance_score >= 3.0)::DECIMAL / COUNT(*)) * 100, 2) as relevant_results_percentage

  FROM search_with_personalization
)

-- Main search results output
SELECT 
  swp.document_id,
  swp.title,
  LEFT(swp.content, 300) || '...' as content_preview,
  swp.author,
  swp.category,
  swp.tags,
  swp.publish_date,
  swp.language,

  -- Relevance and ranking
  ROUND(swp.final_relevance_score, 4) as relevance_score,
  ROW_NUMBER() OVER (ORDER BY swp.final_relevance_score DESC, swp.publish_date DESC) as search_rank,
  swp.match_quality,
  swp.match_type,

  -- Search highlights
  swp.search_highlights,

  -- Content metadata
  swp.content_length,
  swp.estimated_reading_minutes,
  swp.tag_count,
  swp.days_old,

  -- User personalization indicators
  ROUND(swp.personalization_boost, 2) as personalization_factor,

  -- Additional context
  CASE 
    WHEN swp.days_old <= 7 THEN 'Very Recent'
    WHEN swp.days_old <= 30 THEN 'Recent'
    WHEN swp.days_old <= 90 THEN 'Moderate'
    ELSE 'Archive'
  END as content_freshness,

  -- Search result recommendations
  CASE 
    WHEN swp.match_quality = 'excellent' AND swp.match_type = 'exact_match' THEN 'Must Read'
    WHEN swp.match_quality IN ('very_good', 'excellent') AND swp.days_old <= 30 THEN 'Trending'
    WHEN swp.match_quality = 'good' AND swp.popularity >= 100 THEN 'Popular Choice'
    WHEN swp.match_type = 'semantic_match' THEN 'Related Content'
    ELSE 'Standard Result'
  END as result_recommendation

FROM search_with_personalization swp
WHERE swp.final_relevance_score >= 0.5  -- Filter low-relevance results
ORDER BY swp.final_relevance_score DESC, swp.publish_date DESC
LIMIT 50;

-- Vector similarity search with SQL syntax
WITH semantic_search AS (
  SELECT 
    document_id,
    title,
    content,
    author,
    category,

    -- Vector similarity scoring
    VECTOR_SIMILARITY(
      content_embedding, 
      QUERY_EMBEDDING('artificial intelligence machine learning deep learning neural networks'),
      similarity_method = 'cosine'
    ) as semantic_similarity_score,

    -- Semantic relevance classification
    CASE 
      WHEN VECTOR_SIMILARITY(content_embedding, QUERY_EMBEDDING(...)) >= 0.9 THEN 'extremely_relevant'
      WHEN VECTOR_SIMILARITY(content_embedding, QUERY_EMBEDDING(...)) >= 0.8 THEN 'highly_relevant'
      WHEN VECTOR_SIMILARITY(content_embedding, QUERY_EMBEDDING(...)) >= 0.7 THEN 'relevant'
      WHEN VECTOR_SIMILARITY(content_embedding, QUERY_EMBEDDING(...)) >= 0.6 THEN 'somewhat_relevant'
      ELSE 'marginally_relevant'
    END as semantic_relevance_level

  FROM documents
  WHERE VECTOR_SEARCH(
    embedding_field = content_embedding,
    query_vector = QUERY_EMBEDDING('artificial intelligence machine learning deep learning neural networks'),
    similarity_threshold = 0.6,
    max_results = 20,

    -- Additional filters
    AND status = 'published'
    AND language IN ('en', 'es', 'fr')
    AND publish_date >= '2021-01-01'
  )
),

hybrid_search_results AS (
  -- Combine text search and vector search for optimal results
  SELECT 
    document_id,
    title,
    content,
    author,
    category,
    publish_date,

    -- Combined scoring from multiple search methods
    COALESCE(text_search.final_relevance_score, 0) as text_relevance,
    COALESCE(semantic_search.semantic_similarity_score, 0) as semantic_relevance,

    -- Hybrid relevance calculation
    (
      COALESCE(text_search.final_relevance_score, 0) * 0.6 +
      COALESCE(semantic_search.semantic_similarity_score * 10, 0) * 0.4
    ) as hybrid_relevance_score,

    -- Search method indicators
    CASE 
      WHEN text_search.document_id IS NOT NULL AND semantic_search.document_id IS NOT NULL THEN 'hybrid_match'
      WHEN text_search.document_id IS NOT NULL THEN 'text_match'
      WHEN semantic_search.document_id IS NOT NULL THEN 'semantic_match'
      ELSE 'no_match'
    END as search_method,

    -- Quality indicators
    text_search.match_quality as text_match_quality,
    semantic_search.semantic_relevance_level as semantic_match_quality

  FROM (
    SELECT DISTINCT document_id FROM search_with_personalization 
    UNION 
    SELECT DISTINCT document_id FROM semantic_search
  ) all_results
  LEFT JOIN search_with_personalization text_search ON all_results.document_id = text_search.document_id
  LEFT JOIN semantic_search ON all_results.document_id = semantic_search.document_id
  JOIN documents d ON all_results.document_id = d.document_id
)

SELECT 
  hrs.document_id,
  hrs.title,
  LEFT(hrs.content, 400) as content_preview,
  hrs.author,
  hrs.category,
  hrs.publish_date,

  -- Hybrid scoring results
  ROUND(hrs.text_relevance, 4) as text_relevance_score,
  ROUND(hrs.semantic_relevance, 4) as semantic_relevance_score,
  ROUND(hrs.hybrid_relevance_score, 4) as combined_relevance_score,

  -- Search method and quality
  hrs.search_method,
  COALESCE(hrs.text_match_quality, 'n/a') as text_quality,
  COALESCE(hrs.semantic_match_quality, 'n/a') as semantic_quality,

  -- Final recommendation
  CASE 
    WHEN hrs.hybrid_relevance_score >= 8.0 THEN 'Highly Recommended'
    WHEN hrs.hybrid_relevance_score >= 6.0 THEN 'Recommended'
    WHEN hrs.hybrid_relevance_score >= 4.0 THEN 'Relevant'
    WHEN hrs.hybrid_relevance_score >= 2.0 THEN 'Potentially Interesting'
    ELSE 'Marginally Relevant'
  END as recommendation_level

FROM hybrid_search_results hrs
WHERE hrs.hybrid_relevance_score >= 1.0
ORDER BY hrs.hybrid_relevance_score DESC, hrs.publish_date DESC
LIMIT 25;

-- Autocomplete and search suggestions
SELECT 
  suggestion_text,
  suggestion_category,
  popularity_score,
  completion_frequency,

  -- Suggestion quality metrics
  AUTOCOMPLETE_SCORE('machine lear', suggestion_text) as completion_relevance,

  -- Suggestion type classification
  CASE 
    WHEN STARTS_WITH(suggestion_text, 'machine lear') THEN 'prefix_completion'
    WHEN CONTAINS(suggestion_text, 'machine learning') THEN 'phrase_completion'
    WHEN FUZZY_MATCH(suggestion_text, 'machine learning', max_distance = 2) THEN 'corrected_completion'
    ELSE 'related_suggestion'
  END as suggestion_type,

  -- User context enhancement
  CASE 
    WHEN USER_SEARCH_HISTORY_CONTAINS('user123', suggestion_text) THEN true
    ELSE false
  END as user_has_searched_before,

  -- Trending indicator
  CASE 
    WHEN TRENDING_SEARCH_TERM(suggestion_text, time_window = '7d') THEN 'trending'
    WHEN POPULAR_SEARCH_TERM(suggestion_text, time_window = '30d') THEN 'popular'
    ELSE 'standard'
  END as trend_status

FROM AUTOCOMPLETE_SUGGESTIONS(
  partial_query = 'machine lear',
  max_suggestions = 10,

  -- Personalization options
  user_id = 'user123',
  include_user_history = true,
  include_trending = true,

  -- Filtering options
  category_filter = 'technology',
  language_filter = 'en',
  min_popularity = 10
)
ORDER BY completion_relevance DESC, popularity_score DESC;

-- Search analytics and performance monitoring
WITH search_performance_analysis AS (
  SELECT 
    DATE_TRUNC('hour', search_timestamp) as hour_bucket,
    COUNT(*) as total_searches,
    COUNT(DISTINCT user_id) as unique_users,
    AVG(execution_time_ms) as avg_execution_time,
    AVG(total_results_found) as avg_results_count,

    -- Search success metrics
    COUNT(*) FILTER (WHERE total_results_found > 0) as successful_searches,
    COUNT(*) FILTER (WHERE total_results_found >= 10) as highly_successful_searches,

    -- Query complexity analysis
    AVG(LENGTH(query_text)) as avg_query_length,
    COUNT(*) FILTER (WHERE filters_applied IS NOT NULL) as searches_with_filters,

    -- Performance categories
    COUNT(*) FILTER (WHERE execution_time_ms <= 100) as fast_searches,
    COUNT(*) FILTER (WHERE execution_time_ms > 100 AND execution_time_ms <= 500) as moderate_searches,
    COUNT(*) FILTER (WHERE execution_time_ms > 500) as slow_searches,

    -- Search types
    COUNT(*) FILTER (WHERE search_type = 'text') as text_searches,
    COUNT(*) FILTER (WHERE search_type = 'vector') as vector_searches,
    COUNT(*) FILTER (WHERE search_type = 'hybrid') as hybrid_searches,
    COUNT(*) FILTER (WHERE search_type = 'autocomplete') as autocomplete_requests

  FROM search_queries
  WHERE search_timestamp >= CURRENT_TIMESTAMP - INTERVAL '24 hours'
  GROUP BY DATE_TRUNC('hour', search_timestamp)
),

query_pattern_analysis AS (
  SELECT 
    query_text,
    COUNT(*) as query_frequency,
    AVG(total_results_found) as avg_results,
    AVG(execution_time_ms) as avg_execution_time,
    COUNT(DISTINCT user_id) as unique_users,

    -- Query success metrics
    ROUND((COUNT(*) FILTER (WHERE total_results_found > 0)::DECIMAL / COUNT(*)) * 100, 2) as success_rate,

    -- User engagement indicators
    AVG(ARRAY_LENGTH(user_interaction.results_clicked, 1)) as avg_clicks_per_search,
    COUNT(*) FILTER (WHERE user_interaction.conversion_achieved = true) as conversions,

    -- Query characteristics
    LENGTH(query_text) as query_length,
    ARRAY_LENGTH(STRING_TO_ARRAY(query_text, ' '), 1) as word_count,

    -- Classification
    CASE 
      WHEN LENGTH(query_text) <= 10 THEN 'short_query'
      WHEN LENGTH(query_text) <= 30 THEN 'medium_query'
      ELSE 'long_query'
    END as query_length_category,

    CASE 
      WHEN ARRAY_LENGTH(STRING_TO_ARRAY(query_text, ' '), 1) = 1 THEN 'single_word'
      WHEN ARRAY_LENGTH(STRING_TO_ARRAY(query_text, ' '), 1) <= 3 THEN 'short_phrase'
      WHEN ARRAY_LENGTH(STRING_TO_ARRAY(query_text, ' '), 1) <= 6 THEN 'medium_phrase'
      ELSE 'long_phrase'
    END as query_complexity

  FROM search_queries
  WHERE search_timestamp >= CURRENT_TIMESTAMP - INTERVAL '7 days'
  GROUP BY query_text
  HAVING COUNT(*) >= 3  -- Focus on repeated queries
)

-- Comprehensive search analytics report
SELECT 
  -- Time-based performance
  spa.hour_bucket,
  spa.total_searches,
  spa.unique_users,
  spa.avg_execution_time,
  spa.avg_results_count,

  -- Success metrics
  ROUND((spa.successful_searches::DECIMAL / spa.total_searches) * 100, 2) as success_rate_percent,
  ROUND((spa.highly_successful_searches::DECIMAL / spa.total_searches) * 100, 2) as high_success_rate_percent,

  -- Performance distribution
  ROUND((spa.fast_searches::DECIMAL / spa.total_searches) * 100, 2) as fast_search_percent,
  ROUND((spa.moderate_searches::DECIMAL / spa.total_searches) * 100, 2) as moderate_search_percent,
  ROUND((spa.slow_searches::DECIMAL / spa.total_searches) * 100, 2) as slow_search_percent,

  -- Search type distribution
  ROUND((spa.text_searches::DECIMAL / spa.total_searches) * 100, 2) as text_search_percent,
  ROUND((spa.vector_searches::DECIMAL / spa.total_searches) * 100, 2) as vector_search_percent,
  ROUND((spa.hybrid_searches::DECIMAL / spa.total_searches) * 100, 2) as hybrid_search_percent,

  -- User engagement
  ROUND(spa.searches_with_filters::DECIMAL / spa.total_searches * 100, 2) as filter_usage_percent,
  spa.avg_query_length,

  -- Performance assessment
  CASE 
    WHEN spa.avg_execution_time <= 100 THEN 'excellent'
    WHEN spa.avg_execution_time <= 300 THEN 'good'
    WHEN spa.avg_execution_time <= 800 THEN 'fair'
    ELSE 'needs_improvement'
  END as performance_rating,

  -- System health indicators
  CASE 
    WHEN (spa.successful_searches::DECIMAL / spa.total_searches) >= 0.9 THEN 'healthy'
    WHEN (spa.successful_searches::DECIMAL / spa.total_searches) >= 0.7 THEN 'moderate'
    ELSE 'concerning'
  END as system_health_status

FROM search_performance_analysis spa
ORDER BY spa.hour_bucket DESC;

-- Popular and problematic queries analysis
SELECT 
  'popular_queries' as analysis_type,
  qpa.query_text,
  qpa.query_frequency,
  qpa.success_rate,
  qpa.avg_results,
  qpa.avg_execution_time,
  qpa.unique_users,
  qpa.query_length_category,
  qpa.query_complexity,

  -- Recommendations
  CASE 
    WHEN qpa.success_rate < 50 THEN 'Investigate low success rate'
    WHEN qpa.avg_execution_time > 1000 THEN 'Optimize query performance'
    WHEN qpa.avg_results < 5 THEN 'Improve result relevance'
    WHEN qpa.conversions = 0 THEN 'Enhance result quality'
    ELSE 'Query performing well'
  END as recommendation

FROM query_pattern_analysis qpa
WHERE qpa.query_frequency >= 10
ORDER BY qpa.query_frequency DESC
LIMIT 20;

-- QueryLeaf provides comprehensive search capabilities:
-- 1. SQL-familiar syntax for Atlas Search index creation and management
-- 2. Advanced full-text search with fuzzy matching, highlighting, and boosting
-- 3. Vector similarity search for semantic understanding
-- 4. Faceted search and filtering with automatic facet generation
-- 5. Autocomplete and search suggestions with personalization
-- 6. Hybrid search combining multiple search methodologies
-- 7. Real-time search analytics and performance monitoring
-- 8. Integration with MongoDB's native Atlas Search optimizations
-- 9. Multi-language support and advanced text analysis
-- 10. Production-ready search capabilities with familiar SQL syntax

Best Practices for Atlas Search Implementation

Search Index Strategy and Performance Optimization

Essential principles for effective Atlas Search deployment:

Index Design: Create search indexes that balance functionality with performance, optimizing for your most common query patterns
Query Optimization: Structure search queries to leverage Atlas Search's advanced capabilities while maintaining fast response times
Relevance Tuning: Implement sophisticated relevance scoring that combines multiple factors for optimal search results
Multi-Language Support: Design search indexes and queries to handle multiple languages and character sets effectively
Performance Monitoring: Establish comprehensive search analytics to track performance and user behavior
Vector Integration: Leverage vector search for semantic understanding and enhanced search relevance

Production Search Architecture

Design search systems for enterprise-scale requirements:

Scalable Architecture: Implement search infrastructure that can handle high query volumes and large datasets
Advanced Analytics: Deploy comprehensive search analytics with user behavior tracking and performance optimization
Personalization Engine: Integrate machine learning-based personalization for improved search relevance
Multi-Modal Search: Support various search types including text, semantic, and multimedia search capabilities
Real-Time Optimization: Implement automated search optimization based on usage patterns and performance metrics
Security Integration: Ensure search implementations respect data access controls and privacy requirements

Conclusion

MongoDB Atlas Search provides comprehensive native search capabilities that eliminate the complexity of external search engines through advanced text indexing, vector similarity search, and intelligent relevance scoring integrated directly within MongoDB. The combination of full-text search with semantic understanding, multi-language support, and real-time synchronization makes Atlas Search ideal for modern applications requiring sophisticated search experiences.

Key Atlas Search benefits include:

Native Integration: Seamless search capabilities without external dependencies or complex data synchronization
Advanced Text Analysis: Comprehensive full-text search with fuzzy matching, highlighting, and multi-language support
Vector Similarity: Semantic search capabilities using machine learning embeddings for contextual understanding
Real-Time Synchronization: Instant search index updates without manual refresh or batch processing
Faceted Search: Advanced filtering and categorization capabilities for enhanced user search experiences
SQL Accessibility: Familiar SQL-style search operations through QueryLeaf for accessible search implementation

Whether you're building content management systems, e-commerce platforms, knowledge bases, or enterprise search applications, MongoDB Atlas Search with QueryLeaf's familiar SQL interface provides the foundation for powerful, scalable search experiences.

QueryLeaf Integration: QueryLeaf seamlessly manages MongoDB Atlas Search operations while providing SQL-familiar search syntax, index management, and advanced search query construction. Sophisticated search patterns including full-text search, vector similarity, faceted filtering, and search analytics are elegantly handled through familiar SQL constructs, making advanced search capabilities both powerful and accessible to SQL-oriented development teams.

The combination of MongoDB's robust Atlas Search capabilities with SQL-style search operations makes it an ideal platform for applications requiring both advanced search functionality and familiar database interaction patterns, ensuring your search implementations remain both sophisticated and maintainable as your search requirements evolve and scale.

November 4, 2025
26 min read

MongoDB Geospatial Queries and Location-Based Services: Advanced Spatial Indexing and Geographic Data Management

Modern applications increasingly rely on location-aware functionality, from ride-sharing and delivery services to social media check-ins and targeted marketing. Traditional database systems struggle with complex spatial operations, often requiring specialized GIS software or complex geometric calculations that are difficult to integrate, maintain, and scale within application architectures.

MongoDB provides comprehensive native geospatial capabilities with advanced spatial indexing, sophisticated geometric operations, and high-performance location-based queries that eliminate the complexity of external GIS systems. Unlike traditional approaches that require separate spatial databases or complex geometric libraries, MongoDB's integrated geospatial features deliver superior performance through optimized spatial indexes, native coordinate system support, and seamless integration with application data models.

The Traditional Geospatial Challenge

Conventional approaches to location-based services involve significant complexity and performance limitations:

-- Traditional PostgreSQL geospatial approach - complex setup and limited optimization

-- PostGIS extension required for spatial capabilities
CREATE EXTENSION IF NOT EXISTS postgis;

-- Location-based entities with complex geometric types
CREATE TABLE locations (
    location_id BIGSERIAL PRIMARY KEY,
    name VARCHAR(255) NOT NULL,
    category VARCHAR(100) NOT NULL,

    -- PostGIS geometry types (complex to work with)
    coordinates GEOMETRY(POINT, 4326) NOT NULL, -- WGS84 coordinate system
    coverage_area GEOMETRY(POLYGON, 4326),
    search_radius GEOMETRY(POLYGON, 4326),

    -- Additional location metadata
    address TEXT,
    city VARCHAR(100),
    state VARCHAR(50),
    country VARCHAR(50),
    postal_code VARCHAR(20),

    -- Business information
    phone_number VARCHAR(20),
    operating_hours JSONB,
    rating DECIMAL(3,2),
    price_range INTEGER,

    -- Spatial analysis metadata
    population_density INTEGER,
    traffic_level VARCHAR(20),
    accessibility_score DECIMAL(4,2),

    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

-- Complex spatial indexing (manual configuration required)
CREATE INDEX idx_locations_coordinates ON locations USING GIST (coordinates);
CREATE INDEX idx_locations_coverage ON locations USING GIST (coverage_area);
CREATE INDEX idx_locations_category_coords ON locations USING GIST (coordinates, category);

-- User location tracking with spatial relationships
CREATE TABLE user_locations (
    user_location_id BIGSERIAL PRIMARY KEY,
    user_id BIGINT NOT NULL,
    location_coordinates GEOMETRY(POINT, 4326) NOT NULL,
    accuracy_meters DECIMAL(8,2),
    altitude_meters DECIMAL(8,2),

    -- Movement tracking
    speed_kmh DECIMAL(6,2),
    heading_degrees DECIMAL(5,2),

    -- Context information
    location_method VARCHAR(50), -- GPS, WIFI, CELL, MANUAL
    device_type VARCHAR(50),
    battery_level INTEGER,

    -- Privacy and permissions
    location_sharing_level VARCHAR(20) DEFAULT 'private',
    geofence_notifications BOOLEAN DEFAULT false,

    -- Temporal tracking
    recorded_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
    session_id VARCHAR(100),

    FOREIGN KEY (user_id) REFERENCES users(user_id)
);

-- Spatial indexes for user locations
CREATE INDEX idx_user_locations_coords ON user_locations USING GIST (location_coordinates);
CREATE INDEX idx_user_locations_user_time ON user_locations (user_id, recorded_at);
CREATE INDEX idx_user_locations_session ON user_locations (session_id, recorded_at);

-- Complex proximity search with performance issues
WITH nearby_locations AS (
    SELECT 
        l.location_id,
        l.name,
        l.category,
        l.address,
        l.rating,
        l.price_range,

        -- Complex distance calculations
        ST_Distance(
            l.coordinates, 
            ST_SetSRID(ST_MakePoint($longitude, $latitude), 4326)::geography
        ) as distance_meters,

        -- Geometric relationships (expensive operations)
        ST_Contains(l.coverage_area, ST_SetSRID(ST_MakePoint($longitude, $latitude), 4326)) as within_coverage,
        ST_Intersects(l.search_radius, ST_SetSRID(ST_MakePoint($longitude, $latitude), 4326)) as in_search_area,

        -- Bearing calculation (complex trigonometry)
        ST_Azimuth(
            l.coordinates, 
            ST_SetSRID(ST_MakePoint($longitude, $latitude), 4326)
        ) * 180 / PI() as bearing_degrees,

        -- Additional spatial analysis
        l.coordinates,
        l.operating_hours,
        l.phone_number

    FROM locations l
    WHERE 
        -- Basic distance filter (still expensive without proper optimization)
        ST_DWithin(
            l.coordinates::geography, 
            ST_SetSRID(ST_MakePoint($longitude, $latitude), 4326)::geography, 
            $search_radius_meters
        )

        -- Category filtering
        AND ($category IS NULL OR l.category = $category)

        -- Rating filtering
        AND ($min_rating IS NULL OR l.rating >= $min_rating)

        -- Price filtering
        AND ($max_price IS NULL OR l.price_range <= $max_price)

    ORDER BY distance_meters
    LIMIT $limit_count
),

location_analytics AS (
    -- Complex spatial aggregations with performance impact
    SELECT 
        nl.category,
        COUNT(*) as location_count,
        AVG(nl.rating) as avg_rating,
        AVG(nl.distance_meters) as avg_distance,
        MIN(nl.distance_meters) as closest_distance,
        MAX(nl.distance_meters) as furthest_distance,

        -- Expensive geometric calculations
        ST_ConvexHull(ST_Collect(nl.coordinates)) as coverage_polygon,
        ST_Centroid(ST_Collect(nl.coordinates)) as category_center,

        -- Statistical analysis (resource intensive)
        STDDEV_POP(nl.distance_meters) as distance_variance,
        PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY nl.distance_meters) as median_distance

    FROM nearby_locations nl
    GROUP BY nl.category
),

user_movement_analysis AS (
    -- Track user movement patterns (very expensive queries)
    SELECT 
        ul.user_id,
        COUNT(*) as location_updates,

        -- Complex movement calculations
        SUM(
            ST_Distance(
                ul.location_coordinates::geography,
                LAG(ul.location_coordinates::geography) OVER (
                    PARTITION BY ul.user_id 
                    ORDER BY ul.recorded_at
                )
            )
        ) as total_distance_traveled,

        -- Speed analysis
        AVG(ul.speed_kmh) as avg_speed,
        MAX(ul.speed_kmh) as max_speed,

        -- Time-based analysis
        EXTRACT(SECONDS FROM (MAX(ul.recorded_at) - MIN(ul.recorded_at))) as session_duration_seconds,

        -- Geofencing analysis (complex polygon operations)
        COUNT(*) FILTER (
            WHERE EXISTS (
                SELECT 1 FROM locations l 
                WHERE ST_Contains(l.coverage_area, ul.location_coordinates)
            )
        ) as geofence_entries,

        -- Movement patterns
        STRING_AGG(
            DISTINCT CASE 
                WHEN ul.speed_kmh > 50 THEN 'highway'
                WHEN ul.speed_kmh > 20 THEN 'city'
                WHEN ul.speed_kmh > 5 THEN 'walking'
                ELSE 'stationary'
            END, 
            ',' 
            ORDER BY ul.recorded_at
        ) as movement_pattern

    FROM user_locations ul
    WHERE ul.recorded_at >= CURRENT_TIMESTAMP - INTERVAL '1 day'
    GROUP BY ul.user_id
)

-- Final complex spatial query with multiple joins and calculations
SELECT 
    nl.location_id,
    nl.name,
    nl.category,
    nl.address,
    ROUND(nl.distance_meters, 2) as distance_meters,
    ROUND(nl.bearing_degrees, 1) as bearing_degrees,
    nl.rating,
    nl.price_range,

    -- Spatial relationship indicators
    nl.within_coverage,
    nl.in_search_area,

    -- Analytics context
    la.location_count as similar_nearby_count,
    ROUND(la.avg_rating, 2) as category_avg_rating,
    ROUND(la.avg_distance, 2) as category_avg_distance,

    -- User movement context (if available)
    uma.total_distance_traveled,
    uma.avg_speed,
    uma.movement_pattern,

    -- Additional computed fields
    CASE 
        WHEN nl.distance_meters <= 100 THEN 'immediate_vicinity'
        WHEN nl.distance_meters <= 500 THEN 'very_close'
        WHEN nl.distance_meters <= 1000 THEN 'walking_distance'
        WHEN nl.distance_meters <= 5000 THEN 'short_drive'
        ELSE 'distant'
    END as proximity_category,

    -- Operating status (complex JSON processing)
    CASE 
        WHEN nl.operating_hours IS NULL THEN 'unknown'
        WHEN nl.operating_hours->>(EXTRACT(DOW FROM CURRENT_TIMESTAMP)::TEXT) IS NULL THEN 'closed'
        ELSE 'check_hours'
    END as operating_status,

    -- Recommendations based on multiple factors
    CASE 
        WHEN nl.rating >= 4.5 AND nl.distance_meters <= 1000 THEN 'highly_recommended'
        WHEN nl.rating >= 4.0 AND nl.distance_meters <= 2000 THEN 'recommended'
        WHEN nl.distance_meters <= 500 THEN 'convenient'
        ELSE 'standard'
    END as recommendation_level

FROM nearby_locations nl
LEFT JOIN location_analytics la ON nl.category = la.category
LEFT JOIN user_movement_analysis uma ON uma.user_id = $user_id
ORDER BY 
    -- Complex sorting logic
    CASE $sort_preference
        WHEN 'distance' THEN nl.distance_meters
        WHEN 'rating' THEN -nl.rating * 100 + nl.distance_meters
        WHEN 'price' THEN nl.price_range * 1000 + nl.distance_meters
        ELSE nl.distance_meters
    END
LIMIT $result_limit;

-- Traditional geospatial approach problems:
-- 1. Requires PostGIS extension and complex geometric type management
-- 2. Expensive spatial calculations with limited built-in optimization
-- 3. Complex coordinate system transformations and projections
-- 4. Poor performance with large datasets and concurrent spatial queries
-- 5. Limited integration with application data models and business logic
-- 6. Complex indexing strategies requiring deep GIS expertise
-- 7. Difficult to maintain and scale spatial operations
-- 8. Limited support for modern location-based service patterns
-- 9. Complex query syntax requiring specialized GIS knowledge
-- 10. Poor integration with real-time and streaming location data

MongoDB provides comprehensive geospatial capabilities with native optimization and seamless integration:

// MongoDB Advanced Geospatial Operations - native spatial capabilities with optimal performance
const { MongoClient, ObjectId } = require('mongodb');

const client = new MongoClient('mongodb://localhost:27017');
const db = client.db('location_services');

// Comprehensive MongoDB Geospatial Manager
class MongoDBGeospatialManager {
  constructor(db, config = {}) {
    this.db = db;
    this.config = {
      // Default search parameters
      defaultSearchRadius: config.defaultSearchRadius || 5000, // 5km
      defaultMaxResults: config.defaultMaxResults || 100,

      // Performance optimization
      enableSpatialIndexing: config.enableSpatialIndexing !== false,
      enableQueryOptimization: config.enableQueryOptimization !== false,
      enableBulkOperations: config.enableBulkOperations !== false,

      // Coordinate system configuration
      defaultCoordinateSystem: config.defaultCoordinateSystem || 'WGS84',
      enableEarthDistance: config.enableEarthDistance !== false,

      // Advanced features
      enableGeofencing: config.enableGeofencing !== false,
      enableLocationAnalytics: config.enableLocationAnalytics !== false,
      enableRealTimeTracking: config.enableRealTimeTracking !== false,
      enableSpatialAggregation: config.enableSpatialAggregation !== false,

      // Performance monitoring
      enablePerformanceMetrics: config.enablePerformanceMetrics !== false,
      logSlowQueries: config.logSlowQueries !== false,
      queryTimeoutMs: config.queryTimeoutMs || 30000,

      ...config
    };

    // Collection references
    this.collections = {
      locations: db.collection('locations'),
      userLocations: db.collection('user_locations'),
      geofences: db.collection('geofences'),
      locationAnalytics: db.collection('location_analytics'),
      spatialEvents: db.collection('spatial_events')
    };

    // Performance tracking
    this.queryMetrics = {
      totalQueries: 0,
      averageQueryTime: 0,
      spatialQueries: 0,
      indexHits: 0
    };

    this.initializeGeospatialCollections();
  }

  async initializeGeospatialCollections() {
    console.log('Initializing geospatial collections and spatial indexes...');

    try {
      // Setup locations collection with advanced spatial indexing
      await this.setupLocationsCollection();

      // Setup user location tracking
      await this.setupUserLocationTracking();

      // Setup geofencing capabilities
      await this.setupGeofencingSystem();

      // Setup location analytics
      await this.setupLocationAnalytics();

      // Setup spatial event tracking
      await this.setupSpatialEventTracking();

      console.log('All geospatial collections initialized successfully');

    } catch (error) {
      console.error('Error initializing geospatial collections:', error);
      throw error;
    }
  }

  async setupLocationsCollection() {
    console.log('Setting up locations collection with spatial indexing...');

    const locationsCollection = this.collections.locations;

    // Create 2dsphere index for geospatial queries (primary spatial index)
    await locationsCollection.createIndex(
      { coordinates: '2dsphere' },
      { 
        background: true,
        name: 'coordinates_2dsphere',
        // Optimize for common query patterns
        '2dsphereIndexVersion': 3
      }
    );

    // Compound indexes for optimized spatial queries with filters
    await locationsCollection.createIndex(
      { coordinates: '2dsphere', category: 1 },
      { background: true, name: 'spatial_category_index' }
    );

    await locationsCollection.createIndex(
      { coordinates: '2dsphere', rating: -1, priceRange: 1 },
      { background: true, name: 'spatial_rating_price_index' }
    );

    // Coverage area indexing for geofencing
    await locationsCollection.createIndex(
      { coverageArea: '2dsphere' },
      { 
        background: true, 
        sparse: true, 
        name: 'coverage_area_index' 
      }
    );

    // Text index for location search
    await locationsCollection.createIndex(
      { name: 'text', address: 'text', category: 'text' },
      { 
        background: true,
        name: 'location_text_search',
        weights: { name: 3, category: 2, address: 1 }
      }
    );

    // Additional performance indexes
    await locationsCollection.createIndex(
      { category: 1, rating: -1, createdAt: -1 },
      { background: true }
    );

    console.log('Locations collection spatial indexing complete');
  }

  async createLocation(locationData) {
    console.log('Creating location with geospatial data...');

    const startTime = Date.now();

    try {
      const locationDocument = {
        locationId: locationData.locationId || new ObjectId(),
        name: locationData.name,
        category: locationData.category,

        // GeoJSON Point for precise coordinates
        coordinates: {
          type: 'Point',
          coordinates: [locationData.longitude, locationData.latitude] // [lng, lat] order in GeoJSON
        },

        // Optional coverage area as GeoJSON Polygon
        coverageArea: locationData.coverageArea ? {
          type: 'Polygon',
          coordinates: locationData.coverageArea // Array of coordinate arrays
        } : null,

        // Location details
        address: locationData.address,
        city: locationData.city,
        state: locationData.state,
        country: locationData.country,
        postalCode: locationData.postalCode,

        // Business information
        phoneNumber: locationData.phoneNumber,
        website: locationData.website,
        operatingHours: locationData.operatingHours || {},
        rating: locationData.rating || 0,
        priceRange: locationData.priceRange || 1,

        // Spatial metadata
        accuracy: locationData.accuracy,
        altitude: locationData.altitude,
        floor: locationData.floor,

        // Business analytics data
        popularityScore: locationData.popularityScore || 0,
        trafficLevel: locationData.trafficLevel,
        accessibilityFeatures: locationData.accessibilityFeatures || [],

        // Temporal information
        createdAt: new Date(),
        updatedAt: new Date(),

        // Custom attributes
        customAttributes: locationData.customAttributes || {},
        tags: locationData.tags || [],

        // Verification status
        verified: locationData.verified || false,
        verificationSource: locationData.verificationSource
      };

      // Validate GeoJSON format
      if (!this.validateGeoJSONPoint(locationDocument.coordinates)) {
        throw new Error('Invalid coordinates format - must be valid GeoJSON Point');
      }

      const result = await this.collections.locations.insertOne(locationDocument);

      const processingTime = Date.now() - startTime;
      this.updateQueryMetrics('create_location', processingTime);

      console.log(`Location created: ${result.insertedId} (${processingTime}ms)`);

      return {
        success: true,
        locationId: result.insertedId,
        coordinates: locationDocument.coordinates,
        processingTime: processingTime
      };

    } catch (error) {
      console.error('Error creating location:', error);
      return {
        success: false,
        error: error.message,
        processingTime: Date.now() - startTime
      };
    }
  }

  async findNearbyLocations(longitude, latitude, options = {}) {
    console.log(`Finding locations near [${longitude}, ${latitude}]...`);

    const startTime = Date.now();

    try {
      // Build aggregation pipeline for advanced spatial query
      const pipeline = [
        // Stage 1: Geospatial proximity matching
        {
          $geoNear: {
            near: {
              type: 'Point',
              coordinates: [longitude, latitude]
            },
            distanceField: 'distanceMeters',
            maxDistance: options.maxDistance || this.config.defaultSearchRadius,
            spherical: true,

            // Advanced filtering options
            query: {
              ...(options.category && { category: options.category }),
              ...(options.minRating && { rating: { $gte: options.minRating } }),
              ...(options.maxPriceRange && { priceRange: { $lte: options.maxPriceRange } }),
              ...(options.verified !== undefined && { verified: options.verified }),
              ...(options.tags && { tags: { $in: options.tags } })
            },

            // Limit initial results for performance
            limit: options.limit || this.config.defaultMaxResults
          }
        },

        // Stage 2: Add computed fields and spatial analysis
        {
          $addFields: {
            // Distance calculations
            distanceKm: { $divide: ['$distanceMeters', 1000] },

            // Bearing calculation (direction from search point to location)
            bearing: {
              $let: {
                vars: {
                  lat1: { $degreesToRadians: latitude },
                  lat2: { $degreesToRadians: { $arrayElemAt: ['$coordinates.coordinates', 1] } },
                  lng1: { $degreesToRadians: longitude },
                  lng2: { $degreesToRadians: { $arrayElemAt: ['$coordinates.coordinates', 0] } }
                },
                in: {
                  $mod: [
                    {
                      $add: [
                        {
                          $radiansToDegrees: {
                            $atan2: [
                              {
                                $sin: { $subtract: ['$$lng2', '$$lng1'] }
                              },
                              {
                                $subtract: [
                                  {
                                    $multiply: [
                                      { $cos: '$$lat1' },
                                      { $sin: '$$lat2' }
                                    ]
                                  },
                                  {
                                    $multiply: [
                                      { $sin: '$$lat1' },
                                      { $cos: '$$lat2' },
                                      { $cos: { $subtract: ['$$lng2', '$$lng1'] } }
                                    ]
                                  }
                                ]
                              }
                            ]
                          }
                        },
                        360
                      ]
                    },
                    360
                  ]
                }
              }
            },

            // Proximity categorization
            proximityCategory: {
              $switch: {
                branches: [
                  { case: { $lte: ['$distanceMeters', 100] }, then: 'immediate_vicinity' },
                  { case: { $lte: ['$distanceMeters', 500] }, then: 'very_close' },
                  { case: { $lte: ['$distanceMeters', 1000] }, then: 'walking_distance' },
                  { case: { $lte: ['$distanceMeters', 5000] }, then: 'short_drive' }
                ],
                default: 'distant'
              }
            },

            // Recommendation scoring
            recommendationScore: {
              $add: [
                // Base rating score (0-5 scale)
                { $multiply: ['$rating', 2] },

                // Distance penalty (closer is better)
                {
                  $subtract: [
                    10,
                    { $divide: ['$distanceMeters', 500] }
                  ]
                },

                // Popularity bonus
                { $multiply: ['$popularityScore', 0.5] },

                // Verification bonus
                { $cond: [{ $eq: ['$verified', true] }, 2, 0] }
              ]
            }
          }
        },

        // Stage 3: Operating hours analysis (if requested)
        ...(options.checkOperatingHours ? [{
          $addFields: {
            currentlyOpen: {
              $let: {
                vars: {
                  now: new Date(),
                  dayOfWeek: { $dayOfWeek: new Date() }, // 1 = Sunday, 7 = Saturday
                  currentTime: { 
                    $dateToString: { 
                      format: '%H:%M', 
                      date: new Date() 
                    } 
                  }
                },
                in: {
                  // Simplified operating hours check
                  $cond: [
                    { $ne: ['$operatingHours', null] },
                    true, // Would implement complex time checking logic
                    null
                  ]
                }
              }
            }
          }
        }] : []),

        // Stage 4: Coverage area intersection (if requested)
        ...(options.checkCoverageArea ? [{
          $addFields: {
            withinCoverageArea: {
              $cond: [
                { $ne: ['$coverageArea', null] },
                {
                  $function: {
                    body: `function(coverageArea, searchPoint) {
                      // Simplified point-in-polygon check
                      // In production, use MongoDB's native $geoIntersects
                      return true; // Placeholder for complex geometric calculation
                    }`,
                    args: ['$coverageArea', { type: 'Point', coordinates: [longitude, latitude] }],
                    lang: 'js'
                  }
                },
                null
              ]
            }
          }
        }] : []),

        // Stage 5: Final sorting and formatting
        {
          $sort: {
            // Default sort by recommendation score, fallback to distance
            recommendationScore: options.sortBy === 'recommendation' ? -1 : 1,
            distanceMeters: options.sortBy === 'distance' ? 1 : -1,
            rating: -1
          }
        },

        // Stage 6: Limit results
        { $limit: options.limit || this.config.defaultMaxResults },

        // Stage 7: Project final result structure
        {
          $project: {
            locationId: 1,
            name: 1,
            category: 1,
            coordinates: 1,
            address: 1,
            city: 1,
            state: 1,
            country: 1,
            phoneNumber: 1,
            website: 1,
            rating: 1,
            priceRange: 1,

            // Spatial analysis results
            distanceMeters: { $round: ['$distanceMeters', 2] },
            distanceKm: { $round: ['$distanceKm', 3] },
            bearing: { $round: ['$bearing', 1] },
            proximityCategory: 1,
            recommendationScore: { $round: ['$recommendationScore', 2] },

            // Conditional fields
            ...(options.checkOperatingHours && { currentlyOpen: 1 }),
            ...(options.checkCoverageArea && { withinCoverageArea: 1 }),

            // Metadata
            verified: 1,
            tags: 1,
            createdAt: 1,

            // Custom attributes if requested
            ...(options.includeCustomAttributes && { customAttributes: 1 })
          }
        }
      ];

      // Execute aggregation pipeline
      const locations = await this.collections.locations.aggregate(
        pipeline,
        {
          allowDiskUse: true,
          maxTimeMS: this.config.queryTimeoutMs,
          hint: 'coordinates_2dsphere' // Use spatial index
        }
      ).toArray();

      const processingTime = Date.now() - startTime;
      this.updateQueryMetrics('nearby_search', processingTime);

      console.log(`Found ${locations.length} nearby locations (${processingTime}ms)`);

      return {
        success: true,
        locations: locations,
        searchParams: {
          coordinates: [longitude, latitude],
          maxDistance: options.maxDistance || this.config.defaultSearchRadius,
          filters: options
        },
        resultsCount: locations.length,
        processingTime: processingTime
      };

    } catch (error) {
      console.error('Error finding nearby locations:', error);
      return {
        success: false,
        error: error.message,
        processingTime: Date.now() - startTime
      };
    }
  }

  async setupUserLocationTracking() {
    console.log('Setting up user location tracking...');

    const userLocationsCollection = this.collections.userLocations;

    // Spatial index for user locations
    await userLocationsCollection.createIndex(
      { coordinates: '2dsphere' },
      { background: true, name: 'user_coordinates_spatial' }
    );

    // Compound indexes for user tracking queries
    await userLocationsCollection.createIndex(
      { userId: 1, recordedAt: -1 },
      { background: true, name: 'user_timeline' }
    );

    await userLocationsCollection.createIndex(
      { sessionId: 1, recordedAt: 1 },
      { background: true, name: 'session_tracking' }
    );

    // Geofencing compound index
    await userLocationsCollection.createIndex(
      { coordinates: '2dsphere', userId: 1, recordedAt: -1 },
      { background: true, name: 'spatial_user_timeline' }
    );

    console.log('User location tracking setup complete');
  }

  async trackUserLocation(userId, longitude, latitude, metadata = {}) {
    console.log(`Tracking location for user ${userId}: [${longitude}, ${latitude}]`);

    const startTime = Date.now();

    try {
      const locationDocument = {
        userId: userId,
        coordinates: {
          type: 'Point',
          coordinates: [longitude, latitude]
        },

        // Accuracy and technical metadata
        accuracy: metadata.accuracy,
        altitude: metadata.altitude,
        speed: metadata.speed,
        heading: metadata.heading,

        // Device and method information
        locationMethod: metadata.locationMethod || 'GPS',
        deviceType: metadata.deviceType,
        batteryLevel: metadata.batteryLevel,

        // Session context
        sessionId: metadata.sessionId,
        applicationContext: metadata.applicationContext,

        // Privacy and sharing
        locationSharingLevel: metadata.locationSharingLevel || 'private',
        allowGeofenceNotifications: metadata.allowGeofenceNotifications || false,

        // Temporal information
        recordedAt: metadata.recordedAt || new Date(),
        serverProcessedAt: new Date(),

        // Movement analysis
        isStationary: metadata.isStationary || false,
        movementType: metadata.movementType, // walking, driving, cycling, stationary

        // Custom context
        customData: metadata.customData || {}
      };

      // Validate coordinates
      if (!this.validateGeoJSONPoint(locationDocument.coordinates)) {
        throw new Error('Invalid coordinates for user location tracking');
      }

      const result = await this.collections.userLocations.insertOne(locationDocument);

      // Check for geofence triggers (if enabled)
      if (this.config.enableGeofencing) {
        await this.checkGeofenceEvents(userId, longitude, latitude);
      }

      const processingTime = Date.now() - startTime;
      this.updateQueryMetrics('track_user_location', processingTime);

      return {
        success: true,
        locationId: result.insertedId,
        coordinates: locationDocument.coordinates,
        processingTime: processingTime,
        geofenceChecked: this.config.enableGeofencing
      };

    } catch (error) {
      console.error('Error tracking user location:', error);
      return {
        success: false,
        error: error.message,
        processingTime: Date.now() - startTime
      };
    }
  }

  async getUserLocationHistory(userId, options = {}) {
    console.log(`Retrieving location history for user ${userId}...`);

    const startTime = Date.now();

    try {
      const pipeline = [
        // Stage 1: Filter by user and time range
        {
          $match: {
            userId: userId,
            recordedAt: {
              $gte: options.startDate || new Date(Date.now() - (7 * 24 * 60 * 60 * 1000)), // 7 days default
              $lte: options.endDate || new Date()
            },
            ...(options.sessionId && { sessionId: options.sessionId }),
            ...(options.locationSharingLevel && { locationSharingLevel: options.locationSharingLevel })
          }
        },

        // Stage 2: Sort chronologically
        { $sort: { recordedAt: 1 } },

        // Stage 3: Add movement analysis
        {
          $addFields: {
            // Calculate time since last location update
            timeSincePrevious: {
              $subtract: [
                '$recordedAt',
                { $ifNull: [{ $lag: '$recordedAt', offset: 1 }, '$recordedAt'] }
              ]
            }
          }
        },

        // Stage 4: Movement calculations using $setWindowFields
        {
          $setWindowFields: {
            partitionBy: '$userId',
            sortBy: { recordedAt: 1 },
            output: {
              // Distance from previous location
              distanceFromPrevious: {
                $function: {
                  body: `function(currentCoords, previousCoords) {
                    if (!previousCoords) return 0;

                    // Haversine formula for distance calculation
                    const R = 6371000; // Earth's radius in meters
                    const lat1 = currentCoords.coordinates[1] * Math.PI / 180;
                    const lat2 = previousCoords.coordinates[1] * Math.PI / 180;
                    const deltaLat = (previousCoords.coordinates[1] - currentCoords.coordinates[1]) * Math.PI / 180;
                    const deltaLng = (previousCoords.coordinates[0] - currentCoords.coordinates[0]) * Math.PI / 180;

                    const a = Math.sin(deltaLat/2) * Math.sin(deltaLat/2) +
                             Math.cos(lat1) * Math.cos(lat2) *
                             Math.sin(deltaLng/2) * Math.sin(deltaLng/2);
                    const c = 2 * Math.atan2(Math.sqrt(a), Math.sqrt(1-a));

                    return R * c;
                  }`,
                  args: ['$coordinates', { $lag: ['$coordinates', 1] }],
                  lang: 'js'
                }
              },

              // Running total distance
              totalDistanceTraveled: {
                $sum: '$distanceFromPrevious',
                window: { documents: ['unbounded', 'current'] }
              }
            }
          }
        },

        // Stage 5: Limit results
        { $limit: options.limit || 1000 },

        // Stage 6: Project final format
        {
          $project: {
            coordinates: 1,
            accuracy: 1,
            altitude: 1,
            speed: 1,
            heading: 1,
            locationMethod: 1,
            recordedAt: 1,
            sessionId: 1,
            movementType: 1,

            // Calculated fields
            distanceFromPrevious: { $round: ['$distanceFromPrevious', 2] },
            totalDistanceTraveled: { $round: ['$totalDistanceTraveled', 2] },
            timeSincePrevious: { $divide: ['$timeSincePrevious', 1000] }, // Convert to seconds

            // Privacy filtered custom data
            ...(options.includeCustomData && { customData: 1 })
          }
        }
      ];

      const locationHistory = await this.collections.userLocations.aggregate(
        pipeline,
        { allowDiskUse: true, maxTimeMS: this.config.queryTimeoutMs }
      ).toArray();

      // Calculate summary statistics
      const totalDistance = locationHistory.reduce((sum, loc) => sum + (loc.distanceFromPrevious || 0), 0);
      const timespan = locationHistory.length > 0 ? 
        new Date(locationHistory[locationHistory.length - 1].recordedAt) - new Date(locationHistory[0].recordedAt) : 0;

      const processingTime = Date.now() - startTime;
      this.updateQueryMetrics('user_location_history', processingTime);

      return {
        success: true,
        locationHistory: locationHistory,
        summary: {
          totalPoints: locationHistory.length,
          totalDistanceMeters: Math.round(totalDistance),
          totalDistanceKm: Math.round(totalDistance / 1000 * 100) / 100,
          timespanHours: Math.round(timespan / (1000 * 60 * 60) * 100) / 100,
          averageSpeed: timespan > 0 ? Math.round((totalDistance / (timespan / 1000)) * 3.6 * 100) / 100 : 0 // km/h
        },
        processingTime: processingTime
      };

    } catch (error) {
      console.error('Error retrieving user location history:', error);
      return {
        success: false,
        error: error.message,
        processingTime: Date.now() - startTime
      };
    }
  }

  async setupGeofencingSystem() {
    console.log('Setting up geofencing system...');

    const geofencesCollection = this.collections.geofences;

    // Spatial index for geofence areas
    await geofencesCollection.createIndex(
      { area: '2dsphere' },
      { background: true, name: 'geofence_spatial' }
    );

    // Compound indexes for geofence queries
    await geofencesCollection.createIndex(
      { ownerId: 1, isActive: 1 },
      { background: true }
    );

    await geofencesCollection.createIndex(
      { category: 1, isActive: 1 },
      { background: true }
    );

    console.log('Geofencing system setup complete');
  }

  async createGeofence(ownerId, geofenceData) {
    console.log(`Creating geofence for owner ${ownerId}...`);

    const startTime = Date.now();

    try {
      const geofenceDocument = {
        geofenceId: new ObjectId(),
        ownerId: ownerId,
        name: geofenceData.name,
        description: geofenceData.description,
        category: geofenceData.category || 'custom',

        // GeoJSON area (Polygon or Circle)
        area: geofenceData.area,

        // Geofence behavior
        triggerOnEntry: geofenceData.triggerOnEntry !== false,
        triggerOnExit: geofenceData.triggerOnExit !== false,
        triggerOnDwell: geofenceData.triggerOnDwell || false,
        dwellTimeSeconds: geofenceData.dwellTimeSeconds || 300, // 5 minutes

        // Notification settings
        notificationSettings: {
          enabled: geofenceData.notifications?.enabled !== false,
          methods: geofenceData.notifications?.methods || ['push'],
          message: geofenceData.notifications?.message
        },

        // Targeting
        targetUsers: geofenceData.targetUsers || [], // Specific user IDs
        targetUserGroups: geofenceData.targetUserGroups || [],

        // Scheduling
        schedule: geofenceData.schedule || {
          enabled: true,
          startTime: null,
          endTime: null,
          daysOfWeek: [1, 2, 3, 4, 5, 6, 7] // All days
        },

        // State management
        isActive: geofenceData.isActive !== false,
        createdAt: new Date(),
        updatedAt: new Date(),

        // Analytics
        entryCount: 0,
        exitCount: 0,
        dwellCount: 0,
        lastTriggered: null,

        // Custom data
        customData: geofenceData.customData || {}
      };

      // Validate GeoJSON area
      if (!this.validateGeoJSONGeometry(geofenceDocument.area)) {
        throw new Error('Invalid geofence area geometry');
      }

      const result = await geofencesCollection.insertOne(geofenceDocument);

      const processingTime = Date.now() - startTime;

      return {
        success: true,
        geofenceId: result.insertedId,
        area: geofenceDocument.area,
        processingTime: processingTime
      };

    } catch (error) {
      console.error('Error creating geofence:', error);
      return {
        success: false,
        error: error.message,
        processingTime: Date.now() - startTime
      };
    }
  }

  async checkGeofenceEvents(userId, longitude, latitude) {
    console.log(`Checking geofence events for user ${userId} at [${longitude}, ${latitude}]...`);

    try {
      const userPoint = {
        type: 'Point',
        coordinates: [longitude, latitude]
      };

      // Find all active geofences that intersect with user location
      const intersectingGeofences = await this.collections.geofences.find({
        isActive: true,

        // Spatial intersection query
        area: {
          $geoIntersects: {
            $geometry: userPoint
          }
        },

        // Check if user is targeted (empty array means all users)
        $or: [
          { targetUsers: { $size: 0 } },
          { targetUsers: userId }
        ]
      }).toArray();

      // Process each intersecting geofence
      const geofenceEvents = [];

      for (const geofence of intersectingGeofences) {
        // Check if this is a new entry or existing presence
        const recentUserLocation = await this.collections.userLocations.findOne({
          userId: userId,
          recordedAt: { $gte: new Date(Date.now() - (5 * 60 * 1000)) } // Last 5 minutes
        }, { sort: { recordedAt: -1 } });

        let eventType = 'dwelling';

        if (!recentUserLocation) {
          eventType = 'entry';
        }

        // Create geofence event
        const geofenceEvent = {
          eventId: new ObjectId(),
          userId: userId,
          geofenceId: geofence.geofenceId,
          geofenceName: geofence.name,
          eventType: eventType,
          coordinates: userPoint,
          eventTime: new Date(),

          // Context information
          geofenceCategory: geofence.category,
          dwellTimeSeconds: eventType === 'dwelling' ? 
            (recentUserLocation ? (Date.now() - recentUserLocation.recordedAt.getTime()) / 1000 : 0) : 0,

          // Notification triggered
          notificationTriggered: geofence.notificationSettings.enabled &&
            ((eventType === 'entry' && geofence.triggerOnEntry) ||
             (eventType === 'dwelling' && geofence.triggerOnDwell)),

          customData: geofence.customData
        };

        // Store the event
        await this.collections.spatialEvents.insertOne(geofenceEvent);

        // Update geofence statistics
        const updateFields = {};
        updateFields[`${eventType}Count`] = 1;
        updateFields.lastTriggered = new Date();

        await this.collections.geofences.updateOne(
          { geofenceId: geofence.geofenceId },
          { 
            $inc: updateFields,
            $set: { updatedAt: new Date() }
          }
        );

        geofenceEvents.push(geofenceEvent);

        // Trigger notifications if configured
        if (geofenceEvent.notificationTriggered) {
          await this.triggerGeofenceNotification(userId, geofenceEvent);
        }
      }

      return {
        success: true,
        eventsTriggered: geofenceEvents.length,
        events: geofenceEvents
      };

    } catch (error) {
      console.error('Error checking geofence events:', error);
      return {
        success: false,
        error: error.message
      };
    }
  }

  async triggerGeofenceNotification(userId, geofenceEvent) {
    // Placeholder for notification system integration
    console.log(`Geofence notification triggered for user ${userId}:`, {
      geofence: geofenceEvent.geofenceName,
      eventType: geofenceEvent.eventType,
      location: geofenceEvent.coordinates
    });

    // In a real implementation, this would integrate with:
    // - Push notification services
    // - SMS/Email services  
    // - Webhook endpoints
    // - Real-time messaging systems
  }

  validateGeoJSONPoint(coordinates) {
    return coordinates &&
           coordinates.type === 'Point' &&
           Array.isArray(coordinates.coordinates) &&
           coordinates.coordinates.length === 2 &&
           typeof coordinates.coordinates[0] === 'number' &&
           typeof coordinates.coordinates[1] === 'number' &&
           coordinates.coordinates[0] >= -180 && coordinates.coordinates[0] <= 180 &&
           coordinates.coordinates[1] >= -90 && coordinates.coordinates[1] <= 90;
  }

  validateGeoJSONGeometry(geometry) {
    if (!geometry || !geometry.type) return false;

    switch (geometry.type) {
      case 'Point':
        return this.validateGeoJSONPoint(geometry);
      case 'Polygon':
        return geometry.coordinates &&
               Array.isArray(geometry.coordinates) &&
               geometry.coordinates.length > 0 &&
               Array.isArray(geometry.coordinates[0]) &&
               geometry.coordinates[0].length >= 4; // Minimum for polygon
      case 'Circle':
        // MongoDB extension for circular geofences
        return geometry.coordinates &&
               Array.isArray(geometry.coordinates) &&
               geometry.coordinates.length === 2 &&
               typeof geometry.radius === 'number' &&
               geometry.radius > 0;
      default:
        return false;
    }
  }

  updateQueryMetrics(queryType, duration) {
    this.queryMetrics.totalQueries++;
    this.queryMetrics.averageQueryTime = 
      (this.queryMetrics.averageQueryTime + duration) / 2;

    if (queryType.includes('spatial') || queryType.includes('nearby') || queryType.includes('geofence')) {
      this.queryMetrics.spatialQueries++;
    }

    if (this.config.logSlowQueries && duration > 1000) {
      console.log(`Slow query detected: ${queryType} took ${duration}ms`);
    }
  }

  async getPerformanceMetrics() {
    return {
      queryMetrics: this.queryMetrics,
      indexMetrics: await this.analyzeIndexPerformance(),
      collectionStats: await this.getCollectionStatistics()
    };
  }

  async analyzeIndexPerformance() {
    const metrics = {};

    for (const [collectionName, collection] of Object.entries(this.collections)) {
      try {
        const indexStats = await collection.aggregate([{ $indexStats: {} }]).toArray();
        metrics[collectionName] = indexStats;
      } catch (error) {
        console.error(`Error analyzing indexes for ${collectionName}:`, error);
      }
    }

    return metrics;
  }

  async getCollectionStatistics() {
    const stats = {};

    for (const [collectionName, collection] of Object.entries(this.collections)) {
      try {
        stats[collectionName] = await collection.stats();
      } catch (error) {
        console.error(`Error getting stats for ${collectionName}:`, error);
      }
    }

    return stats;
  }

  async shutdown() {
    console.log('Shutting down geospatial manager...');

    // Log final performance metrics
    if (this.config.enablePerformanceMetrics) {
      const metrics = await this.getPerformanceMetrics();
      console.log('Final Performance Metrics:', metrics.queryMetrics);
    }

    console.log('Geospatial manager shutdown complete');
  }
}

// Benefits of MongoDB Geospatial Operations:
// - Native 2dsphere indexing with optimized spatial queries
// - Comprehensive GeoJSON support for points, polygons, and complex geometries  
// - High-performance proximity searches with built-in distance calculations
// - Advanced geofencing capabilities with real-time event triggering
// - Seamless integration with application data without external GIS systems
// - Sophisticated spatial aggregation and analytics capabilities
// - Built-in coordinate system support and projection handling
// - Optimized query performance with spatial index utilization
// - SQL-compatible geospatial operations through QueryLeaf integration
// - Scalable location-based services with MongoDB's distributed architecture

module.exports = {
  MongoDBGeospatialManager
};

Understanding MongoDB Geospatial Architecture

Advanced Spatial Indexing and Query Optimization Patterns

Implement sophisticated geospatial strategies for production MongoDB deployments:

// Production-ready MongoDB geospatial operations with advanced optimization and analytics
class ProductionGeospatialProcessor extends MongoDBGeospatialManager {
  constructor(db, productionConfig) {
    super(db, productionConfig);

    this.productionConfig = {
      ...productionConfig,
      enableAdvancedAnalytics: true,
      enableSpatialCaching: true,
      enableLocationIntelligence: true,
      enablePredictiveGeofencing: true,
      enableSpatialDataMining: true,
      enableRealtimeLocationStreams: true
    };

    this.setupProductionOptimizations();
    this.initializeAdvancedGeospatial();
    this.setupLocationIntelligence();
  }

  async implementAdvancedSpatialAnalytics() {
    console.log('Implementing advanced spatial analytics capabilities...');

    const analyticsStrategy = {
      // Location intelligence
      locationIntelligence: {
        enableHeatmapGeneration: true,
        enableClusterAnalysis: true,
        enablePatternDetection: true,
        enablePredictiveModeling: true
      },

      // Spatial data mining
      spatialDataMining: {
        enableLocationCorrelation: true,
        enableMovementPatternAnalysis: true,
        enableSpatialAnomalyDetection: true,
        enableLocationRecommendations: true
      },

      // Real-time processing
      realtimeProcessing: {
        enableStreamingGeoprocessing: true,
        enableDynamicGeofencing: true,
        enableLocationEventCorrelation: true,
        enableSpatialAlertSystems: true
      }
    };

    return await this.deployAdvancedSpatialAnalytics(analyticsStrategy);
  }

  async setupSpatialCachingSystem() {
    console.log('Setting up advanced spatial caching system...');

    const cachingConfig = {
      // Spatial query caching
      spatialQueryCache: {
        enableProximityCache: true,
        cacheRadius: 1000, // Cache results within 1km
        cacheExpiration: 300, // 5 minutes
        maxCacheEntries: 10000
      },

      // Geofence optimization
      geofenceOptimization: {
        enableGeofenceIndex: true,
        spatialPartitioning: true,
        dynamicGeofenceLoading: true,
        geofenceHierarchy: true
      },

      // Location intelligence cache
      locationIntelligenceCache: {
        enableHeatmapCache: true,
        enablePatternCache: true,
        enablePredictionCache: true
      }
    };

    return await this.deploySpatalCaching(cachingConfig);
  }

  async implementPredictiveGeofencing() {
    console.log('Implementing predictive geofencing capabilities...');

    const predictiveConfig = {
      // Movement prediction
      movementPrediction: {
        enableTrajectoryPrediction: true,
        predictionAccuracy: 0.85,
        predictionTimeHorizon: 1800, // 30 minutes
        learningModelUpdates: true
      },

      // Dynamic geofence creation
      dynamicGeofencing: {
        enablePredictiveGeofences: true,
        contextAwareGeofences: true,
        temporalGeofences: true,
        adaptiveGeofenceSizes: true
      },

      // Behavioral analysis
      behavioralAnalysis: {
        enableLocationPatterns: true,
        enableRoutePrediction: true,
        enableDestinationPrediction: true,
        enableActivityRecognition: true
      }
    };

    return await this.deployPredictiveGeofencing(predictiveConfig);
  }
}

SQL-Style Geospatial Operations with QueryLeaf

QueryLeaf provides familiar SQL syntax for MongoDB geospatial operations and location-based services:

-- QueryLeaf geospatial operations with SQL-familiar syntax for MongoDB

-- Create location-enabled table with spatial indexing
CREATE TABLE locations (
  location_id UUID PRIMARY KEY,
  name VARCHAR(255) NOT NULL,
  category VARCHAR(100) NOT NULL,

  -- Geospatial coordinates (automatically creates 2dsphere index)
  coordinates POINT NOT NULL,
  coverage_area POLYGON,

  -- Location details
  address TEXT,
  city VARCHAR(100),
  state VARCHAR(50),
  country VARCHAR(50),
  postal_code VARCHAR(20),

  -- Business information
  phone_number VARCHAR(20),
  website VARCHAR(255),
  operating_hours DOCUMENT,
  rating DECIMAL(3,2) DEFAULT 0,
  price_range INTEGER DEFAULT 1,

  -- Analytics and metadata
  popularity_score DECIMAL(6,2) DEFAULT 0,
  verified BOOLEAN DEFAULT false,
  tags TEXT[],

  created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
  updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
)
WITH SPATIAL_INDEXING (
  coordinates USING '2dsphere',
  coverage_area USING '2dsphere',

  -- Compound spatial indexes for optimized queries
  COMPOUND INDEX (coordinates, category),
  COMPOUND INDEX (coordinates, rating DESC, price_range ASC)
);

-- User location tracking table
CREATE TABLE user_locations (
  user_location_id UUID PRIMARY KEY,
  user_id VARCHAR(50) NOT NULL,
  coordinates POINT NOT NULL,

  -- Accuracy and technical details
  accuracy_meters DECIMAL(8,2),
  altitude_meters DECIMAL(8,2),
  speed_kmh DECIMAL(6,2),
  heading_degrees DECIMAL(5,2),

  -- Context and metadata
  location_method VARCHAR(50) DEFAULT 'GPS',
  device_type VARCHAR(50),
  session_id VARCHAR(100),

  -- Temporal tracking
  recorded_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,

  -- Privacy settings
  location_sharing_level VARCHAR(20) DEFAULT 'private'
)
WITH SPATIAL_INDEXING (
  coordinates USING '2dsphere',
  COMPOUND INDEX (user_id, recorded_at DESC),
  COMPOUND INDEX (coordinates, user_id, recorded_at DESC)
);

-- Insert locations with spatial data
INSERT INTO locations (
  name, category, coordinates, address, city, state, country,
  phone_number, rating, price_range, tags
) VALUES 
  ('Central Park Cafe', 'restaurant', POINT(-73.965355, 40.782865), 
   '123 Central Park West', 'New York', 'NY', 'USA',
   '+1-212-555-0123', 4.5, 2, ARRAY['cafe', 'outdoor_seating', 'wifi']),

  ('Brooklyn Bridge Pizza', 'restaurant', POINT(-73.997638, 40.706877),
   '456 Brooklyn Bridge Blvd', 'New York', 'NY', 'USA', 
   '+1-718-555-0456', 4.2, 1, ARRAY['pizza', 'takeout', 'delivery']),

  ('Times Square Hotel', 'hotel', POINT(-73.985130, 40.758896),
   '789 Times Square', 'New York', 'NY', 'USA',
   '+1-212-555-0789', 4.0, 3, ARRAY['hotel', 'tourist_area', 'business_center']);

-- Advanced proximity search with spatial functions
WITH nearby_search AS (
  SELECT 
    location_id,
    name,
    category,
    coordinates,
    address,
    rating,
    price_range,
    tags,

    -- Distance calculation using spatial functions
    ST_DISTANCE(coordinates, POINT(-73.985130, 40.758896)) as distance_meters,

    -- Bearing (direction) from search point to location
    ST_AZIMUTH(POINT(-73.985130, 40.758896), coordinates) as bearing_radians,
    ST_AZIMUTH(POINT(-73.985130, 40.758896), coordinates) * 180 / PI() as bearing_degrees,

    -- Proximity categorization
    CASE 
      WHEN ST_DISTANCE(coordinates, POINT(-73.985130, 40.758896)) <= 100 THEN 'immediate_vicinity'
      WHEN ST_DISTANCE(coordinates, POINT(-73.985130, 40.758896)) <= 500 THEN 'very_close'
      WHEN ST_DISTANCE(coordinates, POINT(-73.985130, 40.758896)) <= 1000 THEN 'walking_distance'
      WHEN ST_DISTANCE(coordinates, POINT(-73.985130, 40.758896)) <= 5000 THEN 'short_drive'
      ELSE 'distant'
    END as proximity_category

  FROM locations
  WHERE 
    -- Spatial proximity filter (uses spatial index automatically)
    ST_DWITHIN(coordinates, POINT(-73.985130, 40.758896), 2000) -- Within 2km

    -- Additional filters
    AND category = 'restaurant'
    AND rating >= 4.0
    AND price_range <= 2

  ORDER BY distance_meters ASC
  LIMIT 20
),

enhanced_results AS (
  SELECT 
    ns.*,

    -- Enhanced distance information
    ROUND(distance_meters, 2) as distance_meters_rounded,
    ROUND(distance_meters / 1000, 3) as distance_km,

    -- Cardinal direction
    CASE 
      WHEN bearing_degrees >= 337.5 OR bearing_degrees < 22.5 THEN 'North'
      WHEN bearing_degrees >= 22.5 AND bearing_degrees < 67.5 THEN 'Northeast'
      WHEN bearing_degrees >= 67.5 AND bearing_degrees < 112.5 THEN 'East'
      WHEN bearing_degrees >= 112.5 AND bearing_degrees < 157.5 THEN 'Southeast'
      WHEN bearing_degrees >= 157.5 AND bearing_degrees < 202.5 THEN 'South'
      WHEN bearing_degrees >= 202.5 AND bearing_degrees < 247.5 THEN 'Southwest'
      WHEN bearing_degrees >= 247.5 AND bearing_degrees < 292.5 THEN 'West'
      WHEN bearing_degrees >= 292.5 AND bearing_degrees < 337.5 THEN 'Northwest'
    END as direction,

    -- Recommendation scoring
    (
      rating * 2 +  -- Rating component
      CASE proximity_category
        WHEN 'immediate_vicinity' THEN 10
        WHEN 'very_close' THEN 8
        WHEN 'walking_distance' THEN 6
        WHEN 'short_drive' THEN 4
        ELSE 2
      END +
      (3 - price_range) * 1.5  -- Price component (lower price = higher score)
    ) as recommendation_score,

    -- Walking time estimation (average 5 km/h walking speed)
    ROUND(distance_meters / 1000 / 5 * 60, 0) as estimated_walking_minutes

  FROM nearby_search ns
)
SELECT 
  location_id,
  name,
  category,
  address,
  rating,
  price_range,
  tags,

  -- Distance and direction
  distance_meters_rounded as distance_meters,
  distance_km,
  direction,
  proximity_category,

  -- Practical information
  estimated_walking_minutes,
  recommendation_score,

  -- Helpful descriptions
  CONCAT(
    name, ' is ', distance_meters_rounded, 'm ', direction, 
    ' (', estimated_walking_minutes, ' min walk)'
  ) as location_description

FROM enhanced_results
ORDER BY recommendation_score DESC, distance_meters ASC;

-- Geofencing operations with spatial containment
CREATE TABLE geofences (
  geofence_id UUID PRIMARY KEY,
  owner_id VARCHAR(50) NOT NULL,
  name VARCHAR(255) NOT NULL,
  description TEXT,
  category VARCHAR(100) DEFAULT 'custom',

  -- Geofence area (polygon or circle)
  area POLYGON NOT NULL,

  -- Behavior configuration
  trigger_on_entry BOOLEAN DEFAULT true,
  trigger_on_exit BOOLEAN DEFAULT true,
  trigger_on_dwell BOOLEAN DEFAULT false,
  dwell_time_seconds INTEGER DEFAULT 300,

  -- Targeting
  target_users VARCHAR(50)[],
  target_user_groups VARCHAR(50)[],

  -- Status and analytics
  is_active BOOLEAN DEFAULT true,
  entry_count INTEGER DEFAULT 0,
  exit_count INTEGER DEFAULT 0,
  dwell_count INTEGER DEFAULT 0,
  last_triggered TIMESTAMP,

  created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
  updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
)
WITH SPATIAL_INDEXING (
  area USING '2dsphere',
  COMPOUND INDEX (owner_id, is_active),
  COMPOUND INDEX (category, is_active)
);

-- Create geofences with various geometric shapes
INSERT INTO geofences (
  owner_id, name, description, category, area, target_users
) VALUES 
  -- Circular geofence around Central Park
  ('business_123', 'Central Park Zone', 'Marketing zone around Central Park', 'marketing',
   ST_BUFFER(POINT(-73.965355, 40.782865), 500), -- 500m radius circle
   ARRAY[]); -- Empty array means all users

-- Polygon geofence for Times Square area
INSERT INTO geofences (
  owner_id, name, description, category, area, trigger_on_entry, trigger_on_exit
) VALUES 
  ('business_456', 'Times Square District', 'High-traffic commercial zone', 'commercial',
   POLYGON((
     (-73.987140, 40.755751),  -- Southwest corner
     (-73.982915, 40.755751),  -- Southeast corner  
     (-73.982915, 40.762077),  -- Northeast corner
     (-73.987140, 40.762077),  -- Northwest corner
     (-73.987140, 40.755751)   -- Close the polygon
   )),
   true, true);

-- Advanced geofence event detection query
WITH user_location_check AS (
  SELECT 
    ul.user_id,
    ul.coordinates,
    ul.recorded_at,

    -- Find intersecting geofences
    g.geofence_id,
    g.name as geofence_name,
    g.category,
    g.trigger_on_entry,
    g.trigger_on_exit,
    g.trigger_on_dwell,
    g.dwell_time_seconds,

    -- Check spatial containment
    ST_CONTAINS(g.area, ul.coordinates) as is_inside_geofence,

    -- Previous location analysis for entry/exit detection
    LAG(ul.coordinates) OVER (
      PARTITION BY ul.user_id 
      ORDER BY ul.recorded_at
    ) as previous_coordinates,

    LAG(ul.recorded_at) OVER (
      PARTITION BY ul.user_id 
      ORDER BY ul.recorded_at  
    ) as previous_timestamp

  FROM user_locations ul
  CROSS JOIN geofences g
  WHERE 
    ul.recorded_at >= CURRENT_TIMESTAMP - INTERVAL '5 minutes'
    AND g.is_active = true
    AND (
      ARRAY_LENGTH(g.target_users, 1) IS NULL OR  -- No specific targeting
      ul.user_id = ANY(g.target_users)           -- User is specifically targeted
    )
    AND ST_DWITHIN(ul.coordinates, g.area, 100) -- Pre-filter for performance
),

geofence_events AS (
  SELECT 
    ulc.*,

    -- Event type detection
    CASE 
      WHEN is_inside_geofence AND previous_coordinates IS NULL THEN 'entry'
      WHEN is_inside_geofence AND NOT ST_CONTAINS(
        (SELECT area FROM geofences WHERE geofence_id = ulc.geofence_id), 
        previous_coordinates
      ) THEN 'entry'
      WHEN NOT is_inside_geofence AND ST_CONTAINS(
        (SELECT area FROM geofences WHERE geofence_id = ulc.geofence_id), 
        previous_coordinates  
      ) THEN 'exit'
      WHEN is_inside_geofence AND ST_CONTAINS(
        (SELECT area FROM geofences WHERE geofence_id = ulc.geofence_id), 
        previous_coordinates
      ) THEN 'dwelling'
      ELSE 'none'
    END as event_type,

    -- Dwell time calculation
    CASE 
      WHEN previous_timestamp IS NOT NULL THEN
        EXTRACT(EPOCH FROM (recorded_at - previous_timestamp))
      ELSE 0
    END as dwell_time_seconds_calculated

  FROM user_location_check ulc
  WHERE is_inside_geofence = true OR previous_coordinates IS NOT NULL
),

actionable_events AS (
  SELECT 
    ge.*,

    -- Determine if event should trigger notifications
    CASE 
      WHEN event_type = 'entry' AND trigger_on_entry THEN true
      WHEN event_type = 'exit' AND trigger_on_exit THEN true  
      WHEN event_type = 'dwelling' AND trigger_on_dwell AND 
           dwell_time_seconds_calculated >= dwell_time_seconds THEN true
      ELSE false
    END as should_trigger_notification,

    -- Event metadata
    CURRENT_TIMESTAMP as event_processed_at,
    GENERATE_UUID() as event_id

  FROM geofence_events ge
  WHERE event_type != 'none'
)

SELECT 
  event_id,
  user_id,
  geofence_id,
  geofence_name,
  category,
  event_type,
  coordinates,
  recorded_at,
  should_trigger_notification,
  dwell_time_seconds_calculated,

  -- Event context
  CASE event_type
    WHEN 'entry' THEN CONCAT('User entered ', geofence_name)
    WHEN 'exit' THEN CONCAT('User exited ', geofence_name)
    WHEN 'dwelling' THEN CONCAT('User dwelling in ', geofence_name, ' for ', 
                                ROUND(dwell_time_seconds_calculated), ' seconds')
  END as event_description,

  -- Notification priority
  CASE 
    WHEN category = 'security' THEN 'high'
    WHEN category = 'marketing' AND event_type = 'entry' THEN 'medium'
    WHEN event_type = 'dwelling' THEN 'low'
    ELSE 'normal'
  END as notification_priority

FROM actionable_events
WHERE should_trigger_notification = true
ORDER BY recorded_at DESC, notification_priority DESC;

-- Location analytics and heatmap generation
WITH location_density_analysis AS (
  SELECT 
    -- Create spatial grid cells (approximately 100m x 100m)
    FLOOR(ST_X(coordinates) * 1000) / 1000 as grid_lng,
    FLOOR(ST_Y(coordinates) * 1000) / 1000 as grid_lat,

    -- Calculate grid center point
    ST_POINT(
      (FLOOR(ST_X(coordinates) * 1000) + 0.5) / 1000,
      (FLOOR(ST_Y(coordinates) * 1000) + 0.5) / 1000
    ) as grid_center,

    COUNT(*) as location_count,
    COUNT(DISTINCT user_id) as unique_users,

    -- Temporal analysis
    DATE_TRUNC('hour', recorded_at) as hour_bucket,

    -- Movement analysis
    AVG(speed_kmh) as avg_speed,
    AVG(accuracy_meters) as avg_accuracy,

    -- Activity classification
    COUNT(*) FILTER (WHERE speed_kmh < 5) as stationary_count,
    COUNT(*) FILTER (WHERE speed_kmh >= 5 AND speed_kmh < 25) as walking_count,
    COUNT(*) FILTER (WHERE speed_kmh >= 25 AND speed_kmh < 60) as driving_count,
    COUNT(*) FILTER (WHERE speed_kmh >= 60) as highway_count

  FROM user_locations
  WHERE recorded_at >= CURRENT_TIMESTAMP - INTERVAL '24 hours'
  GROUP BY grid_lng, grid_lat, hour_bucket
),

heatmap_data AS (
  SELECT 
    grid_center,
    grid_lng,
    grid_lat,

    -- Density metrics
    SUM(location_count) as total_locations,
    COUNT(DISTINCT hour_bucket) as active_hours,
    AVG(location_count) as avg_locations_per_hour,
    MAX(location_count) as peak_hour_locations,

    -- User engagement
    SUM(unique_users) as total_unique_users,
    AVG(unique_users) as avg_unique_users,

    -- Activity distribution
    SUM(stationary_count) as total_stationary,
    SUM(walking_count) as total_walking,
    SUM(driving_count) as total_driving,
    SUM(highway_count) as total_highway,

    -- Movement characteristics
    AVG(avg_speed) as overall_avg_speed,
    AVG(avg_accuracy) as overall_avg_accuracy,

    -- Heat intensity calculation
    LN(SUM(location_count) + 1) * LOG(SUM(unique_users) + 1) as heat_intensity

  FROM location_density_analysis
  GROUP BY grid_center, grid_lng, grid_lat
),

hotspot_analysis AS (
  SELECT 
    hd.*,

    -- Percentile rankings for intensity
    PERCENT_RANK() OVER (ORDER BY heat_intensity) as intensity_percentile,
    PERCENT_RANK() OVER (ORDER BY total_unique_users) as user_percentile,
    PERCENT_RANK() OVER (ORDER BY total_locations) as activity_percentile,

    -- Classification
    CASE 
      WHEN heat_intensity > (SELECT PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY heat_intensity) FROM heatmap_data) THEN 'extreme_hotspot'
      WHEN heat_intensity > (SELECT PERCENTILE_CONT(0.85) WITHIN GROUP (ORDER BY heat_intensity) FROM heatmap_data) THEN 'major_hotspot'
      WHEN heat_intensity > (SELECT PERCENTILE_CONT(0.70) WITHIN GROUP (ORDER BY heat_intensity) FROM heatmap_data) THEN 'moderate_hotspot'
      WHEN heat_intensity > (SELECT PERCENTILE_CONT(0.50) WITHIN GROUP (ORDER BY heat_intensity) FROM heatmap_data) THEN 'minor_activity'
      ELSE 'low_activity'
    END as hotspot_classification,

    -- Activity type classification
    CASE 
      WHEN total_stationary > (total_walking + total_driving + total_highway) * 0.7 THEN 'destination_area'
      WHEN total_walking > total_locations * 0.6 THEN 'pedestrian_area'
      WHEN total_driving > total_locations * 0.6 THEN 'transit_area'
      WHEN total_highway > total_locations * 0.4 THEN 'highway_corridor'
      ELSE 'mixed_use_area'
    END as area_type

  FROM heatmap_data hd
  WHERE total_locations >= 10  -- Filter out low-activity areas
)

SELECT 
  grid_center,
  ST_X(grid_center) as longitude,
  ST_Y(grid_center) as latitude,

  -- Density and activity metrics
  total_locations,
  total_unique_users,
  active_hours,
  avg_locations_per_hour,
  peak_hour_locations,

  -- Classification results
  hotspot_classification,
  area_type,

  -- Intensity and ranking
  ROUND(heat_intensity, 3) as heat_intensity,
  ROUND(intensity_percentile * 100, 1) as intensity_percentile_rank,

  -- Activity breakdown
  ROUND((total_stationary::NUMERIC / total_locations) * 100, 1) as stationary_pct,
  ROUND((total_walking::NUMERIC / total_locations) * 100, 1) as walking_pct,
  ROUND((total_driving::NUMERIC / total_locations) * 100, 1) as driving_pct,

  -- Movement characteristics
  ROUND(overall_avg_speed, 2) as avg_speed_kmh,
  ROUND(overall_avg_accuracy, 1) as avg_accuracy_meters,

  -- Insights and recommendations
  CASE hotspot_classification
    WHEN 'extreme_hotspot' THEN 'High-priority area for business development'
    WHEN 'major_hotspot' THEN 'Significant commercial opportunity'
    WHEN 'moderate_hotspot' THEN 'Growing activity area with potential'
    ELSE 'Monitor for emerging trends'
  END as business_recommendation

FROM hotspot_analysis
ORDER BY heat_intensity DESC, total_unique_users DESC
LIMIT 100;

-- Advanced user movement pattern analysis
WITH user_journeys AS (
  SELECT 
    user_id,
    coordinates,
    recorded_at,
    speed_kmh,

    -- Movement analysis using window functions
    LAG(coordinates) OVER (
      PARTITION BY user_id 
      ORDER BY recorded_at
    ) as prev_coordinates,

    LAG(recorded_at) OVER (
      PARTITION BY user_id 
      ORDER BY recorded_at
    ) as prev_timestamp,

    LEAD(coordinates) OVER (
      PARTITION BY user_id 
      ORDER BY recorded_at
    ) as next_coordinates,

    -- Session detection (gap > 30 minutes = new session)
    SUM(CASE 
      WHEN recorded_at - LAG(recorded_at) OVER (
        PARTITION BY user_id ORDER BY recorded_at
      ) > INTERVAL '30 minutes' THEN 1 
      ELSE 0 
    END) OVER (
      PARTITION BY user_id 
      ORDER BY recorded_at 
      ROWS UNBOUNDED PRECEDING
    ) as session_number

  FROM user_locations
  WHERE recorded_at >= CURRENT_TIMESTAMP - INTERVAL '7 days'
),

journey_segments AS (
  SELECT 
    uj.*,

    -- Distance calculations
    CASE 
      WHEN prev_coordinates IS NOT NULL THEN
        ST_DISTANCE(coordinates, prev_coordinates)
      ELSE 0
    END as distance_from_previous,

    -- Time calculations
    CASE 
      WHEN prev_timestamp IS NOT NULL THEN
        EXTRACT(EPOCH FROM (recorded_at - prev_timestamp))
      ELSE 0
    END as time_since_previous,

    -- Direction calculations
    CASE 
      WHEN prev_coordinates IS NOT NULL THEN
        ST_AZIMUTH(prev_coordinates, coordinates) * 180 / PI()
      ELSE NULL
    END as bearing_from_previous,

    -- Stop detection
    CASE 
      WHEN speed_kmh < 2 AND 
           LAG(speed_kmh) OVER (PARTITION BY user_id ORDER BY recorded_at) < 2 
      THEN true 
      ELSE false 
    END as is_stopped

  FROM user_journeys uj
),

movement_patterns AS (
  SELECT 
    user_id,
    session_number,

    -- Session boundaries
    MIN(recorded_at) as session_start,
    MAX(recorded_at) as session_end,
    EXTRACT(SECONDS FROM (MAX(recorded_at) - MIN(recorded_at))) as session_duration_seconds,

    -- Movement statistics
    COUNT(*) as total_location_points,
    SUM(distance_from_previous) as total_distance_meters,
    AVG(speed_kmh) as avg_speed_kmh,
    MAX(speed_kmh) as max_speed_kmh,

    -- Stop analysis
    COUNT(*) FILTER (WHERE is_stopped) as stop_count,
    AVG(time_since_previous) FILTER (WHERE is_stopped) as avg_stop_duration,

    -- Geographic analysis
    ST_EXTENT(coordinates) as bounding_box,
    ST_CENTROID(ST_COLLECT(coordinates)) as activity_center,

    -- Movement characteristics
    CASE 
      WHEN AVG(speed_kmh) < 5 THEN 'pedestrian'
      WHEN AVG(speed_kmh) < 25 THEN 'urban_transit'  
      WHEN AVG(speed_kmh) < 80 THEN 'highway_driving'
      ELSE 'high_speed_transit'
    END as primary_movement_mode,

    -- Journey classification
    CASE 
      WHEN SUM(distance_from_previous) < 500 THEN 'local_area'
      WHEN SUM(distance_from_previous) < 5000 THEN 'neighborhood'
      WHEN SUM(distance_from_previous) < 50000 THEN 'city_wide'
      ELSE 'long_distance'
    END as journey_scope

  FROM journey_segments
  WHERE distance_from_previous IS NOT NULL
  GROUP BY user_id, session_number
)

SELECT 
  user_id,
  session_number,
  session_start,
  session_end,

  -- Duration and distance
  ROUND(session_duration_seconds / 60, 1) as duration_minutes,
  ROUND(total_distance_meters, 2) as distance_meters,
  ROUND(total_distance_meters / 1000, 3) as distance_km,

  -- Movement characteristics
  primary_movement_mode,
  journey_scope,
  ROUND(avg_speed_kmh, 2) as avg_speed_kmh,
  ROUND(max_speed_kmh, 2) as max_speed_kmh,

  -- Activity analysis
  total_location_points,
  stop_count,
  ROUND(avg_stop_duration / 60, 1) as avg_stop_duration_minutes,

  -- Geographic insights
  ST_X(activity_center) as center_longitude,
  ST_Y(activity_center) as center_latitude,

  -- Journey insights
  CASE 
    WHEN stop_count > total_location_points * 0.3 THEN 'multi_destination_trip'
    WHEN stop_count > 0 THEN 'trip_with_stops'
    ELSE 'direct_trip'
  END as trip_pattern,

  -- Efficiency metrics
  CASE 
    WHEN session_duration_seconds > 0 THEN
      ROUND((total_distance_meters / session_duration_seconds) * 3.6, 2) -- km/h
    ELSE 0
  END as overall_journey_speed,

  -- Movement efficiency (straight line vs actual distance)
  CASE 
    WHEN bounding_box IS NOT NULL THEN
      ROUND(
        (ST_DISTANCE(
          ST_POINT(ST_XMIN(bounding_box), ST_YMIN(bounding_box)),
          ST_POINT(ST_XMAX(bounding_box), ST_YMAX(bounding_box))
        ) / NULLIF(total_distance_meters, 0)) * 100, 
        2
      )
    ELSE NULL
  END as route_efficiency_pct

FROM movement_patterns
WHERE session_duration_seconds > 60  -- Filter very short sessions
ORDER BY user_id, session_start DESC;

-- QueryLeaf provides comprehensive geospatial capabilities:
-- 1. SQL-familiar spatial data types and indexing (POINT, POLYGON, etc.)
-- 2. Advanced spatial functions (ST_DISTANCE, ST_CONTAINS, ST_BUFFER, etc.)
-- 3. Optimized proximity searches with automatic spatial index utilization
-- 4. Sophisticated geofencing with entry/exit/dwell event detection
-- 5. Location analytics and heatmap generation with spatial aggregation
-- 6. Movement pattern analysis with trajectory and behavioral insights
-- 7. Real-time spatial event processing and notification triggers
-- 8. Integration with MongoDB's native 2dsphere indexing optimization
-- 9. Complex spatial queries with business logic and filtering
-- 10. Production-ready geospatial operations with familiar SQL syntax

Best Practices for Geospatial Implementation

Spatial Index Strategy and Performance Optimization

Essential principles for effective MongoDB geospatial deployment:

Index Design: Create compound spatial indexes that combine location data with frequently queried attributes
Query Optimization: Structure queries to leverage spatial indexes effectively and minimize computational overhead
Coordinate System: Standardize on WGS84 (EPSG:4326) for consistency and optimal MongoDB performance
Data Validation: Implement comprehensive GeoJSON validation to prevent spatial query errors
Scaling Strategy: Design geospatial collections for horizontal scaling with appropriate shard key selection
Caching Strategy: Implement spatial query result caching for frequently accessed location data

Production Deployment and Location Intelligence

Optimize geospatial operations for enterprise-scale location-based services:

Real-Time Processing: Leverage change streams and geofencing for responsive location-aware applications
Analytics Integration: Combine spatial data with business intelligence for location-driven insights
Privacy Compliance: Implement location data privacy controls and user consent management
Performance Monitoring: Track spatial query performance and optimize based on usage patterns
Fault Tolerance: Design location services with redundancy and failover capabilities
Mobile Optimization: Optimize for mobile device constraints including battery usage and network efficiency

Conclusion

MongoDB geospatial capabilities provide comprehensive native location-based services that eliminate the complexity of external GIS systems through advanced spatial indexing, sophisticated geometric operations, and seamless integration with application data models. The combination of high-performance spatial queries with real-time geofencing and location analytics makes MongoDB ideal for modern location-aware applications.

Key MongoDB Geospatial benefits include:

Native Spatial Indexing: Advanced 2dsphere indexes with optimized geometric operations and coordinate system support
Comprehensive GeoJSON Support: Full support for points, polygons, lines, and complex geometries with native validation
High-Performance Proximity: Optimized distance calculations and bearing analysis for location-based queries
Real-Time Geofencing: Advanced geofence event detection with entry, exit, and dwell time triggers
Location Analytics: Sophisticated spatial aggregation for heatmaps, movement patterns, and location intelligence
SQL Accessibility: Familiar SQL-style spatial operations through QueryLeaf for accessible geospatial development

Whether you're building ride-sharing platforms, delivery services, social media applications, or location-based marketing systems, MongoDB geospatial capabilities with QueryLeaf's familiar SQL interface provide the foundation for scalable, high-performance location services.

QueryLeaf Integration: QueryLeaf seamlessly manages MongoDB geospatial operations while providing SQL-familiar spatial data types, indexing strategies, and location-based query capabilities. Advanced geospatial patterns including proximity searches, geofencing, movement analysis, and location analytics are elegantly handled through familiar SQL constructs, making sophisticated location-based services both powerful and accessible to SQL-oriented development teams.

The combination of MongoDB's robust geospatial capabilities with SQL-style location operations makes it an ideal platform for applications requiring both advanced spatial functionality and familiar database interaction patterns, ensuring your location services can scale efficiently while delivering precise, real-time geographic experiences.

November 3, 2025
23 min read

MongoDB Data Pipeline Optimization and Stream Processing: Advanced Real-Time Analytics for High-Performance Data Workflows

Modern applications require sophisticated data processing capabilities that can handle high-velocity data streams, complex analytical workloads, and real-time insights while maintaining optimal performance under varying load conditions. Traditional data pipeline approaches often struggle with complex transformation logic, performance bottlenecks in aggregation operations, and the operational complexity of maintaining separate systems for batch and stream processing, leading to increased latency, resource inefficiency, and difficulty in maintaining data consistency across processing workflows.

MongoDB provides comprehensive data pipeline capabilities through the Aggregation Framework, Change Streams, and advanced stream processing features that enable real-time analytics, complex data transformations, and high-performance data processing within a single unified platform. Unlike traditional approaches that require multiple specialized systems and complex integration logic, MongoDB's integrated data pipeline capabilities deliver superior performance through native optimization, intelligent query planning, and seamless integration with storage and indexing systems.

The Traditional Data Pipeline Challenge

Conventional data processing architectures face significant limitations when handling complex analytical workloads:

-- Traditional PostgreSQL data pipeline - complex ETL processes with performance limitations

-- Basic data transformation pipeline with limited optimization capabilities
CREATE TABLE raw_events (
    event_id BIGSERIAL PRIMARY KEY,
    event_timestamp TIMESTAMP NOT NULL,
    user_id BIGINT NOT NULL,
    session_id VARCHAR(100),
    event_type VARCHAR(100) NOT NULL,
    event_category VARCHAR(100),

    -- Basic event data (limited nested structure support)
    event_data JSONB,
    device_info JSONB,
    location_data JSONB,

    -- Processing metadata
    ingested_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    processed_at TIMESTAMP,
    processing_status VARCHAR(50) DEFAULT 'pending',

    -- Partitioning key
    partition_date DATE GENERATED ALWAYS AS (event_timestamp::date) STORED

) PARTITION BY RANGE (partition_date);

-- Create monthly partitions (manual maintenance required)
CREATE TABLE raw_events_2024_11 PARTITION OF raw_events
    FOR VALUES FROM ('2024-11-01') TO ('2024-12-01');
CREATE TABLE raw_events_2024_12 PARTITION OF raw_events  
    FOR VALUES FROM ('2024-12-01') TO ('2025-01-01');

-- Complex data transformation pipeline with performance issues
CREATE OR REPLACE FUNCTION process_event_batch(
    batch_size INTEGER DEFAULT 1000
) RETURNS TABLE (
    processed_events INTEGER,
    failed_events INTEGER,
    processing_time_ms INTEGER,
    transformation_errors TEXT[]
) AS $$
DECLARE
    batch_start_time TIMESTAMP;
    processing_errors TEXT[] := '{}';
    events_processed INTEGER := 0;
    events_failed INTEGER := 0;
    event_record RECORD;
BEGIN
    batch_start_time := clock_timestamp();

    -- Process events in batches (inefficient row-by-row processing)
    FOR event_record IN 
        SELECT * FROM raw_events 
        WHERE processing_status = 'pending'
        ORDER BY ingested_at
        LIMIT batch_size
    LOOP
        BEGIN
            -- Complex transformation logic (limited JSON processing capabilities)
            WITH transformed_event AS (
                SELECT 
                    event_record.event_id,
                    event_record.event_timestamp,
                    event_record.user_id,
                    event_record.session_id,
                    event_record.event_type,
                    event_record.event_category,

                    -- Basic data extraction and transformation
                    COALESCE(event_record.event_data->>'revenue', '0')::DECIMAL(10,2) as revenue,
                    COALESCE(event_record.event_data->>'quantity', '1')::INTEGER as quantity,
                    event_record.event_data->>'product_id' as product_id,
                    event_record.event_data->>'product_name' as product_name,

                    -- Device information extraction (limited nested processing)
                    event_record.device_info->>'device_type' as device_type,
                    event_record.device_info->>'browser' as browser,
                    event_record.device_info->>'os' as operating_system,

                    -- Location processing (basic only)
                    event_record.location_data->>'country' as country,
                    event_record.location_data->>'region' as region,
                    event_record.location_data->>'city' as city,

                    -- Time-based calculations
                    EXTRACT(HOUR FROM event_record.event_timestamp) as event_hour,
                    EXTRACT(DOW FROM event_record.event_timestamp) as day_of_week,
                    TO_CHAR(event_record.event_timestamp, 'YYYY-MM') as year_month,

                    -- User segmentation (basic logic only)
                    CASE 
                        WHEN user_segments.segment_type IS NOT NULL THEN user_segments.segment_type
                        ELSE 'unknown'
                    END as user_segment,

                    -- Processing metadata
                    CURRENT_TIMESTAMP as processed_at

                FROM raw_events re
                LEFT JOIN user_segments ON re.user_id = user_segments.user_id
                WHERE re.event_id = event_record.event_id
            )

            -- Insert into processed events table (separate table required)
            INSERT INTO processed_events (
                event_id, event_timestamp, user_id, session_id, event_type, event_category,
                revenue, quantity, product_id, product_name,
                device_type, browser, operating_system,
                country, region, city,
                event_hour, day_of_week, year_month, user_segment,
                processed_at
            )
            SELECT * FROM transformed_event;

            -- Update processing status
            UPDATE raw_events 
            SET 
                processed_at = CURRENT_TIMESTAMP,
                processing_status = 'completed'
            WHERE event_id = event_record.event_id;

            events_processed := events_processed + 1;

        EXCEPTION WHEN OTHERS THEN
            events_failed := events_failed + 1;
            processing_errors := array_append(processing_errors, 
                'Event ID ' || event_record.event_id || ': ' || SQLERRM);

            -- Mark event as failed
            UPDATE raw_events 
            SET 
                processed_at = CURRENT_TIMESTAMP,
                processing_status = 'failed'
            WHERE event_id = event_record.event_id;
        END;
    END LOOP;

    -- Return processing results
    RETURN QUERY SELECT 
        events_processed,
        events_failed,
        EXTRACT(MILLISECONDS FROM clock_timestamp() - batch_start_time)::INTEGER,
        processing_errors;

END;
$$ LANGUAGE plpgsql;

-- Execute batch processing (requires manual scheduling)
SELECT * FROM process_event_batch(1000);

-- Complex analytical query with performance limitations
WITH hourly_metrics AS (
    -- Time-based aggregation with limited optimization
    SELECT 
        DATE_TRUNC('hour', event_timestamp) as hour_bucket,
        event_type,
        event_category,
        user_segment,
        device_type,
        country,

        -- Basic aggregations (limited analytical functions)
        COUNT(*) as event_count,
        COUNT(DISTINCT user_id) as unique_users,
        COUNT(DISTINCT session_id) as unique_sessions,
        SUM(revenue) as total_revenue,
        AVG(revenue) FILTER (WHERE revenue > 0) as avg_revenue_per_transaction,

        -- Limited statistical functions
        PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY revenue) as median_revenue,
        STDDEV_POP(revenue) as revenue_stddev,

        -- Time-based calculations
        MIN(event_timestamp) as first_event_time,
        MAX(event_timestamp) as last_event_time

    FROM processed_events
    WHERE event_timestamp >= CURRENT_TIMESTAMP - INTERVAL '24 hours'
    GROUP BY 
        DATE_TRUNC('hour', event_timestamp),
        event_type, event_category, user_segment, device_type, country
),

user_behavior_analysis AS (
    -- User journey analysis (complex and slow)
    SELECT 
        user_id,
        session_id,

        -- Session-level aggregations
        COUNT(*) as events_per_session,
        SUM(revenue) as session_revenue,
        EXTRACT(SECONDS FROM (MAX(event_timestamp) - MIN(event_timestamp))) as session_duration_seconds,

        -- Event sequence analysis (limited capabilities)
        string_agg(event_type, ' -> ' ORDER BY event_timestamp) as event_sequence,
        array_agg(event_timestamp ORDER BY event_timestamp) as event_timestamps,

        -- Conversion analysis
        CASE 
            WHEN 'purchase' = ANY(array_agg(event_type)) THEN 'converted'
            WHEN 'add_to_cart' = ANY(array_agg(event_type)) THEN 'engaged'
            ELSE 'browsing'
        END as conversion_status,

        -- Time-based metrics
        first_value(event_timestamp) OVER (
            PARTITION BY user_id 
            ORDER BY event_timestamp 
            ROWS UNBOUNDED PRECEDING
        ) as first_session_event,

        last_value(event_timestamp) OVER (
            PARTITION BY user_id 
            ORDER BY event_timestamp 
            ROWS UNBOUNDED FOLLOWING
        ) as last_session_event

    FROM processed_events
    WHERE event_timestamp >= CURRENT_TIMESTAMP - INTERVAL '7 days'
    GROUP BY user_id, session_id
),

funnel_analysis AS (
    -- Conversion funnel analysis (very limited and slow)
    SELECT 
        event_category,
        user_segment,

        -- Funnel step counts
        COUNT(*) FILTER (WHERE event_type = 'view') as step_1_views,
        COUNT(*) FILTER (WHERE event_type = 'click') as step_2_clicks,
        COUNT(*) FILTER (WHERE event_type = 'add_to_cart') as step_3_cart_adds,
        COUNT(*) FILTER (WHERE event_type = 'purchase') as step_4_purchases,

        -- Conversion rates (basic calculations)
        CASE 
            WHEN COUNT(*) FILTER (WHERE event_type = 'view') > 0 THEN
                (COUNT(*) FILTER (WHERE event_type = 'click') * 100.0) / 
                COUNT(*) FILTER (WHERE event_type = 'view')
            ELSE 0
        END as click_through_rate,

        CASE 
            WHEN COUNT(*) FILTER (WHERE event_type = 'add_to_cart') > 0 THEN
                (COUNT(*) FILTER (WHERE event_type = 'purchase') * 100.0) / 
                COUNT(*) FILTER (WHERE event_type = 'add_to_cart')
            ELSE 0
        END as cart_to_purchase_rate

    FROM processed_events
    WHERE event_timestamp >= CURRENT_TIMESTAMP - INTERVAL '30 days'
    GROUP BY event_category, user_segment
)

-- Final analytical output (limited insights)
SELECT 
    hm.hour_bucket,
    hm.event_type,
    hm.event_category,
    hm.user_segment,
    hm.device_type,
    hm.country,

    -- Volume metrics
    hm.event_count,
    hm.unique_users,
    hm.unique_sessions,
    ROUND(hm.event_count::DECIMAL / hm.unique_users, 2) as events_per_user,

    -- Revenue metrics  
    ROUND(hm.total_revenue, 2) as total_revenue,
    ROUND(hm.avg_revenue_per_transaction, 2) as avg_revenue_per_transaction,
    ROUND(hm.median_revenue, 2) as median_revenue,

    -- User behavior insights (very limited)
    (SELECT AVG(events_per_session) 
     FROM user_behavior_analysis uba 
     WHERE uba.session_revenue > 0) as avg_events_per_converting_session,

    -- Conversion insights
    fa.click_through_rate,
    fa.cart_to_purchase_rate,

    -- Performance indicators
    EXTRACT(MINUTES FROM (hm.last_event_time - hm.first_event_time)) as processing_window_minutes,

    -- Trend indicators (very basic)
    LAG(hm.event_count, 1) OVER (
        PARTITION BY hm.event_type, hm.user_segment 
        ORDER BY hm.hour_bucket
    ) as prev_hour_event_count

FROM hourly_metrics hm
LEFT JOIN funnel_analysis fa ON (
    hm.event_category = fa.event_category AND 
    hm.user_segment = fa.user_segment
)
WHERE hm.event_count > 10  -- Filter low-volume segments
ORDER BY hm.hour_bucket DESC, hm.total_revenue DESC
LIMIT 1000;

-- Problems with traditional data pipeline approaches:
-- 1. Complex ETL processes requiring separate batch processing jobs
-- 2. Limited support for nested and complex data structures  
-- 3. Poor performance with large-scale analytical workloads
-- 4. Manual partitioning and maintenance overhead
-- 5. No real-time stream processing capabilities
-- 6. Limited statistical and analytical functions
-- 7. Complex joins and data movement between processing stages
-- 8. No native support for time-series and event stream processing
-- 9. Difficulty in maintaining data consistency across pipeline stages
-- 10. Limited optimization for analytical query patterns

MongoDB provides comprehensive data pipeline capabilities with advanced stream processing and analytics:

// MongoDB Advanced Data Pipeline and Stream Processing - real-time analytics with optimized performance
const { MongoClient, GridFSBucket } = require('mongodb');
const { EventEmitter } = require('events');

// Comprehensive MongoDB Data Pipeline Manager
class AdvancedDataPipelineManager extends EventEmitter {
  constructor(connectionString, pipelineConfig = {}) {
    super();
    this.connectionString = connectionString;
    this.client = null;
    this.db = null;

    // Advanced pipeline configuration
    this.config = {
      // Pipeline processing configuration
      enableStreamProcessing: pipelineConfig.enableStreamProcessing !== false,
      enableRealTimeAnalytics: pipelineConfig.enableRealTimeAnalytics !== false,
      enableBatchProcessing: pipelineConfig.enableBatchProcessing !== false,

      // Performance optimization settings
      aggregationOptimization: pipelineConfig.aggregationOptimization !== false,
      indexOptimization: pipelineConfig.indexOptimization !== false,
      memoryOptimization: pipelineConfig.memoryOptimization !== false,
      parallelProcessing: pipelineConfig.parallelProcessing !== false,

      // Stream processing configuration
      changeStreamOptions: {
        fullDocument: 'updateLookup',
        maxAwaitTimeMS: 1000,
        batchSize: 1000,
        ...pipelineConfig.changeStreamOptions
      },

      // Batch processing configuration
      batchSize: pipelineConfig.batchSize || 5000,
      maxBatchProcessingTime: pipelineConfig.maxBatchProcessingTime || 300000, // 5 minutes

      // Analytics configuration
      analyticsWindowSize: pipelineConfig.analyticsWindowSize || 3600000, // 1 hour
      retentionPeriod: pipelineConfig.retentionPeriod || 90 * 24 * 60 * 60 * 1000, // 90 days

      // Performance monitoring
      enablePerformanceMetrics: pipelineConfig.enablePerformanceMetrics !== false,
      enablePipelineOptimization: pipelineConfig.enablePipelineOptimization !== false
    };

    // Pipeline state management
    this.activePipelines = new Map();
    this.streamProcessors = new Map();
    this.batchProcessors = new Map();
    this.performanceMetrics = {
      pipelinesExecuted: 0,
      totalProcessingTime: 0,
      documentsProcessed: 0,
      averageThroughput: 0
    };

    this.initializeDataPipeline();
  }

  async initializeDataPipeline() {
    console.log('Initializing advanced data pipeline system...');

    try {
      // Connect to MongoDB
      this.client = new MongoClient(this.connectionString);
      await this.client.connect();
      this.db = this.client.db();

      // Setup collections and indexes
      await this.setupPipelineInfrastructure();

      // Initialize stream processing
      if (this.config.enableStreamProcessing) {
        await this.initializeStreamProcessing();
      }

      // Initialize batch processing
      if (this.config.enableBatchProcessing) {
        await this.initializeBatchProcessing();
      }

      // Setup real-time analytics
      if (this.config.enableRealTimeAnalytics) {
        await this.setupRealTimeAnalytics();
      }

      console.log('Advanced data pipeline system initialized successfully');

    } catch (error) {
      console.error('Error initializing data pipeline:', error);
      throw error;
    }
  }

  async setupPipelineInfrastructure() {
    console.log('Setting up data pipeline infrastructure...');

    try {
      // Create collections with optimized configuration
      const collections = {
        rawEvents: this.db.collection('raw_events'),
        processedEvents: this.db.collection('processed_events'),
        analyticsResults: this.db.collection('analytics_results'),
        pipelineMetrics: this.db.collection('pipeline_metrics'),
        userSessions: this.db.collection('user_sessions'),
        conversionFunnels: this.db.collection('conversion_funnels')
      };

      // Create optimized indexes for high-performance data processing
      await collections.rawEvents.createIndexes([
        { key: { eventTimestamp: -1 }, background: true },
        { key: { userId: 1, sessionId: 1, eventTimestamp: -1 }, background: true },
        { key: { eventType: 1, eventCategory: 1, eventTimestamp: -1 }, background: true },
        { key: { 'processingStatus': 1, 'ingestedAt': 1 }, background: true },
        { key: { 'locationData.country': 1, 'deviceInfo.deviceType': 1 }, background: true, sparse: true }
      ]);

      await collections.processedEvents.createIndexes([
        { key: { eventTimestamp: -1 }, background: true },
        { key: { userId: 1, sessionId: 1, eventTimestamp: -1 }, background: true },
        { key: { eventType: 1, userSegment: 1, eventTimestamp: -1 }, background: true },
        { key: { 'metrics.revenue': -1, eventTimestamp: -1 }, background: true, sparse: true }
      ]);

      await collections.analyticsResults.createIndexes([
        { key: { analysisType: 1, timeWindow: -1 }, background: true },
        { key: { 'dimensions.eventType': 1, 'dimensions.userSegment': 1, timeWindow: -1 }, background: true },
        { key: { createdAt: -1 }, background: true }
      ]);

      this.collections = collections;

      console.log('Pipeline infrastructure setup completed');

    } catch (error) {
      console.error('Error setting up pipeline infrastructure:', error);
      throw error;
    }
  }

  async createAdvancedAnalyticsPipeline(pipelineConfig) {
    console.log('Creating advanced analytics pipeline...');

    const pipelineId = this.generatePipelineId();
    const startTime = Date.now();

    try {
      // Build comprehensive aggregation pipeline
      const analyticsStages = [
        // Stage 1: Data filtering and initial processing
        {
          $match: {
            eventTimestamp: {
              $gte: new Date(Date.now() - this.config.analyticsWindowSize),
              $lte: new Date()
            },
            processingStatus: 'completed',
            ...pipelineConfig.matchCriteria
          }
        },

        // Stage 2: Advanced data transformation and enrichment
        {
          $addFields: {
            // Time-based dimensions
            hourBucket: {
              $dateFromParts: {
                year: { $year: '$eventTimestamp' },
                month: { $month: '$eventTimestamp' },
                day: { $dayOfMonth: '$eventTimestamp' },
                hour: { $hour: '$eventTimestamp' }
              }
            },
            dayOfWeek: { $dayOfWeek: '$eventTimestamp' },
            yearMonth: {
              $dateToString: {
                format: '%Y-%m',
                date: '$eventTimestamp'
              }
            },

            // User segmentation and classification
            userSegment: {
              $switch: {
                branches: [
                  {
                    case: { $gte: ['$userMetrics.totalRevenue', 1000] },
                    then: 'high_value'
                  },
                  {
                    case: { $gte: ['$userMetrics.totalRevenue', 100] },
                    then: 'medium_value'
                  },
                  {
                    case: { $gt: ['$userMetrics.totalRevenue', 0] },
                    then: 'low_value'
                  }
                ],
                default: 'non_revenue'
              }
            },

            // Device and technology classification
            deviceCategory: {
              $switch: {
                branches: [
                  {
                    case: { $in: ['$deviceInfo.deviceType', ['smartphone', 'tablet']] },
                    then: 'mobile'
                  },
                  {
                    case: { $eq: ['$deviceInfo.deviceType', 'desktop'] },
                    then: 'desktop'
                  }
                ],
                default: 'other'
              }
            },

            // Geographic clustering
            geoRegion: {
              $switch: {
                branches: [
                  {
                    case: { $in: ['$locationData.country', ['US', 'CA', 'MX']] },
                    then: 'North America'
                  },
                  {
                    case: { $in: ['$locationData.country', ['GB', 'DE', 'FR', 'IT', 'ES']] },
                    then: 'Europe'
                  },
                  {
                    case: { $in: ['$locationData.country', ['JP', 'KR', 'CN', 'IN']] },
                    then: 'Asia'
                  }
                ],
                default: 'Other'
              }
            },

            // Revenue and value metrics
            revenueMetrics: {
              revenue: { $toDouble: '$eventData.revenue' },
              quantity: { $toInt: '$eventData.quantity' },
              averageOrderValue: {
                $cond: [
                  { $gt: [{ $toInt: '$eventData.quantity' }, 0] },
                  { $divide: [{ $toDouble: '$eventData.revenue' }, { $toInt: '$eventData.quantity' }] },
                  0
                ]
              }
            }
          }
        },

        // Stage 3: Multi-dimensional aggregation and analytics
        {
          $group: {
            _id: {
              hourBucket: '$hourBucket',
              eventType: '$eventType',
              eventCategory: '$eventCategory',
              userSegment: '$userSegment',
              deviceCategory: '$deviceCategory',
              geoRegion: '$geoRegion'
            },

            // Volume metrics
            eventCount: { $sum: 1 },
            uniqueUsers: { $addToSet: '$userId' },
            uniqueSessions: { $addToSet: '$sessionId' },

            // Revenue analytics
            totalRevenue: { $sum: '$revenueMetrics.revenue' },
            totalQuantity: { $sum: '$revenueMetrics.quantity' },
            revenueTransactions: {
              $sum: {
                $cond: [{ $gt: ['$revenueMetrics.revenue', 0] }, 1, 0]
              }
            },

            // Statistical aggregations
            revenueValues: { $push: '$revenueMetrics.revenue' },
            quantityValues: { $push: '$revenueMetrics.quantity' },
            avgOrderValues: { $push: '$revenueMetrics.averageOrderValue' },

            // Time-based analytics
            firstEventTime: { $min: '$eventTimestamp' },
            lastEventTime: { $max: '$eventTimestamp' },
            eventTimestamps: { $push: '$eventTimestamp' },

            // User behavior patterns
            userSessions: {
              $push: {
                userId: '$userId',
                sessionId: '$sessionId',
                eventTimestamp: '$eventTimestamp',
                revenue: '$revenueMetrics.revenue'
              }
            }
          }
        },

        // Stage 4: Advanced statistical calculations
        {
          $addFields: {
            // User metrics
            uniqueUserCount: { $size: '$uniqueUsers' },
            uniqueSessionCount: { $size: '$uniqueSessions' },
            eventsPerUser: {
              $divide: ['$eventCount', { $size: '$uniqueUsers' }]
            },
            eventsPerSession: {
              $divide: ['$eventCount', { $size: '$uniqueSessions' }]
            },

            // Revenue analytics
            averageRevenue: {
              $cond: [
                { $gt: ['$revenueTransactions', 0] },
                { $divide: ['$totalRevenue', '$revenueTransactions'] },
                0
              ]
            },
            revenuePerUser: {
              $divide: ['$totalRevenue', { $size: '$uniqueUsers' }]
            },
            conversionRate: {
              $divide: ['$revenueTransactions', '$eventCount']
            },

            // Statistical measures
            revenueStats: {
              $let: {
                vars: {
                  sortedRevenues: {
                    $sortArray: {
                      input: '$revenueValues',
                      sortBy: 1
                    }
                  }
                },
                in: {
                  median: {
                    $arrayElemAt: [
                      '$$sortedRevenues',
                      { $floor: { $multiply: [{ $size: '$$sortedRevenues' }, 0.5] } }
                    ]
                  },
                  percentile75: {
                    $arrayElemAt: [
                      '$$sortedRevenues',
                      { $floor: { $multiply: [{ $size: '$$sortedRevenues' }, 0.75] } }
                    ]
                  },
                  percentile95: {
                    $arrayElemAt: [
                      '$$sortedRevenues',
                      { $floor: { $multiply: [{ $size: '$$sortedRevenues' }, 0.95] } }
                    ]
                  }
                }
              }
            },

            // Temporal analysis
            processingWindowMinutes: {
              $divide: [
                { $subtract: ['$lastEventTime', '$firstEventTime'] },
                60000 // Convert to minutes
              ]
            },

            // Session analysis
            sessionMetrics: {
              $reduce: {
                input: '$userSessions',
                initialValue: {
                  totalSessions: 0,
                  convertingSessions: 0,
                  totalSessionRevenue: 0
                },
                in: {
                  totalSessions: { $add: ['$$value.totalSessions', 1] },
                  convertingSessions: {
                    $cond: [
                      { $gt: ['$$this.revenue', 0] },
                      { $add: ['$$value.convertingSessions', 1] },
                      '$$value.convertingSessions'
                    ]
                  },
                  totalSessionRevenue: {
                    $add: ['$$value.totalSessionRevenue', '$$this.revenue']
                  }
                }
              }
            }
          }
        },

        // Stage 5: Performance optimization and data enrichment
        {
          $addFields: {
            // Performance indicators
            performanceMetrics: {
              throughputEventsPerMinute: {
                $divide: ['$eventCount', '$processingWindowMinutes']
              },
              revenueVelocity: {
                $divide: ['$totalRevenue', '$processingWindowMinutes']
              },
              userEngagementRate: {
                $divide: [{ $size: '$uniqueUsers' }, '$eventCount']
              }
            },

            // Business metrics
            businessMetrics: {
              customerLifetimeValue: {
                $multiply: [
                  '$revenuePerUser',
                  { $literal: 12 } // Assuming 12-month projection
                ]
              },
              sessionConversionRate: {
                $divide: [
                  '$sessionMetrics.convertingSessions',
                  '$sessionMetrics.totalSessions'
                ]
              },
              averageSessionValue: {
                $divide: [
                  '$sessionMetrics.totalSessionRevenue',
                  '$sessionMetrics.totalSessions'
                ]
              }
            },

            // Data quality metrics
            dataQuality: {
              completenessScore: {
                $divide: [
                  { $add: [
                    { $cond: [{ $gt: [{ $size: '$uniqueUsers' }, 0] }, 1, 0] },
                    { $cond: [{ $gt: ['$eventCount', 0] }, 1, 0] },
                    { $cond: [{ $ne: ['$_id.eventType', null] }, 1, 0] },
                    { $cond: [{ $ne: ['$_id.eventCategory', null] }, 1, 0] }
                  ] },
                  4
                ]
              },
              consistencyScore: {
                $cond: [
                  { $eq: ['$eventsPerSession', { $divide: ['$eventCount', { $size: '$uniqueSessions' }] }] },
                  1.0,
                  0.8
                ]
              }
            }
          }
        },

        // Stage 6: Final result formatting and metadata
        {
          $project: {
            // Dimension information
            dimensions: '$_id',
            timeWindow: '$_id.hourBucket',
            analysisType: { $literal: pipelineConfig.analysisType || 'comprehensive_analytics' },

            // Core metrics
            metrics: {
              volume: {
                eventCount: '$eventCount',
                uniqueUserCount: '$uniqueUserCount',
                uniqueSessionCount: '$uniqueSessionCount',
                eventsPerUser: { $round: ['$eventsPerUser', 2] },
                eventsPerSession: { $round: ['$eventsPerSession', 2] }
              },

              revenue: {
                totalRevenue: { $round: ['$totalRevenue', 2] },
                totalQuantity: '$totalQuantity',
                revenueTransactions: '$revenueTransactions',
                averageRevenue: { $round: ['$averageRevenue', 2] },
                revenuePerUser: { $round: ['$revenuePerUser', 2] },
                conversionRate: { $round: ['$conversionRate', 4] }
              },

              statistical: {
                medianRevenue: { $round: ['$revenueStats.median', 2] },
                percentile75Revenue: { $round: ['$revenueStats.percentile75', 2] },
                percentile95Revenue: { $round: ['$revenueStats.percentile95', 2] }
              },

              performance: '$performanceMetrics',
              business: '$businessMetrics',
              dataQuality: '$dataQuality'
            },

            // Temporal information
            temporal: {
              firstEventTime: '$firstEventTime',
              lastEventTime: '$lastEventTime',
              processingWindowMinutes: { $round: ['$processingWindowMinutes', 1] }
            },

            // Pipeline metadata
            pipelineMetadata: {
              pipelineId: { $literal: pipelineId },
              executionTime: { $literal: new Date() },
              configurationUsed: { $literal: pipelineConfig }
            }
          }
        },

        // Stage 7: Results persistence and optimization
        {
          $merge: {
            into: 'analytics_results',
            whenMatched: 'replace',
            whenNotMatched: 'insert'
          }
        }
      ];

      // Execute the comprehensive analytics pipeline
      console.log('Executing comprehensive analytics pipeline...');
      const pipelineResult = await this.collections.processedEvents.aggregate(
        analyticsStages,
        {
          allowDiskUse: true,
          maxTimeMS: this.config.maxBatchProcessingTime,
          hint: { eventTimestamp: -1 }, // Optimize with time-based index
          comment: `Advanced analytics pipeline: ${pipelineId}`
        }
      ).toArray();

      const executionTime = Date.now() - startTime;

      // Update performance metrics
      this.updatePipelineMetrics(pipelineId, {
        executionTime: executionTime,
        documentsProcessed: pipelineResult.length,
        pipelineType: 'analytics',
        success: true
      });

      this.emit('pipelineCompleted', {
        pipelineId: pipelineId,
        pipelineType: 'analytics',
        executionTime: executionTime,
        documentsProcessed: pipelineResult.length,
        resultsGenerated: pipelineResult.length
      });

      console.log(`Analytics pipeline completed: ${pipelineId} (${executionTime}ms, ${pipelineResult.length} results)`);

      return {
        success: true,
        pipelineId: pipelineId,
        executionTime: executionTime,
        resultsGenerated: pipelineResult.length,
        analyticsData: pipelineResult
      };

    } catch (error) {
      console.error(`Analytics pipeline failed: ${pipelineId}`, error);

      this.updatePipelineMetrics(pipelineId, {
        executionTime: Date.now() - startTime,
        pipelineType: 'analytics',
        success: false,
        error: error.message
      });

      return {
        success: false,
        pipelineId: pipelineId,
        error: error.message
      };
    }
  }

  async initializeStreamProcessing() {
    console.log('Initializing real-time stream processing...');

    try {
      // Setup change streams for real-time processing
      const changeStream = this.collections.rawEvents.watch(
        [
          {
            $match: {
              'operationType': { $in: ['insert', 'update'] },
              'fullDocument.processingStatus': { $ne: 'processed' }
            }
          }
        ],
        this.config.changeStreamOptions
      );

      // Process streaming data in real-time
      changeStream.on('change', async (change) => {
        try {
          await this.processStreamingEvent(change);
        } catch (error) {
          console.error('Error processing streaming event:', error);
          this.emit('streamProcessingError', { change, error: error.message });
        }
      });

      changeStream.on('error', (error) => {
        console.error('Change stream error:', error);
        this.emit('changeStreamError', { error: error.message });
      });

      this.streamProcessors.set('main', changeStream);

      console.log('Stream processing initialized successfully');

    } catch (error) {
      console.error('Error initializing stream processing:', error);
      throw error;
    }
  }

  async processStreamingEvent(change) {
    console.log('Processing streaming event:', change.documentKey);

    const document = change.fullDocument;
    const processingStartTime = Date.now();

    try {
      // Real-time event transformation and enrichment
      const transformedEvent = await this.transformEventData(document);

      // Apply real-time analytics calculations
      const analyticsData = await this.calculateRealTimeMetrics(transformedEvent);

      // Update processed events collection
      await this.collections.processedEvents.replaceOne(
        { _id: transformedEvent._id },
        {
          ...transformedEvent,
          ...analyticsData,
          processedAt: new Date(),
          processingLatency: Date.now() - processingStartTime
        },
        { upsert: true }
      );

      // Update real-time analytics aggregations
      if (this.config.enableRealTimeAnalytics) {
        await this.updateRealTimeAnalytics(transformedEvent);
      }

      this.emit('eventProcessed', {
        eventId: document._id,
        processingLatency: Date.now() - processingStartTime,
        analyticsGenerated: Object.keys(analyticsData).length
      });

    } catch (error) {
      console.error('Error processing streaming event:', error);

      // Mark event as failed for retry processing
      await this.collections.rawEvents.updateOne(
        { _id: document._id },
        {
          $set: {
            processingStatus: 'failed',
            processingError: error.message,
            lastProcessingAttempt: new Date()
          }
        }
      );

      throw error;
    }
  }

  async transformEventData(rawEvent) {
    // Advanced event data transformation with MongoDB-specific optimizations
    const transformed = {
      _id: rawEvent._id,
      eventId: rawEvent.eventId || rawEvent._id,
      eventTimestamp: rawEvent.eventTimestamp,
      userId: rawEvent.userId,
      sessionId: rawEvent.sessionId,
      eventType: rawEvent.eventType,
      eventCategory: rawEvent.eventCategory,

      // Enhanced data extraction using MongoDB operators
      eventData: {
        ...rawEvent.eventData,
        revenue: parseFloat(rawEvent.eventData?.revenue || 0),
        quantity: parseInt(rawEvent.eventData?.quantity || 1),
        productId: rawEvent.eventData?.productId,
        productName: rawEvent.eventData?.productName
      },

      // Device and technology information
      deviceInfo: {
        deviceType: rawEvent.deviceInfo?.deviceType || 'unknown',
        browser: rawEvent.deviceInfo?.browser || 'unknown',
        operatingSystem: rawEvent.deviceInfo?.os || 'unknown',
        screenResolution: rawEvent.deviceInfo?.screenResolution,
        userAgent: rawEvent.deviceInfo?.userAgent
      },

      // Geographic information
      locationData: {
        country: rawEvent.locationData?.country || 'unknown',
        region: rawEvent.locationData?.region || 'unknown',
        city: rawEvent.locationData?.city || 'unknown',
        coordinates: rawEvent.locationData?.coordinates
      },

      // Time-based dimensions for efficient aggregation
      timeDimensions: {
        hour: rawEvent.eventTimestamp.getHours(),
        dayOfWeek: rawEvent.eventTimestamp.getDay(),
        yearMonth: `${rawEvent.eventTimestamp.getFullYear()}-${String(rawEvent.eventTimestamp.getMonth() + 1).padStart(2, '0')}`,
        quarterYear: `Q${Math.floor(rawEvent.eventTimestamp.getMonth() / 3) + 1}-${rawEvent.eventTimestamp.getFullYear()}`
      },

      // Processing metadata
      processingMetadata: {
        transformedAt: new Date(),
        version: '2.0',
        source: 'stream_processor'
      }
    };

    return transformed;
  }

  async calculateRealTimeMetrics(event) {
    // Real-time metrics calculation using MongoDB aggregation
    const metricsCalculation = [
      {
        $match: {
          userId: event.userId,
          eventTimestamp: {
            $gte: new Date(Date.now() - 24 * 60 * 60 * 1000) // Last 24 hours
          }
        }
      },
      {
        $group: {
          _id: null,
          totalEvents: { $sum: 1 },
          totalRevenue: { $sum: '$eventData.revenue' },
          uniqueSessions: { $addToSet: '$sessionId' },
          eventTypes: { $addToSet: '$eventType' },
          averageOrderValue: { $avg: '$eventData.revenue' }
        }
      }
    ];

    const userMetrics = await this.collections.processedEvents
      .aggregate(metricsCalculation)
      .toArray();

    return {
      userMetrics: userMetrics[0] || {
        totalEvents: 1,
        totalRevenue: event.eventData.revenue,
        uniqueSessions: [event.sessionId],
        eventTypes: [event.eventType],
        averageOrderValue: event.eventData.revenue
      }
    };
  }

  updatePipelineMetrics(pipelineId, metrics) {
    // Update system-wide pipeline performance metrics
    this.performanceMetrics.pipelinesExecuted++;
    this.performanceMetrics.totalProcessingTime += metrics.executionTime;
    this.performanceMetrics.documentsProcessed += metrics.documentsProcessed || 0;

    if (this.performanceMetrics.pipelinesExecuted > 0) {
      this.performanceMetrics.averageThroughput = 
        this.performanceMetrics.documentsProcessed / 
        (this.performanceMetrics.totalProcessingTime / 1000);
    }

    // Store detailed pipeline metrics
    this.collections.pipelineMetrics.insertOne({
      pipelineId: pipelineId,
      metrics: metrics,
      timestamp: new Date(),
      systemMetrics: {
        memoryUsage: process.memoryUsage(),
        systemPerformance: this.performanceMetrics
      }
    }).catch(error => {
      console.error('Error storing pipeline metrics:', error);
    });
  }

  generatePipelineId() {
    const timestamp = new Date().toISOString().replace(/[:.]/g, '-');
    return `pipeline_${timestamp}_${Math.random().toString(36).substr(2, 9)}`;
  }

  async shutdown() {
    console.log('Shutting down data pipeline manager...');

    try {
      // Close all active stream processors
      for (const [processorId, stream] of this.streamProcessors.entries()) {
        await stream.close();
        console.log(`Closed stream processor: ${processorId}`);
      }

      // Close MongoDB connection
      if (this.client) {
        await this.client.close();
      }

      console.log('Data pipeline manager shutdown complete');

    } catch (error) {
      console.error('Error during shutdown:', error);
    }
  }
}

// Benefits of MongoDB Advanced Data Pipeline:
// - Real-time stream processing with Change Streams for immediate insights
// - Comprehensive aggregation framework for complex analytical workloads
// - Native support for nested and complex data structures without ETL overhead
// - Optimized indexing and query planning for high-performance analytics
// - Integrated batch and stream processing within a single platform
// - Advanced statistical and mathematical functions for sophisticated analytics
// - Automatic scaling and optimization for large-scale data processing
// - SQL-compatible pipeline management through QueryLeaf integration
// - Built-in performance monitoring and optimization capabilities
// - Production-ready stream processing with minimal configuration overhead

module.exports = {
  AdvancedDataPipelineManager
};

Understanding MongoDB Data Pipeline Architecture

Advanced Stream Processing and Real-Time Analytics Patterns

Implement sophisticated data pipeline workflows for production MongoDB deployments:

// Enterprise-grade MongoDB data pipeline with advanced stream processing and analytics optimization
class EnterpriseDataPipelineProcessor extends AdvancedDataPipelineManager {
  constructor(connectionString, enterpriseConfig) {
    super(connectionString, enterpriseConfig);

    this.enterpriseConfig = {
      ...enterpriseConfig,
      enableAdvancedAnalytics: true,
      enableMachineLearningPipelines: true,
      enablePredictiveAnalytics: true,
      enableDataGovernance: true,
      enableComplianceReporting: true
    };

    this.setupEnterpriseFeatures();
    this.initializePredictiveAnalytics();
    this.setupComplianceFramework();
  }

  async implementAdvancedDataPipeline() {
    console.log('Implementing enterprise data pipeline with advanced capabilities...');

    const pipelineStrategy = {
      // Multi-tier processing architecture
      processingTiers: {
        realTimeProcessing: {
          latencyTarget: 100, // milliseconds
          throughputTarget: 100000, // events per second
          consistencyLevel: 'eventual'
        },
        nearRealTimeProcessing: {
          latencyTarget: 5000, // 5 seconds
          throughputTarget: 50000,
          consistencyLevel: 'strong'
        },
        batchProcessing: {
          latencyTarget: 300000, // 5 minutes
          throughputTarget: 1000000,
          consistencyLevel: 'strong'
        }
      },

      // Advanced analytics capabilities
      analyticsCapabilities: {
        descriptiveAnalytics: true,
        diagnosticAnalytics: true,
        predictiveAnalytics: true,
        prescriptiveAnalytics: true
      },

      // Data governance and compliance
      dataGovernance: {
        dataLineageTracking: true,
        dataQualityMonitoring: true,
        privacyCompliance: true,
        auditTrailMaintenance: true
      }
    };

    return await this.deployEnterpriseStrategy(pipelineStrategy);
  }

  async setupPredictiveAnalytics() {
    console.log('Setting up predictive analytics capabilities...');

    const predictiveConfig = {
      // Machine learning models
      models: {
        churnPrediction: true,
        revenueForecasting: true,
        behaviorPrediction: true,
        anomalyDetection: true
      },

      // Feature engineering
      featureEngineering: {
        temporalFeatures: true,
        behavioralFeatures: true,
        demographicFeatures: true,
        interactionFeatures: true
      },

      // Model deployment
      modelDeployment: {
        realTimeScoring: true,
        batchScoring: true,
        modelVersioning: true,
        performanceMonitoring: true
      }
    };

    return await this.deployPredictiveAnalytics(predictiveConfig);
  }
}

SQL-Style Data Pipeline Operations with QueryLeaf

QueryLeaf provides familiar SQL syntax for MongoDB data pipeline operations and stream processing:

-- QueryLeaf advanced data pipeline operations with SQL-familiar syntax for MongoDB

-- Configure comprehensive data pipeline strategy
CONFIGURE DATA_PIPELINE 
SET pipeline_name = 'enterprise_analytics_pipeline',
    processing_modes = ['real_time', 'batch', 'stream'],

    -- Real-time processing configuration
    stream_processing_enabled = true,
    stream_latency_target_ms = 100,
    stream_throughput_target = 100000,
    change_stream_batch_size = 1000,

    -- Batch processing configuration
    batch_processing_enabled = true,
    batch_size = 10000,
    batch_processing_interval_minutes = 5,
    max_batch_processing_time_minutes = 30,

    -- Analytics configuration
    enable_real_time_analytics = true,
    analytics_window_size_hours = 24,
    enable_predictive_analytics = true,
    enable_statistical_functions = true,

    -- Performance optimization
    enable_aggregation_optimization = true,
    enable_index_optimization = true,
    enable_parallel_processing = true,
    max_memory_usage_gb = 8,

    -- Data governance
    enable_data_lineage_tracking = true,
    enable_data_quality_monitoring = true,
    enable_audit_trail = true,
    data_retention_days = 90;

-- Advanced multi-dimensional analytics pipeline with comprehensive transformations
WITH event_enrichment AS (
  SELECT 
    event_id,
    event_timestamp,
    user_id,
    session_id,
    event_type,
    event_category,

    -- Advanced data extraction and type conversion
    CAST(event_data->>'revenue' AS DECIMAL(10,2)) as revenue,
    CAST(event_data->>'quantity' AS INTEGER) as quantity,
    event_data->>'product_id' as product_id,
    event_data->>'product_name' as product_name,
    event_data->>'campaign_id' as campaign_id,

    -- Device and technology classification
    CASE 
      WHEN device_info->>'device_type' IN ('smartphone', 'tablet') THEN 'mobile'
      WHEN device_info->>'device_type' = 'desktop' THEN 'desktop'
      ELSE 'other'
    END as device_category,

    device_info->>'browser' as browser,
    device_info->>'operating_system' as operating_system,

    -- Geographic dimensions
    location_data->>'country' as country,
    location_data->>'region' as region,
    location_data->>'city' as city,

    -- Advanced geographic clustering
    CASE 
      WHEN location_data->>'country' IN ('US', 'CA', 'MX') THEN 'North America'
      WHEN location_data->>'country' IN ('GB', 'DE', 'FR', 'IT', 'ES', 'NL') THEN 'Europe'
      WHEN location_data->>'country' IN ('JP', 'KR', 'CN', 'IN', 'SG') THEN 'Asia Pacific'
      WHEN location_data->>'country' IN ('BR', 'AR', 'CL', 'CO') THEN 'Latin America'
      ELSE 'Other'
    END as geo_region,

    -- Time-based dimensions for efficient aggregation
    DATE_TRUNC('hour', event_timestamp) as hour_bucket,
    EXTRACT(HOUR FROM event_timestamp) as event_hour,
    EXTRACT(DOW FROM event_timestamp) as day_of_week,
    EXTRACT(WEEK FROM event_timestamp) as week_of_year,
    EXTRACT(MONTH FROM event_timestamp) as month_of_year,
    TO_CHAR(event_timestamp, 'YYYY-MM') as year_month,
    TO_CHAR(event_timestamp, 'YYYY-"Q"Q') as year_quarter,

    -- Advanced user segmentation
    CASE 
      WHEN user_metrics.total_revenue >= 1000 THEN 'high_value'
      WHEN user_metrics.total_revenue >= 100 THEN 'medium_value'  
      WHEN user_metrics.total_revenue > 0 THEN 'low_value'
      ELSE 'non_revenue'
    END as user_segment,

    -- Customer lifecycle classification
    CASE 
      WHEN user_metrics.days_since_first_event <= 30 THEN 'new'
      WHEN user_metrics.days_since_last_event <= 30 THEN 'active'
      WHEN user_metrics.days_since_last_event <= 90 THEN 'dormant'
      ELSE 'inactive'  
    END as customer_lifecycle_stage,

    -- Behavioral indicators
    user_metrics.total_events as user_total_events,
    user_metrics.total_revenue as user_total_revenue,
    user_metrics.avg_session_duration as user_avg_session_duration,
    user_metrics.days_since_first_event,
    user_metrics.days_since_last_event,

    -- Revenue and value calculations
    CASE 
      WHEN CAST(event_data->>'quantity' AS INTEGER) > 0 THEN
        CAST(event_data->>'revenue' AS DECIMAL(10,2)) / CAST(event_data->>'quantity' AS INTEGER)
      ELSE 0
    END as average_order_value,

    -- Processing metadata
    CURRENT_TIMESTAMP as processed_at,
    'advanced_pipeline_v2' as processing_version

  FROM raw_events re
  LEFT JOIN user_behavioral_metrics user_metrics ON re.user_id = user_metrics.user_id
  WHERE 
    re.event_timestamp >= CURRENT_TIMESTAMP - INTERVAL '24 hours'
    AND re.processing_status = 'pending'
),

comprehensive_aggregation AS (
  SELECT 
    hour_bucket,
    event_type,
    event_category,
    user_segment,
    customer_lifecycle_stage,
    device_category,
    geo_region,
    browser,
    operating_system,

    -- Volume metrics with advanced calculations
    COUNT(*) as event_count,
    COUNT(DISTINCT user_id) as unique_users,
    COUNT(DISTINCT session_id) as unique_sessions,
    COUNT(DISTINCT product_id) as unique_products,
    COUNT(DISTINCT campaign_id) as unique_campaigns,

    -- User engagement metrics
    ROUND(COUNT(*)::DECIMAL / COUNT(DISTINCT user_id), 2) as events_per_user,
    ROUND(COUNT(*)::DECIMAL / COUNT(DISTINCT session_id), 2) as events_per_session,
    ROUND(COUNT(DISTINCT session_id)::DECIMAL / COUNT(DISTINCT user_id), 2) as sessions_per_user,

    -- Revenue analytics with statistical functions
    SUM(revenue) as total_revenue,
    SUM(quantity) as total_quantity,
    COUNT(*) FILTER (WHERE revenue > 0) as revenue_transactions,

    -- Advanced revenue statistics
    AVG(revenue) as avg_revenue,
    AVG(revenue) FILTER (WHERE revenue > 0) as avg_revenue_per_transaction,
    PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY revenue) as median_revenue,
    PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY revenue) as percentile_75_revenue,
    PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY revenue) as percentile_95_revenue,
    STDDEV_POP(revenue) as revenue_standard_deviation,

    -- Advanced order value analytics
    AVG(average_order_value) as avg_order_value,
    PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY average_order_value) as median_order_value,
    MAX(average_order_value) as max_order_value,
    MIN(average_order_value) FILTER (WHERE average_order_value > 0) as min_order_value,

    -- Conversion and engagement metrics
    ROUND((COUNT(*) FILTER (WHERE revenue > 0)::DECIMAL / COUNT(*)) * 100, 2) as conversion_rate_percent,
    SUM(revenue) / NULLIF(COUNT(DISTINCT user_id), 0) as revenue_per_user,
    SUM(revenue) / NULLIF(COUNT(DISTINCT session_id), 0) as revenue_per_session,

    -- Time-based analytics
    MIN(event_timestamp) as window_start_time,
    MAX(event_timestamp) as window_end_time,
    EXTRACT(MINUTES FROM (MAX(event_timestamp) - MIN(event_timestamp))) as processing_window_minutes,

    -- User behavior pattern analysis
    AVG(user_total_events) as avg_user_lifetime_events,
    AVG(user_total_revenue) as avg_user_lifetime_revenue,
    AVG(user_avg_session_duration) as avg_session_duration_seconds,
    AVG(days_since_first_event) as avg_days_since_first_event,

    -- Product performance metrics
    MODE() WITHIN GROUP (ORDER BY product_id) as top_product_id,
    MODE() WITHIN GROUP (ORDER BY product_name) as top_product_name,
    COUNT(DISTINCT product_id) FILTER (WHERE revenue > 0) as converting_products,

    -- Campaign effectiveness
    MODE() WITHIN GROUP (ORDER BY campaign_id) as top_campaign_id,
    COUNT(DISTINCT campaign_id) FILTER (WHERE revenue > 0) as converting_campaigns,

    -- Seasonal and temporal patterns
    MODE() WITHIN GROUP (ORDER BY day_of_week) as most_active_day_of_week,
    MODE() WITHIN GROUP (ORDER BY event_hour) as most_active_hour,

    -- Data quality indicators
    COUNT(*) FILTER (WHERE revenue IS NOT NULL) / COUNT(*)::DECIMAL as revenue_data_completeness,
    COUNT(*) FILTER (WHERE product_id IS NOT NULL) / COUNT(*)::DECIMAL as product_data_completeness,
    COUNT(*) FILTER (WHERE geo_region != 'Other') / COUNT(*)::DECIMAL as location_data_completeness

  FROM event_enrichment
  GROUP BY 
    hour_bucket, event_type, event_category, user_segment, customer_lifecycle_stage,
    device_category, geo_region, browser, operating_system
),

performance_analysis AS (
  SELECT 
    ca.*,

    -- Performance indicators and rankings
    ROW_NUMBER() OVER (ORDER BY total_revenue DESC) as revenue_rank,
    ROW_NUMBER() OVER (ORDER BY unique_users DESC) as user_engagement_rank,
    ROW_NUMBER() OVER (ORDER BY conversion_rate_percent DESC) as conversion_rank,
    ROW_NUMBER() OVER (ORDER BY avg_order_value DESC) as aov_rank,

    -- Efficiency metrics
    ROUND(total_revenue / processing_window_minutes, 2) as revenue_velocity_per_minute,
    ROUND(event_count::DECIMAL / processing_window_minutes, 1) as event_velocity_per_minute,
    ROUND(unique_users::DECIMAL / processing_window_minutes, 1) as user_acquisition_rate_per_minute,

    -- Business health indicators
    CASE 
      WHEN conversion_rate_percent >= 5.0 THEN 'excellent'
      WHEN conversion_rate_percent >= 2.0 THEN 'good'
      WHEN conversion_rate_percent >= 1.0 THEN 'fair'
      ELSE 'poor'
    END as conversion_performance_rating,

    CASE 
      WHEN revenue_per_user >= 100 THEN 'high_value'
      WHEN revenue_per_user >= 25 THEN 'medium_value'
      WHEN revenue_per_user >= 5 THEN 'low_value'
      ELSE 'minimal_value'
    END as user_value_rating,

    CASE 
      WHEN events_per_user >= 10 THEN 'highly_engaged'
      WHEN events_per_user >= 5 THEN 'moderately_engaged'
      WHEN events_per_user >= 2 THEN 'lightly_engaged'
      ELSE 'minimally_engaged'
    END as user_engagement_rating,

    -- Trend and growth indicators
    LAG(total_revenue) OVER (
      PARTITION BY event_type, user_segment, device_category, geo_region 
      ORDER BY hour_bucket
    ) as prev_hour_revenue,

    LAG(unique_users) OVER (
      PARTITION BY event_type, user_segment, device_category, geo_region 
      ORDER BY hour_bucket
    ) as prev_hour_users,

    LAG(conversion_rate_percent) OVER (
      PARTITION BY event_type, user_segment, device_category, geo_region 
      ORDER BY hour_bucket
    ) as prev_hour_conversion_rate

  FROM comprehensive_aggregation ca
),

trend_analysis AS (
  SELECT 
    pa.*,

    -- Revenue trends
    CASE 
      WHEN prev_hour_revenue IS NOT NULL AND prev_hour_revenue > 0 THEN
        ROUND(((total_revenue - prev_hour_revenue) / prev_hour_revenue * 100), 1)
      ELSE NULL
    END as revenue_change_percent,

    -- User acquisition trends
    CASE 
      WHEN prev_hour_users IS NOT NULL AND prev_hour_users > 0 THEN
        ROUND(((unique_users - prev_hour_users)::DECIMAL / prev_hour_users * 100), 1)
      ELSE NULL
    END as user_growth_percent,

    -- Conversion optimization trends
    CASE 
      WHEN prev_hour_conversion_rate IS NOT NULL THEN
        ROUND((conversion_rate_percent - prev_hour_conversion_rate), 2)
      ELSE NULL
    END as conversion_rate_change,

    -- Growth classification
    CASE 
      WHEN prev_hour_revenue IS NOT NULL AND total_revenue > prev_hour_revenue * 1.1 THEN 'high_growth'
      WHEN prev_hour_revenue IS NOT NULL AND total_revenue > prev_hour_revenue * 1.05 THEN 'moderate_growth'
      WHEN prev_hour_revenue IS NOT NULL AND total_revenue >= prev_hour_revenue * 0.95 THEN 'stable'
      WHEN prev_hour_revenue IS NOT NULL AND total_revenue >= prev_hour_revenue * 0.9 THEN 'moderate_decline'
      WHEN prev_hour_revenue IS NOT NULL THEN 'significant_decline'
      ELSE 'insufficient_data'
    END as revenue_trend_classification,

    -- Anomaly detection indicators
    CASE 
      WHEN conversion_rate_percent > (AVG(conversion_rate_percent) OVER () + 2 * STDDEV_POP(conversion_rate_percent) OVER ()) THEN 'conversion_anomaly_high'
      WHEN conversion_rate_percent < (AVG(conversion_rate_percent) OVER () - 2 * STDDEV_POP(conversion_rate_percent) OVER ()) THEN 'conversion_anomaly_low'
      WHEN revenue_per_user > (AVG(revenue_per_user) OVER () + 2 * STDDEV_POP(revenue_per_user) OVER ()) THEN 'revenue_anomaly_high'
      WHEN revenue_per_user < (AVG(revenue_per_user) OVER () - 2 * STDDEV_POP(revenue_per_user) OVER ()) THEN 'revenue_anomaly_low'
      ELSE 'normal'
    END as anomaly_detection_status

  FROM performance_analysis pa
),

insights_and_recommendations AS (
  SELECT 
    ta.*,

    -- Strategic insights
    ARRAY[
      CASE WHEN conversion_performance_rating = 'excellent' THEN 'Maintain current conversion optimization strategies' END,
      CASE WHEN conversion_performance_rating = 'poor' THEN 'Implement conversion rate optimization initiatives' END,
      CASE WHEN user_value_rating = 'high_value' THEN 'Focus on retention and upselling strategies' END,
      CASE WHEN user_value_rating = 'minimal_value' THEN 'Develop user value enhancement programs' END,
      CASE WHEN revenue_trend_classification = 'high_growth' THEN 'Scale successful channels and campaigns' END,
      CASE WHEN revenue_trend_classification = 'significant_decline' THEN 'Investigate and address performance issues urgently' END,
      CASE WHEN anomaly_detection_status LIKE '%anomaly%' THEN 'Investigate anomalous behavior for opportunities or issues' END
    ]::TEXT[] as strategic_recommendations,

    -- Operational recommendations
    ARRAY[
      CASE WHEN event_velocity_per_minute > 1000 THEN 'Consider increasing processing capacity' END,
      CASE WHEN revenue_data_completeness < 0.9 THEN 'Improve data collection completeness' END,
      CASE WHEN location_data_completeness < 0.8 THEN 'Enhance geographic data capture' END,
      CASE WHEN processing_window_minutes > 60 THEN 'Optimize data pipeline performance' END
    ]::TEXT[] as operational_recommendations,

    -- Priority scoring for resource allocation
    CASE 
      WHEN total_revenue >= 10000 AND conversion_rate_percent >= 3.0 THEN 10  -- Highest priority
      WHEN total_revenue >= 5000 AND conversion_rate_percent >= 2.0 THEN 8
      WHEN total_revenue >= 1000 AND conversion_rate_percent >= 1.0 THEN 6
      WHEN unique_users >= 1000 THEN 4
      ELSE 2
    END as business_priority_score,

    -- Investment recommendations
    CASE 
      WHEN business_priority_score >= 8 THEN 'High investment recommended'
      WHEN business_priority_score >= 6 THEN 'Moderate investment recommended'
      WHEN business_priority_score >= 4 THEN 'Selective investment recommended'
      ELSE 'Monitor performance'
    END as investment_recommendation

  FROM trend_analysis ta
)

-- Final comprehensive analytics output with actionable insights
SELECT 
  -- Core dimensions
  hour_bucket,
  event_type,
  event_category,
  user_segment,
  customer_lifecycle_stage,
  device_category,
  geo_region,

  -- Volume and engagement metrics
  event_count,
  unique_users,
  unique_sessions,
  events_per_user,
  events_per_session,
  sessions_per_user,

  -- Revenue analytics
  ROUND(total_revenue, 2) as total_revenue,
  revenue_transactions,
  ROUND(avg_revenue_per_transaction, 2) as avg_revenue_per_transaction,
  ROUND(median_revenue, 2) as median_revenue,
  ROUND(percentile_95_revenue, 2) as percentile_95_revenue,
  ROUND(revenue_per_user, 2) as revenue_per_user,
  ROUND(revenue_per_session, 2) as revenue_per_session,

  -- Performance indicators
  conversion_rate_percent,
  ROUND(avg_order_value, 2) as avg_order_value,
  conversion_performance_rating,
  user_value_rating,
  user_engagement_rating,

  -- Trend analysis
  revenue_change_percent,
  user_growth_percent,
  conversion_rate_change,
  revenue_trend_classification,
  anomaly_detection_status,

  -- Business metrics
  business_priority_score,
  investment_recommendation,

  -- Performance rankings
  revenue_rank,
  user_engagement_rank,
  conversion_rank,

  -- Operational metrics
  ROUND(revenue_velocity_per_minute, 2) as revenue_velocity_per_minute,
  ROUND(event_velocity_per_minute, 1) as event_velocity_per_minute,
  processing_window_minutes,

  -- Data quality
  ROUND(revenue_data_completeness * 100, 1) as revenue_data_completeness_percent,
  ROUND(product_data_completeness * 100, 1) as product_data_completeness_percent,
  ROUND(location_data_completeness * 100, 1) as location_data_completeness_percent,

  -- Top performing entities
  top_product_name,
  top_campaign_id,
  most_active_hour,

  -- Strategic insights
  strategic_recommendations,
  operational_recommendations,

  -- Metadata
  window_start_time,
  window_end_time,
  CURRENT_TIMESTAMP as analysis_generated_at

FROM insights_and_recommendations
WHERE 
  -- Filter for significant segments to focus analysis
  (event_count >= 10 OR total_revenue >= 100 OR unique_users >= 5)
  AND business_priority_score >= 2
ORDER BY 
  business_priority_score DESC,
  total_revenue DESC,
  hour_bucket DESC;

-- Real-time streaming analytics with change stream processing
CREATE STREAMING_ANALYTICS_VIEW real_time_conversion_funnel AS
WITH funnel_events AS (
  SELECT 
    user_id,
    session_id,
    event_type,
    event_timestamp,
    revenue,

    -- Create event sequence within sessions
    ROW_NUMBER() OVER (
      PARTITION BY user_id, session_id 
      ORDER BY event_timestamp
    ) as event_sequence,

    -- Identify funnel steps
    CASE event_type
      WHEN 'page_view' THEN 1
      WHEN 'product_view' THEN 2  
      WHEN 'add_to_cart' THEN 3
      WHEN 'checkout_start' THEN 4
      WHEN 'purchase' THEN 5
      ELSE 0
    END as funnel_step,

    -- Calculate time between events
    LAG(event_timestamp) OVER (
      PARTITION BY user_id, session_id 
      ORDER BY event_timestamp
    ) as prev_event_timestamp

  FROM CHANGE_STREAM('raw_events')
  WHERE event_timestamp >= CURRENT_TIMESTAMP - INTERVAL '1 hour'
  AND event_type IN ('page_view', 'product_view', 'add_to_cart', 'checkout_start', 'purchase')
),

real_time_funnel_analysis AS (
  SELECT 
    DATE_TRUNC('minute', event_timestamp) as minute_bucket,

    -- Funnel step counts
    COUNT(*) FILTER (WHERE funnel_step = 1) as step_1_page_views,
    COUNT(*) FILTER (WHERE funnel_step = 2) as step_2_product_views,  
    COUNT(*) FILTER (WHERE funnel_step = 3) as step_3_add_to_cart,
    COUNT(*) FILTER (WHERE funnel_step = 4) as step_4_checkout_start,
    COUNT(*) FILTER (WHERE funnel_step = 5) as step_5_purchase,

    -- Unique user counts at each step
    COUNT(DISTINCT user_id) FILTER (WHERE funnel_step = 1) as unique_users_step_1,
    COUNT(DISTINCT user_id) FILTER (WHERE funnel_step = 2) as unique_users_step_2,
    COUNT(DISTINCT user_id) FILTER (WHERE funnel_step = 3) as unique_users_step_3,
    COUNT(DISTINCT user_id) FILTER (WHERE funnel_step = 4) as unique_users_step_4,
    COUNT(DISTINCT user_id) FILTER (WHERE funnel_step = 5) as unique_users_step_5,

    -- Revenue metrics
    SUM(revenue) FILTER (WHERE funnel_step = 5) as total_revenue,
    AVG(revenue) FILTER (WHERE funnel_step = 5 AND revenue > 0) as avg_purchase_value,

    -- Timing analysis
    AVG(EXTRACT(SECONDS FROM (event_timestamp - prev_event_timestamp))) FILTER (
      WHERE prev_event_timestamp IS NOT NULL 
      AND EXTRACT(SECONDS FROM (event_timestamp - prev_event_timestamp)) <= 3600
    ) as avg_time_between_steps_seconds

  FROM funnel_events
  WHERE funnel_step > 0
  GROUP BY DATE_TRUNC('minute', event_timestamp)
)

SELECT 
  minute_bucket,

  -- Funnel volumes
  step_1_page_views,
  step_2_product_views,
  step_3_add_to_cart, 
  step_4_checkout_start,
  step_5_purchase,

  -- Conversion rates between steps
  ROUND((step_2_product_views::DECIMAL / NULLIF(step_1_page_views, 0)) * 100, 2) as page_to_product_rate,
  ROUND((step_3_add_to_cart::DECIMAL / NULLIF(step_2_product_views, 0)) * 100, 2) as product_to_cart_rate,
  ROUND((step_4_checkout_start::DECIMAL / NULLIF(step_3_add_to_cart, 0)) * 100, 2) as cart_to_checkout_rate,
  ROUND((step_5_purchase::DECIMAL / NULLIF(step_4_checkout_start, 0)) * 100, 2) as checkout_to_purchase_rate,

  -- Overall funnel performance
  ROUND((step_5_purchase::DECIMAL / NULLIF(step_1_page_views, 0)) * 100, 2) as overall_conversion_rate,

  -- User journey efficiency
  ROUND((unique_users_step_5::DECIMAL / NULLIF(unique_users_step_1, 0)) * 100, 2) as user_conversion_rate,

  -- Revenue performance
  ROUND(total_revenue, 2) as total_revenue,
  ROUND(avg_purchase_value, 2) as avg_purchase_value,
  ROUND(total_revenue / NULLIF(unique_users_step_5, 0), 2) as revenue_per_converting_user,

  -- Efficiency metrics
  ROUND(avg_time_between_steps_seconds / 60.0, 1) as avg_minutes_between_steps,

  -- Performance indicators
  CASE 
    WHEN overall_conversion_rate >= 5.0 THEN 'excellent'
    WHEN overall_conversion_rate >= 2.0 THEN 'good'
    WHEN overall_conversion_rate >= 1.0 THEN 'fair'
    ELSE 'needs_improvement'
  END as funnel_performance_rating,

  -- Real-time alerts
  CASE 
    WHEN overall_conversion_rate < 0.5 THEN 'LOW_CONVERSION_ALERT'
    WHEN avg_time_between_steps_seconds > 300 THEN 'SLOW_FUNNEL_ALERT'  
    WHEN step_5_purchase = 0 AND step_4_checkout_start > 5 THEN 'CHECKOUT_ISSUE_ALERT'
    ELSE 'normal'
  END as real_time_alert_status,

  CURRENT_TIMESTAMP as analysis_timestamp

FROM real_time_funnel_analysis
WHERE minute_bucket >= CURRENT_TIMESTAMP - INTERVAL '30 minutes'
ORDER BY minute_bucket DESC;

-- QueryLeaf provides comprehensive data pipeline capabilities:
-- 1. SQL-familiar syntax for MongoDB aggregation pipeline construction
-- 2. Advanced real-time stream processing with Change Streams integration
-- 3. Comprehensive multi-dimensional analytics with statistical functions
-- 4. Built-in performance optimization and index utilization
-- 5. Real-time anomaly detection and business intelligence
-- 6. Advanced funnel analysis and conversion optimization
-- 7. Sophisticated trend analysis and predictive indicators
-- 8. Enterprise-ready data governance and compliance features
-- 9. Automated insights generation and recommendation systems
-- 10. Production-ready stream processing with minimal configuration

Best Practices for Production Data Pipeline Implementation

Pipeline Architecture Design Principles

Essential principles for effective MongoDB data pipeline deployment:

Multi-Tier Processing Strategy: Implement real-time, near-real-time, and batch processing tiers based on latency and consistency requirements
Performance Optimization: Design aggregation pipelines with proper indexing, stage ordering, and memory optimization for maximum throughput
Stream Processing Integration: Leverage Change Streams for real-time processing while maintaining batch processing for historical analysis
Data Quality Management: Implement comprehensive data validation, cleansing, and quality monitoring throughout the pipeline
Scalability Planning: Design pipelines that can scale horizontally and handle increasing data volumes and processing complexity
Monitoring and Alerting: Establish comprehensive pipeline monitoring with performance metrics and business-critical alerting

Enterprise Data Pipeline Architecture

Design pipeline systems for enterprise-scale requirements:

Advanced Analytics Integration: Implement sophisticated analytical capabilities including predictive analytics and machine learning integration
Data Governance Framework: Establish data lineage tracking, compliance monitoring, and audit trail maintenance
Performance Monitoring: Implement comprehensive performance tracking with optimization recommendations and capacity planning
Security and Compliance: Design secure pipelines with encryption, access controls, and regulatory compliance features
Operational Excellence: Integrate with existing monitoring systems and establish operational procedures for pipeline management
Disaster Recovery: Implement pipeline resilience with failover capabilities and data recovery procedures

Conclusion

MongoDB data pipeline optimization and stream processing provide comprehensive real-time analytics capabilities that enable sophisticated data transformations, high-performance analytical workloads, and intelligent business insights through native aggregation framework optimization, integrated change stream processing, and advanced statistical functions. The unified platform approach eliminates the complexity of managing separate batch and stream processing systems while delivering superior performance and operational simplicity.

Key MongoDB Data Pipeline benefits include:

Real-Time Processing: Advanced Change Streams integration for immediate data processing and real-time analytics generation
Comprehensive Analytics: Sophisticated aggregation framework with advanced statistical functions and multi-dimensional analysis capabilities
Performance Optimization: Native query optimization, intelligent indexing, and memory management for maximum throughput
Stream and Batch Integration: Unified platform supporting both real-time stream processing and comprehensive batch analytics
Business Intelligence: Advanced analytics with anomaly detection, trend analysis, and automated insights generation
SQL Accessibility: Familiar SQL-style data pipeline operations through QueryLeaf for accessible advanced analytics

Whether you're building real-time dashboards, implementing complex analytical workloads, processing high-velocity data streams, or developing sophisticated business intelligence systems, MongoDB data pipeline optimization with QueryLeaf's familiar SQL interface provides the foundation for scalable, high-performance data processing workflows.

QueryLeaf Integration: QueryLeaf automatically optimizes MongoDB aggregation pipelines while providing SQL-familiar syntax for complex analytics, stream processing, and data transformation operations. Advanced pipeline construction, performance optimization, and business intelligence features are seamlessly handled through familiar SQL constructs, making sophisticated data processing accessible to SQL-oriented analytics teams.

The combination of MongoDB's powerful aggregation framework with SQL-style data pipeline operations makes it an ideal platform for applications requiring both advanced analytical capabilities and familiar database interaction patterns, ensuring your data processing workflows can scale efficiently while delivering actionable business insights in real-time.

November 2, 2025
24 min read

MongoDB Capped Collections: Fixed-Size High-Performance Logging and Data Streaming for Real-Time Applications

Real-time applications require efficient data structures for continuous data capture, event streaming, and high-frequency logging without the overhead of traditional database management. Conventional database approaches struggle with scenarios requiring sustained high-throughput writes, automatic old data removal, and guaranteed insertion order preservation, often leading to performance degradation, storage bloat, and complex maintenance procedures in production environments.

MongoDB capped collections provide native fixed-size, high-performance data structures that maintain insertion order and automatically remove old documents when storage limits are reached. Unlike traditional database logging solutions that require complex archival processes and performance-degrading maintenance operations, MongoDB capped collections deliver consistent high-throughput writes, predictable storage usage, and automatic data lifecycle management through optimized storage allocation and write-optimized data structures.

The Traditional High-Performance Logging Challenge

Conventional database logging approaches often encounter significant performance and maintenance challenges:

-- Traditional PostgreSQL high-performance logging - complex maintenance and performance issues

-- Basic application logging table with growing maintenance complexity
CREATE TABLE application_logs (
    log_id BIGSERIAL PRIMARY KEY,
    application_name VARCHAR(100) NOT NULL,
    log_level VARCHAR(20) NOT NULL,
    timestamp TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    message TEXT NOT NULL,

    -- Additional context fields
    user_id BIGINT,
    session_id VARCHAR(100),
    request_id VARCHAR(100),

    -- Performance metadata
    duration_ms INTEGER,
    memory_usage_mb DECIMAL(8,2),
    cpu_usage_percent DECIMAL(5,2),

    -- Log metadata
    thread_id INTEGER,
    process_id INTEGER,
    hostname VARCHAR(100),

    -- Complex indexing for performance
    CONSTRAINT valid_log_level CHECK (log_level IN ('DEBUG', 'INFO', 'WARN', 'ERROR', 'CRITICAL'))
);

-- Multiple indexes required for different query patterns - increasing maintenance overhead
CREATE INDEX idx_logs_timestamp ON application_logs(timestamp DESC);
CREATE INDEX idx_logs_level_timestamp ON application_logs(log_level, timestamp DESC);
CREATE INDEX idx_logs_app_timestamp ON application_logs(application_name, timestamp DESC);
CREATE INDEX idx_logs_user_timestamp ON application_logs(user_id, timestamp DESC) WHERE user_id IS NOT NULL;
CREATE INDEX idx_logs_session_timestamp ON application_logs(session_id, timestamp DESC) WHERE session_id IS NOT NULL;

-- Complex partitioning strategy for log table management
CREATE TABLE application_logs_2024_01 (
    CHECK (timestamp >= '2024-01-01' AND timestamp < '2024-02-01')
) INHERITS (application_logs);

CREATE TABLE application_logs_2024_02 (
    CHECK (timestamp >= '2024-02-01' AND timestamp < '2024-03-01')
) INHERITS (application_logs);

-- Monthly partition maintenance (complex and error-prone)
CREATE OR REPLACE FUNCTION create_monthly_log_partition()
RETURNS VOID AS $$
DECLARE
    partition_name TEXT;
    start_date DATE;
    end_date DATE;
BEGIN
    start_date := DATE_TRUNC('month', CURRENT_DATE);
    end_date := start_date + INTERVAL '1 month';
    partition_name := 'application_logs_' || TO_CHAR(start_date, 'YYYY_MM');

    EXECUTE format('
        CREATE TABLE IF NOT EXISTS %I (
            CHECK (timestamp >= %L AND timestamp < %L)
        ) INHERITS (application_logs)', 
        partition_name, start_date, end_date);

    EXECUTE format('
        CREATE INDEX IF NOT EXISTS %I ON %I(timestamp DESC)',
        'idx_' || partition_name || '_timestamp', partition_name);
END;
$$ LANGUAGE plpgsql;

-- Automated cleanup process with significant performance impact
CREATE OR REPLACE FUNCTION cleanup_old_logs(retention_days INTEGER DEFAULT 90)
RETURNS TABLE(
    deleted_count BIGINT,
    cleanup_duration_ms BIGINT,
    affected_partitions TEXT[]
) AS $$
DECLARE
    cutoff_date TIMESTAMP;
    partition_record RECORD;
    total_deleted BIGINT := 0;
    start_time TIMESTAMP := clock_timestamp();
    affected_partitions TEXT[] := '{}';
BEGIN
    cutoff_date := CURRENT_TIMESTAMP - (retention_days || ' days')::INTERVAL;

    -- Delete from main table (expensive operation)
    DELETE FROM ONLY application_logs 
    WHERE timestamp < cutoff_date;

    GET DIAGNOSTICS total_deleted = ROW_COUNT;

    -- Handle partitioned tables
    FOR partition_record IN 
        SELECT schemaname, tablename 
        FROM pg_tables 
        WHERE tablename LIKE 'application_logs_%'
        AND tablename ~ '^\d{4}_\d{2}$'
    LOOP
        -- Check if entire partition can be dropped
        EXECUTE format('
            SELECT COUNT(*) 
            FROM %I.%I 
            WHERE timestamp >= %L',
            partition_record.schemaname,
            partition_record.tablename,
            cutoff_date
        );

        -- Complex logic to determine drop vs cleanup
        IF FOUND THEN
            EXECUTE format('DROP TABLE IF EXISTS %I.%I CASCADE',
                partition_record.schemaname, partition_record.tablename);
            affected_partitions := affected_partitions || partition_record.tablename;
        ELSE
            -- Partial cleanup within partition (expensive)
            EXECUTE format('
                DELETE FROM %I.%I WHERE timestamp < %L',
                partition_record.schemaname, partition_record.tablename, cutoff_date);
        END IF;
    END LOOP;

    -- Vacuum and reindex (significant performance impact)
    VACUUM ANALYZE application_logs;
    REINDEX TABLE application_logs;

    RETURN QUERY SELECT 
        total_deleted,
        EXTRACT(MILLISECONDS FROM clock_timestamp() - start_time)::BIGINT,
        affected_partitions;
END;
$$ LANGUAGE plpgsql;

-- High-frequency insert procedure with limited performance optimization
CREATE OR REPLACE FUNCTION batch_insert_logs(log_entries JSONB[])
RETURNS TABLE(
    inserted_count INTEGER,
    failed_count INTEGER,
    processing_time_ms INTEGER
) AS $$
DECLARE
    log_entry JSONB;
    success_count INTEGER := 0;
    error_count INTEGER := 0;
    start_time TIMESTAMP := clock_timestamp();
    temp_table_name TEXT := 'temp_log_batch_' || extract(epoch from now())::INTEGER;
BEGIN

    -- Create temporary table for batch processing
    EXECUTE format('
        CREATE TEMP TABLE %I (
            application_name VARCHAR(100),
            log_level VARCHAR(20),
            timestamp TIMESTAMP,
            message TEXT,
            user_id BIGINT,
            session_id VARCHAR(100),
            request_id VARCHAR(100),
            duration_ms INTEGER,
            memory_usage_mb DECIMAL(8,2),
            thread_id INTEGER,
            hostname VARCHAR(100)
        )', temp_table_name);

    -- Process each log entry individually (inefficient for high volume)
    FOREACH log_entry IN ARRAY log_entries
    LOOP
        BEGIN
            EXECUTE format('
                INSERT INTO %I (
                    application_name, log_level, timestamp, message,
                    user_id, session_id, request_id, duration_ms,
                    memory_usage_mb, thread_id, hostname
                ) VALUES ($1, $2, $3, $4, $5, $6, $7, $8, $9, $10, $11)',
                temp_table_name
            ) USING 
                log_entry->>'application_name',
                log_entry->>'log_level',
                (log_entry->>'timestamp')::TIMESTAMP,
                log_entry->>'message',
                (log_entry->>'user_id')::BIGINT,
                log_entry->>'session_id',
                log_entry->>'request_id',
                (log_entry->>'duration_ms')::INTEGER,
                (log_entry->>'memory_usage_mb')::DECIMAL(8,2),
                (log_entry->>'thread_id')::INTEGER,
                log_entry->>'hostname';

            success_count := success_count + 1;

        EXCEPTION WHEN OTHERS THEN
            error_count := error_count + 1;
            -- Limited error handling for high-frequency operations
            CONTINUE;
        END;
    END LOOP;

    -- Batch insert into main table (still limited by indexing overhead)
    EXECUTE format('
        INSERT INTO application_logs (
            application_name, log_level, timestamp, message,
            user_id, session_id, request_id, duration_ms,
            memory_usage_mb, thread_id, hostname
        )
        SELECT * FROM %I', temp_table_name);

    -- Cleanup
    EXECUTE format('DROP TABLE %I', temp_table_name);

    RETURN QUERY SELECT 
        success_count,
        error_count,
        EXTRACT(MILLISECONDS FROM clock_timestamp() - start_time)::INTEGER;
END;
$$ LANGUAGE plpgsql;

-- Real-time event streaming table with performance limitations
CREATE TABLE event_stream (
    event_id BIGSERIAL PRIMARY KEY,
    event_type VARCHAR(100) NOT NULL,
    event_timestamp TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    user_id BIGINT,
    session_id VARCHAR(100),

    -- Event payload (limited JSON support)
    event_data JSONB,

    -- Stream metadata
    stream_partition VARCHAR(50),
    sequence_number BIGINT,

    -- Processing metadata
    processing_status VARCHAR(20) DEFAULT 'pending',
    processed_at TIMESTAMP,
    processor_id VARCHAR(100)
);

-- Complex trigger for sequence number management
CREATE OR REPLACE FUNCTION update_sequence_number()
RETURNS TRIGGER AS $$
BEGIN
    NEW.sequence_number := nextval('event_stream_sequence');
    RETURN NEW;
END;
$$ LANGUAGE plpgsql;

CREATE TRIGGER event_stream_sequence_trigger
    BEFORE INSERT ON event_stream
    FOR EACH ROW
    EXECUTE FUNCTION update_sequence_number();

-- Performance monitoring with complex aggregations
WITH log_performance_analysis AS (
    SELECT 
        application_name,
        log_level,
        DATE_TRUNC('hour', timestamp) as hour_bucket,
        COUNT(*) as log_count,

        -- Complex aggregations causing performance issues
        AVG(CASE WHEN duration_ms IS NOT NULL THEN duration_ms ELSE NULL END) as avg_duration,
        PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY duration_ms) as p95_duration,
        AVG(CASE WHEN memory_usage_mb IS NOT NULL THEN memory_usage_mb ELSE NULL END) as avg_memory_usage,

        -- Storage analysis
        SUM(LENGTH(message)) as total_message_bytes,
        AVG(LENGTH(message)) as avg_message_length,

        -- Performance degradation over time
        COUNT(*) / EXTRACT(EPOCH FROM INTERVAL '1 hour') as logs_per_second

    FROM application_logs
    WHERE timestamp >= CURRENT_TIMESTAMP - INTERVAL '24 hours'
    GROUP BY application_name, log_level, DATE_TRUNC('hour', timestamp)
),
storage_growth_analysis AS (
    -- Complex storage growth calculations
    SELECT 
        DATE_TRUNC('day', timestamp) as day_bucket,
        COUNT(*) as daily_logs,
        SUM(LENGTH(message) + COALESCE(LENGTH(session_id), 0) + COALESCE(LENGTH(request_id), 0)) as daily_storage_bytes,

        -- Growth projections (expensive calculations)
        LAG(COUNT(*)) OVER (ORDER BY DATE_TRUNC('day', timestamp)) as prev_day_logs,
        LAG(SUM(LENGTH(message) + COALESCE(LENGTH(session_id), 0) + COALESCE(LENGTH(request_id), 0))) OVER (ORDER BY DATE_TRUNC('day', timestamp)) as prev_day_bytes

    FROM application_logs
    WHERE timestamp >= CURRENT_TIMESTAMP - INTERVAL '30 days'
    GROUP BY DATE_TRUNC('day', timestamp)
)
SELECT 
    lpa.application_name,
    lpa.log_level,
    lpa.hour_bucket,
    lpa.log_count,

    -- Performance metrics
    ROUND(lpa.avg_duration, 2) as avg_duration_ms,
    ROUND(lpa.p95_duration, 2) as p95_duration_ms,
    ROUND(lpa.logs_per_second, 2) as throughput_logs_per_second,

    -- Storage efficiency
    ROUND(lpa.total_message_bytes / 1024.0 / 1024.0, 2) as message_storage_mb,
    ROUND(lpa.avg_message_length, 0) as avg_message_length,

    -- Growth indicators
    sga.daily_logs,
    ROUND(sga.daily_storage_bytes / 1024.0 / 1024.0, 2) as daily_storage_mb,

    -- Growth rate calculations (complex and expensive)
    CASE 
        WHEN sga.prev_day_logs IS NOT NULL THEN
            ROUND(((sga.daily_logs - sga.prev_day_logs) / sga.prev_day_logs::DECIMAL * 100), 1)
        ELSE NULL
    END as daily_log_growth_percent,

    CASE 
        WHEN sga.prev_day_bytes IS NOT NULL THEN
            ROUND(((sga.daily_storage_bytes - sga.prev_day_bytes) / sga.prev_day_bytes::DECIMAL * 100), 1)
        ELSE NULL
    END as daily_storage_growth_percent

FROM log_performance_analysis lpa
JOIN storage_growth_analysis sga ON DATE_TRUNC('day', lpa.hour_bucket) = sga.day_bucket
WHERE lpa.log_count > 0
ORDER BY lpa.application_name, lpa.log_level, lpa.hour_bucket DESC;

-- Traditional logging approach problems:
-- 1. Unbounded storage growth requiring complex partitioning and archival
-- 2. Performance degradation as table size increases due to indexing overhead
-- 3. Complex maintenance procedures for partition management and cleanup
-- 4. High-frequency writes causing lock contention and performance bottlenecks
-- 5. Expensive aggregation queries for real-time monitoring and analytics
-- 6. Limited support for truly high-throughput event streaming scenarios
-- 7. Complex error handling and recovery mechanisms for batch processing
-- 8. Storage bloat and fragmentation issues requiring regular maintenance
-- 9. No guarantee of insertion order preservation under concurrent access
-- 10. Resource-intensive cleanup and archival processes impacting performance

MongoDB capped collections provide elegant fixed-size, high-performance data structures for logging and streaming:

// MongoDB Capped Collections - high-performance logging and streaming with automatic size management
const { MongoClient, ObjectId } = require('mongodb');

const client = new MongoClient('mongodb://localhost:27017');
const db = client.db('high_performance_logging');

// Comprehensive MongoDB Capped Collections Manager
class CappedCollectionsManager {
  constructor(db, config = {}) {
    this.db = db;
    this.config = {
      // Default capped collection configurations
      defaultLogSize: config.defaultLogSize || 100 * 1024 * 1024, // 100MB
      defaultMaxDocuments: config.defaultMaxDocuments || 50000,

      // Performance optimization settings
      enableBulkOperations: config.enableBulkOperations !== false,
      enableAsyncOperations: config.enableAsyncOperations !== false,
      batchSize: config.batchSize || 1000,
      writeBufferSize: config.writeBufferSize || 16384,

      // Collection management
      enablePerformanceMonitoring: config.enablePerformanceMonitoring !== false,
      enableAutoOptimization: config.enableAutoOptimization !== false,
      enableMetricsCollection: config.enableMetricsCollection !== false,

      // Write concern and consistency
      writeConcern: config.writeConcern || {
        w: 1, // Fast writes for high-throughput logging
        j: false, // Disable journaling for maximum speed (trade-off with durability)
        wtimeout: 1000
      },

      // Advanced features
      enableTailableCursors: config.enableTailableCursors !== false,
      enableChangeStreams: config.enableChangeStreams !== false,
      enableRealTimeProcessing: config.enableRealTimeProcessing !== false,

      // Resource management
      maxConcurrentTails: config.maxConcurrentTails || 10,
      tailCursorTimeout: config.tailCursorTimeout || 30000,
      processingThreads: config.processingThreads || 4
    };

    // Collection references
    this.cappedCollections = new Map();
    this.tailableCursors = new Map();
    this.performanceMetrics = new Map();
    this.processingStats = {
      totalWrites: 0,
      totalReads: 0,
      averageWriteTime: 0,
      averageReadTime: 0,
      errorCount: 0
    };

    this.initializeCappedCollections();
  }

  async initializeCappedCollections() {
    console.log('Initializing capped collections for high-performance logging...');

    try {
      // Application logging with different retention strategies
      await this.createOptimizedCappedCollection('application_logs', {
        size: 200 * 1024 * 1024, // 200MB
        max: 100000, // Maximum 100k documents
        description: 'High-frequency application logs with automatic rotation'
      });

      // Real-time event streaming
      await this.createOptimizedCappedCollection('event_stream', {
        size: 500 * 1024 * 1024, // 500MB
        max: 250000, // Maximum 250k events
        description: 'Real-time event streaming with insertion order preservation'
      });

      // Performance metrics collection
      await this.createOptimizedCappedCollection('performance_metrics', {
        size: 100 * 1024 * 1024, // 100MB
        max: 50000, // Maximum 50k metric entries
        description: 'System performance metrics with circular buffer behavior'
      });

      // Audit trail with longer retention
      await this.createOptimizedCappedCollection('audit_trail', {
        size: 1024 * 1024 * 1024, // 1GB
        max: 1000000, // Maximum 1M audit entries
        description: 'Security audit trail with extended retention'
      });

      // User activity stream
      await this.createOptimizedCappedCollection('user_activity_stream', {
        size: 300 * 1024 * 1024, // 300MB
        max: 150000, // Maximum 150k activities
        description: 'User activity tracking with real-time processing'
      });

      // System health monitoring
      await this.createOptimizedCappedCollection('system_health_logs', {
        size: 150 * 1024 * 1024, // 150MB
        max: 75000, // Maximum 75k health checks
        description: 'System health monitoring with high-frequency updates'
      });

      // Initialize performance monitoring
      if (this.config.enablePerformanceMonitoring) {
        await this.setupPerformanceMonitoring();
      }

      // Setup real-time processing
      if (this.config.enableRealTimeProcessing) {
        await this.initializeRealTimeProcessing();
      }

      console.log('All capped collections initialized successfully');

    } catch (error) {
      console.error('Error initializing capped collections:', error);
      throw error;
    }
  }

  async createOptimizedCappedCollection(collectionName, options) {
    console.log(`Creating optimized capped collection: ${collectionName}...`);

    try {
      // Check if collection already exists
      const collections = await this.db.listCollections({ name: collectionName }).toArray();

      if (collections.length > 0) {
        // Collection exists - verify it's capped and get reference
        const collectionInfo = collections[0];
        if (!collectionInfo.options.capped) {
          throw new Error(`Collection ${collectionName} exists but is not capped`);
        }

        console.log(`Existing capped collection ${collectionName} found`);
        const collection = this.db.collection(collectionName);
        this.cappedCollections.set(collectionName, {
          collection: collection,
          options: collectionInfo.options,
          description: options.description
        });

      } else {
        // Create new capped collection
        const collection = await this.db.createCollection(collectionName, {
          capped: true,
          size: options.size,
          max: options.max,

          // Storage engine options for performance
          storageEngine: {
            wiredTiger: {
              configString: 'block_compressor=snappy' // Enable compression
            }
          }
        });

        // Create optimized indexes for capped collections
        await this.createCappedCollectionIndexes(collection, collectionName);

        this.cappedCollections.set(collectionName, {
          collection: collection,
          options: { capped: true, size: options.size, max: options.max },
          description: options.description,
          created: new Date()
        });

        console.log(`Created capped collection ${collectionName}: ${options.size} bytes, max ${options.max} documents`);
      }

    } catch (error) {
      console.error(`Error creating capped collection ${collectionName}:`, error);
      throw error;
    }
  }

  async createCappedCollectionIndexes(collection, collectionName) {
    console.log(`Creating optimized indexes for ${collectionName}...`);

    try {
      // Most capped collections benefit from a timestamp index for range queries
      // Note: Capped collections maintain insertion order, so _id is naturally ordered
      await collection.createIndex(
        { timestamp: -1 }, 
        { background: true, name: 'timestamp_desc' }
      );

      // Collection-specific indexes based on common query patterns
      switch (collectionName) {
        case 'application_logs':
          await collection.createIndexes([
            { key: { level: 1, timestamp: -1 }, background: true, name: 'level_timestamp' },
            { key: { application: 1, timestamp: -1 }, background: true, name: 'app_timestamp' },
            { key: { userId: 1 }, background: true, sparse: true, name: 'user_sparse' }
          ]);
          break;

        case 'event_stream':
          await collection.createIndexes([
            { key: { eventType: 1, timestamp: -1 }, background: true, name: 'event_type_timestamp' },
            { key: { userId: 1, timestamp: -1 }, background: true, sparse: true, name: 'user_timeline' },
            { key: { sessionId: 1 }, background: true, sparse: true, name: 'session_events' }
          ]);
          break;

        case 'performance_metrics':
          await collection.createIndexes([
            { key: { metricName: 1, timestamp: -1 }, background: true, name: 'metric_timeline' },
            { key: { hostname: 1, timestamp: -1 }, background: true, name: 'host_metrics' }
          ]);
          break;

        case 'audit_trail':
          await collection.createIndexes([
            { key: { action: 1, timestamp: -1 }, background: true, name: 'action_timeline' },
            { key: { userId: 1, timestamp: -1 }, background: true, name: 'user_audit' },
            { key: { resourceId: 1 }, background: true, sparse: true, name: 'resource_audit' }
          ]);
          break;
      }

    } catch (error) {
      console.error(`Error creating indexes for ${collectionName}:`, error);
      // Don't fail initialization for index creation issues
    }
  }

  async logApplicationEvent(application, level, message, metadata = {}) {
    const startTime = Date.now();

    try {
      const logCollection = this.cappedCollections.get('application_logs').collection;

      const logDocument = {
        timestamp: new Date(),
        application: application,
        level: level.toUpperCase(),
        message: message,

        // Enhanced metadata
        ...metadata,

        // System context
        hostname: metadata.hostname || require('os').hostname(),
        processId: process.pid,
        threadId: metadata.threadId,

        // Performance context
        memoryUsage: metadata.includeMemoryUsage ? process.memoryUsage() : undefined,

        // Request context
        requestId: metadata.requestId,
        sessionId: metadata.sessionId,
        userId: metadata.userId,

        // Application context
        version: metadata.version,
        environment: metadata.environment || process.env.NODE_ENV,

        // Timing information
        duration: metadata.duration,

        // Additional structured data
        tags: metadata.tags || [],
        customData: metadata.customData
      };

      // High-performance insert with minimal write concern
      const result = await logCollection.insertOne(logDocument, {
        writeConcern: this.config.writeConcern
      });

      // Update performance metrics
      this.updatePerformanceMetrics('application_logs', 'write', Date.now() - startTime);

      return {
        insertedId: result.insertedId,
        collection: 'application_logs',
        processingTime: Date.now() - startTime,
        logLevel: level,
        success: true
      };

    } catch (error) {
      console.error('Error logging application event:', error);
      this.processingStats.errorCount++;

      return {
        success: false,
        error: error.message,
        collection: 'application_logs',
        processingTime: Date.now() - startTime
      };
    }
  }

  async streamEvent(eventType, eventData, options = {}) {
    const startTime = Date.now();

    try {
      const streamCollection = this.cappedCollections.get('event_stream').collection;

      const eventDocument = {
        timestamp: new Date(),
        eventType: eventType,
        eventData: eventData,

        // Event metadata
        eventId: options.eventId || new ObjectId(),
        correlationId: options.correlationId,
        causationId: options.causationId,

        // User and session context
        userId: options.userId,
        sessionId: options.sessionId,

        // System context
        source: options.source || 'application',
        hostname: options.hostname || require('os').hostname(),

        // Event processing metadata
        priority: options.priority || 'normal',
        tags: options.tags || [],

        // Real-time processing flags
        requiresProcessing: options.requiresProcessing || false,
        processingStatus: options.processingStatus || 'pending',

        // Event relationships
        parentEventId: options.parentEventId,
        childEventIds: options.childEventIds || [],

        // Timing and sequence
        occurredAt: options.occurredAt || new Date(),
        sequenceNumber: options.sequenceNumber,

        // Custom event payload
        payload: eventData
      };

      // Insert event into capped collection
      const result = await streamCollection.insertOne(eventDocument, {
        writeConcern: this.config.writeConcern
      });

      // Trigger real-time processing if enabled
      if (this.config.enableRealTimeProcessing && eventDocument.requiresProcessing) {
        await this.triggerRealTimeProcessing(eventDocument);
      }

      // Update metrics
      this.updatePerformanceMetrics('event_stream', 'write', Date.now() - startTime);

      return {
        insertedId: result.insertedId,
        eventId: eventDocument.eventId,
        collection: 'event_stream',
        processingTime: Date.now() - startTime,
        success: true,
        sequenceOrder: result.insertedId // ObjectId provides natural ordering
      };

    } catch (error) {
      console.error('Error streaming event:', error);
      this.processingStats.errorCount++;

      return {
        success: false,
        error: error.message,
        collection: 'event_stream',
        processingTime: Date.now() - startTime
      };
    }
  }

  async recordPerformanceMetric(metricName, value, metadata = {}) {
    const startTime = Date.now();

    try {
      const metricsCollection = this.cappedCollections.get('performance_metrics').collection;

      const metricDocument = {
        timestamp: new Date(),
        metricName: metricName,
        value: value,

        // Metric metadata
        unit: metadata.unit || 'count',
        type: metadata.type || 'gauge', // gauge, counter, histogram, timer

        // System context
        hostname: metadata.hostname || require('os').hostname(),
        service: metadata.service || 'unknown',
        environment: metadata.environment || process.env.NODE_ENV,

        // Metric dimensions
        tags: metadata.tags || {},
        dimensions: metadata.dimensions || {},

        // Statistical data
        min: metadata.min,
        max: metadata.max,
        avg: metadata.avg,
        count: metadata.count,
        sum: metadata.sum,

        // Performance context
        duration: metadata.duration,
        sampleRate: metadata.sampleRate || 1.0,

        // Additional metadata
        source: metadata.source || 'system',
        category: metadata.category || 'performance',
        priority: metadata.priority || 'normal',

        // Custom data
        customMetadata: metadata.customMetadata
      };

      const result = await metricsCollection.insertOne(metricDocument, {
        writeConcern: this.config.writeConcern
      });

      // Update internal metrics
      this.updatePerformanceMetrics('performance_metrics', 'write', Date.now() - startTime);

      return {
        insertedId: result.insertedId,
        collection: 'performance_metrics',
        metricName: metricName,
        processingTime: Date.now() - startTime,
        success: true
      };

    } catch (error) {
      console.error('Error recording performance metric:', error);
      this.processingStats.errorCount++;

      return {
        success: false,
        error: error.message,
        collection: 'performance_metrics',
        processingTime: Date.now() - startTime
      };
    }
  }

  async createTailableCursor(collectionName, filter = {}, options = {}) {
    console.log(`Creating tailable cursor for ${collectionName}...`);

    try {
      const cappedCollection = this.cappedCollections.get(collectionName);
      if (!cappedCollection) {
        throw new Error(`Capped collection ${collectionName} not found`);
      }

      const collection = cappedCollection.collection;

      // Configure tailable cursor options
      const cursorOptions = {
        tailable: true,
        awaitData: true,
        noCursorTimeout: true,
        maxTimeMS: options.maxTimeMS || this.config.tailCursorTimeout,
        batchSize: options.batchSize || 100,
        ...options
      };

      // Create cursor starting from specified position or end
      let cursor;
      if (options.startFromEnd || options.startAfter) {
        if (options.startAfter) {
          filter._id = { $gt: options.startAfter };
        }
        cursor = collection.find(filter, cursorOptions);
      } else {
        // Start from beginning
        cursor = collection.find(filter, cursorOptions);
      }

      // Store cursor for management
      const cursorId = options.cursorId || new ObjectId().toString();
      this.tailableCursors.set(cursorId, {
        cursor: cursor,
        collection: collectionName,
        filter: filter,
        options: cursorOptions,
        created: new Date(),
        active: true
      });

      console.log(`Tailable cursor ${cursorId} created for ${collectionName}`);

      return {
        cursorId: cursorId,
        cursor: cursor,
        collection: collectionName,
        success: true
      };

    } catch (error) {
      console.error(`Error creating tailable cursor for ${collectionName}:`, error);
      return {
        success: false,
        error: error.message,
        collection: collectionName
      };
    }
  }

  async processTailableCursor(cursorId, processingFunction, options = {}) {
    console.log(`Starting tailable cursor processing for ${cursorId}...`);

    try {
      const cursorInfo = this.tailableCursors.get(cursorId);
      if (!cursorInfo) {
        throw new Error(`Tailable cursor ${cursorId} not found`);
      }

      const cursor = cursorInfo.cursor;
      const processingStats = {
        documentsProcessed: 0,
        errors: 0,
        startTime: new Date(),
        lastProcessedAt: null
      };

      // Process documents as they arrive
      while (await cursor.hasNext() && cursorInfo.active) {
        try {
          const document = await cursor.next();

          if (document) {
            // Process the document
            const processingStartTime = Date.now();
            await processingFunction(document, cursorInfo.collection);

            // Update statistics
            processingStats.documentsProcessed++;
            processingStats.lastProcessedAt = new Date();

            // Update performance metrics
            this.updatePerformanceMetrics(
              cursorInfo.collection, 
              'tail_process', 
              Date.now() - processingStartTime
            );

            // Batch processing optimization
            if (options.batchProcessing && processingStats.documentsProcessed % options.batchSize === 0) {
              await this.flushBatchProcessing(cursorId, options);
            }
          }

        } catch (processingError) {
          console.error(`Error processing document from cursor ${cursorId}:`, processingError);
          processingStats.errors++;

          // Handle processing errors based on configuration
          if (options.stopOnError) {
            break;
          }
        }
      }

      console.log(`Tailable cursor processing completed for ${cursorId}:`, processingStats);

      return {
        success: true,
        cursorId: cursorId,
        processingStats: processingStats
      };

    } catch (error) {
      console.error(`Error in tailable cursor processing for ${cursorId}:`, error);
      return {
        success: false,
        error: error.message,
        cursorId: cursorId
      };
    }
  }

  async bulkInsertLogs(collectionName, documents, options = {}) {
    console.log(`Performing bulk insert to ${collectionName} with ${documents.length} documents...`);
    const startTime = Date.now();

    try {
      const cappedCollection = this.cappedCollections.get(collectionName);
      if (!cappedCollection) {
        throw new Error(`Capped collection ${collectionName} not found`);
      }

      const collection = cappedCollection.collection;

      // Prepare documents with consistent structure
      const preparedDocuments = documents.map((doc, index) => ({
        ...doc,
        timestamp: doc.timestamp || new Date(),
        batchId: options.batchId || new ObjectId(),
        batchIndex: index,
        bulkInsertMetadata: {
          batchSize: documents.length,
          insertedAt: new Date(),
          source: options.source || 'bulk_operation'
        }
      }));

      // Configure bulk insert options for maximum performance
      const insertOptions = {
        ordered: options.ordered || false, // Unordered for better performance
        writeConcern: options.writeConcern || this.config.writeConcern,
        bypassDocumentValidation: options.bypassValidation || false
      };

      // Execute bulk insert
      const result = await collection.insertMany(preparedDocuments, insertOptions);

      // Update performance metrics
      const processingTime = Date.now() - startTime;
      this.updatePerformanceMetrics(collectionName, 'bulk_write', processingTime);
      this.processingStats.totalWrites += result.insertedCount;

      console.log(`Bulk insert completed: ${result.insertedCount} documents in ${processingTime}ms`);

      return {
        success: true,
        collection: collectionName,
        insertedCount: result.insertedCount,
        insertedIds: Object.values(result.insertedIds),
        processingTime: processingTime,
        throughput: Math.round((result.insertedCount / processingTime) * 1000), // docs/second
        batchId: options.batchId
      };

    } catch (error) {
      console.error(`Error in bulk insert to ${collectionName}:`, error);
      this.processingStats.errorCount++;

      return {
        success: false,
        error: error.message,
        collection: collectionName,
        processingTime: Date.now() - startTime
      };
    }
  }

  async queryRecentDocuments(collectionName, filter = {}, options = {}) {
    const startTime = Date.now();

    try {
      const cappedCollection = this.cappedCollections.get(collectionName);
      if (!cappedCollection) {
        throw new Error(`Capped collection ${collectionName} not found`);
      }

      const collection = cappedCollection.collection;

      // Configure query options for optimal performance
      const queryOptions = {
        sort: { $natural: options.reverse ? 1 : -1 }, // Natural order (insertion order)
        limit: options.limit || 1000,
        projection: options.projection || {},
        maxTimeMS: options.maxTimeMS || 5000,
        batchSize: options.batchSize || 100
      };

      // Add time range filter if specified
      if (options.timeRange) {
        filter.timestamp = {
          $gte: options.timeRange.start,
          $lte: options.timeRange.end || new Date()
        };
      }

      // Execute query
      const documents = await collection.find(filter, queryOptions).toArray();

      // Update performance metrics
      const processingTime = Date.now() - startTime;
      this.updatePerformanceMetrics(collectionName, 'read', processingTime);
      this.processingStats.totalReads += documents.length;

      return {
        success: true,
        collection: collectionName,
        documents: documents,
        count: documents.length,
        processingTime: processingTime,
        query: filter,
        options: queryOptions
      };

    } catch (error) {
      console.error(`Error querying ${collectionName}:`, error);
      this.processingStats.errorCount++;

      return {
        success: false,
        error: error.message,
        collection: collectionName,
        processingTime: Date.now() - startTime
      };
    }
  }

  updatePerformanceMetrics(collectionName, operationType, duration) {
    if (!this.config.enablePerformanceMonitoring) return;

    const key = `${collectionName}_${operationType}`;

    if (!this.performanceMetrics.has(key)) {
      this.performanceMetrics.set(key, {
        totalOperations: 0,
        totalTime: 0,
        averageTime: 0,
        minTime: Infinity,
        maxTime: 0,
        lastOperation: null
      });
    }

    const metrics = this.performanceMetrics.get(key);

    metrics.totalOperations++;
    metrics.totalTime += duration;
    metrics.averageTime = metrics.totalTime / metrics.totalOperations;
    metrics.minTime = Math.min(metrics.minTime, duration);
    metrics.maxTime = Math.max(metrics.maxTime, duration);
    metrics.lastOperation = new Date();

    // Update global stats
    if (operationType === 'write' || operationType === 'bulk_write') {
      this.processingStats.averageWriteTime = 
        (this.processingStats.averageWriteTime + duration) / 2;
    } else if (operationType === 'read') {
      this.processingStats.averageReadTime = 
        (this.processingStats.averageReadTime + duration) / 2;
    }
  }

  async getCollectionStats() {
    console.log('Gathering capped collection statistics...');

    const stats = {};

    for (const [collectionName, cappedInfo] of this.cappedCollections.entries()) {
      try {
        const collection = cappedInfo.collection;

        // Get MongoDB collection stats
        const mongoStats = await collection.stats();

        // Get performance metrics
        const performanceKey = `${collectionName}_write`;
        const performanceMetrics = this.performanceMetrics.get(performanceKey) || {};

        stats[collectionName] = {
          // Collection configuration
          configuration: cappedInfo.options,
          description: cappedInfo.description,
          created: cappedInfo.created,

          // MongoDB stats
          size: mongoStats.size,
          storageSize: mongoStats.storageSize,
          totalIndexSize: mongoStats.totalIndexSize,
          count: mongoStats.count,
          avgObjSize: mongoStats.avgObjSize,
          maxSize: mongoStats.maxSize,
          max: mongoStats.max,

          // Utilization metrics
          sizeUtilization: (mongoStats.size / mongoStats.maxSize * 100).toFixed(2) + '%',
          countUtilization: mongoStats.max ? (mongoStats.count / mongoStats.max * 100).toFixed(2) + '%' : 'N/A',

          // Performance metrics
          averageWriteTime: performanceMetrics.averageTime || 0,
          totalOperations: performanceMetrics.totalOperations || 0,
          minWriteTime: performanceMetrics.minTime === Infinity ? 0 : performanceMetrics.minTime || 0,
          maxWriteTime: performanceMetrics.maxTime || 0,
          lastOperation: performanceMetrics.lastOperation,

          // Health indicators
          isNearCapacity: mongoStats.size / mongoStats.maxSize > 0.8,
          hasRecentActivity: performanceMetrics.lastOperation && 
            (new Date() - performanceMetrics.lastOperation) < 300000, // 5 minutes

          // Estimated metrics
          estimatedDocumentsPerHour: this.estimateDocumentsPerHour(performanceMetrics),
          estimatedTimeToCapacity: this.estimateTimeToCapacity(mongoStats, performanceMetrics)
        };

      } catch (error) {
        stats[collectionName] = {
          error: error.message,
          available: false
        };
      }
    }

    return {
      collections: stats,
      globalStats: this.processingStats,
      summary: {
        totalCollections: this.cappedCollections.size,
        totalActiveCursors: this.tailableCursors.size,
        totalMemoryUsage: this.estimateMemoryUsage(),
        uptime: new Date() - this.startTime || new Date()
      }
    };
  }

  estimateDocumentsPerHour(performanceMetrics) {
    if (!performanceMetrics || !performanceMetrics.lastOperation) return 0;

    const hoursActive = (new Date() - (this.startTime || new Date())) / (1000 * 60 * 60);
    if (hoursActive === 0) return 0;

    return Math.round((performanceMetrics.totalOperations || 0) / hoursActive);
  }

  estimateTimeToCapacity(mongoStats, performanceMetrics) {
    if (!performanceMetrics || !performanceMetrics.totalOperations) return 'Unknown';

    const remainingSpace = mongoStats.maxSize - mongoStats.size;
    const averageDocSize = mongoStats.avgObjSize || 1000;
    const remainingDocuments = Math.floor(remainingSpace / averageDocSize);

    const documentsPerHour = this.estimateDocumentsPerHour(performanceMetrics);
    if (documentsPerHour === 0) return 'Unknown';

    const hoursToCapacity = remainingDocuments / documentsPerHour;

    if (hoursToCapacity < 24) {
      return `${Math.round(hoursToCapacity)} hours`;
    } else {
      return `${Math.round(hoursToCapacity / 24)} days`;
    }
  }

  estimateMemoryUsage() {
    // Rough estimate based on active cursors and performance metrics
    const baseMem = 50 * 1024 * 1024; // 50MB base
    const cursorMem = this.tailableCursors.size * 1024 * 1024; // 1MB per cursor
    const metricsMem = this.performanceMetrics.size * 10 * 1024; // 10KB per metric set

    return baseMem + cursorMem + metricsMem;
  }

  async shutdown() {
    console.log('Shutting down capped collections manager...');

    // Close all tailable cursors
    for (const [cursorId, cursorInfo] of this.tailableCursors.entries()) {
      try {
        cursorInfo.active = false;
        await cursorInfo.cursor.close();
        console.log(`Closed tailable cursor: ${cursorId}`);
      } catch (error) {
        console.error(`Error closing cursor ${cursorId}:`, error);
      }
    }

    // Clear collections and metrics
    this.cappedCollections.clear();
    this.tailableCursors.clear();
    this.performanceMetrics.clear();

    console.log('Capped collections manager shutdown complete');
  }
}

// Benefits of MongoDB Capped Collections:
// - Fixed-size storage with automatic old document removal (circular buffer behavior)
// - Guaranteed insertion order preservation for event sequencing
// - High-performance writes without index maintenance overhead
// - Optimal read performance for recent document queries
// - Built-in document rotation without external management
// - Tailable cursors for real-time data streaming
// - Memory-efficient operations with predictable resource usage
// - No fragmentation or storage bloat issues
// - Ideal for logging, event streaming, and real-time analytics
// - SQL-compatible operations through QueryLeaf integration

module.exports = {
  CappedCollectionsManager
};

Understanding MongoDB Capped Collections Architecture

Advanced High-Performance Logging and Streaming Patterns

Implement sophisticated capped collection strategies for production MongoDB deployments:

// Production-ready MongoDB capped collections with advanced optimization and real-time processing
class ProductionCappedCollectionsManager extends CappedCollectionsManager {
  constructor(db, productionConfig) {
    super(db, productionConfig);

    this.productionConfig = {
      ...productionConfig,
      enableShardedDeployment: true,
      enableReplicationOptimization: true,
      enableAdvancedMonitoring: true,
      enableAutomaticSizing: true,
      enableCompression: true,
      enableRealTimeAlerts: true
    };

    this.setupProductionOptimizations();
    this.initializeAdvancedMonitoring();
    this.setupAutomaticManagement();
  }

  async implementShardedCappedCollections(collectionName, shardingStrategy) {
    console.log(`Implementing sharded capped collections for ${collectionName}...`);

    const shardingConfig = {
      // Shard key design for capped collections
      shardKey: shardingStrategy.shardKey || { timestamp: 1, hostname: 1 },

      // Chunk size optimization for high-throughput writes
      chunkSizeMB: shardingStrategy.chunkSize || 16,

      // Balancing strategy
      enableAutoSplit: true,
      enableBalancer: true,
      balancerWindowStart: "01:00",
      balancerWindowEnd: "06:00",

      // Write distribution
      enableEvenWriteDistribution: true,
      monitorHotShards: true,
      automaticRebalancing: true
    };

    return await this.deployShardedCappedCollection(collectionName, shardingConfig);
  }

  async setupAdvancedRealTimeProcessing() {
    console.log('Setting up advanced real-time processing for capped collections...');

    const processingPipeline = {
      // Stream processing configuration
      streamProcessing: {
        enableChangeStreams: true,
        enableAggregationPipelines: true,
        enableParallelProcessing: true,
        maxConcurrentProcessors: 8
      },

      // Real-time analytics
      realTimeAnalytics: {
        enableWindowedAggregations: true,
        windowSizes: ['1m', '5m', '15m', '1h'],
        enableTrendDetection: true,
        enableAnomalyDetection: true
      },

      // Event correlation
      eventCorrelation: {
        enableEventMatching: true,
        correlationTimeWindow: 300000, // 5 minutes
        enableComplexEventProcessing: true
      }
    };

    return await this.deployRealTimeProcessing(processingPipeline);
  }

  async implementAutomaticCapacityManagement() {
    console.log('Implementing automatic capacity management for capped collections...');

    const capacityManagement = {
      // Automatic sizing
      automaticSizing: {
        enableDynamicResize: true,
        growthThreshold: 0.8,  // 80% capacity
        shrinkThreshold: 0.3,  // 30% capacity
        maxSize: 10 * 1024 * 1024 * 1024, // 10GB max
        minSize: 100 * 1024 * 1024 // 100MB min
      },

      // Performance-based optimization
      performanceOptimization: {
        monitorWriteLatency: true,
        latencyThreshold: 100, // 100ms
        enableAutomaticIndexing: true,
        optimizeForWorkload: true
      },

      // Resource management
      resourceManagement: {
        monitorMemoryUsage: true,
        memoryThreshold: 0.7, // 70% memory usage
        enableBackpressure: true,
        enableLoadShedding: true
      }
    };

    return await this.deployCapacityManagement(capacityManagement);
  }
}

SQL-Style Capped Collections Operations with QueryLeaf

QueryLeaf provides familiar SQL syntax for MongoDB capped collections and high-performance logging:

-- QueryLeaf capped collections operations with SQL-familiar syntax for MongoDB

-- Create capped collections with SQL-style DDL
CREATE CAPPED COLLECTION application_logs 
WITH (
  size = '200MB',
  max_documents = 100000,
  write_concern = 'fast',
  compression = 'snappy'
);

-- Alternative syntax for collection creation
CREATE TABLE event_stream (
  event_id UUID DEFAULT GENERATE_UUID(),
  timestamp TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
  event_type VARCHAR(100) NOT NULL,
  event_data DOCUMENT,
  user_id VARCHAR(50),
  session_id VARCHAR(100),
  source VARCHAR(50) DEFAULT 'application',

  -- Capped collection metadata
  insertion_order BIGINT -- Natural insertion order in capped collections
)
WITH CAPPED (
  size = '500MB',
  max_documents = 250000,
  auto_rotation = true
);

-- High-performance log insertion with SQL syntax
INSERT INTO application_logs (
  application, level, message, timestamp, user_id, session_id, metadata
) VALUES 
  ('web-server', 'INFO', 'User login successful', CURRENT_TIMESTAMP, 'user123', 'sess456', 
   JSON_OBJECT('ip_address', '192.168.1.100', 'user_agent', 'Mozilla/5.0...')),
  ('web-server', 'WARN', 'Slow query detected', CURRENT_TIMESTAMP, 'user123', 'sess456',
   JSON_OBJECT('query_time', 2500, 'table', 'users')),
  ('payment-service', 'ERROR', 'Payment processing failed', CURRENT_TIMESTAMP, 'user789', 'sess789',
   JSON_OBJECT('amount', 99.99, 'error_code', 'CARD_DECLINED'));

-- Bulk insertion for high-throughput logging
INSERT INTO application_logs (application, level, message, timestamp, metadata)
WITH log_batch AS (
  SELECT 
    app_name as application,
    log_level as level,
    log_message as message,
    log_timestamp as timestamp,

    -- Enhanced metadata generation
    JSON_OBJECT(
      'hostname', hostname,
      'process_id', process_id,
      'thread_id', thread_id,
      'memory_usage_mb', memory_usage / 1024 / 1024,
      'request_duration_ms', request_duration,
      'tags', log_tags,
      'custom_data', custom_metadata
    ) as metadata

  FROM staging_logs
  WHERE processed = false
    AND log_timestamp >= CURRENT_TIMESTAMP - INTERVAL '1 hour'
)
SELECT application, level, message, timestamp, metadata
FROM log_batch
WHERE level IN ('INFO', 'WARN', 'ERROR', 'CRITICAL')

-- Capped collection bulk insert configuration
WITH BULK_OPTIONS (
  batch_size = 1000,
  ordered = false,
  write_concern = 'fast',
  bypass_validation = false
);

-- Event streaming with guaranteed insertion order
INSERT INTO event_stream (
  event_type, event_data, user_id, session_id, 
  correlation_id, source, priority, tags
) 
WITH event_preparation AS (
  SELECT 
    event_type,
    event_payload as event_data,
    user_id,
    session_id,

    -- Generate correlation context
    COALESCE(correlation_id, GENERATE_UUID()) as correlation_id,
    COALESCE(event_source, 'application') as source,
    COALESCE(event_priority, 'normal') as priority,

    -- Generate event tags for filtering
    ARRAY[
      event_category,
      'realtime',
      CASE WHEN event_priority = 'high' THEN 'urgent' ELSE 'standard' END
    ] as tags,

    -- Add timing metadata
    CURRENT_TIMESTAMP as insertion_timestamp,
    event_occurred_at

  FROM incoming_events
  WHERE processing_status = 'pending'
    AND event_occurred_at >= CURRENT_TIMESTAMP - INTERVAL '5 minutes'
)
SELECT 
  event_type,
  JSON_SET(
    event_data,
    '$.insertion_timestamp', insertion_timestamp,
    '$.occurred_at', event_occurred_at,
    '$.processing_context', JSON_OBJECT(
      'inserted_by', 'queryleaf',
      'capped_collection', true,
      'guaranteed_order', true
    )
  ) as event_data,
  user_id,
  session_id,
  correlation_id,
  source,
  priority,
  tags
FROM event_preparation
ORDER BY event_occurred_at, correlation_id;

-- Query recent logs with natural insertion order (most efficient for capped collections)
WITH recent_application_logs AS (
  SELECT 
    timestamp,
    application,
    level,
    message,
    user_id,
    session_id,
    metadata,

    -- Natural insertion order in capped collections
    _id as insertion_order,

    -- Extract metadata fields
    JSON_EXTRACT(metadata, '$.hostname') as hostname,
    JSON_EXTRACT(metadata, '$.request_duration_ms') as request_duration,
    JSON_EXTRACT(metadata, '$.memory_usage_mb') as memory_usage,

    -- Calculate log age
    EXTRACT(SECONDS FROM CURRENT_TIMESTAMP - timestamp) as age_seconds,

    -- Categorize log importance
    CASE level
      WHEN 'CRITICAL' THEN 1
      WHEN 'ERROR' THEN 2  
      WHEN 'WARN' THEN 3
      WHEN 'INFO' THEN 4
      WHEN 'DEBUG' THEN 5
    END as log_priority_numeric

  FROM application_logs
  WHERE 
    -- Time-based filtering (efficient with capped collections)
    timestamp >= CURRENT_TIMESTAMP - INTERVAL '1 hour'

    -- Application filtering
    AND (application = $1 OR $1 IS NULL)

    -- Level filtering
    AND level IN ('ERROR', 'WARN', 'INFO')

  -- Natural order query (most efficient for capped collections)
  ORDER BY $natural DESC
  LIMIT 1000
),

log_analysis AS (
  SELECT 
    ral.*,

    -- Session context analysis
    COUNT(*) OVER (
      PARTITION BY session_id 
      ORDER BY timestamp 
      ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
    ) as session_log_sequence,

    -- Error rate analysis
    COUNT(*) FILTER (WHERE level IN ('ERROR', 'CRITICAL')) OVER (
      PARTITION BY application, DATE_TRUNC('minute', timestamp)
    ) as errors_this_minute,

    -- Performance analysis
    AVG(request_duration) OVER (
      PARTITION BY application 
      ORDER BY timestamp 
      ROWS BETWEEN 10 PRECEDING AND CURRENT ROW
    ) as rolling_avg_duration,

    -- Anomaly detection
    CASE 
      WHEN request_duration > 
        AVG(request_duration) OVER (
          PARTITION BY application 
          ORDER BY timestamp 
          ROWS BETWEEN 100 PRECEDING AND CURRENT ROW
        ) * 3 
      THEN 'performance_anomaly'

      WHEN errors_this_minute > 10 THEN 'error_spike'

      WHEN memory_usage > 
        AVG(memory_usage) OVER (
          PARTITION BY hostname 
          ORDER BY timestamp 
          ROWS BETWEEN 50 PRECEDING AND CURRENT ROW
        ) * 2
      THEN 'memory_anomaly'

      ELSE 'normal'
    END as anomaly_status

  FROM recent_application_logs ral
)

SELECT 
  timestamp,
  application,
  level,
  message,
  user_id,
  session_id,
  hostname,

  -- Performance metrics
  request_duration,
  memory_usage,
  rolling_avg_duration,

  -- Context information
  session_log_sequence,
  errors_this_minute,

  -- Analysis results
  log_priority_numeric,
  anomaly_status,
  age_seconds,

  -- Helpful indicators
  CASE 
    WHEN age_seconds < 60 THEN 'very_recent'
    WHEN age_seconds < 300 THEN 'recent' 
    WHEN age_seconds < 1800 THEN 'moderate'
    ELSE 'older'
  END as recency_category,

  -- Alert conditions
  CASE 
    WHEN level = 'CRITICAL' OR anomaly_status != 'normal' THEN 'immediate_attention'
    WHEN level = 'ERROR' AND errors_this_minute > 5 THEN 'monitor_closely'
    WHEN level = 'WARN' AND session_log_sequence > 20 THEN 'session_issues'
    ELSE 'normal_monitoring'
  END as attention_level

FROM log_analysis
WHERE 
  -- Focus on actionable logs
  (level IN ('CRITICAL', 'ERROR') OR anomaly_status != 'normal')

ORDER BY 
  -- Prioritize by importance and recency
  CASE attention_level
    WHEN 'immediate_attention' THEN 1
    WHEN 'monitor_closely' THEN 2  
    WHEN 'session_issues' THEN 3
    ELSE 4
  END,
  timestamp DESC

LIMIT 500;

-- Real-time event stream processing with tailable cursor behavior
WITH LIVE_EVENT_STREAM AS (
  SELECT 
    event_id,
    timestamp,
    event_type,
    event_data,
    user_id,
    session_id,
    correlation_id,
    source,
    tags,

    -- Event sequence tracking
    _id as natural_order,

    -- Extract event payload details
    JSON_EXTRACT(event_data, '$.action') as action,
    JSON_EXTRACT(event_data, '$.resource') as resource,
    JSON_EXTRACT(event_data, '$.metadata') as event_metadata,

    -- Real-time processing flags
    JSON_EXTRACT(event_data, '$.requires_processing') as requires_processing,
    JSON_EXTRACT(event_data, '$.priority') as event_priority

  FROM event_stream
  WHERE 
    -- Process events from the last insertion point
    _id > $last_processed_id

    -- Focus on events requiring real-time processing
    AND (
      JSON_EXTRACT(event_data, '$.requires_processing') = true
      OR event_type IN ('user_action', 'system_alert', 'security_event')
      OR JSON_EXTRACT(event_data, '$.priority') = 'high'
    )

  -- Use natural insertion order for optimal capped collection performance
  ORDER BY $natural ASC
),

event_correlation AS (
  SELECT 
    les.*,

    -- Correlation analysis
    COUNT(*) OVER (
      PARTITION BY correlation_id
      ORDER BY natural_order
    ) as correlation_sequence,

    -- User behavior patterns
    COUNT(*) OVER (
      PARTITION BY user_id, event_type
      ORDER BY timestamp
      RANGE BETWEEN INTERVAL '5 minutes' PRECEDING AND CURRENT ROW  
    ) as recent_similar_events,

    -- Session context
    STRING_AGG(event_type, ' -> ') OVER (
      PARTITION BY session_id
      ORDER BY natural_order
      ROWS BETWEEN 5 PRECEDING AND CURRENT ROW
    ) as session_event_sequence,

    -- Anomaly detection
    CASE 
      WHEN recent_similar_events > 10 THEN 'potential_abuse'
      WHEN correlation_sequence > 50 THEN 'long_running_process'
      WHEN event_type = 'security_event' THEN 'security_concern'
      ELSE 'normal_event'
    END as event_classification

  FROM LIVE_EVENT_STREAM les
),

processed_events AS (
  SELECT 
    ec.*,

    -- Generate processing instructions
    JSON_OBJECT(
      'processing_priority', 
      CASE event_classification
        WHEN 'security_concern' THEN 'critical'
        WHEN 'potential_abuse' THEN 'high'
        WHEN 'long_running_process' THEN 'monitor'
        ELSE 'standard'
      END,

      'correlation_context', JSON_OBJECT(
        'correlation_id', correlation_id,
        'sequence', correlation_sequence,
        'related_events', recent_similar_events
      ),

      'session_context', JSON_OBJECT(
        'session_id', session_id,
        'event_sequence', session_event_sequence,
        'user_id', user_id
      ),

      'processing_metadata', JSON_OBJECT(
        'inserted_at', CURRENT_TIMESTAMP,
        'natural_order', natural_order,
        'capped_collection_source', true
      )
    ) as processing_instructions,

    -- Determine next processing steps
    CASE event_classification
      WHEN 'security_concern' THEN 'immediate_alert'
      WHEN 'potential_abuse' THEN 'rate_limit_check'  
      WHEN 'long_running_process' THEN 'status_update'
      ELSE 'standard_processing'
    END as next_action

  FROM event_correlation ec
)

SELECT 
  event_id,
  timestamp,
  event_type,
  action,
  resource,
  user_id,
  session_id,

  -- Analysis results
  event_classification,
  correlation_sequence,
  recent_similar_events,
  next_action,

  -- Processing context
  processing_instructions,

  -- Natural ordering for downstream systems
  natural_order,

  -- Real-time indicators
  EXTRACT(SECONDS FROM CURRENT_TIMESTAMP - timestamp) as processing_latency_seconds,

  CASE 
    WHEN EXTRACT(SECONDS FROM CURRENT_TIMESTAMP - timestamp) < 5 THEN 'real_time'
    WHEN EXTRACT(SECONDS FROM CURRENT_TIMESTAMP - timestamp) < 30 THEN 'near_real_time'  
    ELSE 'delayed_processing'
  END as processing_timeliness

FROM processed_events
WHERE event_classification != 'normal_event' OR requires_processing = true
ORDER BY 
  -- Process highest priority events first
  CASE next_action
    WHEN 'immediate_alert' THEN 1
    WHEN 'rate_limit_check' THEN 2
    WHEN 'status_update' THEN 3
    ELSE 4
  END,
  natural_order ASC;

-- Performance metrics and capacity monitoring for capped collections
WITH capped_collection_stats AS (
  SELECT 
    collection_name,

    -- Storage utilization
    current_size_mb,
    max_size_mb,
    (current_size_mb / max_size_mb * 100) as size_utilization_percent,

    -- Document utilization  
    document_count,
    max_documents,
    (document_count / NULLIF(max_documents, 0) * 100) as document_utilization_percent,

    -- Performance metrics
    avg_document_size,
    total_index_size_mb,

    -- Operation statistics
    total_inserts_today,
    avg_inserts_per_hour,
    peak_inserts_per_hour,

    -- Capacity projections
    estimated_hours_to_capacity,
    estimated_rotation_frequency

  FROM (
    -- This would be populated by MongoDB collection stats
    VALUES 
      ('application_logs', 150, 200, 75000, 100000, 2048, 5, 180000, 7500, 15000, 8, 'every_3_hours'),
      ('event_stream', 400, 500, 200000, 250000, 2048, 8, 480000, 20000, 35000, 4, 'every_hour'),
      ('performance_metrics', 80, 100, 40000, 50000, 2048, 3, 96000, 4000, 8000, 20, 'every_5_hours')
  ) AS stats(collection_name, current_size_mb, max_size_mb, document_count, max_documents, 
             avg_document_size, total_index_size_mb, total_inserts_today, avg_inserts_per_hour,
             peak_inserts_per_hour, estimated_hours_to_capacity, estimated_rotation_frequency)
),

performance_analysis AS (
  SELECT 
    ccs.*,

    -- Utilization status
    CASE 
      WHEN size_utilization_percent > 90 THEN 'critical'
      WHEN size_utilization_percent > 80 THEN 'warning'  
      WHEN size_utilization_percent > 60 THEN 'moderate'
      ELSE 'healthy'
    END as size_status,

    CASE 
      WHEN document_utilization_percent > 90 THEN 'critical'
      WHEN document_utilization_percent > 80 THEN 'warning'
      WHEN document_utilization_percent > 60 THEN 'moderate'  
      ELSE 'healthy'
    END as document_status,

    -- Performance indicators
    CASE 
      WHEN peak_inserts_per_hour / NULLIF(avg_inserts_per_hour, 0) > 3 THEN 'high_variance'
      WHEN peak_inserts_per_hour / NULLIF(avg_inserts_per_hour, 0) > 2 THEN 'moderate_variance'
      ELSE 'stable_load'
    END as load_pattern,

    -- Capacity recommendations
    CASE 
      WHEN estimated_hours_to_capacity < 24 THEN 'monitor_closely'
      WHEN estimated_hours_to_capacity < 72 THEN 'plan_expansion'
      WHEN estimated_hours_to_capacity > 168 THEN 'over_provisioned'
      ELSE 'adequate_capacity'
    END as capacity_recommendation,

    -- Optimization suggestions
    CASE 
      WHEN total_index_size_mb / current_size_mb > 0.3 THEN 'review_indexes'
      WHEN avg_document_size > 4096 THEN 'consider_compression'
      WHEN avg_inserts_per_hour < 100 THEN 'potentially_over_sized'
      ELSE 'well_optimized'
    END as optimization_suggestion

  FROM capped_collection_stats ccs
)

SELECT 
  collection_name,

  -- Current utilization
  ROUND(size_utilization_percent, 1) as size_used_percent,
  ROUND(document_utilization_percent, 1) as documents_used_percent,
  size_status,
  document_status,

  -- Capacity information  
  current_size_mb,
  max_size_mb,
  (max_size_mb - current_size_mb) as remaining_capacity_mb,
  document_count,
  max_documents,

  -- Performance metrics
  avg_document_size,
  total_index_size_mb,
  load_pattern,
  avg_inserts_per_hour,
  peak_inserts_per_hour,

  -- Projections and recommendations
  estimated_hours_to_capacity,
  estimated_rotation_frequency,
  capacity_recommendation,
  optimization_suggestion,

  -- Action items
  CASE 
    WHEN size_status = 'critical' OR document_status = 'critical' THEN 'immediate_action_required'
    WHEN capacity_recommendation = 'monitor_closely' THEN 'increase_monitoring_frequency'
    WHEN optimization_suggestion != 'well_optimized' THEN 'schedule_optimization_review'
    ELSE 'continue_normal_operations'
  END as recommended_action,

  -- Detailed recommendations
  CASE recommended_action
    WHEN 'immediate_action_required' THEN 'Increase capped collection size or reduce retention period'
    WHEN 'increase_monitoring_frequency' THEN 'Monitor every 15 minutes instead of hourly'
    WHEN 'schedule_optimization_review' THEN 'Review indexes, compression, and document structure'
    ELSE 'Collection is operating within normal parameters'
  END as action_details

FROM performance_analysis
ORDER BY 
  CASE size_status 
    WHEN 'critical' THEN 1
    WHEN 'warning' THEN 2
    WHEN 'moderate' THEN 3  
    ELSE 4
  END,
  collection_name;

-- QueryLeaf provides comprehensive capped collection capabilities:
-- 1. SQL-familiar capped collection creation and management
-- 2. High-performance bulk insertion with optimized batching
-- 3. Natural insertion order queries for optimal performance
-- 4. Real-time event streaming with tailable cursor behavior  
-- 5. Advanced analytics and anomaly detection on streaming data
-- 6. Automatic capacity monitoring and optimization recommendations
-- 7. Integration with MongoDB's native capped collection optimizations
-- 8. SQL-style operations for complex streaming data workflows
-- 9. Built-in performance monitoring and alerting capabilities
-- 10. Production-ready capped collections with enterprise features

Best Practices for Capped Collections Implementation

Performance Optimization and Design Strategy

Essential principles for effective MongoDB capped collections deployment:

Size Planning: Calculate optimal collection sizes based on throughput, retention requirements, and query patterns
Write Optimization: Design write patterns that leverage capped collections' sequential write performance advantages
Query Strategy: Utilize natural insertion order and time-based queries for optimal read performance
Index Design: Implement minimal, strategic indexing that complements capped collection characteristics
Monitoring Strategy: Track utilization, rotation frequency, and performance metrics for capacity planning
Integration Patterns: Design applications that benefit from guaranteed insertion order and automatic data lifecycle

Production Deployment and Operational Excellence

Optimize capped collections for enterprise-scale requirements:

Capacity Management: Implement automated monitoring and alerting for collection utilization and performance
Write Distribution: Design shard keys and distribution strategies for balanced writes across replica sets
Real-Time Processing: Leverage tailable cursors and change streams for efficient real-time data processing
Backup Strategy: Account for capped collection characteristics in backup and disaster recovery planning
Performance Monitoring: Track write throughput, query performance, and resource utilization continuously
Operational Integration: Integrate capped collections with existing logging, monitoring, and alerting infrastructure

Conclusion

MongoDB capped collections provide native high-performance data structures that eliminate the complexity of traditional logging and streaming solutions through fixed-size storage, guaranteed insertion order, and automatic data lifecycle management. The combination of predictable performance characteristics with real-time processing capabilities makes capped collections ideal for modern streaming data applications.

Key MongoDB Capped Collections benefits include:

High-Performance Writes: Sequential write optimization with minimal index maintenance overhead
Predictable Storage: Fixed-size collections with automatic old document removal and no storage bloat
Insertion Order Guarantee: Natural document ordering ideal for event sequencing and temporal data analysis
Real-Time Processing: Tailable cursors and change streams for efficient streaming data consumption
Resource Efficiency: Predictable memory usage and optimal performance characteristics for high-throughput scenarios
SQL Accessibility: Familiar SQL-style capped collection operations through QueryLeaf for accessible streaming data management

Whether you're implementing application logging, event streaming, performance monitoring, or real-time analytics, MongoDB capped collections with QueryLeaf's familiar SQL interface provide the foundation for efficient, predictable, and scalable streaming data solutions.

QueryLeaf Integration: QueryLeaf seamlessly manages MongoDB capped collections while providing SQL-familiar syntax for high-performance logging, real-time streaming, and circular buffer operations. Advanced capped collection patterns including capacity planning, real-time processing, and performance optimization are elegantly handled through familiar SQL constructs, making sophisticated streaming data management both powerful and accessible to SQL-oriented development teams.

The combination of MongoDB's robust capped collection capabilities with SQL-style streaming operations makes it an ideal platform for applications requiring both high-throughput data capture and familiar database interaction patterns, ensuring your streaming data infrastructure can scale efficiently while maintaining predictable performance and operational simplicity.

November 1, 2025
19 min read

MongoDB TTL Collections: Automatic Data Lifecycle Management and Expiration for Efficient Storage

Modern applications generate vast amounts of transient data that needs careful lifecycle management to maintain performance and control storage costs. Traditional approaches to data cleanup involve complex batch jobs, scheduled maintenance scripts, and manual processes that are error-prone and resource-intensive.

MongoDB TTL (Time To Live) collections provide native automatic data expiration capabilities that eliminate the complexity of manual data lifecycle management. Unlike traditional database systems that require custom deletion processes or external job schedulers, MongoDB TTL indexes automatically remove expired documents, ensuring optimal storage utilization and performance without operational overhead.

The Traditional Data Lifecycle Challenge

Conventional approaches to managing data expiration and cleanup involve significant complexity and operational burden:

-- Traditional PostgreSQL data cleanup approach - complex and resource-intensive

-- Session cleanup with manual batch processing
CREATE TABLE user_sessions (
    session_id UUID PRIMARY KEY,
    user_id BIGINT NOT NULL,
    session_data JSONB,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    expires_at TIMESTAMP NOT NULL,
    last_accessed TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    is_active BOOLEAN DEFAULT true
);

-- Scheduled cleanup job (requires external cron/scheduler)
-- This query must run regularly and can be resource-intensive
DELETE FROM user_sessions 
WHERE expires_at < CURRENT_TIMESTAMP 
   OR (last_accessed < CURRENT_TIMESTAMP - INTERVAL '30 days' AND is_active = false);

-- Complex log cleanup with multiple conditions
CREATE TABLE application_logs (
    log_id BIGSERIAL PRIMARY KEY,
    application_name VARCHAR(100) NOT NULL,
    log_level VARCHAR(20) NOT NULL,
    message TEXT,
    metadata JSONB,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,

    -- Manual retention policy implementation
    retention_days INTEGER DEFAULT 30,
    should_archive BOOLEAN DEFAULT false
);

-- Multi-stage cleanup process
WITH logs_to_cleanup AS (
    SELECT log_id, application_name, created_at, retention_days
    FROM application_logs
    WHERE 
        -- Different retention periods by log level
        (log_level = 'DEBUG' AND created_at < CURRENT_TIMESTAMP - INTERVAL '7 days')
        OR (log_level = 'INFO' AND created_at < CURRENT_TIMESTAMP - INTERVAL '30 days')
        OR (log_level = 'WARN' AND created_at < CURRENT_TIMESTAMP - INTERVAL '90 days')
        OR (log_level = 'ERROR' AND created_at < CURRENT_TIMESTAMP - INTERVAL '365 days')
        OR (should_archive = false AND created_at < CURRENT_TIMESTAMP - retention_days * INTERVAL '1 day')
),
archival_candidates AS (
    -- Identify logs that should be archived before deletion
    SELECT ltc.log_id, ltc.application_name, ltc.created_at
    FROM logs_to_cleanup ltc
    JOIN application_logs al ON ltc.log_id = al.log_id
    WHERE al.log_level IN ('ERROR', 'CRITICAL') 
       OR al.metadata ? 'trace_id' -- Contains important debugging info
),
archive_process AS (
    -- Archive important logs (complex external process)
    INSERT INTO archived_application_logs 
    SELECT al.* FROM application_logs al
    JOIN archival_candidates ac ON al.log_id = ac.log_id
    RETURNING log_id
)
-- Finally delete the logs
DELETE FROM application_logs
WHERE log_id IN (
    SELECT log_id FROM logs_to_cleanup
    WHERE log_id NOT IN (SELECT log_id FROM archival_candidates)
       OR log_id IN (SELECT log_id FROM archive_process)
);

-- Traditional approach problems:
-- 1. Complex scheduling and orchestration required
-- 2. Resource-intensive batch operations during cleanup
-- 3. Risk of data loss if cleanup jobs fail
-- 4. Manual management of different retention policies
-- 5. No automatic optimization of storage and indexes
-- 6. Difficulty in handling timezone and date calculations
-- 7. Complex error handling and retry logic required
-- 8. Performance impact during large cleanup operations
-- 9. Manual coordination between cleanup and application logic
-- 10. Inconsistent cleanup behavior across different environments

-- Attempting MySQL-style events (limited functionality)
SET GLOBAL event_scheduler = ON;

CREATE EVENT cleanup_expired_sessions
ON SCHEDULE EVERY 1 HOUR
STARTS CURRENT_TIMESTAMP
DO
BEGIN
    DELETE FROM user_sessions 
    WHERE expires_at < NOW() 
    LIMIT 1000; -- Prevent long-running operations
END;

-- MySQL event limitations:
-- - Basic scheduling only
-- - No complex retention logic
-- - Limited error handling
-- - Manual management of batch sizes
-- - No integration with application lifecycle
-- - Poor visibility into cleanup operations

MongoDB TTL collections provide elegant automatic data expiration:

// MongoDB TTL Collections - automatic data lifecycle management
const { MongoClient, ObjectId } = require('mongodb');

const client = new MongoClient('mongodb://localhost:27017');
const db = client.db('data_lifecycle_management');

// Comprehensive MongoDB TTL Data Lifecycle Manager
class MongoDBTTLManager {
  constructor(db, config = {}) {
    this.db = db;
    this.config = {
      defaultTTL: config.defaultTTL || 3600, // 1 hour default
      enableMetrics: config.enableMetrics !== false,
      enableIndexOptimization: config.enableIndexOptimization !== false,
      cleanupLogLevel: config.cleanupLogLevel || 'info',
      ...config
    };

    this.collections = {
      userSessions: db.collection('user_sessions'),
      applicationLogs: db.collection('application_logs'),
      temporaryData: db.collection('temporary_data'),
      eventStream: db.collection('event_stream'),
      apiRequests: db.collection('api_requests'),
      cacheEntries: db.collection('cache_entries'),
      ttlMetrics: db.collection('ttl_metrics')
    };

    this.ttlIndexes = new Map();
    this.expirationStrategies = new Map();
  }

  async initializeTTLCollections() {
    console.log('Initializing TTL collections and indexes...');

    try {
      // User sessions with 24-hour expiration
      await this.setupSessionTTL();

      // Application logs with variable retention based on log level
      await this.setupLogsTTL();

      // Temporary data with flexible expiration
      await this.setupTemporaryDataTTL();

      // Event stream with time-based partitioning
      await this.setupEventStreamTTL();

      // API request tracking with automatic cleanup
      await this.setupAPIRequestsTTL();

      // Cache entries with intelligent expiration
      await this.setupCacheTTL();

      // Metrics collection for monitoring TTL performance
      await this.setupTTLMetrics();

      console.log('All TTL collections initialized successfully');

    } catch (error) {
      console.error('Error initializing TTL collections:', error);
      throw error;
    }
  }

  async setupSessionTTL() {
    console.log('Setting up user session TTL...');

    const sessionCollection = this.collections.userSessions;

    // Create TTL index for automatic session expiration
    await sessionCollection.createIndex(
      { expiresAt: 1 },
      { 
        expireAfterSeconds: 0, // Expire based on document field value
        background: true,
        name: 'session_ttl_index'
      }
    );

    // Secondary TTL index for inactive sessions
    await sessionCollection.createIndex(
      { lastAccessedAt: 1 },
      { 
        expireAfterSeconds: 7 * 24 * 3600, // 7 days for inactive sessions
        background: true,
        name: 'session_inactivity_ttl_index'
      }
    );

    // Compound index for efficient session queries
    await sessionCollection.createIndex(
      { userId: 1, isActive: 1, expiresAt: 1 },
      { background: true }
    );

    this.ttlIndexes.set('userSessions', [
      { field: 'expiresAt', expireAfterSeconds: 0 },
      { field: 'lastAccessedAt', expireAfterSeconds: 7 * 24 * 3600 }
    ]);

    console.log('User session TTL configured');
  }

  async createUserSession(userId, sessionData, customTTL = null) {
    const expirationTime = new Date(Date.now() + ((customTTL || 24 * 3600) * 1000));

    const sessionDocument = {
      sessionId: new ObjectId(),
      userId: userId,
      sessionData: sessionData,
      createdAt: new Date(),
      expiresAt: expirationTime, // TTL field for automatic expiration
      lastAccessedAt: new Date(),
      isActive: true,

      // Session metadata
      userAgent: sessionData.userAgent,
      ipAddress: sessionData.ipAddress,
      deviceType: sessionData.deviceType,

      // Expiration strategy metadata
      ttlStrategy: 'fixed_expiration',
      customTTL: customTTL,
      renewalCount: 0
    };

    const result = await this.collections.userSessions.insertOne(sessionDocument);

    console.log(`Created session ${result.insertedId} for user ${userId}, expires at ${expirationTime}`);
    return result.insertedId;
  }

  async renewUserSession(sessionId, additionalTTL = 3600) {
    const newExpirationTime = new Date(Date.now() + (additionalTTL * 1000));

    const result = await this.collections.userSessions.updateOne(
      { sessionId: new ObjectId(sessionId), isActive: true },
      {
        $set: {
          expiresAt: newExpirationTime,
          lastAccessedAt: new Date()
        },
        $inc: { renewalCount: 1 }
      }
    );

    if (result.modifiedCount > 0) {
      console.log(`Renewed session ${sessionId} until ${newExpirationTime}`);
    }

    return result.modifiedCount > 0;
  }

  async setupLogsTTL() {
    console.log('Setting up application logs TTL with level-based retention...');

    const logsCollection = this.collections.applicationLogs;

    // Create partial TTL indexes for different log levels
    // Debug logs expire quickly
    await logsCollection.createIndex(
      { createdAt: 1 },
      {
        expireAfterSeconds: 7 * 24 * 3600, // 7 days
        partialFilterExpression: { logLevel: 'DEBUG' },
        background: true,
        name: 'debug_logs_ttl'
      }
    );

    // Info logs have moderate retention
    await logsCollection.createIndex(
      { createdAt: 1 },
      {
        expireAfterSeconds: 30 * 24 * 3600, // 30 days
        partialFilterExpression: { logLevel: 'INFO' },
        background: true,
        name: 'info_logs_ttl'
      }
    );

    // Warning logs kept longer
    await logsCollection.createIndex(
      { createdAt: 1 },
      {
        expireAfterSeconds: 90 * 24 * 3600, // 90 days
        partialFilterExpression: { logLevel: 'WARN' },
        background: true,
        name: 'warn_logs_ttl'
      }
    );

    // Error logs kept for a full year
    await logsCollection.createIndex(
      { createdAt: 1 },
      {
        expireAfterSeconds: 365 * 24 * 3600, // 365 days
        partialFilterExpression: { logLevel: { $in: ['ERROR', 'CRITICAL'] } },
        background: true,
        name: 'error_logs_ttl'
      }
    );

    // Compound index for efficient log queries
    await logsCollection.createIndex(
      { applicationName: 1, logLevel: 1, createdAt: -1 },
      { background: true }
    );

    this.expirationStrategies.set('applicationLogs', {
      DEBUG: 7 * 24 * 3600,
      INFO: 30 * 24 * 3600,
      WARN: 90 * 24 * 3600,
      ERROR: 365 * 24 * 3600,
      CRITICAL: 365 * 24 * 3600
    });

    console.log('Application logs TTL configured with level-based retention');
  }

  async createLogEntry(applicationName, logLevel, message, metadata = {}) {
    const logDocument = {
      logId: new ObjectId(),
      applicationName: applicationName,
      logLevel: logLevel.toUpperCase(),
      message: message,
      metadata: metadata,
      createdAt: new Date(), // TTL field used by level-specific indexes

      // Additional context
      hostname: metadata.hostname || 'unknown',
      processId: metadata.processId,
      threadId: metadata.threadId,
      traceId: metadata.traceId,

      // Automatic expiration via TTL indexes
      // No manual expiration field needed - handled by partial TTL indexes
    };

    const result = await this.collections.applicationLogs.insertOne(logDocument);

    // Log retention info based on level
    const retentionDays = this.expirationStrategies.get('applicationLogs')[logLevel.toUpperCase()];
    const expirationDate = new Date(Date.now() + (retentionDays * 1000));

    if (this.config.cleanupLogLevel === 'debug') {
      console.log(`Created ${logLevel} log entry ${result.insertedId}, will expire around ${expirationDate}`);
    }

    return result.insertedId;
  }

  async setupTemporaryDataTTL() {
    console.log('Setting up temporary data TTL with flexible expiration...');

    const tempCollection = this.collections.temporaryData;

    // Primary TTL index using document field
    await tempCollection.createIndex(
      { expiresAt: 1 },
      {
        expireAfterSeconds: 0, // Use document field value
        background: true,
        name: 'temp_data_ttl'
      }
    );

    // Backup TTL index with default expiration
    await tempCollection.createIndex(
      { createdAt: 1 },
      {
        expireAfterSeconds: 24 * 3600, // 24 hours default
        partialFilterExpression: { expiresAt: { $exists: false } },
        background: true,
        name: 'temp_data_default_ttl'
      }
    );

    // Index for data type queries
    await tempCollection.createIndex(
      { dataType: 1, createdAt: -1 },
      { background: true }
    );

    console.log('Temporary data TTL configured');
  }

  async storeTemporaryData(dataType, data, ttlSeconds = 3600) {
    const expirationTime = new Date(Date.now() + (ttlSeconds * 1000));

    const tempDocument = {
      tempId: new ObjectId(),
      dataType: dataType,
      data: data,
      createdAt: new Date(),
      expiresAt: expirationTime, // TTL field

      // Metadata
      sizeBytes: JSON.stringify(data).length,
      compressionType: data.compressionType || 'none',
      accessCount: 0,

      // TTL configuration
      ttlSeconds: ttlSeconds,
      autoExpire: true
    };

    const result = await this.collections.temporaryData.insertOne(tempDocument);

    console.log(`Stored temporary ${dataType} data ${result.insertedId}, expires at ${expirationTime}`);
    return result.insertedId;
  }

  async setupEventStreamTTL() {
    console.log('Setting up event stream TTL with sliding window retention...');

    const eventCollection = this.collections.eventStream;

    // TTL index for event stream with 30-day retention
    await eventCollection.createIndex(
      { timestamp: 1 },
      {
        expireAfterSeconds: 30 * 24 * 3600, // 30 days
        background: true,
        name: 'event_stream_ttl'
      }
    );

    // Compound index for event queries
    await eventCollection.createIndex(
      { eventType: 1, timestamp: -1 },
      { background: true }
    );

    // Index for user-specific events
    await eventCollection.createIndex(
      { userId: 1, timestamp: -1 },
      { background: true }
    );

    console.log('Event stream TTL configured');
  }

  async createEvent(eventType, userId, eventData) {
    const eventDocument = {
      eventId: new ObjectId(),
      eventType: eventType,
      userId: userId,
      eventData: eventData,
      timestamp: new Date(), // TTL field

      // Event metadata
      source: eventData.source || 'application',
      sessionId: eventData.sessionId,
      correlationId: eventData.correlationId,

      // Automatic expiration after 30 days via TTL index
    };

    const result = await this.collections.eventStream.insertOne(eventDocument);
    return result.insertedId;
  }

  async setupAPIRequestsTTL() {
    console.log('Setting up API requests TTL for monitoring and analytics...');

    const apiCollection = this.collections.apiRequests;

    // TTL index with 7-day retention for API requests
    await apiCollection.createIndex(
      { requestTime: 1 },
      {
        expireAfterSeconds: 7 * 24 * 3600, // 7 days
        background: true,
        name: 'api_requests_ttl'
      }
    );

    // Compound indexes for API analytics
    await apiCollection.createIndex(
      { endpoint: 1, requestTime: -1 },
      { background: true }
    );

    await apiCollection.createIndex(
      { statusCode: 1, requestTime: -1 },
      { background: true }
    );

    console.log('API requests TTL configured');
  }

  async logAPIRequest(endpoint, method, statusCode, responseTime, metadata = {}) {
    const requestDocument = {
      requestId: new ObjectId(),
      endpoint: endpoint,
      method: method.toUpperCase(),
      statusCode: statusCode,
      responseTime: responseTime,
      requestTime: new Date(), // TTL field

      // Request details
      userAgent: metadata.userAgent,
      ipAddress: metadata.ipAddress,
      userId: metadata.userId,
      sessionId: metadata.sessionId,

      // Performance metrics
      requestSize: metadata.requestSize || 0,
      responseSize: metadata.responseSize || 0,

      // Automatic expiration after 7 days
    };

    const result = await this.collections.apiRequests.insertOne(requestDocument);
    return result.insertedId;
  }

  async setupCacheTTL() {
    console.log('Setting up cache entries TTL with intelligent expiration...');

    const cacheCollection = this.collections.cacheEntries;

    // Primary TTL index using document field for custom expiration
    await cacheCollection.createIndex(
      { expiresAt: 1 },
      {
        expireAfterSeconds: 0, // Use document field
        background: true,
        name: 'cache_ttl'
      }
    );

    // Backup TTL for entries without explicit expiration
    await cacheCollection.createIndex(
      { lastAccessedAt: 1 },
      {
        expireAfterSeconds: 3600, // 1 hour default
        background: true,
        name: 'cache_access_ttl'
      }
    );

    // Index for cache key lookups
    await cacheCollection.createIndex(
      { cacheKey: 1 },
      { unique: true, background: true }
    );

    console.log('Cache TTL configured');
  }

  async setCacheEntry(cacheKey, value, ttlSeconds = 300) {
    const expirationTime = new Date(Date.now() + (ttlSeconds * 1000));

    const cacheDocument = {
      cacheKey: cacheKey,
      value: value,
      createdAt: new Date(),
      lastAccessedAt: new Date(),
      expiresAt: expirationTime, // TTL field

      // Cache metadata
      accessCount: 0,
      ttlSeconds: ttlSeconds,
      valueType: typeof value,
      sizeBytes: JSON.stringify(value).length,

      // Hit ratio tracking
      hitCount: 0,
      missCount: 0
    };

    const result = await cacheCollection.updateOne(
      { cacheKey: cacheKey },
      {
        $set: cacheDocument,
        $setOnInsert: { createdAt: new Date() }
      },
      { upsert: true }
    );

    return result.upsertedId || result.modifiedCount > 0;
  }

  async getCacheEntry(cacheKey) {
    const result = await this.collections.cacheEntries.findOneAndUpdate(
      { cacheKey: cacheKey },
      {
        $set: { lastAccessedAt: new Date() },
        $inc: { accessCount: 1, hitCount: 1 }
      },
      { returnDocument: 'after' }
    );

    return result.value?.value || null;
  }

  async setupTTLMetrics() {
    console.log('Setting up TTL metrics collection...');

    const metricsCollection = this.collections.ttlMetrics;

    // TTL index for metrics with 90-day retention
    await metricsCollection.createIndex(
      { timestamp: 1 },
      {
        expireAfterSeconds: 90 * 24 * 3600, // 90 days
        background: true,
        name: 'metrics_ttl'
      }
    );

    // Index for metrics queries
    await metricsCollection.createIndex(
      { collectionName: 1, timestamp: -1 },
      { background: true }
    );

    console.log('TTL metrics collection configured');
  }

  async collectTTLMetrics() {
    console.log('Collecting TTL performance metrics...');

    try {
      const metrics = {
        timestamp: new Date(),
        collections: {}
      };

      // Collect metrics for each TTL collection
      for (const [collectionName, collection] of Object.entries(this.collections)) {
        if (collectionName === 'ttlMetrics') continue;

        const collectionStats = await collection.stats();
        const indexStats = await this.getTTLIndexStats(collection);

        metrics.collections[collectionName] = {
          documentCount: collectionStats.count,
          storageSize: collectionStats.storageSize,
          avgObjSize: collectionStats.avgObjSize,
          totalIndexSize: collectionStats.totalIndexSize,
          ttlIndexes: indexStats,

          // Calculate expiration rates
          estimatedExpirationRate: await this.estimateExpirationRate(collection)
        };
      }

      // Store metrics
      await this.collections.ttlMetrics.insertOne(metrics);

      if (this.config.enableMetrics) {
        console.log('TTL Metrics:', {
          totalCollections: Object.keys(metrics.collections).length,
          totalDocuments: Object.values(metrics.collections).reduce((sum, c) => sum + c.documentCount, 0),
          totalStorageSize: Object.values(metrics.collections).reduce((sum, c) => sum + c.storageSize, 0)
        });
      }

      return metrics;

    } catch (error) {
      console.error('Error collecting TTL metrics:', error);
      throw error;
    }
  }

  async getTTLIndexStats(collection) {
    const indexes = await collection.listIndexes().toArray();
    const ttlIndexes = indexes.filter(index => 
      index.expireAfterSeconds !== undefined || index.expireAfterSeconds === 0
    );

    return ttlIndexes.map(index => ({
      name: index.name,
      key: index.key,
      expireAfterSeconds: index.expireAfterSeconds,
      partialFilterExpression: index.partialFilterExpression
    }));
  }

  async estimateExpirationRate(collection) {
    // Simple estimation based on documents created vs documents existing
    const now = new Date();
    const oneDayAgo = new Date(now.getTime() - (24 * 60 * 60 * 1000));

    const recentDocuments = await collection.countDocuments({
      createdAt: { $gte: oneDayAgo }
    });

    const totalDocuments = await collection.countDocuments();

    return recentDocuments > 0 ? (recentDocuments / totalDocuments) : 0;
  }

  async optimizeTTLIndexes() {
    console.log('Optimizing TTL indexes for better performance...');

    try {
      for (const [collectionName, collection] of Object.entries(this.collections)) {
        if (collectionName === 'ttlMetrics') continue;

        // Analyze index usage
        const indexStats = await collection.aggregate([
          { $indexStats: {} }
        ]).toArray();

        // Identify underutilized TTL indexes
        for (const indexStat of indexStats) {
          if (indexStat.key && indexStat.key.expiresAt) {
            const usage = indexStat.accesses;
            console.log(`TTL index ${indexStat.name} usage:`, usage);

            // Suggest optimizations based on usage patterns
            if (usage.ops < 100 && usage.since) {
              console.log(`Consider reviewing TTL index ${indexStat.name} - low usage detected`);
            }
          }
        }
      }

    } catch (error) {
      console.error('Error optimizing TTL indexes:', error);
    }
  }

  async getTTLStatus() {
    const status = {
      collectionsWithTTL: 0,
      totalTTLIndexes: 0,
      activeExpirations: {},
      systemHealth: 'healthy'
    };

    for (const [collectionName, collection] of Object.entries(this.collections)) {
      if (collectionName === 'ttlMetrics') continue;

      const indexes = await collection.listIndexes().toArray();
      const ttlIndexes = indexes.filter(index => 
        index.expireAfterSeconds !== undefined || index.expireAfterSeconds === 0
      );

      if (ttlIndexes.length > 0) {
        status.collectionsWithTTL++;
        status.totalTTLIndexes += ttlIndexes.length;

        // Estimate documents that will expire soon
        const soonToExpire = await this.estimateSoonToExpire(collection, ttlIndexes);
        status.activeExpirations[collectionName] = soonToExpire;
      }
    }

    return status;
  }

  async estimateSoonToExpire(collection, ttlIndexes) {
    let totalSoonToExpire = 0;

    for (const index of ttlIndexes) {
      if (index.expireAfterSeconds === 0) {
        // Documents expire based on field value
        const fieldName = Object.keys(index.key)[0];
        const nextHour = new Date(Date.now() + (60 * 60 * 1000));

        const count = await collection.countDocuments({
          [fieldName]: { $lt: nextHour }
        });

        totalSoonToExpire += count;
      } else {
        // Documents expire based on index TTL
        const fieldName = Object.keys(index.key)[0];
        const cutoffTime = new Date(Date.now() - (index.expireAfterSeconds * 1000) + (60 * 60 * 1000));

        const count = await collection.countDocuments({
          [fieldName]: { $lt: cutoffTime }
        });

        totalSoonToExpire += count;
      }
    }

    return totalSoonToExpire;
  }

  async shutdown() {
    console.log('Shutting down TTL Manager...');

    // Final metrics collection
    if (this.config.enableMetrics) {
      await this.collectTTLMetrics();
    }

    // Display final status
    const status = await this.getTTLStatus();
    console.log('Final TTL Status:', status);

    console.log('TTL Manager shutdown complete');
  }
}

// Benefits of MongoDB TTL Collections:
// - Automatic data expiration without manual intervention
// - Multiple TTL strategies (fixed time, document field, partial indexes)
// - Built-in optimization and storage reclamation
// - Integration with MongoDB's index and query optimization
// - Flexible retention policies based on data characteristics
// - No external job scheduling required
// - Consistent behavior across replica sets and sharded clusters
// - Real-time metrics and monitoring capabilities
// - SQL-compatible TTL operations through QueryLeaf integration

module.exports = {
  MongoDBTTLManager
};

Understanding MongoDB TTL Architecture

Advanced TTL Patterns and Configuration Strategies

Implement sophisticated TTL patterns for different data lifecycle requirements:

// Advanced TTL patterns for production MongoDB deployments
class AdvancedTTLStrategies extends MongoDBTTLManager {
  constructor(db, advancedConfig) {
    super(db, advancedConfig);

    this.advancedConfig = {
      ...advancedConfig,
      enableTimezoneSupport: true,
      enableConditionalExpiration: true,
      enableGradualExpiration: true,
      enableExpirationNotifications: true,
      enableComplianceMode: true
    };
  }

  async setupConditionalTTL() {
    // TTL that expires documents based on multiple conditions
    console.log('Setting up conditional TTL with complex business logic...');

    const conditionalTTLCollection = this.db.collection('conditional_expiration');

    // Different TTL for different user tiers
    await conditionalTTLCollection.createIndex(
      { lastActivityAt: 1 },
      {
        expireAfterSeconds: 30 * 24 * 3600, // 30 days for free tier
        partialFilterExpression: { 
          userTier: 'free',
          isPremium: false 
        },
        background: true,
        name: 'free_user_data_ttl'
      }
    );

    await conditionalTTLCollection.createIndex(
      { lastActivityAt: 1 },
      {
        expireAfterSeconds: 365 * 24 * 3600, // 1 year for premium users
        partialFilterExpression: { 
          userTier: 'premium',
          isPremium: true 
        },
        background: true,
        name: 'premium_user_data_ttl'
      }
    );

    // Business-critical data never expires automatically
    await conditionalTTLCollection.createIndex(
      { reviewDate: 1 },
      {
        expireAfterSeconds: 7 * 365 * 24 * 3600, // 7 years for compliance
        partialFilterExpression: { 
          dataClassification: 'business_critical',
          complianceRetentionRequired: true
        },
        background: true,
        name: 'compliance_data_ttl'
      }
    );
  }

  async setupGradualExpiration() {
    // Implement gradual expiration to reduce system load
    console.log('Setting up gradual expiration strategy...');

    const gradualCollection = this.db.collection('gradual_expiration');

    // Stagger expiration across time buckets
    const timeBuckets = [
      { hour: 2, expireSeconds: 7 * 24 * 3600 },   // 2 AM
      { hour: 14, expireSeconds: 14 * 24 * 3600 }, // 2 PM
      { hour: 20, expireSeconds: 21 * 24 * 3600 }  // 8 PM
    ];

    for (const bucket of timeBuckets) {
      await gradualCollection.createIndex(
        { createdAt: 1 },
        {
          expireAfterSeconds: bucket.expireSeconds,
          partialFilterExpression: {
            expirationBucket: bucket.hour
          },
          background: true,
          name: `gradual_ttl_${bucket.hour}h`
        }
      );
    }
  }

  async createDocumentWithGradualExpiration(data) {
    // Assign expiration bucket based on hash of document ID
    const buckets = [2, 14, 20];
    const bucketIndex = Math.abs(data.hashCode || Math.random()) % buckets.length;
    const selectedBucket = buckets[bucketIndex];

    const document = {
      ...data,
      createdAt: new Date(),
      expirationBucket: selectedBucket,

      // Add jitter to prevent thundering herd
      expirationJitter: Math.floor(Math.random() * 3600) // 0-1 hour jitter
    };

    return await this.db.collection('gradual_expiration').insertOne(document);
  }

  async setupTimezoneTTL() {
    // TTL that respects business hours and timezones
    console.log('Setting up timezone-aware TTL...');

    const timezoneCollection = this.db.collection('timezone_expiration');

    // Create TTL based on business date rather than UTC
    await timezoneCollection.createIndex(
      { businessDateExpiry: 1 },
      {
        expireAfterSeconds: 0, // Use document field
        background: true,
        name: 'business_timezone_ttl'
      }
    );
  }

  async createBusinessHoursTTLDocument(data, businessTimezone = 'America/New_York', retentionDays = 30) {
    const moment = require('moment-timezone');

    // Calculate expiration at end of business day in specified timezone
    const businessExpiry = moment()
      .tz(businessTimezone)
      .add(retentionDays, 'days')
      .endOf('day') // Expire at end of business day
      .toDate();

    const document = {
      ...data,
      createdAt: new Date(),
      businessDateExpiry: businessExpiry,
      timezone: businessTimezone,
      retentionPolicy: 'business_hours_aligned'
    };

    return await timezoneCollection.insertOne(document);
  }

  async setupComplianceTTL() {
    // TTL with compliance and audit requirements
    console.log('Setting up compliance-aware TTL...');

    const complianceCollection = this.db.collection('compliance_data');

    // Legal hold prevents automatic expiration
    await complianceCollection.createIndex(
      { scheduledDestructionDate: 1 },
      {
        expireAfterSeconds: 0,
        partialFilterExpression: {
          legalHold: false,
          complianceStatus: 'approved_for_destruction'
        },
        background: true,
        name: 'compliance_ttl'
      }
    );

    // Audit trail for expired documents
    await complianceCollection.createIndex(
      { auditExpirationDate: 1 },
      {
        expireAfterSeconds: 10 * 365 * 24 * 3600, // 10 years for audit trail
        background: true,
        name: 'audit_trail_ttl'
      }
    );
  }

  async createComplianceDocument(data, retentionYears = 7) {
    const scheduledDestruction = new Date();
    scheduledDestruction.setFullYear(scheduledDestruction.getFullYear() + retentionYears);

    const document = {
      ...data,
      createdAt: new Date(),
      retentionPeriodYears: retentionYears,
      scheduledDestructionDate: scheduledDestruction,

      // Compliance metadata
      legalHold: false,
      complianceStatus: 'under_retention',
      dataClassification: data.dataClassification || 'standard',

      // Audit requirements
      auditExpirationDate: new Date(scheduledDestruction.getTime() + (3 * 365 * 24 * 60 * 60 * 1000)) // +3 years
    };

    return await this.db.collection('compliance_data').insertOne(document);
  }

  async implementExpirationNotifications() {
    // Set up change streams to monitor expiring documents
    console.log('Setting up expiration notifications...');

    const expirationNotifier = this.db.collection('expiration_notifications');

    // Monitor documents that will expire soon
    setInterval(async () => {
      await this.checkUpcomingExpirations();
    }, 60 * 60 * 1000); // Check every hour
  }

  async checkUpcomingExpirations() {
    const collections = [
      'user_sessions', 
      'application_logs', 
      'temporary_data',
      'compliance_data'
    ];

    for (const collectionName of collections) {
      const collection = this.db.collection(collectionName);

      // Find documents expiring in the next 24 hours
      const tomorrow = new Date(Date.now() + (24 * 60 * 60 * 1000));

      const soonToExpire = await collection.find({
        $or: [
          { expiresAt: { $lt: tomorrow, $gte: new Date() } },
          { businessDateExpiry: { $lt: tomorrow, $gte: new Date() } },
          { scheduledDestructionDate: { $lt: tomorrow, $gte: new Date() } }
        ]
      }).toArray();

      if (soonToExpire.length > 0) {
        console.log(`${collectionName}: ${soonToExpire.length} documents expiring within 24 hours`);

        // Send notifications or trigger workflows
        await this.sendExpirationNotifications(collectionName, soonToExpire);
      }
    }
  }

  async sendExpirationNotifications(collectionName, documents) {
    // Implementation would integrate with notification systems
    const notification = {
      timestamp: new Date(),
      collection: collectionName,
      documentsCount: documents.length,
      urgency: 'medium',
      action: 'documents_expiring_soon'
    };

    console.log('Expiration notification:', notification);

    // Store notification for processing
    await this.db.collection('expiration_notifications').insertOne(notification);
  }
}

SQL-Style TTL Operations with QueryLeaf

QueryLeaf provides familiar SQL syntax for MongoDB TTL operations:

-- QueryLeaf TTL operations with SQL-familiar syntax

-- Create TTL-enabled collections with automatic expiration
CREATE TABLE user_sessions (
  session_id UUID PRIMARY KEY,
  user_id VARCHAR(50) NOT NULL,
  session_data DOCUMENT,
  created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
  expires_at TIMESTAMP NOT NULL,
  last_accessed_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
  is_active BOOLEAN DEFAULT true
)
WITH TTL (
  -- Multiple TTL strategies
  expires_at EXPIRE_AFTER 0,  -- Use document field value
  last_accessed_at EXPIRE_AFTER '7 days' -- Inactive session cleanup
);

-- Create application logs with level-based retention
CREATE TABLE application_logs (
  log_id UUID PRIMARY KEY,
  application_name VARCHAR(100) NOT NULL,
  log_level VARCHAR(20) NOT NULL,
  message TEXT,
  metadata DOCUMENT,
  created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
)
WITH TTL (
  -- Different retention by log level using partial indexes
  created_at EXPIRE_AFTER '7 days' WHERE log_level = 'DEBUG',
  created_at EXPIRE_AFTER '30 days' WHERE log_level = 'INFO',
  created_at EXPIRE_AFTER '90 days' WHERE log_level = 'WARN',
  created_at EXPIRE_AFTER '365 days' WHERE log_level IN ('ERROR', 'CRITICAL')
);

-- Temporary data with flexible TTL
CREATE TABLE temporary_data (
  temp_id UUID PRIMARY KEY,
  data_type VARCHAR(100),
  data DOCUMENT,
  created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
  expires_at TIMESTAMP,
  ttl_seconds INTEGER DEFAULT 3600
)
WITH TTL (
  expires_at EXPIRE_AFTER 0,  -- Use document field
  created_at EXPIRE_AFTER '24 hours' WHERE expires_at IS NULL  -- Default fallback
);

-- Insert session with custom TTL
INSERT INTO user_sessions (user_id, session_data, expires_at, is_active)
VALUES 
  ('user123', '{"preferences": {"theme": "dark"}}', CURRENT_TIMESTAMP + INTERVAL '2 hours', true),
  ('user456', '{"preferences": {"lang": "en"}}', CURRENT_TIMESTAMP + INTERVAL '1 day', true);

-- Insert log entries (automatic TTL based on level)
INSERT INTO application_logs (application_name, log_level, message, metadata)
VALUES 
  ('web-server', 'DEBUG', 'Request processed', '{"endpoint": "/api/users", "duration": 45}'),
  ('web-server', 'ERROR', 'Database connection failed', '{"error": "timeout", "retry_count": 3}'),
  ('payment-service', 'INFO', 'Payment processed', '{"amount": 99.99, "currency": "USD"}');

-- Query active sessions with TTL information
SELECT 
  session_id,
  user_id,
  created_at,
  expires_at,

  -- Calculate remaining TTL
  EXTRACT(EPOCH FROM (expires_at - CURRENT_TIMESTAMP)) as seconds_until_expiry,

  -- Expiration status
  CASE 
    WHEN expires_at <= CURRENT_TIMESTAMP THEN 'expired'
    WHEN expires_at <= CURRENT_TIMESTAMP + INTERVAL '1 hour' THEN 'expiring_soon'
    ELSE 'active'
  END as expiration_status,

  -- Session age
  EXTRACT(EPOCH FROM (CURRENT_TIMESTAMP - created_at)) as session_age_seconds

FROM user_sessions
WHERE is_active = true
ORDER BY expires_at ASC;

-- Extend session TTL (renew expiration)
UPDATE user_sessions 
SET 
  expires_at = CURRENT_TIMESTAMP + INTERVAL '2 hours',
  last_accessed_at = CURRENT_TIMESTAMP
WHERE session_id = 'session-uuid-here'
  AND is_active = true
  AND expires_at > CURRENT_TIMESTAMP;

-- Store temporary data with custom expiration
INSERT INTO temporary_data (data_type, data, expires_at, ttl_seconds)
VALUES 
  ('cache_entry', '{"result": [1,2,3], "computed_at": "2025-11-01T10:00:00Z"}', CURRENT_TIMESTAMP + INTERVAL '5 minutes', 300),
  ('user_upload', '{"filename": "document.pdf", "size": 1024000}', CURRENT_TIMESTAMP + INTERVAL '24 hours', 86400),
  ('temp_report', '{"report_data": {...}, "generated_for": "user123"}', CURRENT_TIMESTAMP + INTERVAL '1 hour', 3600);

-- Advanced TTL queries with business logic
WITH session_analytics AS (
  SELECT 
    user_id,
    COUNT(*) as total_sessions,
    AVG(EXTRACT(EPOCH FROM (expires_at - created_at))) as avg_session_duration,
    MAX(last_accessed_at) as last_activity,

    -- TTL health metrics
    COUNT(*) FILTER (WHERE expires_at <= CURRENT_TIMESTAMP) as expired_sessions,
    COUNT(*) FILTER (WHERE expires_at <= CURRENT_TIMESTAMP + INTERVAL '1 hour') as soon_to_expire,
    COUNT(*) FILTER (WHERE last_accessed_at < CURRENT_TIMESTAMP - INTERVAL '1 day') as inactive_sessions

  FROM user_sessions
  WHERE created_at >= CURRENT_TIMESTAMP - INTERVAL '7 days'
  GROUP BY user_id
),
user_engagement AS (
  SELECT 
    sa.*,

    -- Engagement scoring
    CASE 
      WHEN avg_session_duration > 7200 AND inactive_sessions = 0 THEN 'highly_engaged'
      WHEN avg_session_duration > 1800 AND inactive_sessions < 2 THEN 'engaged'
      WHEN inactive_sessions > total_sessions * 0.5 THEN 'low_engagement'
      ELSE 'moderate_engagement'
    END as engagement_level,

    -- TTL optimization recommendations
    CASE 
      WHEN inactive_sessions > 5 THEN 'reduce_session_ttl'
      WHEN expired_sessions = 0 AND soon_to_expire = 0 THEN 'extend_session_ttl'
      ELSE 'current_ttl_optimal'
    END as ttl_recommendation

  FROM session_analytics sa
)
SELECT 
  user_id,
  total_sessions,
  ROUND(avg_session_duration / 60, 2) as avg_session_minutes,
  last_activity,
  engagement_level,
  ttl_recommendation,

  -- Session health indicators
  ROUND((total_sessions - expired_sessions)::numeric / total_sessions * 100, 1) as session_health_pct,

  -- TTL efficiency metrics
  expired_sessions,
  soon_to_expire,
  inactive_sessions

FROM user_engagement
WHERE total_sessions > 0
ORDER BY 
  CASE engagement_level 
    WHEN 'highly_engaged' THEN 1
    WHEN 'engaged' THEN 2
    WHEN 'moderate_engagement' THEN 3
    ELSE 4
  END,
  total_sessions DESC;

-- Log retention analysis with TTL monitoring
WITH log_retention_analysis AS (
  SELECT 
    application_name,
    log_level,
    DATE_TRUNC('day', created_at) as log_date,
    COUNT(*) as daily_log_count,
    AVG(LENGTH(message)) as avg_message_length,

    -- TTL calculation based on level-specific retention
    CASE log_level
      WHEN 'DEBUG' THEN created_at + INTERVAL '7 days'
      WHEN 'INFO' THEN created_at + INTERVAL '30 days'
      WHEN 'WARN' THEN created_at + INTERVAL '90 days'
      WHEN 'ERROR' THEN created_at + INTERVAL '365 days'
      WHEN 'CRITICAL' THEN created_at + INTERVAL '365 days'
      ELSE created_at + INTERVAL '30 days'
    END as estimated_expiry,

    -- Storage impact analysis
    SUM(LENGTH(message) + COALESCE(LENGTH(metadata::TEXT), 0)) as daily_storage_bytes

  FROM application_logs
  WHERE created_at >= CURRENT_TIMESTAMP - INTERVAL '30 days'
  GROUP BY application_name, log_level, DATE_TRUNC('day', created_at)
),
storage_projections AS (
  SELECT 
    application_name,
    log_level,

    -- Current metrics
    SUM(daily_log_count) as total_logs,
    AVG(daily_log_count) as avg_daily_logs,
    SUM(daily_storage_bytes) as total_storage_bytes,
    AVG(daily_storage_bytes) as avg_daily_storage,

    -- TTL impact
    MIN(estimated_expiry) as earliest_expiry,
    MAX(estimated_expiry) as latest_expiry,

    -- Storage efficiency
    CASE log_level
      WHEN 'DEBUG' THEN SUM(daily_storage_bytes) * 7 / 30 -- 7-day retention
      WHEN 'INFO' THEN SUM(daily_storage_bytes) -- 30-day retention
      WHEN 'WARN' THEN SUM(daily_storage_bytes) * 3 -- 90-day retention
      ELSE SUM(daily_storage_bytes) * 12 -- 365-day retention
    END as projected_steady_state_storage

  FROM log_retention_analysis
  GROUP BY application_name, log_level
)
SELECT 
  application_name,
  log_level,
  total_logs,
  avg_daily_logs,

  -- Storage analysis
  ROUND(total_storage_bytes / 1024.0 / 1024.0, 2) as storage_mb,
  ROUND(avg_daily_storage / 1024.0 / 1024.0, 2) as avg_daily_mb,
  ROUND(projected_steady_state_storage / 1024.0 / 1024.0, 2) as steady_state_mb,

  -- TTL effectiveness
  earliest_expiry,
  latest_expiry,
  EXTRACT(DAYS FROM (latest_expiry - earliest_expiry)) as retention_range_days,

  -- Storage optimization
  ROUND((total_storage_bytes - projected_steady_state_storage) / 1024.0 / 1024.0, 2) as storage_savings_mb,
  ROUND(((total_storage_bytes - projected_steady_state_storage) / total_storage_bytes * 100), 1) as storage_reduction_pct,

  -- Recommendations
  CASE 
    WHEN log_level = 'DEBUG' AND avg_daily_logs > 10000 THEN 'Consider shorter DEBUG retention or sampling'
    WHEN projected_steady_state_storage > total_storage_bytes * 2 THEN 'TTL may be too long for this log volume'
    WHEN projected_steady_state_storage < total_storage_bytes * 0.1 THEN 'TTL may be too aggressive'
    ELSE 'TTL appears well-configured'
  END as ttl_recommendation

FROM storage_projections
WHERE total_logs > 0
ORDER BY application_name, 
  CASE log_level 
    WHEN 'CRITICAL' THEN 1
    WHEN 'ERROR' THEN 2
    WHEN 'WARN' THEN 3
    WHEN 'INFO' THEN 4
    WHEN 'DEBUG' THEN 5
  END;

-- TTL index health monitoring
WITH ttl_index_health AS (
  SELECT 
    'user_sessions' as collection_name,
    'session_ttl' as index_name,
    'expires_at' as ttl_field,
    0 as expire_after_seconds,

    -- Health metrics
    COUNT(*) as total_documents,
    COUNT(*) FILTER (WHERE expires_at <= CURRENT_TIMESTAMP) as expired_documents,
    COUNT(*) FILTER (WHERE expires_at <= CURRENT_TIMESTAMP + INTERVAL '1 hour') as expiring_soon,

    -- Performance metrics
    AVG(EXTRACT(EPOCH FROM (expires_at - created_at))) as avg_document_lifetime,
    MIN(expires_at) as earliest_expiry,
    MAX(expires_at) as latest_expiry

  FROM user_sessions

  UNION ALL

  SELECT 
    'application_logs' as collection_name,
    'logs_level_ttl' as index_name,
    'created_at' as ttl_field,
    CASE log_level
      WHEN 'DEBUG' THEN 7 * 24 * 3600
      WHEN 'INFO' THEN 30 * 24 * 3600
      WHEN 'WARN' THEN 90 * 24 * 3600
      ELSE 365 * 24 * 3600
    END as expire_after_seconds,

    COUNT(*) as total_documents,
    COUNT(*) FILTER (WHERE 
      created_at <= CURRENT_TIMESTAMP - 
      CASE log_level
        WHEN 'DEBUG' THEN INTERVAL '7 days'
        WHEN 'INFO' THEN INTERVAL '30 days'
        WHEN 'WARN' THEN INTERVAL '90 days'
        ELSE INTERVAL '365 days'
      END
    ) as expired_documents,
    COUNT(*) FILTER (WHERE 
      created_at <= CURRENT_TIMESTAMP + INTERVAL '1 day' - 
      CASE log_level
        WHEN 'DEBUG' THEN INTERVAL '7 days'
        WHEN 'INFO' THEN INTERVAL '30 days'
        WHEN 'WARN' THEN INTERVAL '90 days'
        ELSE INTERVAL '365 days'
      END
    ) as expiring_soon,

    AVG(EXTRACT(EPOCH FROM (CURRENT_TIMESTAMP - created_at))) as avg_document_lifetime,
    MIN(created_at) as earliest_expiry,
    MAX(created_at) as latest_expiry

  FROM application_logs
  GROUP BY log_level
)
SELECT 
  collection_name,
  index_name,
  ttl_field,
  expire_after_seconds,
  total_documents,
  expired_documents,
  expiring_soon,

  -- TTL efficiency metrics
  ROUND(avg_document_lifetime / 3600, 2) as avg_lifetime_hours,
  CASE 
    WHEN total_documents > 0 
    THEN ROUND((expired_documents::numeric / total_documents) * 100, 2)
    ELSE 0
  END as expiration_rate_pct,

  -- TTL health indicators
  CASE 
    WHEN expired_documents > total_documents * 0.9 THEN 'unhealthy_high_expiration'
    WHEN expired_documents = 0 AND total_documents > 1000 THEN 'no_expiration_detected'
    WHEN expiring_soon > total_documents * 0.5 THEN 'high_upcoming_expiration'
    ELSE 'healthy'
  END as ttl_health_status,

  -- Performance impact assessment
  CASE 
    WHEN expired_documents > 10000 THEN 'high_cleanup_load'
    WHEN expiring_soon > 5000 THEN 'moderate_cleanup_load'
    ELSE 'low_cleanup_load'
  END as cleanup_load_assessment

FROM ttl_index_health
ORDER BY collection_name, expire_after_seconds;

-- TTL collection management commands
-- Monitor TTL operations
SHOW TTL STATUS;

-- Optimize TTL indexes
OPTIMIZE TTL INDEXES;

-- Modify TTL expiration times
ALTER TABLE user_sessions 
MODIFY TTL expires_at EXPIRE_AFTER 0,
MODIFY TTL last_accessed_at EXPIRE_AFTER '14 days';

-- Remove TTL from a collection
ALTER TABLE temporary_data DROP TTL created_at;

-- QueryLeaf provides comprehensive TTL capabilities:
-- 1. SQL-familiar TTL creation and management syntax
-- 2. Multiple TTL strategies (field-based, time-based, conditional)
-- 3. Advanced TTL monitoring and health assessment
-- 4. Automatic storage optimization and cleanup
-- 5. Business logic integration with TTL policies
-- 6. Compliance and audit-friendly TTL management
-- 7. Performance monitoring and optimization recommendations
-- 8. Integration with MongoDB's native TTL optimizations
-- 9. Flexible retention policies with partial index support
-- 10. Familiar SQL syntax for complex TTL operations

Best Practices for TTL Implementation

Data Lifecycle Strategy Design

Essential principles for effective TTL implementation:

Business Alignment: Design TTL policies that align with business requirements and compliance needs
Performance Optimization: Consider the impact of TTL operations on database performance
Storage Management: Balance data retention needs with storage costs and performance
Monitoring Strategy: Implement comprehensive monitoring for TTL effectiveness
Gradual Implementation: Roll out TTL policies gradually to assess impact
Backup Considerations: Ensure TTL policies don't conflict with backup and recovery strategies

Advanced TTL Configuration

Optimize TTL for production environments:

Index Strategy: Design TTL indexes to minimize performance impact during cleanup
Batch Operations: Configure TTL to avoid large batch deletions during peak hours
Partial Indexes: Use partial indexes for complex retention policies
Compound TTL: Combine TTL with other indexing strategies for optimal performance
Timezone Handling: Account for business timezone requirements in TTL calculations
Compliance Integration: Ensure TTL policies meet regulatory and audit requirements

Conclusion

MongoDB TTL collections eliminate the complexity of manual data lifecycle management by providing native, automatic data expiration capabilities. The ability to configure flexible retention policies, monitor TTL effectiveness, and integrate with business logic makes TTL collections essential for modern data management strategies.

Key TTL benefits include:

Automatic Data Management: Hands-off data expiration without manual intervention
Flexible Retention Policies: Multiple TTL strategies for different data types and business requirements
Storage Optimization: Automatic cleanup reduces storage costs and improves performance
Compliance Support: Built-in capabilities for audit trails and regulatory compliance
Performance Benefits: Optimized cleanup operations with minimal impact on application performance
SQL Accessibility: Familiar SQL-style TTL operations through QueryLeaf integration

Whether you're managing user sessions, application logs, temporary data, or compliance-sensitive information, MongoDB TTL collections with QueryLeaf's familiar SQL interface provide the foundation for efficient, automated data lifecycle management.

QueryLeaf Integration: QueryLeaf seamlessly manages MongoDB TTL collections while providing SQL-familiar data lifecycle management syntax, retention policy configuration, and TTL monitoring capabilities. Advanced TTL patterns including conditional expiration, gradual cleanup, and compliance-aware retention are elegantly handled through familiar SQL constructs, making sophisticated data lifecycle management both powerful and accessible to SQL-oriented development teams.

The combination of MongoDB's robust TTL capabilities with SQL-style data lifecycle operations makes it an ideal platform for applications requiring both automated data management and familiar database interaction patterns, ensuring your TTL strategies remain both effective and maintainable as your data needs evolve and scale.

October 31, 2025
17 min read

MongoDB Bulk Operations and Performance Optimization: Advanced Batch Processing for High-Throughput Applications

High-throughput applications require efficient data processing capabilities that can handle large volumes of documents with minimal latency and optimal resource utilization. Traditional single-document operations become performance bottlenecks when applications need to process thousands or millions of documents, leading to increased response times, inefficient network utilization, and poor system scalability under heavy data processing loads.

MongoDB's bulk operations provide sophisticated batch processing capabilities that enable applications to perform multiple document operations in a single request, dramatically improving throughput while reducing network overhead and server-side processing costs. Unlike traditional databases that require complex batching logic or application-level transaction management, MongoDB offers native bulk operation support with automatic optimization, error handling, and performance monitoring.

The Single-Document Operation Challenge

Traditional document-by-document processing approaches face significant performance limitations in high-volume scenarios:

-- Traditional approach - processing documents one at a time (inefficient pattern)

-- Example: Processing user registration batch - individual operations
INSERT INTO users (name, email, registration_date, status) 
VALUES ('John Doe', '[email protected]', CURRENT_TIMESTAMP, 'pending');

INSERT INTO users (name, email, registration_date, status) 
VALUES ('Jane Smith', '[email protected]', CURRENT_TIMESTAMP, 'pending');

INSERT INTO users (name, email, registration_date, status) 
VALUES ('Bob Johnson', '[email protected]', CURRENT_TIMESTAMP, 'pending');

-- Problems with single-document operations:
-- 1. High network round-trip overhead for each operation
-- 2. Individual index updates and lock acquisitions
-- 3. Inefficient resource utilization and memory allocation
-- 4. Poor scaling characteristics under high load
-- 5. Complex error handling for partial failures
-- 6. Limited transaction scope and atomicity guarantees

-- Example: Updating user statuses individually (performance bottleneck)
UPDATE users SET status = 'active', activated_at = CURRENT_TIMESTAMP 
WHERE email = '[email protected]';

UPDATE users SET status = 'active', activated_at = CURRENT_TIMESTAMP 
WHERE email = '[email protected]';

UPDATE users SET status = 'active', activated_at = CURRENT_TIMESTAMP 
WHERE email = '[email protected]';

-- Individual updates result in:
-- - Multiple database connections and query parsing overhead
-- - Repeated index lookups and document retrieval operations  
-- - Inefficient write operations with individual lock acquisitions
-- - High latency due to network round trips
-- - Difficult error recovery and consistency management
-- - Poor resource utilization with context switching overhead

-- Example: Data cleanup operations (time-consuming individual deletes)
DELETE FROM users WHERE last_login < CURRENT_DATE - INTERVAL '2 years';
-- This approach processes each matching document individually

DELETE FROM user_sessions WHERE created_at < CURRENT_DATE - INTERVAL '30 days';
-- Again, individual document processing

DELETE FROM audit_logs WHERE log_date < CURRENT_DATE - INTERVAL '1 year';
-- More individual processing overhead

-- Single-document limitations:
-- 1. Long-running operations that block other requests
-- 2. Inefficient resource allocation and memory usage
-- 3. Poor progress tracking and monitoring capabilities
-- 4. Difficult to implement proper error handling
-- 5. No batch-level optimization opportunities
-- 6. Complex application logic for managing large datasets
-- 7. Limited ability to prioritize or throttle operations
-- 8. Inefficient use of database connection pooling

-- Traditional PostgreSQL bulk insert attempt (limited capabilities)
BEGIN;
INSERT INTO users (name, email, registration_date, status) VALUES
  ('User 1', '[email protected]', CURRENT_TIMESTAMP, 'pending'),
  ('User 2', '[email protected]', CURRENT_TIMESTAMP, 'pending'),
  ('User 3', '[email protected]', CURRENT_TIMESTAMP, 'pending');
  -- Limited to relatively small batches due to query size restrictions
  -- No advanced error handling or partial success reporting
  -- Limited optimization compared to native bulk operations
COMMIT;

-- PostgreSQL bulk update limitations
UPDATE users SET 
  status = CASE 
    WHEN email = '[email protected]' THEN 'active'
    WHEN email = '[email protected]' THEN 'suspended'
    WHEN email = '[email protected]' THEN 'active'
    ELSE status
  END,
  last_updated = CURRENT_TIMESTAMP
WHERE email IN ('[email protected]', '[email protected]', '[email protected]');

-- Issues with traditional bulk approaches:
-- 1. Complex SQL syntax for conditional updates
-- 2. Limited flexibility for different operations per document
-- 3. No built-in error reporting for individual items
-- 4. Query size limitations for large batches
-- 5. Poor performance characteristics compared to native bulk operations
-- 6. Limited monitoring and progress reporting capabilities

MongoDB bulk operations provide comprehensive high-performance batch processing:

// MongoDB Advanced Bulk Operations - comprehensive batch processing with optimization

const { MongoClient } = require('mongodb');

// Advanced MongoDB Bulk Operations Manager
class MongoDBBulkOperationsManager {
  constructor(db) {
    this.db = db;
    this.performanceMetrics = {
      bulkInserts: { operations: 0, documentsProcessed: 0, totalTime: 0 },
      bulkUpdates: { operations: 0, documentsProcessed: 0, totalTime: 0 },
      bulkDeletes: { operations: 0, documentsProcessed: 0, totalTime: 0 },
      bulkWrites: { operations: 0, documentsProcessed: 0, totalTime: 0 }
    };
    this.errorTracking = new Map();
    this.optimizationSettings = {
      defaultBatchSize: 1000,
      maxBatchSize: 10000,
      enableOrdered: false, // Unordered operations for better performance
      enableBypassValidation: false,
      retryAttempts: 3,
      retryDelayMs: 1000
    };
  }

  // High-performance bulk insert operations
  async performBulkInsert(collectionName, documents, options = {}) {
    console.log(`Starting bulk insert of ${documents.length} documents into ${collectionName}`);

    const startTime = Date.now();
    const collection = this.db.collection(collectionName);

    // Configure bulk insert options for optimal performance
    const bulkOptions = {
      ordered: options.ordered !== undefined ? options.ordered : this.optimizationSettings.enableOrdered,
      bypassDocumentValidation: options.bypassValidation || this.optimizationSettings.enableBypassValidation,
      writeConcern: options.writeConcern || { w: 'majority', j: true }
    };

    try {
      // Process documents in optimal batch sizes
      const batchSize = Math.min(
        options.batchSize || this.optimizationSettings.defaultBatchSize,
        this.optimizationSettings.maxBatchSize
      );

      const results = [];
      let totalInserted = 0;
      let totalErrors = 0;

      for (let i = 0; i < documents.length; i += batchSize) {
        const batch = documents.slice(i, i + batchSize);

        try {
          console.log(`Processing batch ${Math.floor(i / batchSize) + 1} of ${Math.ceil(documents.length / batchSize)}`);

          // Add metadata to documents for tracking
          const enrichedBatch = batch.map(doc => ({
            ...doc,
            _bulk_operation_id: `bulk_insert_${Date.now()}_${Math.random().toString(36).substr(2, 9)}`,
            _inserted_at: new Date(),
            _batch_number: Math.floor(i / batchSize) + 1
          }));

          const batchResult = await collection.insertMany(enrichedBatch, bulkOptions);

          results.push({
            batchIndex: Math.floor(i / batchSize),
            insertedCount: batchResult.insertedCount,
            insertedIds: batchResult.insertedIds,
            success: true
          });

          totalInserted += batchResult.insertedCount;

        } catch (error) {
          console.error(`Batch ${Math.floor(i / batchSize) + 1} failed:`, error.message);

          // Handle partial failures in unordered operations
          if (error.result && error.result.insertedCount) {
            totalInserted += error.result.insertedCount;
          }

          totalErrors += batch.length - (error.result?.insertedCount || 0);

          results.push({
            batchIndex: Math.floor(i / batchSize),
            insertedCount: error.result?.insertedCount || 0,
            error: error.message,
            success: false
          });

          // Track errors for analysis
          this.trackBulkOperationError('bulkInsert', error);
        }
      }

      const totalTime = Date.now() - startTime;

      // Update performance metrics
      this.updatePerformanceMetrics('bulkInserts', {
        operations: 1,
        documentsProcessed: totalInserted,
        totalTime: totalTime
      });

      const summary = {
        success: totalErrors === 0,
        totalDocuments: documents.length,
        insertedDocuments: totalInserted,
        failedDocuments: totalErrors,
        executionTimeMs: totalTime,
        throughputDocsPerSecond: Math.round((totalInserted / totalTime) * 1000),
        batchResults: results
      };

      console.log(`Bulk insert completed: ${totalInserted}/${documents.length} documents processed in ${totalTime}ms`);
      return summary;

    } catch (error) {
      console.error('Bulk insert operation failed:', error);
      this.trackBulkOperationError('bulkInsert', error);
      throw error;
    }
  }

  // Advanced bulk update operations with flexible patterns
  async performBulkUpdate(collectionName, updateOperations, options = {}) {
    console.log(`Starting bulk update of ${updateOperations.length} operations on ${collectionName}`);

    const startTime = Date.now();
    const collection = this.db.collection(collectionName);

    try {
      // Initialize ordered or unordered bulk operation
      const bulkOp = options.ordered ? collection.initializeOrderedBulkOp() : 
                                       collection.initializeUnorderedBulkOp();

      let operationCount = 0;

      // Process different types of update operations
      for (const operation of updateOperations) {
        const { filter, update, upsert = false, arrayFilters = null, hint = null } = operation;

        // Add operation metadata for tracking
        const enhancedUpdate = {
          ...update,
          $set: {
            ...update.$set,
            _last_bulk_update: new Date(),
            _bulk_operation_id: `bulk_update_${Date.now()}_${operationCount}`
          }
        };

        // Configure update operation based on type
        const updateConfig = { upsert };
        if (arrayFilters) updateConfig.arrayFilters = arrayFilters;
        if (hint) updateConfig.hint = hint;

        // Add to bulk operation
        if (operation.type === 'updateMany') {
          bulkOp.find(filter).updateMany(enhancedUpdate, updateConfig);
        } else {
          bulkOp.find(filter).updateOne(enhancedUpdate, updateConfig);
        }

        operationCount++;

        // Execute batch when reaching optimal size
        if (operationCount % this.optimizationSettings.defaultBatchSize === 0) {
          console.log(`Executing intermediate batch of ${this.optimizationSettings.defaultBatchSize} operations`);
        }
      }

      // Execute all bulk update operations
      console.log(`Executing ${operationCount} bulk update operations`);
      const result = await bulkOp.execute({
        writeConcern: options.writeConcern || { w: 'majority', j: true }
      });

      const totalTime = Date.now() - startTime;

      // Update performance metrics
      this.updatePerformanceMetrics('bulkUpdates', {
        operations: 1,
        documentsProcessed: result.modifiedCount + result.upsertedCount,
        totalTime: totalTime
      });

      const summary = {
        success: true,
        totalOperations: operationCount,
        matchedDocuments: result.matchedCount,
        modifiedDocuments: result.modifiedCount,
        upsertedDocuments: result.upsertedCount,
        upsertedIds: result.upsertedIds,
        executionTimeMs: totalTime,
        throughputOpsPerSecond: Math.round((operationCount / totalTime) * 1000),
        writeErrors: result.writeErrors || [],
        writeConcernErrors: result.writeConcernErrors || []
      };

      console.log(`Bulk update completed: ${result.modifiedCount} documents modified, ${result.upsertedCount} upserted in ${totalTime}ms`);
      return summary;

    } catch (error) {
      console.error('Bulk update operation failed:', error);
      this.trackBulkOperationError('bulkUpdate', error);

      // Return partial results if available
      if (error.result) {
        const totalTime = Date.now() - startTime;
        return {
          success: false,
          error: error.message,
          partialResult: {
            matchedDocuments: error.result.matchedCount,
            modifiedDocuments: error.result.modifiedCount,
            upsertedDocuments: error.result.upsertedCount,
            executionTimeMs: totalTime
          }
        };
      }
      throw error;
    }
  }

  // Optimized bulk delete operations
  async performBulkDelete(collectionName, deleteOperations, options = {}) {
    console.log(`Starting bulk delete of ${deleteOperations.length} operations on ${collectionName}`);

    const startTime = Date.now();
    const collection = this.db.collection(collectionName);

    try {
      // Initialize bulk operation
      const bulkOp = options.ordered ? collection.initializeOrderedBulkOp() : 
                                       collection.initializeUnorderedBulkOp();

      let operationCount = 0;

      // Process delete operations
      for (const operation of deleteOperations) {
        const { filter, deleteType = 'deleteMany', hint = null } = operation;

        // Configure delete operation
        const deleteConfig = {};
        if (hint) deleteConfig.hint = hint;

        // Add to bulk operation based on type
        if (deleteType === 'deleteOne') {
          bulkOp.find(filter).deleteOne();
        } else {
          bulkOp.find(filter).delete(); // deleteMany is default
        }

        operationCount++;
      }

      // Execute bulk delete operations
      console.log(`Executing ${operationCount} bulk delete operations`);
      const result = await bulkOp.execute({
        writeConcern: options.writeConcern || { w: 'majority', j: true }
      });

      const totalTime = Date.now() - startTime;

      // Update performance metrics
      this.updatePerformanceMetrics('bulkDeletes', {
        operations: 1,
        documentsProcessed: result.deletedCount,
        totalTime: totalTime
      });

      const summary = {
        success: true,
        totalOperations: operationCount,
        deletedDocuments: result.deletedCount,
        executionTimeMs: totalTime,
        throughputOpsPerSecond: Math.round((operationCount / totalTime) * 1000),
        writeErrors: result.writeErrors || [],
        writeConcernErrors: result.writeConcernErrors || []
      };

      console.log(`Bulk delete completed: ${result.deletedCount} documents deleted in ${totalTime}ms`);
      return summary;

    } catch (error) {
      console.error('Bulk delete operation failed:', error);
      this.trackBulkOperationError('bulkDelete', error);

      if (error.result) {
        const totalTime = Date.now() - startTime;
        return {
          success: false,
          error: error.message,
          partialResult: {
            deletedDocuments: error.result.deletedCount,
            executionTimeMs: totalTime
          }
        };
      }
      throw error;
    }
  }

  // Mixed bulk operations (insert, update, delete in single batch)
  async performMixedBulkOperations(collectionName, operations, options = {}) {
    console.log(`Starting mixed bulk operations: ${operations.length} operations on ${collectionName}`);

    const startTime = Date.now();
    const collection = this.db.collection(collectionName);

    try {
      const bulkOp = options.ordered ? collection.initializeOrderedBulkOp() : 
                                       collection.initializeUnorderedBulkOp();

      let insertCount = 0;
      let updateCount = 0;
      let deleteCount = 0;

      // Process mixed operations
      for (const operation of operations) {
        const { type, ...opData } = operation;

        switch (type) {
          case 'insert':
            const enrichedDoc = {
              ...opData.document,
              _bulk_operation_id: `bulk_mixed_${Date.now()}_${insertCount}`,
              _inserted_at: new Date()
            };
            bulkOp.insert(enrichedDoc);
            insertCount++;
            break;

          case 'updateOne':
            const updateOneData = {
              ...opData.update,
              $set: {
                ...opData.update.$set,
                _last_bulk_update: new Date(),
                _bulk_operation_id: `bulk_mixed_update_${Date.now()}_${updateCount}`
              }
            };
            bulkOp.find(opData.filter).updateOne(updateOneData, { upsert: opData.upsert || false });
            updateCount++;
            break;

          case 'updateMany':
            const updateManyData = {
              ...opData.update,
              $set: {
                ...opData.update.$set,
                _last_bulk_update: new Date(),
                _bulk_operation_id: `bulk_mixed_update_${Date.now()}_${updateCount}`
              }
            };
            bulkOp.find(opData.filter).updateMany(updateManyData, { upsert: opData.upsert || false });
            updateCount++;
            break;

          case 'deleteOne':
            bulkOp.find(opData.filter).deleteOne();
            deleteCount++;
            break;

          case 'deleteMany':
            bulkOp.find(opData.filter).delete();
            deleteCount++;
            break;

          default:
            console.warn(`Unknown operation type: ${type}`);
        }
      }

      // Execute mixed bulk operations
      console.log(`Executing mixed bulk operations: ${insertCount} inserts, ${updateCount} updates, ${deleteCount} deletes`);
      const result = await bulkOp.execute({
        writeConcern: options.writeConcern || { w: 'majority', j: true }
      });

      const totalTime = Date.now() - startTime;
      const totalDocumentsProcessed = result.insertedCount + result.modifiedCount + result.deletedCount + result.upsertedCount;

      // Update performance metrics
      this.updatePerformanceMetrics('bulkWrites', {
        operations: 1,
        documentsProcessed: totalDocumentsProcessed,
        totalTime: totalTime
      });

      const summary = {
        success: true,
        totalOperations: operations.length,
        operationBreakdown: {
          inserts: insertCount,
          updates: updateCount,
          deletes: deleteCount
        },
        results: {
          insertedDocuments: result.insertedCount,
          insertedIds: result.insertedIds,
          matchedDocuments: result.matchedCount,
          modifiedDocuments: result.modifiedCount,
          deletedDocuments: result.deletedCount,
          upsertedDocuments: result.upsertedCount,
          upsertedIds: result.upsertedIds
        },
        executionTimeMs: totalTime,
        throughputOpsPerSecond: Math.round((operations.length / totalTime) * 1000),
        throughputDocsPerSecond: Math.round((totalDocumentsProcessed / totalTime) * 1000),
        writeErrors: result.writeErrors || [],
        writeConcernErrors: result.writeConcernErrors || []
      };

      console.log(`Mixed bulk operations completed: ${totalDocumentsProcessed} documents processed in ${totalTime}ms`);
      return summary;

    } catch (error) {
      console.error('Mixed bulk operations failed:', error);
      this.trackBulkOperationError('bulkWrite', error);

      if (error.result) {
        const totalTime = Date.now() - startTime;
        const totalDocumentsProcessed = error.result.insertedCount + error.result.modifiedCount + error.result.deletedCount + error.result.upsertedCount;

        return {
          success: false,
          error: error.message,
          partialResult: {
            insertedDocuments: error.result.insertedCount,
            modifiedDocuments: error.result.modifiedCount,
            deletedDocuments: error.result.deletedCount,
            upsertedDocuments: error.result.upsertedCount,
            totalDocumentsProcessed: totalDocumentsProcessed,
            executionTimeMs: totalTime
          }
        };
      }
      throw error;
    }
  }

  // Performance monitoring and optimization
  updatePerformanceMetrics(operationType, metrics) {
    const current = this.performanceMetrics[operationType];
    current.operations += metrics.operations;
    current.documentsProcessed += metrics.documentsProcessed;
    current.totalTime += metrics.totalTime;
  }

  trackBulkOperationError(operationType, error) {
    if (!this.errorTracking.has(operationType)) {
      this.errorTracking.set(operationType, []);
    }

    this.errorTracking.get(operationType).push({
      timestamp: new Date(),
      error: error.message,
      code: error.code,
      details: error.writeErrors || error.result
    });
  }

  getBulkOperationStatistics() {
    const stats = {};

    for (const [operationType, metrics] of Object.entries(this.performanceMetrics)) {
      if (metrics.operations > 0) {
        stats[operationType] = {
          totalOperations: metrics.operations,
          documentsProcessed: metrics.documentsProcessed,
          averageExecutionTimeMs: Math.round(metrics.totalTime / metrics.operations),
          averageThroughputDocsPerSecond: Math.round((metrics.documentsProcessed / metrics.totalTime) * 1000),
          totalExecutionTimeMs: metrics.totalTime
        };
      }
    }

    return stats;
  }

  getErrorStatistics() {
    const errorStats = {};

    for (const [operationType, errors] of this.errorTracking.entries()) {
      errorStats[operationType] = {
        totalErrors: errors.length,
        recentErrors: errors.filter(e => Date.now() - e.timestamp.getTime() < 3600000), // Last hour
        errorBreakdown: this.groupErrorsByCode(errors)
      };
    }

    return errorStats;
  }

  groupErrorsByCode(errors) {
    const breakdown = {};
    errors.forEach(error => {
      const code = error.code || 'Unknown';
      breakdown[code] = (breakdown[code] || 0) + 1;
    });
    return breakdown;
  }

  // Optimized data import functionality
  async performOptimizedDataImport(collectionName, dataSource, options = {}) {
    console.log(`Starting optimized data import for ${collectionName}`);

    const importOptions = {
      batchSize: options.batchSize || 5000,
      enableValidation: options.enableValidation !== false,
      createIndexes: options.createIndexes || false,
      dropExistingCollection: options.dropExisting || false,
      parallelBatches: options.parallelBatches || 1
    };

    try {
      const collection = this.db.collection(collectionName);

      // Drop existing collection if requested
      if (importOptions.dropExistingCollection) {
        try {
          await collection.drop();
          console.log(`Existing collection ${collectionName} dropped`);
        } catch (error) {
          console.log(`Collection ${collectionName} did not exist or could not be dropped`);
        }
      }

      // Create indexes before import if specified
      if (importOptions.createIndexes && options.indexes) {
        console.log('Creating indexes before data import...');
        for (const indexSpec of options.indexes) {
          await collection.createIndex(indexSpec.fields, indexSpec.options);
        }
      }

      // Process data in optimized batches
      let totalImported = 0;
      const startTime = Date.now();

      // Assuming dataSource is an array or iterable
      const documents = Array.isArray(dataSource) ? dataSource : await this.convertDataSource(dataSource);

      const result = await this.performBulkInsert(collectionName, documents, {
        batchSize: importOptions.batchSize,
        bypassValidation: !importOptions.enableValidation,
        ordered: false // Unordered for better performance
      });

      console.log(`Data import completed: ${result.insertedDocuments} documents imported in ${result.executionTimeMs}ms`);
      return result;

    } catch (error) {
      console.error(`Data import failed for ${collectionName}:`, error);
      throw error;
    }
  }

  async convertDataSource(dataSource) {
    // Convert various data sources (streams, iterators, etc.) to arrays
    // This is a placeholder - implement based on your specific data source types
    if (typeof dataSource.toArray === 'function') {
      return await dataSource.toArray();
    }

    if (Symbol.iterator in dataSource) {
      return Array.from(dataSource);
    }

    throw new Error('Unsupported data source type');
  }
}

// Example usage: High-performance bulk operations
async function demonstrateBulkOperations() {
  const client = new MongoClient('mongodb://localhost:27017');
  await client.connect();
  const db = client.db('bulk_operations_demo');

  const bulkManager = new MongoDBBulkOperationsManager(db);

  // Demonstrate bulk insert
  const usersToInsert = [];
  for (let i = 0; i < 10000; i++) {
    usersToInsert.push({
      name: `User ${i}`,
      email: `user${i}@example.com`,
      age: Math.floor(Math.random() * 50) + 18,
      department: ['Engineering', 'Sales', 'Marketing', 'HR'][Math.floor(Math.random() * 4)],
      salary: Math.floor(Math.random() * 100000) + 40000,
      join_date: new Date(Date.now() - Math.random() * 365 * 24 * 60 * 60 * 1000)
    });
  }

  const insertResult = await bulkManager.performBulkInsert('users', usersToInsert);
  console.log('Bulk Insert Result:', insertResult);

  // Demonstrate bulk update
  const updateOperations = [
    {
      type: 'updateMany',
      filter: { department: 'Engineering' },
      update: { 
        $set: { department: 'Software Engineering' },
        $inc: { salary: 5000 }
      }
    },
    {
      type: 'updateMany', 
      filter: { age: { $lt: 25 } },
      update: { $set: { employee_type: 'junior' } },
      upsert: false
    }
  ];

  const updateResult = await bulkManager.performBulkUpdate('users', updateOperations);
  console.log('Bulk Update Result:', updateResult);

  // Display performance statistics
  const stats = bulkManager.getBulkOperationStatistics();
  console.log('Performance Statistics:', stats);

  await client.close();
}

Understanding MongoDB Bulk Operations Architecture

Advanced Bulk Processing Patterns and Performance Optimization

Implement sophisticated bulk operation patterns for production-scale data processing:

// Production-ready MongoDB bulk operations with advanced optimization strategies
class EnterpriseMongoDBBulkManager extends MongoDBBulkOperationsManager {
  constructor(db, enterpriseConfig = {}) {
    super(db);

    this.enterpriseConfig = {
      enableShardingOptimization: enterpriseConfig.enableShardingOptimization || false,
      enableReplicationOptimization: enterpriseConfig.enableReplicationOptimization || false,
      enableCompressionOptimization: enterpriseConfig.enableCompressionOptimization || false,
      maxConcurrentOperations: enterpriseConfig.maxConcurrentOperations || 10,
      enableProgressTracking: enterpriseConfig.enableProgressTracking || true,
      enableResourceMonitoring: enterpriseConfig.enableResourceMonitoring || true
    };

    this.setupEnterpriseOptimizations();
  }

  async performParallelBulkOperations(collectionName, operationBatches, options = {}) {
    console.log(`Starting parallel bulk operations on ${collectionName} with ${operationBatches.length} batches`);

    const concurrency = Math.min(
      options.maxConcurrency || this.enterpriseConfig.maxConcurrentOperations,
      operationBatches.length
    );

    const results = [];
    const startTime = Date.now();

    // Process batches in parallel with controlled concurrency
    for (let i = 0; i < operationBatches.length; i += concurrency) {
      const batchPromises = [];

      for (let j = i; j < Math.min(i + concurrency, operationBatches.length); j++) {
        const batch = operationBatches[j];

        const promise = this.processSingleBatch(collectionName, batch, {
          ...options,
          batchIndex: j
        });

        batchPromises.push(promise);
      }

      // Wait for current set of concurrent batches to complete
      const batchResults = await Promise.allSettled(batchPromises);
      results.push(...batchResults);

      console.log(`Completed ${Math.min(i + concurrency, operationBatches.length)} of ${operationBatches.length} batches`);
    }

    const totalTime = Date.now() - startTime;

    return this.consolidateParallelResults(results, totalTime);
  }

  async processSingleBatch(collectionName, batch, options) {
    // Determine batch type and process accordingly
    if (batch.type === 'insert') {
      return await this.performBulkInsert(collectionName, batch.documents, options);
    } else if (batch.type === 'update') {
      return await this.performBulkUpdate(collectionName, batch.operations, options);
    } else if (batch.type === 'delete') {
      return await this.performBulkDelete(collectionName, batch.operations, options);
    } else if (batch.type === 'mixed') {
      return await this.performMixedBulkOperations(collectionName, batch.operations, options);
    }
  }

  async performShardOptimizedBulkOperations(collectionName, operations, shardKey) {
    console.log(`Performing shard-optimized bulk operations on ${collectionName}`);

    // Group operations by shard key for optimal routing
    const shardGroupedOps = this.groupOperationsByShardKey(operations, shardKey);

    const results = [];

    for (const [shardValue, shardOps] of shardGroupedOps.entries()) {
      console.log(`Processing ${shardOps.length} operations for shard key value: ${shardValue}`);

      const shardResult = await this.performMixedBulkOperations(collectionName, shardOps, {
        ordered: false // Better performance for sharded clusters
      });

      results.push({
        shardKey: shardValue,
        result: shardResult
      });
    }

    return this.consolidateShardResults(results);
  }

  groupOperationsByShardKey(operations, shardKey) {
    const grouped = new Map();

    for (const operation of operations) {
      let keyValue;

      if (operation.type === 'insert') {
        keyValue = operation.document[shardKey];
      } else {
        keyValue = operation.filter[shardKey];
      }

      if (!grouped.has(keyValue)) {
        grouped.set(keyValue, []);
      }

      grouped.get(keyValue).push(operation);
    }

    return grouped;
  }

  async performStreamingBulkOperations(collectionName, dataStream, options = {}) {
    console.log(`Starting streaming bulk operations on ${collectionName}`);

    const batchSize = options.batchSize || 1000;
    const processingOptions = {
      ordered: false,
      ...options
    };

    let batch = [];
    let totalProcessed = 0;
    const results = [];

    return new Promise((resolve, reject) => {
      dataStream.on('data', async (data) => {
        batch.push(data);

        if (batch.length >= batchSize) {
          try {
            const batchResult = await this.performBulkInsert(
              collectionName, 
              batch, 
              processingOptions
            );

            results.push(batchResult);
            totalProcessed += batchResult.insertedDocuments;
            batch = [];

            console.log(`Processed ${totalProcessed} documents so far`);

          } catch (error) {
            reject(error);
          }
        }
      });

      dataStream.on('end', async () => {
        try {
          // Process remaining documents
          if (batch.length > 0) {
            const finalResult = await this.performBulkInsert(
              collectionName, 
              batch, 
              processingOptions
            );
            results.push(finalResult);
            totalProcessed += finalResult.insertedDocuments;
          }

          resolve({
            success: true,
            totalProcessed: totalProcessed,
            batchResults: results
          });

        } catch (error) {
          reject(error);
        }
      });

      dataStream.on('error', reject);
    });
  }
}

QueryLeaf Bulk Operations Integration

QueryLeaf provides familiar SQL syntax for MongoDB bulk operations and batch processing:

-- QueryLeaf bulk operations with SQL-familiar syntax for MongoDB batch processing

-- Bulk insert with SQL VALUES syntax (automatically optimized for MongoDB bulk operations)
INSERT INTO users (name, email, age, department, salary, join_date)
VALUES 
  ('John Doe', '[email protected]', 32, 'Engineering', 85000, CURRENT_DATE),
  ('Jane Smith', '[email protected]', 28, 'Sales', 75000, CURRENT_DATE - INTERVAL '1 month'),
  ('Bob Johnson', '[email protected]', 35, 'Marketing', 70000, CURRENT_DATE - INTERVAL '2 months'),
  ('Alice Brown', '[email protected]', 29, 'HR', 68000, CURRENT_DATE - INTERVAL '3 months'),
  ('Charlie Wilson', '[email protected]', 31, 'Engineering', 90000, CURRENT_DATE - INTERVAL '4 months');

-- QueryLeaf automatically converts this to optimized MongoDB bulk insert:
-- db.users.insertMany([documents...], { ordered: false })

-- Bulk update operations using SQL UPDATE syntax
-- Update all engineers' salaries (automatically uses MongoDB bulk operations)
UPDATE users 
SET salary = salary * 1.1, 
    last_updated = CURRENT_TIMESTAMP,
    promotion_eligible = true
WHERE department = 'Engineering';

-- Update employees based on multiple conditions
UPDATE users 
SET employee_level = CASE 
  WHEN age > 35 AND salary > 80000 THEN 'Senior'
  WHEN age > 30 OR salary > 70000 THEN 'Mid-level'
  ELSE 'Junior'
END,
last_evaluation = CURRENT_DATE
WHERE join_date < CURRENT_DATE - INTERVAL '6 months';

-- QueryLeaf optimizes these as MongoDB bulk update operations:
-- Uses bulkWrite() with updateMany operations for optimal performance

-- Bulk delete operations
-- Clean up old inactive users
DELETE FROM users 
WHERE last_login < CURRENT_DATE - INTERVAL '2 years' 
  AND status = 'inactive';

-- Remove test data
DELETE FROM users 
WHERE email LIKE '%test%' OR email LIKE '%example%';

-- QueryLeaf converts to optimized bulk delete operations

-- Advanced bulk processing with data transformation and aggregation
WITH user_statistics AS (
  SELECT 
    department,
    COUNT(*) as employee_count,
    AVG(salary) as avg_salary,
    MAX(salary) as max_salary,
    MIN(join_date) as earliest_hire
  FROM users 
  GROUP BY department
),

salary_adjustments AS (
  SELECT 
    u._id,
    u.name,
    u.department,
    u.salary,
    us.avg_salary,

    -- Calculate adjustment based on department average
    CASE 
      WHEN u.salary < us.avg_salary * 0.8 THEN u.salary * 1.15  -- 15% increase
      WHEN u.salary < us.avg_salary * 0.9 THEN u.salary * 1.10  -- 10% increase  
      WHEN u.salary > us.avg_salary * 1.2 THEN u.salary * 1.02  -- 2% increase
      ELSE u.salary * 1.05  -- 5% standard increase
    END as new_salary,

    CURRENT_DATE as adjustment_date

  FROM users u
  JOIN user_statistics us ON u.department = us.department
  WHERE u.status = 'active'
)

-- Bulk update with calculated values (QueryLeaf optimizes this as bulk operation)
UPDATE users 
SET salary = sa.new_salary,
    last_salary_review = sa.adjustment_date,
    salary_review_reason = CONCAT('Department average adjustment - Previous: $', 
                                 CAST(sa.salary AS VARCHAR), 
                                 ', New: $', 
                                 CAST(sa.new_salary AS VARCHAR))
FROM salary_adjustments sa
WHERE users._id = sa._id;

-- Bulk data processing with conditional operations
-- Process employee performance reviews in batches
WITH performance_data AS (
  SELECT 
    _id,
    name,
    department,
    performance_score,

    -- Calculate performance category
    CASE 
      WHEN performance_score >= 90 THEN 'exceptional'
      WHEN performance_score >= 80 THEN 'exceeds_expectations'  
      WHEN performance_score >= 70 THEN 'meets_expectations'
      WHEN performance_score >= 60 THEN 'needs_improvement'
      ELSE 'unsatisfactory'
    END as performance_category,

    -- Calculate bonus eligibility
    CASE 
      WHEN performance_score >= 85 AND department IN ('Sales', 'Engineering') THEN true
      WHEN performance_score >= 90 THEN true
      ELSE false
    END as bonus_eligible,

    -- Calculate development plan requirement
    CASE 
      WHEN performance_score < 70 THEN true
      ELSE false  
    END as requires_development_plan

  FROM employees 
  WHERE review_period = '2025-Q3'
),

bonus_calculations AS (
  SELECT 
    pd._id,
    pd.bonus_eligible,

    -- Calculate bonus amount
    CASE 
      WHEN pd.performance_score >= 95 THEN u.salary * 0.15  -- 15% bonus
      WHEN pd.performance_score >= 90 THEN u.salary * 0.12  -- 12% bonus  
      WHEN pd.performance_score >= 85 THEN u.salary * 0.10  -- 10% bonus
      ELSE 0
    END as bonus_amount

  FROM performance_data pd
  JOIN users u ON pd._id = u._id
  WHERE pd.bonus_eligible = true
)

-- Execute bulk updates for performance review results
UPDATE users 
SET performance_category = pd.performance_category,
    bonus_eligible = pd.bonus_eligible,
    bonus_amount = COALESCE(bc.bonus_amount, 0),
    requires_development_plan = pd.requires_development_plan,
    last_performance_review = CURRENT_DATE,
    review_status = 'completed'
FROM performance_data pd
LEFT JOIN bonus_calculations bc ON pd._id = bc._id  
WHERE users._id = pd._id;

-- Advanced batch processing with data validation and error handling
-- Bulk data import with validation
INSERT INTO products (sku, name, category, price, stock_quantity, supplier_id, created_at)
SELECT 
  import_sku,
  import_name,
  import_category,
  CAST(import_price AS DECIMAL(10,2)),
  CAST(import_stock AS INTEGER),
  supplier_lookup.supplier_id,
  CURRENT_TIMESTAMP

FROM product_import_staging pis
JOIN suppliers supplier_lookup ON pis.supplier_name = supplier_lookup.name

-- Validation conditions
WHERE import_sku IS NOT NULL
  AND import_name IS NOT NULL  
  AND import_category IN ('Electronics', 'Clothing', 'Books', 'Home', 'Sports')
  AND import_price::DECIMAL(10,2) > 0
  AND import_stock::INTEGER >= 0
  AND supplier_lookup.supplier_id IS NOT NULL

  -- Duplicate check
  AND NOT EXISTS (
    SELECT 1 FROM products p 
    WHERE p.sku = pis.import_sku
  );

-- Bulk inventory adjustments with audit trail
WITH inventory_adjustments AS (
  SELECT 
    product_id,
    adjustment_quantity,
    adjustment_reason,
    adjustment_type, -- 'increase', 'decrease', 'recount'
    CURRENT_TIMESTAMP as adjustment_timestamp,
    'system' as adjusted_by
  FROM inventory_adjustment_queue
  WHERE processed = false
),

stock_calculations AS (
  SELECT 
    ia.product_id,
    p.stock_quantity as current_stock,

    CASE ia.adjustment_type
      WHEN 'increase' THEN p.stock_quantity + ia.adjustment_quantity
      WHEN 'decrease' THEN GREATEST(p.stock_quantity - ia.adjustment_quantity, 0)
      WHEN 'recount' THEN ia.adjustment_quantity
      ELSE p.stock_quantity
    END as new_stock_quantity,

    ia.adjustment_reason,
    ia.adjustment_timestamp,
    ia.adjusted_by

  FROM inventory_adjustments ia
  JOIN products p ON ia.product_id = p._id
)

-- Bulk update product stock levels
UPDATE products 
SET stock_quantity = sc.new_stock_quantity,
    last_stock_update = sc.adjustment_timestamp,
    stock_updated_by = sc.adjusted_by
FROM stock_calculations sc
WHERE products._id = sc.product_id;

-- Insert audit records for inventory changes
INSERT INTO inventory_audit_log (
  product_id,
  previous_stock,
  new_stock,
  adjustment_reason,
  adjustment_timestamp,
  adjusted_by
)
SELECT 
  sc.product_id,
  sc.current_stock,
  sc.new_stock_quantity,
  sc.adjustment_reason,
  sc.adjustment_timestamp,
  sc.adjusted_by
FROM stock_calculations sc;

-- Mark adjustment queue items as processed
UPDATE inventory_adjustment_queue 
SET processed = true,
    processed_at = CURRENT_TIMESTAMP
WHERE processed = false;

-- High-performance bulk operations with monitoring
-- Query for bulk operation performance analysis
WITH operation_metrics AS (
  SELECT 
    DATE_TRUNC('hour', operation_timestamp) as hour_bucket,
    operation_type, -- 'bulk_insert', 'bulk_update', 'bulk_delete'
    collection_name,

    -- Performance metrics
    COUNT(*) as operations_count,
    SUM(documents_processed) as total_documents,
    AVG(execution_time_ms) as avg_execution_time_ms,
    MAX(execution_time_ms) as max_execution_time_ms,
    MIN(execution_time_ms) as min_execution_time_ms,

    -- Throughput calculations
    AVG(throughput_docs_per_second) as avg_throughput_docs_per_sec,
    MAX(throughput_docs_per_second) as max_throughput_docs_per_sec,

    -- Error tracking
    COUNT(*) FILTER (WHERE success = false) as failed_operations,
    COUNT(*) FILTER (WHERE success = true) as successful_operations,

    -- Resource utilization
    AVG(memory_usage_mb) as avg_memory_usage_mb,
    AVG(cpu_utilization_percent) as avg_cpu_utilization

  FROM bulk_operation_log
  WHERE operation_timestamp >= CURRENT_TIMESTAMP - INTERVAL '24 hours'
  GROUP BY DATE_TRUNC('hour', operation_timestamp), operation_type, collection_name
)

SELECT 
  hour_bucket,
  operation_type,
  collection_name,
  operations_count,
  total_documents,

  -- Performance summary
  ROUND(avg_execution_time_ms, 2) as avg_execution_time_ms,
  ROUND(avg_throughput_docs_per_sec, 0) as avg_throughput_docs_per_sec,
  max_throughput_docs_per_sec,

  -- Success rate
  successful_operations,
  failed_operations,
  ROUND((successful_operations::DECIMAL / (successful_operations + failed_operations)) * 100, 2) as success_rate_percent,

  -- Resource efficiency  
  ROUND(avg_memory_usage_mb, 1) as avg_memory_usage_mb,
  ROUND(avg_cpu_utilization, 1) as avg_cpu_utilization_percent,

  -- Performance assessment
  CASE 
    WHEN avg_execution_time_ms < 100 AND success_rate_percent > 99 THEN 'excellent'
    WHEN avg_execution_time_ms < 500 AND success_rate_percent > 95 THEN 'good'
    WHEN avg_execution_time_ms < 1000 AND success_rate_percent > 90 THEN 'acceptable'
    ELSE 'needs_optimization'
  END as performance_rating

FROM operation_metrics
ORDER BY hour_bucket DESC, total_documents DESC;

-- QueryLeaf provides comprehensive bulk operation support:
-- 1. Automatic conversion of SQL batch operations to MongoDB bulk operations
-- 2. Optimal batching strategies based on operation types and data characteristics
-- 3. Advanced error handling with partial success reporting
-- 4. Performance monitoring and optimization recommendations
-- 5. Support for complex data transformations during bulk processing
-- 6. Intelligent resource utilization and concurrency management
-- 7. Integration with MongoDB's native bulk operation optimizations
-- 8. Familiar SQL syntax for complex batch processing workflows

Best Practices for MongoDB Bulk Operations

Performance Optimization Strategies

Essential principles for maximizing bulk operation performance:

Batch Size Optimization: Choose optimal batch sizes based on document size, available memory, and network capacity
Unordered Operations: Use unordered bulk operations when possible for better parallelization and performance
Index Considerations: Consider index impact when performing bulk operations - create indexes before bulk inserts, after bulk updates
Write Concern Configuration: Balance consistency requirements with performance using appropriate write concern settings
Error Handling Strategy: Implement comprehensive error handling with partial success reporting and retry logic
Resource Monitoring: Monitor system resources during bulk operations and adjust batch sizes dynamically

Production Deployment Considerations

Optimize bulk operations for enterprise production environments:

Sharding Awareness: Design bulk operations to work efficiently with MongoDB sharded clusters
Replication Optimization: Configure operations to work optimally with replica sets and read preferences
Concurrency Management: Implement appropriate concurrency controls to prevent resource contention
Progress Tracking: Provide comprehensive progress reporting for long-running bulk operations
Memory Management: Monitor and control memory usage during large-scale bulk processing
Performance Monitoring: Implement detailed performance monitoring and alerting for bulk operations

Conclusion

MongoDB bulk operations provide powerful capabilities for high-throughput data processing that dramatically improve performance compared to single-document operations through intelligent batching, automatic optimization, and comprehensive error handling. The native bulk operation support enables applications to efficiently process large volumes of data while maintaining consistency and providing detailed operational visibility.

Key MongoDB Bulk Operations benefits include:

High-Performance Processing: Optimal throughput through intelligent batching and reduced network overhead
Flexible Operation Types: Support for mixed bulk operations including inserts, updates, and deletes in single batches
Advanced Error Handling: Comprehensive error reporting with partial success tracking and recovery capabilities
Resource Optimization: Efficient memory and CPU utilization through optimized batch processing algorithms
Production Scalability: Enterprise-ready bulk processing with monitoring, progress tracking, and performance optimization
SQL Accessibility: Familiar SQL-style bulk operations through QueryLeaf for accessible high-performance data processing

Whether you're building data import systems, batch processing pipelines, ETL workflows, or high-throughput applications, MongoDB bulk operations with QueryLeaf's familiar SQL interface provide the foundation for efficient, scalable, and reliable batch data processing.

QueryLeaf Integration: QueryLeaf automatically optimizes SQL batch operations into MongoDB bulk operations while providing familiar SQL syntax for complex data processing workflows. Advanced bulk operation patterns, performance monitoring, and error handling are seamlessly handled through familiar SQL constructs, making high-performance batch processing accessible to SQL-oriented development teams.

The combination of MongoDB's robust bulk operation capabilities with SQL-style batch processing operations makes it an ideal platform for applications requiring both high-throughput data processing and familiar database operation patterns, ensuring your batch processing workflows can scale efficiently while maintaining performance and reliability.

October 30, 2025
25 min read

MongoDB Backup and Recovery for Enterprise Data Protection: Advanced Disaster Recovery Strategies, Point-in-Time Recovery, and Operational Resilience

Enterprise applications require comprehensive data protection strategies that ensure business continuity during system failures, natural disasters, or data corruption events. Traditional database backup approaches often struggle with the complexity of distributed systems, large data volumes, and the stringent Recovery Time Objectives (RTO) and Recovery Point Objectives (RPO) demanded by modern business operations.

MongoDB's distributed architecture and flexible backup mechanisms provide sophisticated data protection capabilities that support everything from simple scheduled backups to complex multi-region disaster recovery scenarios. Unlike traditional relational systems that often require expensive specialized backup software and complex coordination across multiple database instances, MongoDB's replica sets, sharding, and oplog-based recovery enable native, high-performance backup strategies that integrate seamlessly with cloud storage systems and enterprise infrastructure.

The Traditional Backup Challenge

Conventional database backup approaches face significant limitations when dealing with large-scale distributed applications:

-- Traditional PostgreSQL backup approach - complex and time-consuming

-- Full database backup (blocks database during backup)
pg_dump --host=localhost --port=5432 --username=postgres \
  --format=custom --blobs --verbose --file=full_backup_20240130.dump \
  --schema=public ecommerce_db;

-- Problems with traditional full backups:
-- 1. Database blocking during backup operations
-- 2. Exponentially growing backup sizes
-- 3. Long recovery times for large databases
-- 4. No granular recovery options
-- 5. Complex coordination across multiple database instances
-- 6. Limited point-in-time recovery capabilities
-- 7. Expensive storage requirements for frequent backups
-- 8. Manual intervention required for disaster recovery scenarios

-- Incremental backup simulation (requires complex custom scripting)
BEGIN;

-- Create backup tracking table
CREATE TABLE IF NOT EXISTS backup_tracking (
    backup_id SERIAL PRIMARY KEY,
    backup_type VARCHAR(20) NOT NULL, -- full, incremental, differential
    backup_timestamp TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    last_lsn BIGINT,
    backup_size_bytes BIGINT,
    backup_location TEXT NOT NULL,
    backup_status VARCHAR(20) DEFAULT 'in_progress',
    completion_time TIMESTAMP,
    verification_status VARCHAR(20),
    retention_until TIMESTAMP
);

-- Track WAL position for incremental backups
CREATE TABLE IF NOT EXISTS wal_tracking (
    tracking_id SERIAL PRIMARY KEY,
    checkpoint_timestamp TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    wal_position BIGINT NOT NULL,
    transaction_count BIGINT,
    database_size_bytes BIGINT,
    active_connections INTEGER
);

COMMIT;

-- Complex stored procedure for incremental backup coordination
CREATE OR REPLACE FUNCTION perform_incremental_backup(
    backup_location TEXT,
    compression_level INTEGER DEFAULT 6
)
RETURNS TABLE (
    backup_id INTEGER,
    backup_size_bytes BIGINT,
    duration_seconds INTEGER,
    success BOOLEAN
) AS $$
DECLARE
    current_lsn BIGINT;
    last_backup_lsn BIGINT;
    backup_start_time TIMESTAMP := clock_timestamp();
    new_backup_id INTEGER;
    backup_command TEXT;
    backup_result INTEGER;
BEGIN
    -- Get current WAL position
    SELECT pg_current_wal_lsn() INTO current_lsn;

    -- Get last backup LSN
    SELECT COALESCE(MAX(last_lsn), 0) 
    INTO last_backup_lsn 
    FROM backup_tracking 
    WHERE backup_status = 'completed';

    -- Check if incremental backup is needed
    IF current_lsn <= last_backup_lsn THEN
        RAISE NOTICE 'No changes since last backup, skipping incremental backup';
        RETURN;
    END IF;

    -- Create backup record
    INSERT INTO backup_tracking (
        backup_type, 
        last_lsn, 
        backup_location
    ) 
    VALUES (
        'incremental', 
        current_lsn, 
        backup_location
    ) 
    RETURNING backup_id INTO new_backup_id;

    -- Perform incremental backup (simplified - actual implementation much more complex)
    -- This would require complex WAL shipping and parsing logic
    backup_command := format(
        'pg_basebackup --host=localhost --username=postgres --wal-method=stream --compress=%s --format=tar --pgdata=%s/incremental_%s',
        compression_level,
        backup_location,
        new_backup_id
    );

    -- Execute backup command (in real implementation)
    -- SELECT * FROM system_command(backup_command) INTO backup_result;
    backup_result := 0; -- Simulate success

    IF backup_result = 0 THEN
        -- Update backup record with completion
        UPDATE backup_tracking 
        SET 
            backup_status = 'completed',
            completion_time = clock_timestamp(),
            backup_size_bytes = pg_database_size(current_database())
        WHERE backup_id = new_backup_id;

        -- Record WAL tracking information
        INSERT INTO wal_tracking (
            wal_position,
            transaction_count,
            database_size_bytes,
            active_connections
        ) VALUES (
            current_lsn,
            (SELECT sum(xact_commit + xact_rollback) FROM pg_stat_database),
            pg_database_size(current_database()),
            (SELECT count(*) FROM pg_stat_activity WHERE state = 'active')
        );

        RETURN QUERY SELECT 
            new_backup_id,
            pg_database_size(current_database()),
            EXTRACT(SECONDS FROM clock_timestamp() - backup_start_time)::INTEGER,
            TRUE;
    ELSE
        -- Mark backup as failed
        UPDATE backup_tracking 
        SET backup_status = 'failed' 
        WHERE backup_id = new_backup_id;

        RETURN QUERY SELECT 
            new_backup_id,
            0::BIGINT,
            EXTRACT(SECONDS FROM clock_timestamp() - backup_start_time)::INTEGER,
            FALSE;
    END IF;
END;
$$ LANGUAGE plpgsql;

-- Point-in-time recovery simulation (extremely complex in traditional systems)
CREATE OR REPLACE FUNCTION simulate_point_in_time_recovery(
    target_timestamp TIMESTAMP,
    recovery_location TEXT
)
RETURNS TABLE (
    recovery_success BOOLEAN,
    recovered_to_timestamp TIMESTAMP,
    recovery_duration_minutes INTEGER,
    data_loss_minutes INTEGER
) AS $$
DECLARE
    base_backup_id INTEGER;
    target_lsn BIGINT;
    recovery_start_time TIMESTAMP := clock_timestamp();
    actual_recovery_timestamp TIMESTAMP;
BEGIN
    -- Find appropriate base backup
    SELECT backup_id 
    INTO base_backup_id
    FROM backup_tracking 
    WHERE backup_timestamp <= target_timestamp 
      AND backup_status = 'completed'
      AND backup_type IN ('full', 'differential')
    ORDER BY backup_timestamp DESC 
    LIMIT 1;

    IF base_backup_id IS NULL THEN
        RAISE EXCEPTION 'No suitable base backup found for timestamp %', target_timestamp;
    END IF;

    -- Find target LSN from WAL tracking
    SELECT wal_position 
    INTO target_lsn
    FROM wal_tracking 
    WHERE checkpoint_timestamp <= target_timestamp
    ORDER BY checkpoint_timestamp DESC 
    LIMIT 1;

    -- Simulate complex recovery process
    -- In reality, this involves:
    -- 1. Restoring base backup
    -- 2. Applying WAL files up to target point
    -- 3. Complex validation and consistency checks
    -- 4. Service coordination and failover

    -- Simulate recovery time based on data size and complexity
    PERFORM pg_sleep(
        CASE 
            WHEN pg_database_size(current_database()) > 1073741824 THEN 5 -- Large DB: 5+ minutes
            WHEN pg_database_size(current_database()) > 104857600 THEN 2  -- Medium DB: 2+ minutes
            ELSE 0.5 -- Small DB: 30+ seconds
        END
    );

    actual_recovery_timestamp := target_timestamp - INTERVAL '2 minutes'; -- Simulate slight data loss

    RETURN QUERY SELECT 
        TRUE as recovery_success,
        actual_recovery_timestamp,
        EXTRACT(MINUTES FROM clock_timestamp() - recovery_start_time)::INTEGER,
        EXTRACT(MINUTES FROM target_timestamp - actual_recovery_timestamp)::INTEGER;

END;
$$ LANGUAGE plpgsql;

-- Disaster recovery coordination (manual and error-prone)
CREATE OR REPLACE FUNCTION coordinate_disaster_recovery(
    disaster_scenario VARCHAR(100),
    recovery_site_location TEXT,
    maximum_data_loss_minutes INTEGER DEFAULT 15
)
RETURNS TABLE (
    step_number INTEGER,
    step_description TEXT,
    step_status VARCHAR(20),
    step_duration_minutes INTEGER,
    success BOOLEAN
) AS $$
DECLARE
    step_counter INTEGER := 0;
    total_start_time TIMESTAMP := clock_timestamp();
    step_start_time TIMESTAMP;
BEGIN
    -- Step 1: Assess disaster scope
    step_counter := step_counter + 1;
    step_start_time := clock_timestamp();

    -- Simulate disaster assessment
    PERFORM pg_sleep(0.5);

    RETURN QUERY SELECT 
        step_counter,
        'Assess disaster scope and determine recovery requirements',
        'completed',
        EXTRACT(MINUTES FROM clock_timestamp() - step_start_time)::INTEGER,
        TRUE;

    -- Step 2: Activate disaster recovery site
    step_counter := step_counter + 1;
    step_start_time := clock_timestamp();

    PERFORM pg_sleep(2);

    RETURN QUERY SELECT 
        step_counter,
        'Activate disaster recovery site and initialize infrastructure',
        'completed',
        EXTRACT(MINUTES FROM clock_timestamp() - step_start_time)::INTEGER,
        TRUE;

    -- Step 3: Restore latest backup
    step_counter := step_counter + 1;
    step_start_time := clock_timestamp();

    PERFORM pg_sleep(3);

    RETURN QUERY SELECT 
        step_counter,
        'Restore latest full backup to recovery site',
        'completed', 
        EXTRACT(MINUTES FROM clock_timestamp() - step_start_time)::INTEGER,
        TRUE;

    -- Step 4: Apply incremental backups and WAL files
    step_counter := step_counter + 1;
    step_start_time := clock_timestamp();

    PERFORM pg_sleep(1.5);

    RETURN QUERY SELECT 
        step_counter,
        'Apply incremental backups and WAL files for point-in-time recovery',
        'completed',
        EXTRACT(MINUTES FROM clock_timestamp() - step_start_time)::INTEGER,
        TRUE;

    -- Step 5: Validate data consistency and application connectivity
    step_counter := step_counter + 1;
    step_start_time := clock_timestamp();

    PERFORM pg_sleep(1);

    RETURN QUERY SELECT 
        step_counter,
        'Validate data consistency and test application connectivity',
        'completed',
        EXTRACT(MINUTES FROM clock_timestamp() - step_start_time)::INTEGER,
        TRUE;

    -- Step 6: Switch application traffic to recovery site
    step_counter := step_counter + 1;
    step_start_time := clock_timestamp();

    PERFORM pg_sleep(0.5);

    RETURN QUERY SELECT 
        step_counter,
        'Switch application traffic to disaster recovery site',
        'completed',
        EXTRACT(MINUTES FROM clock_timestamp() - step_start_time)::INTEGER,
        TRUE;

END;
$$ LANGUAGE plpgsql;

-- Problems with traditional disaster recovery approaches:
-- 1. Complex manual coordination across multiple systems and teams
-- 2. Long recovery times due to sequential restoration process
-- 3. High risk of human error during crisis situations
-- 4. Limited automation and orchestration capabilities
-- 5. Expensive infrastructure duplication requirements
-- 6. Difficult testing and validation of recovery procedures
-- 7. Poor integration with cloud storage and modern infrastructure
-- 8. Limited granular recovery options for specific collections or datasets
-- 9. Complex dependency management across related database systems
-- 10. High operational overhead for maintaining backup infrastructure

MongoDB provides comprehensive backup and recovery capabilities that address these traditional limitations:

// MongoDB Enterprise Backup and Recovery Management System
const { MongoClient, GridFSBucket } = require('mongodb');
const fs = require('fs');
const path = require('path');
const zlib = require('zlib');
const crypto = require('crypto');

// Advanced MongoDB backup and recovery management system
class MongoEnterpriseBackupManager {
  constructor(connectionUri, options = {}) {
    this.client = new MongoClient(connectionUri);
    this.db = null;
    this.gridFS = null;

    // Backup configuration
    this.config = {
      // Backup strategy settings
      backupStrategy: {
        enableFullBackups: options.backupStrategy?.enableFullBackups !== false,
        enableIncrementalBackups: options.backupStrategy?.enableIncrementalBackups !== false,
        fullBackupInterval: options.backupStrategy?.fullBackupInterval || '7d',
        incrementalBackupInterval: options.backupStrategy?.incrementalBackupInterval || '1h',
        retentionPeriod: options.backupStrategy?.retentionPeriod || '90d',
        compressionEnabled: options.backupStrategy?.compressionEnabled !== false,
        encryptionEnabled: options.backupStrategy?.encryptionEnabled || false
      },

      // Storage configuration
      storageSettings: {
        localBackupPath: options.storageSettings?.localBackupPath || './backups',
        cloudStorageEnabled: options.storageSettings?.cloudStorageEnabled || false,
        cloudProvider: options.storageSettings?.cloudProvider || 'aws', // aws, azure, gcp
        cloudBucket: options.storageSettings?.cloudBucket || 'mongodb-backups',
        storageClass: options.storageSettings?.storageClass || 'standard' // standard, infrequent, archive
      },

      // Recovery configuration
      recoverySettings: {
        enablePointInTimeRecovery: options.recoverySettings?.enablePointInTimeRecovery !== false,
        oplogRetentionHours: options.recoverySettings?.oplogRetentionHours || 72,
        parallelRecoveryThreads: options.recoverySettings?.parallelRecoveryThreads || 4,
        recoveryValidationEnabled: options.recoverySettings?.recoveryValidationEnabled !== false
      },

      // Disaster recovery configuration
      disasterRecovery: {
        enableCrossRegionReplication: options.disasterRecovery?.enableCrossRegionReplication || false,
        replicationRegions: options.disasterRecovery?.replicationRegions || [],
        automaticFailover: options.disasterRecovery?.automaticFailover || false,
        rpoMinutes: options.disasterRecovery?.rpoMinutes || 15, // Recovery Point Objective
        rtoMinutes: options.disasterRecovery?.rtoMinutes || 30   // Recovery Time Objective
      }
    };

    // Backup state tracking
    this.backupState = {
      lastFullBackup: null,
      lastIncrementalBackup: null,
      activeBackupOperations: new Map(),
      backupHistory: new Map(),
      recoveryOperations: new Map()
    };

    // Performance metrics
    this.metrics = {
      totalBackupsCreated: 0,
      totalDataBackedUp: 0,
      totalRecoveryOperations: 0,
      averageBackupTime: 0,
      averageRecoveryTime: 0,
      backupSuccessRate: 100,
      lastBackupTimestamp: null
    };
  }

  async initialize(databaseName) {
    console.log('Initializing MongoDB Enterprise Backup Manager...');

    try {
      await this.client.connect();
      this.db = this.client.db(databaseName);
      this.gridFS = new GridFSBucket(this.db, { bucketName: 'backups' });

      // Setup backup management collections
      await this.setupBackupCollections();

      // Initialize backup storage directories
      await this.initializeBackupStorage();

      // Load existing backup history
      await this.loadBackupHistory();

      // Setup automated backup scheduling if enabled
      if (this.config.backupStrategy.enableFullBackups || 
          this.config.backupStrategy.enableIncrementalBackups) {
        this.setupAutomatedBackups();
      }

      console.log('MongoDB Enterprise Backup Manager initialized successfully');

    } catch (error) {
      console.error('Error initializing backup manager:', error);
      throw error;
    }
  }

  // Create comprehensive full backup
  async createFullBackup(options = {}) {
    console.log('Starting full backup operation...');

    const backupId = this.generateBackupId();
    const startTime = Date.now();

    try {
      // Initialize backup operation tracking
      const backupOperation = {
        backupId: backupId,
        backupType: 'full',
        startTime: new Date(startTime),
        status: 'in_progress',
        collections: [],
        totalDocuments: 0,
        totalSize: 0,
        compressionRatio: 0,
        encryptionEnabled: this.config.backupStrategy.encryptionEnabled
      };

      this.backupState.activeBackupOperations.set(backupId, backupOperation);

      // Get list of collections to backup
      const collections = options.collections || await this.getBackupCollections();
      backupOperation.collections = collections.map(c => c.name);

      console.log(`Backing up ${collections.length} collections...`);

      // Create backup metadata
      const backupMetadata = {
        backupId: backupId,
        backupType: 'full',
        timestamp: new Date(),
        databaseName: this.db.databaseName,
        collections: collections.map(c => ({
          name: c.name,
          documentCount: 0,
          avgDocSize: 0,
          totalSize: 0,
          indexes: []
        })),
        backupSize: 0,
        compressionEnabled: this.config.backupStrategy.compressionEnabled,
        encryptionEnabled: this.config.backupStrategy.encryptionEnabled,
        version: '1.0'
      };

      // Backup each collection with metadata
      for (const collectionInfo of collections) {
        const collectionBackup = await this.backupCollection(
          collectionInfo.name, 
          backupId, 
          'full',
          options
        );

        // Update metadata
        const collectionMeta = backupMetadata.collections.find(c => c.name === collectionInfo.name);
        collectionMeta.documentCount = collectionBackup.documentCount;
        collectionMeta.avgDocSize = collectionBackup.avgDocSize;
        collectionMeta.totalSize = collectionBackup.totalSize;
        collectionMeta.indexes = collectionBackup.indexes;

        backupOperation.totalDocuments += collectionBackup.documentCount;
        backupOperation.totalSize += collectionBackup.totalSize;
      }

      // Backup database metadata and indexes
      await this.backupDatabaseMetadata(backupId, backupMetadata);

      // Create backup manifest
      const backupManifest = await this.createBackupManifest(backupId, backupMetadata);

      // Store backup in GridFS
      await this.storeBackupInGridFS(backupId, backupManifest);

      // Upload to cloud storage if enabled
      if (this.config.storageSettings.cloudStorageEnabled) {
        await this.uploadToCloudStorage(backupId, backupManifest);
      }

      // Calculate final metrics
      const endTime = Date.now();
      const duration = endTime - startTime;

      backupOperation.status = 'completed';
      backupOperation.endTime = new Date(endTime);
      backupOperation.duration = duration;
      backupOperation.compressionRatio = this.calculateCompressionRatio(backupOperation.totalSize, backupManifest.compressedSize);

      // Update backup history
      this.backupState.backupHistory.set(backupId, backupOperation);
      this.backupState.lastFullBackup = backupOperation;
      this.backupState.activeBackupOperations.delete(backupId);

      // Update metrics
      this.updateBackupMetrics(backupOperation);

      // Log backup completion
      await this.logBackupOperation(backupOperation);

      console.log(`Full backup completed successfully: ${backupId}`);
      console.log(`Duration: ${Math.round(duration / 1000)}s, Size: ${Math.round(backupOperation.totalSize / 1024 / 1024)}MB`);

      return {
        backupId: backupId,
        backupType: 'full',
        duration: duration,
        totalSize: backupOperation.totalSize,
        collections: backupOperation.collections.length,
        totalDocuments: backupOperation.totalDocuments,
        compressionRatio: backupOperation.compressionRatio,
        success: true
      };

    } catch (error) {
      console.error(`Full backup failed: ${backupId}`, error);

      // Update backup operation status
      const backupOperation = this.backupState.activeBackupOperations.get(backupId);
      if (backupOperation) {
        backupOperation.status = 'failed';
        backupOperation.error = error.message;
        backupOperation.endTime = new Date();

        // Move to history
        this.backupState.backupHistory.set(backupId, backupOperation);
        this.backupState.activeBackupOperations.delete(backupId);
      }

      throw error;
    }
  }

  // Create incremental backup based on oplog
  async createIncrementalBackup(options = {}) {
    console.log('Starting incremental backup operation...');

    if (!this.backupState.lastFullBackup) {
      throw new Error('No full backup found. Full backup required before incremental backup.');
    }

    const backupId = this.generateBackupId();
    const startTime = Date.now();

    try {
      // Get oplog entries since last backup
      const lastBackupTime = this.backupState.lastIncrementalBackup?.endTime || 
                             this.backupState.lastFullBackup.endTime;

      const oplogEntries = await this.getOplogEntries(lastBackupTime, options);

      console.log(`Processing ${oplogEntries.length} oplog entries for incremental backup...`);

      const backupOperation = {
        backupId: backupId,
        backupType: 'incremental',
        startTime: new Date(startTime),
        status: 'in_progress',
        baseBackupId: this.backupState.lastFullBackup.backupId,
        oplogEntries: oplogEntries.length,
        affectedCollections: new Set(),
        totalSize: 0
      };

      this.backupState.activeBackupOperations.set(backupId, backupOperation);

      // Process oplog entries and create incremental backup data
      const incrementalData = await this.processOplogForBackup(oplogEntries, backupId);

      // Update operation with processed data
      backupOperation.affectedCollections = Array.from(incrementalData.affectedCollections);
      backupOperation.totalSize = incrementalData.totalSize;

      // Create incremental backup manifest
      const incrementalManifest = {
        backupId: backupId,
        backupType: 'incremental',
        timestamp: new Date(),
        baseBackupId: this.backupState.lastFullBackup.backupId,
        oplogStartTime: lastBackupTime,
        oplogEndTime: new Date(),
        oplogEntries: oplogEntries.length,
        affectedCollections: backupOperation.affectedCollections,
        incrementalSize: incrementalData.totalSize
      };

      // Store incremental backup
      await this.storeIncrementalBackup(backupId, incrementalData, incrementalManifest);

      // Upload to cloud storage if enabled
      if (this.config.storageSettings.cloudStorageEnabled) {
        await this.uploadIncrementalToCloud(backupId, incrementalManifest);
      }

      // Complete backup operation
      const endTime = Date.now();
      const duration = endTime - startTime;

      backupOperation.status = 'completed';
      backupOperation.endTime = new Date(endTime);
      backupOperation.duration = duration;

      // Update backup state
      this.backupState.backupHistory.set(backupId, backupOperation);
      this.backupState.lastIncrementalBackup = backupOperation;
      this.backupState.activeBackupOperations.delete(backupId);

      // Update metrics
      this.updateBackupMetrics(backupOperation);

      // Log backup completion
      await this.logBackupOperation(backupOperation);

      console.log(`Incremental backup completed successfully: ${backupId}`);
      console.log(`Duration: ${Math.round(duration / 1000)}s, Oplog entries: ${oplogEntries.length}`);

      return {
        backupId: backupId,
        backupType: 'incremental',
        duration: duration,
        oplogEntries: oplogEntries.length,
        affectedCollections: backupOperation.affectedCollections.length,
        totalSize: backupOperation.totalSize,
        success: true
      };

    } catch (error) {
      console.error(`Incremental backup failed: ${backupId}`, error);

      const backupOperation = this.backupState.activeBackupOperations.get(backupId);
      if (backupOperation) {
        backupOperation.status = 'failed';
        backupOperation.error = error.message;
        backupOperation.endTime = new Date();

        this.backupState.backupHistory.set(backupId, backupOperation);
        this.backupState.activeBackupOperations.delete(backupId);
      }

      throw error;
    }
  }

  // Advanced point-in-time recovery
  async performPointInTimeRecovery(targetTimestamp, options = {}) {
    console.log(`Starting point-in-time recovery to ${targetTimestamp}...`);

    const recoveryId = this.generateRecoveryId();
    const startTime = Date.now();

    try {
      // Find appropriate backup chain for target timestamp
      const backupChain = await this.findBackupChain(targetTimestamp);

      if (!backupChain || backupChain.length === 0) {
        throw new Error(`No suitable backup found for timestamp: ${targetTimestamp}`);
      }

      console.log(`Using backup chain: ${backupChain.map(b => b.backupId).join(' -> ')}`);

      const recoveryOperation = {
        recoveryId: recoveryId,
        recoveryType: 'point_in_time',
        targetTimestamp: targetTimestamp,
        startTime: new Date(startTime),
        status: 'in_progress',
        backupChain: backupChain,
        recoveryDatabase: options.recoveryDatabase || `${this.db.databaseName}_recovery_${recoveryId}`,
        totalSteps: 0,
        completedSteps: 0
      };

      this.backupState.recoveryOperations.set(recoveryId, recoveryOperation);

      // Create recovery database
      const recoveryDb = this.client.db(recoveryOperation.recoveryDatabase);

      // Step 1: Restore base full backup
      console.log('Restoring base full backup...');
      await this.restoreFullBackup(backupChain[0], recoveryDb, recoveryOperation);
      recoveryOperation.completedSteps++;

      // Step 2: Apply incremental backups in sequence
      for (let i = 1; i < backupChain.length; i++) {
        console.log(`Applying incremental backup ${i}/${backupChain.length - 1}...`);
        await this.applyIncrementalBackup(backupChain[i], recoveryDb, recoveryOperation);
        recoveryOperation.completedSteps++;
      }

      // Step 3: Apply oplog entries up to target timestamp
      console.log('Applying oplog entries for point-in-time recovery...');
      await this.applyOplogToTimestamp(targetTimestamp, recoveryDb, recoveryOperation);
      recoveryOperation.completedSteps++;

      // Step 4: Validate recovered database
      if (this.config.recoverySettings.recoveryValidationEnabled) {
        console.log('Validating recovered database...');
        await this.validateRecoveredDatabase(recoveryDb, recoveryOperation);
        recoveryOperation.completedSteps++;
      }

      // Complete recovery operation
      const endTime = Date.now();
      const duration = endTime - startTime;

      recoveryOperation.status = 'completed';
      recoveryOperation.endTime = new Date(endTime);
      recoveryOperation.duration = duration;
      recoveryOperation.actualRecoveryTimestamp = await this.getLatestTimestampFromDb(recoveryDb);

      // Calculate data loss
      const dataLoss = targetTimestamp - recoveryOperation.actualRecoveryTimestamp;
      recoveryOperation.dataLossMs = Math.max(0, dataLoss);

      // Update metrics
      this.updateRecoveryMetrics(recoveryOperation);

      // Log recovery completion
      await this.logRecoveryOperation(recoveryOperation);

      console.log(`Point-in-time recovery completed successfully: ${recoveryId}`);
      console.log(`Recovery database: ${recoveryOperation.recoveryDatabase}`);
      console.log(`Duration: ${Math.round(duration / 1000)}s, Data loss: ${Math.round(dataLoss / 1000)}s`);

      return {
        recoveryId: recoveryId,
        recoveryType: 'point_in_time',
        duration: duration,
        recoveryDatabase: recoveryOperation.recoveryDatabase,
        actualRecoveryTimestamp: recoveryOperation.actualRecoveryTimestamp,
        dataLossMs: recoveryOperation.dataLossMs,
        backupChainLength: backupChain.length,
        success: true
      };

    } catch (error) {
      console.error(`Point-in-time recovery failed: ${recoveryId}`, error);

      const recoveryOperation = this.backupState.recoveryOperations.get(recoveryId);
      if (recoveryOperation) {
        recoveryOperation.status = 'failed';
        recoveryOperation.error = error.message;
        recoveryOperation.endTime = new Date();
      }

      throw error;
    }
  }

  // Disaster recovery orchestration
  async orchestrateDisasterRecovery(disasterScenario, options = {}) {
    console.log(`Orchestrating disaster recovery for scenario: ${disasterScenario}`);

    const recoveryId = this.generateRecoveryId();
    const startTime = Date.now();

    try {
      const disasterRecoveryOperation = {
        recoveryId: recoveryId,
        recoveryType: 'disaster_recovery',
        disasterScenario: disasterScenario,
        startTime: new Date(startTime),
        status: 'in_progress',
        steps: [],
        currentStep: 0,
        recoveryRegion: options.recoveryRegion || 'primary',
        targetRPO: this.config.disasterRecovery.rpoMinutes,
        targetRTO: this.config.disasterRecovery.rtoMinutes
      };

      this.backupState.recoveryOperations.set(recoveryId, disasterRecoveryOperation);

      // Define disaster recovery steps
      const recoverySteps = [
        {
          step: 1,
          description: 'Assess disaster scope and activate recovery procedures',
          action: this.assessDisasterScope.bind(this),
          estimatedDuration: 2
        },
        {
          step: 2, 
          description: 'Initialize disaster recovery infrastructure',
          action: this.initializeRecoveryInfrastructure.bind(this),
          estimatedDuration: 5
        },
        {
          step: 3,
          description: 'Locate and prepare latest backup chain',
          action: this.prepareDisasterRecoveryBackups.bind(this),
          estimatedDuration: 3
        },
        {
          step: 4,
          description: 'Restore database from backup chain',
          action: this.restoreDisasterRecoveryDatabase.bind(this),
          estimatedDuration: 15
        },
        {
          step: 5,
          description: 'Validate data consistency and integrity',
          action: this.validateDisasterRecoveryDatabase.bind(this),
          estimatedDuration: 3
        },
        {
          step: 6,
          description: 'Switch application traffic to recovery site',
          action: this.switchToRecoverySite.bind(this),
          estimatedDuration: 2
        }
      ];

      disasterRecoveryOperation.steps = recoverySteps;
      disasterRecoveryOperation.totalSteps = recoverySteps.length;

      // Execute recovery steps sequentially
      for (const step of recoverySteps) {
        console.log(`Executing step ${step.step}: ${step.description}`);
        disasterRecoveryOperation.currentStep = step.step;

        const stepStartTime = Date.now();

        try {
          await step.action(disasterRecoveryOperation, options);

          step.status = 'completed';
          step.actualDuration = Math.round((Date.now() - stepStartTime) / 1000 / 60);

          console.log(`Step ${step.step} completed in ${step.actualDuration} minutes`);

        } catch (stepError) {
          step.status = 'failed';
          step.error = stepError.message;
          step.actualDuration = Math.round((Date.now() - stepStartTime) / 1000 / 60);

          console.error(`Step ${step.step} failed:`, stepError);
          throw stepError;
        }
      }

      // Complete disaster recovery
      const endTime = Date.now();
      const totalDuration = Math.round((endTime - startTime) / 1000 / 60);

      disasterRecoveryOperation.status = 'completed';
      disasterRecoveryOperation.endTime = new Date(endTime);
      disasterRecoveryOperation.totalDuration = totalDuration;
      disasterRecoveryOperation.rtoAchieved = totalDuration <= this.config.disasterRecovery.rtoMinutes;

      // Update metrics
      this.updateRecoveryMetrics(disasterRecoveryOperation);

      // Log disaster recovery completion
      await this.logRecoveryOperation(disasterRecoveryOperation);

      console.log(`Disaster recovery completed successfully: ${recoveryId}`);
      console.log(`Total duration: ${totalDuration} minutes (RTO target: ${this.config.disasterRecovery.rtoMinutes} minutes)`);

      return {
        recoveryId: recoveryId,
        recoveryType: 'disaster_recovery',
        totalDuration: totalDuration,
        rtoAchieved: disasterRecoveryOperation.rtoAchieved,
        stepsCompleted: recoverySteps.filter(s => s.status === 'completed').length,
        totalSteps: recoverySteps.length,
        success: true
      };

    } catch (error) {
      console.error(`Disaster recovery failed: ${recoveryId}`, error);

      const recoveryOperation = this.backupState.recoveryOperations.get(recoveryId);
      if (recoveryOperation) {
        recoveryOperation.status = 'failed';
        recoveryOperation.error = error.message;
        recoveryOperation.endTime = new Date();
      }

      throw error;
    }
  }

  // Backup individual collection with compression and encryption
  async backupCollection(collectionName, backupId, backupType, options) {
    console.log(`Backing up collection: ${collectionName}`);

    const collection = this.db.collection(collectionName);
    const backupData = {
      collectionName: collectionName,
      backupId: backupId,
      backupType: backupType,
      timestamp: new Date(),
      documents: [],
      indexes: [],
      documentCount: 0,
      totalSize: 0,
      avgDocSize: 0
    };

    try {
      // Get collection stats
      const stats = await collection.stats();
      backupData.documentCount = stats.count || 0;
      backupData.totalSize = stats.size || 0;
      backupData.avgDocSize = backupData.documentCount > 0 ? backupData.totalSize / backupData.documentCount : 0;

      // Backup collection indexes
      const indexes = await collection.listIndexes().toArray();
      backupData.indexes = indexes.filter(idx => idx.name !== '_id_'); // Exclude default _id index

      // Stream collection documents for memory-efficient backup
      const cursor = collection.find({});
      const documents = [];

      while (await cursor.hasNext()) {
        const doc = await cursor.next();
        documents.push(doc);

        // Process in batches to manage memory usage
        if (documents.length >= 1000) {
          await this.processBatch(documents, backupData, backupId, collectionName);
          documents.length = 0; // Clear batch
        }
      }

      // Process remaining documents
      if (documents.length > 0) {
        await this.processBatch(documents, backupData, backupId, collectionName);
      }

      console.log(`Collection backup completed: ${collectionName} (${backupData.documentCount} documents)`);

      return backupData;

    } catch (error) {
      console.error(`Error backing up collection ${collectionName}:`, error);
      throw error;
    }
  }

  // Process document batch with compression and encryption
  async processBatch(documents, backupData, backupId, collectionName) {
    // Serialize documents to JSON
    const batchData = JSON.stringify(documents);

    // Apply compression if enabled
    let processedData = Buffer.from(batchData, 'utf8');
    if (this.config.backupStrategy.compressionEnabled) {
      processedData = zlib.gzipSync(processedData);
    }

    // Apply encryption if enabled  
    if (this.config.backupStrategy.encryptionEnabled) {
      processedData = this.encryptData(processedData);
    }

    // Store batch data (implementation would store to GridFS or file system)
    const batchId = `${backupId}_${collectionName}_${Date.now()}`;
    await this.storeBatch(batchId, processedData);

    backupData.documents.push({
      batchId: batchId,
      documentCount: documents.length,
      compressedSize: processedData.length,
      originalSize: Buffer.byteLength(batchData, 'utf8')
    });
  }

  // Get oplog entries for incremental backup
  async getOplogEntries(fromTimestamp, options = {}) {
    console.log(`Retrieving oplog entries from ${fromTimestamp}...`);

    try {
      const oplogDb = this.client.db('local');
      const oplogCollection = oplogDb.collection('oplog.rs');

      // Query oplog for entries since last backup
      const query = {
        ts: { $gt: fromTimestamp },
        ns: { $regex: `^${this.db.databaseName}\.` }, // Only our database
        op: { $in: ['i', 'u', 'd'] } // Insert, update, delete operations
      };

      // Exclude certain collections from oplog backup
      const excludeCollections = options.excludeCollections || ['backups.files', 'backups.chunks'];
      if (excludeCollections.length > 0) {
        query.ns = {
          $regex: `^${this.db.databaseName}\.`,
          $nin: excludeCollections.map(col => `${this.db.databaseName}.${col}`)
        };
      }

      const oplogEntries = await oplogCollection
        .find(query)
        .sort({ ts: 1 })
        .limit(options.maxEntries || 100000)
        .toArray();

      console.log(`Retrieved ${oplogEntries.length} oplog entries`);

      return oplogEntries;

    } catch (error) {
      console.error('Error retrieving oplog entries:', error);
      throw error;
    }
  }

  // Process oplog entries for incremental backup
  async processOplogForBackup(oplogEntries, backupId) {
    console.log('Processing oplog entries for incremental backup...');

    const incrementalData = {
      backupId: backupId,
      oplogEntries: oplogEntries,
      affectedCollections: new Set(),
      totalSize: 0,
      operationCounts: {
        inserts: 0,
        updates: 0,
        deletes: 0
      }
    };

    // Group oplog entries by collection
    const collectionOps = new Map();

    for (const entry of oplogEntries) {
      const collectionName = entry.ns.split('.')[1];
      incrementalData.affectedCollections.add(collectionName);

      if (!collectionOps.has(collectionName)) {
        collectionOps.set(collectionName, []);
      }
      collectionOps.get(collectionName).push(entry);

      // Count operation types
      switch (entry.op) {
        case 'i': incrementalData.operationCounts.inserts++; break;
        case 'u': incrementalData.operationCounts.updates++; break;  
        case 'd': incrementalData.operationCounts.deletes++; break;
      }
    }

    // Process and store oplog data per collection
    for (const [collectionName, ops] of collectionOps) {
      const collectionOplogData = JSON.stringify(ops);
      let processedData = Buffer.from(collectionOplogData, 'utf8');

      // Apply compression
      if (this.config.backupStrategy.compressionEnabled) {
        processedData = zlib.gzipSync(processedData);
      }

      // Apply encryption
      if (this.config.backupStrategy.encryptionEnabled) {
        processedData = this.encryptData(processedData);
      }

      // Store incremental data
      const incrementalId = `${backupId}_oplog_${collectionName}`;
      await this.storeIncrementalData(incrementalId, processedData);

      incrementalData.totalSize += processedData.length;
    }

    console.log(`Processed oplog for ${incrementalData.affectedCollections.size} collections`);

    return incrementalData;
  }

  // Comprehensive backup analytics and monitoring
  async getBackupAnalytics(timeRange = '30d') {
    console.log('Generating backup and recovery analytics...');

    const timeRanges = {
      '1d': 1,
      '7d': 7,
      '30d': 30,
      '90d': 90
    };

    const days = timeRanges[timeRange] || 30;
    const startDate = new Date(Date.now() - (days * 24 * 60 * 60 * 1000));

    try {
      // Get backup history from database
      const backupHistory = await this.db.collection('backup_operations')
        .find({
          startTime: { $gte: startDate }
        })
        .sort({ startTime: -1 })
        .toArray();

      // Get recovery history
      const recoveryHistory = await this.db.collection('recovery_operations')
        .find({
          startTime: { $gte: startDate }
        })
        .sort({ startTime: -1 })
        .toArray();

      // Calculate analytics
      const analytics = {
        reportGeneratedAt: new Date(),
        timeRange: timeRange,

        // Backup statistics
        backupStatistics: {
          totalBackups: backupHistory.length,
          fullBackups: backupHistory.filter(b => b.backupType === 'full').length,
          incrementalBackups: backupHistory.filter(b => b.backupType === 'incremental').length,
          successfulBackups: backupHistory.filter(b => b.status === 'completed').length,
          failedBackups: backupHistory.filter(b => b.status === 'failed').length,
          successRate: backupHistory.length > 0 
            ? (backupHistory.filter(b => b.status === 'completed').length / backupHistory.length) * 100 
            : 0,

          // Size and performance metrics
          totalDataBackedUp: backupHistory
            .filter(b => b.status === 'completed')
            .reduce((sum, b) => sum + (b.totalSize || 0), 0),
          averageBackupSize: 0,
          averageBackupDuration: 0,
          averageCompressionRatio: 0
        },

        // Recovery statistics  
        recoveryStatistics: {
          totalRecoveryOperations: recoveryHistory.length,
          pointInTimeRecoveries: recoveryHistory.filter(r => r.recoveryType === 'point_in_time').length,
          disasterRecoveries: recoveryHistory.filter(r => r.recoveryType === 'disaster_recovery').length,
          successfulRecoveries: recoveryHistory.filter(r => r.status === 'completed').length,
          failedRecoveries: recoveryHistory.filter(r => r.status === 'failed').length,
          recoverySuccessRate: recoveryHistory.length > 0 
            ? (recoveryHistory.filter(r => r.status === 'completed').length / recoveryHistory.length) * 100 
            : 0,

          // Performance metrics
          averageRecoveryDuration: 0,
          averageDataLoss: 0,
          rtoCompliance: 0,
          rpoCompliance: 0
        },

        // System health indicators
        systemHealth: {
          backupFrequency: this.calculateBackupFrequency(backupHistory),
          storageUtilization: await this.calculateStorageUtilization(),
          lastSuccessfulBackup: backupHistory.find(b => b.status === 'completed'),
          nextScheduledBackup: this.getNextScheduledBackup(),
          alertsAndWarnings: []
        },

        // Detailed backup history
        recentBackups: backupHistory.slice(0, 10),
        recentRecoveries: recoveryHistory.slice(0, 5)
      };

      // Calculate averages
      const completedBackups = backupHistory.filter(b => b.status === 'completed');
      if (completedBackups.length > 0) {
        analytics.backupStatistics.averageBackupSize = 
          analytics.backupStatistics.totalDataBackedUp / completedBackups.length;
        analytics.backupStatistics.averageBackupDuration = 
          completedBackups.reduce((sum, b) => sum + (b.duration || 0), 0) / completedBackups.length;
        analytics.backupStatistics.averageCompressionRatio = 
          completedBackups.reduce((sum, b) => sum + (b.compressionRatio || 1), 0) / completedBackups.length;
      }

      const completedRecoveries = recoveryHistory.filter(r => r.status === 'completed');
      if (completedRecoveries.length > 0) {
        analytics.recoveryStatistics.averageRecoveryDuration = 
          completedRecoveries.reduce((sum, r) => sum + (r.duration || 0), 0) / completedRecoveries.length;
        analytics.recoveryStatistics.averageDataLoss = 
          completedRecoveries.reduce((sum, r) => sum + (r.dataLossMs || 0), 0) / completedRecoveries.length;
      }

      // Generate alerts and warnings
      analytics.systemHealth.alertsAndWarnings = this.generateHealthAlerts(analytics);

      return analytics;

    } catch (error) {
      console.error('Error generating backup analytics:', error);
      throw error;
    }
  }

  // Utility methods
  async setupBackupCollections() {
    // Create indexes for backup management collections
    await this.db.collection('backup_operations').createIndexes([
      { key: { backupId: 1 }, unique: true },
      { key: { backupType: 1, startTime: -1 } },
      { key: { status: 1, startTime: -1 } },
      { key: { startTime: -1 } }
    ]);

    await this.db.collection('recovery_operations').createIndexes([
      { key: { recoveryId: 1 }, unique: true },
      { key: { recoveryType: 1, startTime: -1 } },
      { key: { status: 1, startTime: -1 } }
    ]);
  }

  async initializeBackupStorage() {
    // Create backup storage directories
    if (!fs.existsSync(this.config.storageSettings.localBackupPath)) {
      fs.mkdirSync(this.config.storageSettings.localBackupPath, { recursive: true });
    }
  }

  generateBackupId() {
    return `backup_${Date.now()}_${Math.random().toString(36).substr(2, 9)}`;
  }

  generateRecoveryId() {
    return `recovery_${Date.now()}_${Math.random().toString(36).substr(2, 9)}`;
  }

  calculateCompressionRatio(originalSize, compressedSize) {
    return originalSize > 0 ? originalSize / compressedSize : 1;
  }

  encryptData(data) {
    // Simplified encryption - in production, use proper encryption libraries
    const cipher = crypto.createCipher('aes192', 'backup-encryption-key');
    let encrypted = cipher.update(data, 'binary', 'hex');
    encrypted += cipher.final('hex');
    return Buffer.from(encrypted, 'hex');
  }

  async storeBatch(batchId, data) {
    // Store batch data in GridFS
    const uploadStream = this.gridFS.openUploadStream(batchId);
    uploadStream.end(data);
    return new Promise((resolve, reject) => {
      uploadStream.on('finish', resolve);
      uploadStream.on('error', reject);
    });
  }

  async logBackupOperation(backupOperation) {
    await this.db.collection('backup_operations').insertOne({
      ...backupOperation,
      loggedAt: new Date()
    });
  }

  async logRecoveryOperation(recoveryOperation) {
    await this.db.collection('recovery_operations').insertOne({
      ...recoveryOperation,
      loggedAt: new Date()
    });
  }

  // Placeholder methods for complex operations
  async getBackupCollections() { /* Implementation */ return []; }
  async backupDatabaseMetadata(backupId, metadata) { /* Implementation */ }
  async createBackupManifest(backupId, metadata) { /* Implementation */ return {}; }
  async storeBackupInGridFS(backupId, manifest) { /* Implementation */ }
  async uploadToCloudStorage(backupId, manifest) { /* Implementation */ }
  async storeIncrementalBackup(backupId, data, manifest) { /* Implementation */ }
  async findBackupChain(timestamp) { /* Implementation */ return []; }
  async restoreFullBackup(backup, db, operation) { /* Implementation */ }
  async applyIncrementalBackup(backup, db, operation) { /* Implementation */ }
  async applyOplogToTimestamp(timestamp, db, operation) { /* Implementation */ }
  async validateRecoveredDatabase(db, operation) { /* Implementation */ }
  async assessDisasterScope(operation, options) { /* Implementation */ }
  async initializeRecoveryInfrastructure(operation, options) { /* Implementation */ }
  async prepareDisasterRecoveryBackups(operation, options) { /* Implementation */ }
  async restoreDisasterRecoveryDatabase(operation, options) { /* Implementation */ }
  async validateDisasterRecoveryDatabase(operation, options) { /* Implementation */ }
  async switchToRecoverySite(operation, options) { /* Implementation */ }

  updateBackupMetrics(operation) {
    this.metrics.totalBackupsCreated++;
    this.metrics.totalDataBackedUp += operation.totalSize || 0;
    this.metrics.lastBackupTimestamp = operation.endTime;
  }

  updateRecoveryMetrics(operation) {
    this.metrics.totalRecoveryOperations++;
    // Update other recovery metrics
  }
}

// Example usage demonstrating comprehensive backup and recovery
async function demonstrateEnterpriseBackupRecovery() {
  const backupManager = new MongoEnterpriseBackupManager('mongodb://localhost:27017');

  try {
    await backupManager.initialize('production_ecommerce');

    console.log('Performing full backup...');
    const fullBackupResult = await backupManager.createFullBackup();
    console.log('Full backup result:', fullBackupResult);

    // Simulate some data changes
    console.log('Simulating data changes...');
    await new Promise(resolve => setTimeout(resolve, 2000));

    console.log('Performing incremental backup...');
    const incrementalBackupResult = await backupManager.createIncrementalBackup();
    console.log('Incremental backup result:', incrementalBackupResult);

    // Demonstrate point-in-time recovery
    const recoveryTimestamp = new Date(Date.now() - 60000); // 1 minute ago
    console.log('Performing point-in-time recovery...');
    const recoveryResult = await backupManager.performPointInTimeRecovery(recoveryTimestamp);
    console.log('Recovery result:', recoveryResult);

    // Generate analytics report
    const analytics = await backupManager.getBackupAnalytics('30d');
    console.log('Backup Analytics:', JSON.stringify(analytics, null, 2));

  } catch (error) {
    console.error('Backup and recovery demonstration error:', error);
  }
}

module.exports = {
  MongoEnterpriseBackupManager,
  demonstrateEnterpriseBackupRecovery
};

QueryLeaf Backup and Recovery Integration

QueryLeaf provides SQL-familiar syntax for MongoDB backup and recovery operations:

-- QueryLeaf backup and recovery with SQL-style commands

-- Create comprehensive backup strategy configuration
CREATE BACKUP_STRATEGY enterprise_production AS (
  -- Strategy identification
  strategy_name = 'enterprise_production_backups',
  strategy_description = 'Production environment backup strategy with disaster recovery',

  -- Backup scheduling configuration
  full_backup_schedule = JSON_OBJECT(
    'frequency', 'weekly',
    'day_of_week', 'sunday', 
    'time', '02:00:00',
    'timezone', 'UTC'
  ),

  incremental_backup_schedule = JSON_OBJECT(
    'frequency', 'hourly',
    'interval_hours', 4,
    'business_hours_only', false
  ),

  -- Data retention policy
  retention_policy = JSON_OBJECT(
    'full_backups_retention_days', 90,
    'incremental_backups_retention_days', 30,
    'archive_after_days', 365,
    'permanent_retention_monthly', true
  ),

  -- Storage configuration
  storage_configuration = JSON_OBJECT(
    'primary_storage', JSON_OBJECT(
      'type', 'cloud',
      'provider', 'aws',
      'bucket', 'enterprise-mongodb-backups',
      'region', 'us-east-1',
      'storage_class', 'standard'
    ),
    'secondary_storage', JSON_OBJECT(
      'type', 'cloud',
      'provider', 'azure',
      'container', 'backup-replica',
      'region', 'east-us-2',
      'storage_class', 'cool'
    ),
    'local_cache', JSON_OBJECT(
      'enabled', true,
      'path', '/backup/cache',
      'max_size_gb', 500
    )
  ),

  -- Compression and encryption settings
  data_protection = JSON_OBJECT(
    'compression_enabled', true,
    'compression_algorithm', 'gzip',
    'compression_level', 6,
    'encryption_enabled', true,
    'encryption_algorithm', 'AES-256',
    'key_rotation_days', 90
  ),

  -- Performance and resource limits
  performance_settings = JSON_OBJECT(
    'max_concurrent_backups', 3,
    'backup_bandwidth_limit_mbps', 100,
    'memory_limit_gb', 8,
    'backup_timeout_hours', 6,
    'parallel_collection_backups', true
  )
);

-- Execute full backup with comprehensive options
EXECUTE BACKUP full_backup_production WITH OPTIONS (
  -- Backup scope
  backup_type = 'full',
  databases = JSON_ARRAY('ecommerce', 'analytics', 'user_management'),
  include_system_collections = true,
  include_indexes = true,

  -- Quality and validation
  verify_backup_integrity = true,
  test_restore_sample = true,
  backup_checksum_validation = true,

  -- Performance optimization
  batch_size = 1000,
  parallel_collections = 4,
  compression_level = 6,

  -- Metadata and tracking
  backup_tags = JSON_OBJECT(
    'environment', 'production',
    'application', 'ecommerce_platform',
    'backup_tier', 'critical',
    'retention_class', 'long_term'
  ),

  backup_description = 'Weekly full backup for production ecommerce platform'
);

-- Monitor backup progress with real-time analytics
WITH backup_progress AS (
  SELECT 
    backup_id,
    backup_type,
    database_name,

    -- Progress tracking
    total_collections,
    completed_collections,
    ROUND((completed_collections::numeric / total_collections) * 100, 2) as progress_percentage,

    -- Performance metrics
    EXTRACT(MINUTES FROM CURRENT_TIMESTAMP - backup_start_time) as elapsed_minutes,
    CASE 
      WHEN completed_collections > 0 THEN
        ROUND(
          (total_collections - completed_collections) * 
          (EXTRACT(MINUTES FROM CURRENT_TIMESTAMP - backup_start_time) / completed_collections),
          0
        )
      ELSE NULL
    END as estimated_remaining_minutes,

    -- Size and throughput
    total_documents_processed,
    total_size_backed_up_mb,
    ROUND(
      total_size_backed_up_mb / 
      (EXTRACT(MINUTES FROM CURRENT_TIMESTAMP - backup_start_time) + 0.1), 
      2
    ) as throughput_mb_per_minute,

    -- Compression and efficiency
    original_size_mb,
    compressed_size_mb,
    ROUND(
      CASE 
        WHEN original_size_mb > 0 THEN 
          (1 - (compressed_size_mb / original_size_mb)) * 100 
        ELSE 0 
      END, 
      1
    ) as compression_ratio_percent,

    backup_status,
    error_count,
    warning_count

  FROM ACTIVE_BACKUP_OPERATIONS()
  WHERE backup_status IN ('running', 'finalizing')
),

-- Resource utilization analysis
resource_utilization AS (
  SELECT 
    backup_id,

    -- System resource usage
    cpu_usage_percent,
    memory_usage_mb,
    disk_io_mb_per_sec,
    network_io_mb_per_sec,

    -- Database performance impact
    active_connections_during_backup,
    query_response_time_impact_percent,
    replication_lag_seconds,

    -- Storage utilization
    backup_storage_used_gb,
    available_storage_gb,
    ROUND(
      (backup_storage_used_gb / (backup_storage_used_gb + available_storage_gb)) * 100, 
      1
    ) as storage_utilization_percent

  FROM BACKUP_RESOURCE_MONITORING()
  WHERE monitoring_timestamp >= DATE_SUB(NOW(), INTERVAL 1 HOUR)
)

SELECT 
  -- Current backup status
  bp.backup_id,
  bp.backup_type,
  bp.database_name,
  bp.progress_percentage || '%' as progress,
  bp.backup_status,

  -- Time estimates
  bp.elapsed_minutes || ' min elapsed' as duration,
  COALESCE(bp.estimated_remaining_minutes || ' min remaining', 'Calculating...') as eta,

  -- Performance indicators
  bp.throughput_mb_per_minute || ' MB/min' as throughput,
  bp.compression_ratio_percent || '% compression' as compression,

  -- Quality indicators
  bp.error_count as errors,
  bp.warning_count as warnings,
  bp.total_documents_processed as documents,

  -- Resource impact
  ru.cpu_usage_percent || '%' as cpu_usage,
  ru.memory_usage_mb || 'MB' as memory_usage,
  ru.query_response_time_impact_percent || '% slower' as db_impact,
  ru.storage_utilization_percent || '%' as storage_used,

  -- Health assessment
  CASE 
    WHEN bp.error_count > 0 THEN 'Errors Detected'
    WHEN ru.cpu_usage_percent > 80 THEN 'High CPU Usage'
    WHEN ru.query_response_time_impact_percent > 20 THEN 'High DB Impact'
    WHEN bp.throughput_mb_per_minute < 10 THEN 'Low Throughput'
    WHEN ru.storage_utilization_percent > 90 THEN 'Storage Critical'
    ELSE 'Healthy'
  END as health_status,

  -- Recommendations
  CASE 
    WHEN bp.throughput_mb_per_minute < 10 THEN 'Consider increasing batch size or parallel operations'
    WHEN ru.cpu_usage_percent > 80 THEN 'Reduce concurrent operations or backup during off-peak hours'
    WHEN ru.query_response_time_impact_percent > 20 THEN 'Schedule backup during maintenance window'
    WHEN ru.storage_utilization_percent > 90 THEN 'Archive old backups or increase storage capacity'
    WHEN bp.progress_percentage > 95 THEN 'Backup nearing completion, prepare for verification'
    ELSE 'Backup proceeding normally'
  END as recommendation

FROM backup_progress bp
LEFT JOIN resource_utilization ru ON bp.backup_id = ru.backup_id
ORDER BY bp.backup_start_time DESC;

-- Advanced point-in-time recovery with SQL-style syntax
WITH recovery_analysis AS (
  SELECT 
    target_timestamp,

    -- Find optimal backup chain
    (SELECT backup_id FROM BACKUP_OPERATIONS 
     WHERE backup_type = 'full' 
       AND backup_timestamp <= target_timestamp 
       AND backup_status = 'completed'
     ORDER BY backup_timestamp DESC 
     LIMIT 1) as base_backup_id,

    -- Count incremental backups needed
    (SELECT COUNT(*) FROM BACKUP_OPERATIONS
     WHERE backup_type = 'incremental'
       AND backup_timestamp <= target_timestamp
       AND backup_timestamp > (
         SELECT backup_timestamp FROM BACKUP_OPERATIONS 
         WHERE backup_type = 'full' 
           AND backup_timestamp <= target_timestamp 
           AND backup_status = 'completed'
         ORDER BY backup_timestamp DESC 
         LIMIT 1
       )) as incremental_backups_needed,

    -- Estimate recovery time
    (SELECT 
       (backup_duration_minutes * 0.8) + -- Full restore (slightly faster than backup)
       (COUNT(*) * 5) + -- Incremental backups (5 min each)
       10 -- Oplog application and validation
     FROM BACKUP_OPERATIONS
     WHERE backup_type = 'incremental'
       AND backup_timestamp <= target_timestamp
     GROUP BY target_timestamp) as estimated_recovery_minutes,

    -- Calculate potential data loss
    TIMESTAMPDIFF(SECOND, target_timestamp, 
      (SELECT MAX(oplog_timestamp) FROM OPLOG_BACKUP_COVERAGE 
       WHERE oplog_timestamp <= target_timestamp)) as potential_data_loss_seconds

  FROM (SELECT TIMESTAMP('2024-01-30 14:30:00') as target_timestamp) t
)

-- Execute point-in-time recovery
EXECUTE RECOVERY point_in_time_recovery WITH OPTIONS (
  -- Recovery target
  recovery_target_timestamp = '2024-01-30 14:30:00',
  recovery_target_name = 'pre_deployment_state',

  -- Recovery destination  
  recovery_database = 'ecommerce_recovery_20240130',
  recovery_mode = 'new_database', -- new_database, replace_existing, parallel_validation

  -- Recovery scope
  include_databases = JSON_ARRAY('ecommerce', 'user_management'),
  exclude_collections = JSON_ARRAY('temp_data', 'cache_collection'),
  include_system_data = true,

  -- Performance and safety options
  parallel_recovery_threads = 4,
  recovery_batch_size = 500,
  validate_recovery = true,
  create_recovery_report = true,

  -- Backup chain configuration (auto-detected if not specified)
  base_backup_id = (SELECT base_backup_id FROM recovery_analysis),

  -- Safety and rollback
  enable_recovery_rollback = true,
  recovery_timeout_minutes = 120,

  -- Notification and logging
  notify_on_completion = JSON_ARRAY('[email protected]', '[email protected]'),
  recovery_priority = 'high',

  recovery_metadata = JSON_OBJECT(
    'requested_by', 'database_admin',
    'business_justification', 'Rollback deployment due to data corruption',
    'ticket_number', 'INC-2024-0130-001',
    'approval_code', 'RECOVERY-AUTH-789'
  )
) RETURNING recovery_operation_id, estimated_completion_time, recovery_database_name;

-- Monitor point-in-time recovery progress
WITH recovery_progress AS (
  SELECT 
    recovery_operation_id,
    recovery_type,
    target_timestamp,
    recovery_database,

    -- Progress tracking
    total_recovery_steps,
    completed_recovery_steps,
    current_step_description,
    ROUND((completed_recovery_steps::numeric / total_recovery_steps) * 100, 2) as progress_percentage,

    -- Time analysis
    EXTRACT(MINUTES FROM CURRENT_TIMESTAMP - recovery_start_time) as elapsed_minutes,
    estimated_total_duration_minutes,
    estimated_remaining_minutes,

    -- Data recovery metrics
    total_collections_to_restore,
    collections_restored,
    documents_recovered,
    oplog_entries_applied,

    -- Quality and validation
    validation_errors,
    consistency_warnings,
    recovery_status,

    -- Performance metrics
    recovery_throughput_mb_per_minute,
    current_memory_usage_mb,
    current_cpu_usage_percent

  FROM ACTIVE_RECOVERY_OPERATIONS()
  WHERE recovery_status IN ('initializing', 'restoring', 'applying_oplog', 'validating')
),

-- Recovery validation and integrity checks
recovery_validation AS (
  SELECT 
    recovery_operation_id,

    -- Data integrity checks
    total_document_count_original,
    total_document_count_recovered,
    document_count_variance,

    -- Index validation
    total_indexes_original,
    total_indexes_recovered,  
    index_recreation_success_rate,

    -- Consistency validation
    referential_integrity_check_status,
    data_type_consistency_status,
    duplicate_detection_status,

    -- Business rule validation
    constraint_validation_errors,
    business_rule_violations,

    -- Performance baseline comparison
    query_performance_comparison_percent,
    storage_size_comparison_percent,

    -- Final validation score
    CASE 
      WHEN document_count_variance = 0 
        AND index_recreation_success_rate = 100
        AND referential_integrity_check_status = 'PASSED'
        AND constraint_validation_errors = 0
      THEN 'EXCELLENT'
      WHEN ABS(document_count_variance) < 0.1
        AND index_recreation_success_rate >= 95
        AND constraint_validation_errors < 10
      THEN 'GOOD'
      WHEN ABS(document_count_variance) < 1.0
        AND index_recreation_success_rate >= 90
      THEN 'ACCEPTABLE'
      ELSE 'NEEDS_REVIEW'
    END as overall_validation_status

  FROM RECOVERY_VALIDATION_RESULTS()
  WHERE validation_completed_at >= DATE_SUB(NOW(), INTERVAL 2 HOUR)
)

SELECT 
  -- Recovery operation overview
  rp.recovery_operation_id,
  rp.recovery_type,
  rp.target_timestamp,
  rp.recovery_database,
  rp.progress_percentage || '%' as progress,
  rp.recovery_status,

  -- Timing information
  rp.elapsed_minutes || ' min elapsed' as duration,
  rp.estimated_remaining_minutes || ' min remaining' as eta,
  rp.current_step_description as current_activity,

  -- Recovery metrics
  rp.collections_restored || '/' || rp.total_collections_to_restore as collections_progress,
  FORMAT_NUMBER(rp.documents_recovered) as documents_recovered,
  FORMAT_NUMBER(rp.oplog_entries_applied) as oplog_entries,

  -- Performance indicators
  rp.recovery_throughput_mb_per_minute || ' MB/min' as throughput,
  rp.current_memory_usage_mb || ' MB' as memory_usage,
  rp.current_cpu_usage_percent || '%' as cpu_usage,

  -- Quality metrics
  rp.validation_errors as errors,
  rp.consistency_warnings as warnings,

  -- Validation results (when available)
  COALESCE(rv.overall_validation_status, 'IN_PROGRESS') as validation_status,
  COALESCE(rv.document_count_variance || '%', 'Calculating...') as data_accuracy,
  COALESCE(rv.index_recreation_success_rate || '%', 'Pending...') as index_success,

  -- Health and status indicators
  CASE 
    WHEN rp.recovery_status = 'failed' THEN 'Recovery Failed'
    WHEN rp.validation_errors > 0 THEN 'Validation Errors Detected'
    WHEN rp.current_cpu_usage_percent > 90 THEN 'High Resource Usage'
    WHEN rp.progress_percentage > 95 AND rp.recovery_status = 'validating' THEN 'Final Validation'
    WHEN rp.recovery_status = 'completed' THEN 'Recovery Completed Successfully'
    ELSE 'Recovery In Progress'
  END as status_indicator,

  -- Recommendations and next steps
  CASE 
    WHEN rp.recovery_status = 'completed' AND rv.overall_validation_status = 'EXCELLENT' 
      THEN 'Recovery completed successfully. Database ready for use.'
    WHEN rp.recovery_status = 'completed' AND rv.overall_validation_status = 'GOOD'
      THEN 'Recovery completed. Minor inconsistencies detected, review validation report.'
    WHEN rp.recovery_status = 'completed' AND rv.overall_validation_status = 'NEEDS_REVIEW'
      THEN 'Recovery completed with issues. Manual review required before production use.'
    WHEN rp.validation_errors > 0 
      THEN 'Validation errors detected. Check recovery logs and consider retry.'
    WHEN rp.estimated_remaining_minutes < 10 
      THEN 'Recovery nearly complete. Prepare for validation phase.'
    WHEN rp.recovery_throughput_mb_per_minute < 5 
      THEN 'Low recovery throughput. Consider resource optimization.'
    ELSE 'Recovery progressing normally. Continue monitoring.'
  END as recommendations

FROM recovery_progress rp
LEFT JOIN recovery_validation rv ON rp.recovery_operation_id = rv.recovery_operation_id
ORDER BY rp.recovery_start_time DESC;

-- Disaster recovery orchestration dashboard
CREATE VIEW disaster_recovery_dashboard AS
SELECT 
  -- Current disaster recovery readiness
  (SELECT COUNT(*) FROM BACKUP_OPERATIONS 
   WHERE backup_status = 'completed' 
     AND backup_timestamp >= DATE_SUB(NOW(), INTERVAL 24 HOUR)) as backups_last_24h,

  (SELECT MIN(TIMESTAMPDIFF(HOUR, backup_timestamp, NOW())) 
   FROM BACKUP_OPERATIONS 
   WHERE backup_type = 'full' AND backup_status = 'completed') as hours_since_last_full_backup,

  (SELECT COUNT(*) FROM BACKUP_OPERATIONS 
   WHERE backup_type = 'incremental' 
     AND backup_timestamp >= DATE_SUB(NOW(), INTERVAL 4 HOUR)
     AND backup_status = 'completed') as recent_incremental_backups,

  -- Recovery capabilities
  (SELECT COUNT(*) FROM RECOVERY_TEST_OPERATIONS 
   WHERE test_timestamp >= DATE_SUB(NOW(), INTERVAL 30 DAY)
     AND test_status = 'successful') as successful_recovery_tests_30d,

  (SELECT AVG(recovery_duration_minutes) FROM RECOVERY_TEST_OPERATIONS
   WHERE test_timestamp >= DATE_SUB(NOW(), INTERVAL 90 DAY)
     AND test_status = 'successful') as avg_recovery_time_minutes,

  -- RPO/RTO compliance
  (SELECT 
     CASE 
       WHEN MIN(TIMESTAMPDIFF(MINUTE, backup_timestamp, NOW())) <= 15 THEN 'COMPLIANT'
       WHEN MIN(TIMESTAMPDIFF(MINUTE, backup_timestamp, NOW())) <= 30 THEN 'WARNING'  
       ELSE 'NON_COMPLIANT'
     END
   FROM BACKUP_OPERATIONS 
   WHERE backup_status = 'completed') as rpo_compliance_status,

  (SELECT 
     CASE 
       WHEN avg_recovery_time_minutes <= 30 THEN 'COMPLIANT'
       WHEN avg_recovery_time_minutes <= 60 THEN 'WARNING'
       ELSE 'NON_COMPLIANT'  
     END) as rto_compliance_status,

  -- Storage and capacity
  (SELECT SUM(backup_size_mb) FROM BACKUP_OPERATIONS 
   WHERE backup_status = 'completed') as total_backup_storage_mb,

  (SELECT available_storage_gb FROM STORAGE_CAPACITY_MONITORING 
   ORDER BY monitoring_timestamp DESC LIMIT 1) as available_storage_gb,

  -- System health indicators
  (SELECT COUNT(*) FROM ACTIVE_BACKUP_OPERATIONS()) as active_backup_operations,
  (SELECT COUNT(*) FROM ACTIVE_RECOVERY_OPERATIONS()) as active_recovery_operations,

  -- Alert conditions
  JSON_ARRAYAGG(
    CASE 
      WHEN hours_since_last_full_backup > 168 THEN 'Full backup overdue'
      WHEN recent_incremental_backups = 0 THEN 'No recent incremental backups'
      WHEN successful_recovery_tests_30d = 0 THEN 'No recent recovery testing'
      WHEN available_storage_gb < 100 THEN 'Low storage capacity'
      WHEN rpo_compliance_status = 'NON_COMPLIANT' THEN 'RPO compliance violation'
      WHEN rto_compliance_status = 'NON_COMPLIANT' THEN 'RTO compliance violation'
    END
  ) as active_alerts,

  -- Overall disaster recovery readiness score
  CASE 
    WHEN hours_since_last_full_backup <= 24
      AND recent_incremental_backups >= 6  
      AND successful_recovery_tests_30d >= 2
      AND rpo_compliance_status = 'COMPLIANT'
      AND rto_compliance_status = 'COMPLIANT'
      AND available_storage_gb >= 500
    THEN 'EXCELLENT'
    WHEN hours_since_last_full_backup <= 48
      AND recent_incremental_backups >= 3
      AND successful_recovery_tests_30d >= 1  
      AND rpo_compliance_status != 'NON_COMPLIANT'
      AND available_storage_gb >= 200
    THEN 'GOOD'
    WHEN hours_since_last_full_backup <= 168
      AND recent_incremental_backups >= 1
      AND available_storage_gb >= 100
    THEN 'FAIR'
    ELSE 'CRITICAL'
  END as disaster_recovery_readiness,

  NOW() as dashboard_timestamp;

-- QueryLeaf backup and recovery capabilities provide:
-- 1. SQL-familiar backup strategy configuration and execution
-- 2. Real-time backup and recovery progress monitoring  
-- 3. Advanced point-in-time recovery with comprehensive validation
-- 4. Disaster recovery orchestration and readiness assessment
-- 5. Performance optimization and resource utilization tracking
-- 6. Comprehensive analytics and compliance reporting
-- 7. Integration with MongoDB's native backup capabilities
-- 8. Enterprise-grade automation and scheduling features
-- 9. Multi-storage tier management and lifecycle policies
-- 10. Complete audit trail and regulatory compliance support

Best Practices for MongoDB Backup and Recovery

Backup Strategy Design

Essential principles for comprehensive data protection:

3-2-1 Rule: Maintain 3 copies of data, on 2 different storage types, with 1 offsite copy
Tiered Storage: Use different storage classes based on access patterns and retention requirements
Incremental Backups: Implement frequent incremental backups to minimize data loss
Testing and Validation: Regularly test backup restoration and validate data integrity
Automation: Automate backup processes to reduce human error and ensure consistency
Monitoring: Implement comprehensive monitoring for backup success and storage utilization

Recovery Planning

Optimize recovery strategies for business continuity:

RTO/RPO Definition: Clearly define Recovery Time and Point Objectives for different scenarios
Recovery Testing: Conduct regular disaster recovery drills and document procedures
Priority Classification: Classify data and applications by recovery priority
Documentation: Maintain detailed recovery procedures and contact information
Cross-Region Strategy: Implement geographic distribution for disaster resilience
Validation Procedures: Establish data validation protocols for recovered systems

Conclusion

MongoDB's comprehensive backup and recovery capabilities provide enterprise-grade data protection that supports complex disaster recovery scenarios, automated backup workflows, and granular point-in-time recovery operations. By implementing advanced backup strategies with QueryLeaf's familiar SQL interface, organizations can ensure business continuity while maintaining operational simplicity and regulatory compliance.

Key MongoDB backup and recovery benefits include:

Native Integration: Seamless integration with MongoDB's replica sets and sharding for optimal performance
Flexible Recovery Options: Point-in-time recovery, selective collection restore, and cross-region disaster recovery
Automated Workflows: Sophisticated scheduling, retention management, and cloud storage integration
Performance Optimization: Parallel processing, compression, and incremental backup strategies
Enterprise Features: Encryption, compliance reporting, and comprehensive audit trails
Operational Simplicity: Familiar SQL-style backup and recovery commands reduce learning curve

Whether you're protecting financial transaction data, healthcare records, or e-commerce platforms, MongoDB's backup and recovery capabilities with QueryLeaf's enterprise management interface provide the foundation for robust data protection strategies that scale with your organization's growth and compliance requirements.

QueryLeaf Integration: QueryLeaf automatically translates SQL-familiar backup and recovery commands into optimized MongoDB operations, providing familiar scheduling, monitoring, and validation capabilities. Advanced disaster recovery orchestration, compliance reporting, and performance optimization are seamlessly handled through SQL-style interfaces, making enterprise-grade data protection both comprehensive and accessible for database-oriented teams.

The combination of MongoDB's native backup capabilities with SQL-style operational commands makes it an ideal platform for mission-critical applications requiring both sophisticated data protection and familiar administrative workflows, ensuring your backup and recovery strategies remain both effective and maintainable as they evolve to meet changing business requirements.

October 29, 2025
19 min read

MongoDB Schema Evolution and Migration Strategies: Advanced Patterns for Database Versioning, Backward Compatibility, and SQL-Style Schema Management

Production MongoDB applications face inevitable schema evolution challenges as business requirements change, data models mature, and application functionality expands. Traditional relational databases handle schema changes through DDL operations with strict versioning, but often require complex migration scripts, application downtime, and careful coordination between database and application deployments.

MongoDB's flexible document model provides powerful schema evolution capabilities that enable incremental data model changes, backward compatibility maintenance, and zero-downtime migrations. Unlike rigid relational schemas, MongoDB supports mixed document structures within collections, enabling gradual transitions and sophisticated migration strategies that adapt to real-world deployment constraints.

The Traditional Schema Migration Challenge

Conventional relational databases face significant limitations when implementing schema evolution and data migration:

-- Traditional PostgreSQL schema migration - rigid and disruptive approach

-- Step 1: Create backup table (downtime and storage overhead)
CREATE TABLE users_backup AS SELECT * FROM users;

-- Step 2: Add new columns with application downtime
ALTER TABLE users 
ADD COLUMN user_preferences JSONB DEFAULT '{}',
ADD COLUMN subscription_tier VARCHAR(50) DEFAULT 'basic',
ADD COLUMN last_login_timestamp TIMESTAMP,
ADD COLUMN account_status VARCHAR(20) DEFAULT 'active',
ADD COLUMN profile_completion_percentage INTEGER DEFAULT 0;

-- Step 3: Update existing data (potentially long-running operation)
BEGIN TRANSACTION;

-- Complex data transformation requiring application logic
UPDATE users 
SET user_preferences = jsonb_build_object(
  'email_notifications', true,
  'privacy_level', 'standard',
  'theme', 'light',
  'language', 'en'
)
WHERE user_preferences = '{}';

-- Derive subscription tier from existing data
UPDATE users 
SET subscription_tier = CASE 
  WHEN annual_subscription_fee > 120 THEN 'premium'
  WHEN annual_subscription_fee > 60 THEN 'plus' 
  ELSE 'basic'
END
WHERE subscription_tier = 'basic';

-- Calculate profile completion
UPDATE users 
SET profile_completion_percentage = (
  CASE WHEN email IS NOT NULL THEN 20 ELSE 0 END +
  CASE WHEN phone IS NOT NULL THEN 20 ELSE 0 END +
  CASE WHEN address IS NOT NULL THEN 20 ELSE 0 END +
  CASE WHEN birth_date IS NOT NULL THEN 20 ELSE 0 END +
  CASE WHEN bio IS NOT NULL AND LENGTH(bio) > 50 THEN 20 ELSE 0 END
)
WHERE profile_completion_percentage = 0;

COMMIT TRANSACTION;

-- Step 4: Create new indexes (additional downtime)
CREATE INDEX CONCURRENTLY users_subscription_tier_idx ON users(subscription_tier);
CREATE INDEX CONCURRENTLY users_last_login_idx ON users(last_login_timestamp);
CREATE INDEX CONCURRENTLY users_account_status_idx ON users(account_status);

-- Step 5: Drop old columns (breaking change requiring application updates)
ALTER TABLE users 
DROP COLUMN IF EXISTS old_preferences_text,
DROP COLUMN IF EXISTS legacy_status_code,
DROP COLUMN IF EXISTS deprecated_login_count;

-- Step 6: Rename columns (coordinated deployment required)
ALTER TABLE users 
RENAME COLUMN user_email TO email_address,
RENAME COLUMN user_phone to phone_number;

-- Step 7: Create migration log table (manual tracking)
CREATE TABLE schema_migrations (
    migration_id SERIAL PRIMARY KEY,
    migration_name VARCHAR(200) NOT NULL,
    applied_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    application_version VARCHAR(50),
    database_version VARCHAR(50),
    rollback_script TEXT,
    migration_notes TEXT
);

INSERT INTO schema_migrations (
    migration_name, 
    application_version, 
    database_version,
    rollback_script,
    migration_notes
) VALUES (
    'users_table_v2_migration',
    '2.1.0',
    '2.1.0',
    'ALTER TABLE users DROP COLUMN user_preferences, subscription_tier, last_login_timestamp, account_status, profile_completion_percentage;',
    'Added user preferences, subscription tiers, and profile completion tracking'
);

-- Problems with traditional schema migration approaches:
-- 1. Application downtime required for structural changes
-- 2. All-or-nothing migration approach with limited rollback capabilities
-- 3. Complex coordination between database and application deployments
-- 4. Risk of data loss during migration failures
-- 5. Performance impact during large table modifications
-- 6. Limited support for gradual migration and A/B testing scenarios
-- 7. Difficulty in maintaining multiple application versions simultaneously
-- 8. Complex rollback procedures requiring manual intervention
-- 9. Poor support for distributed systems and microservices architectures
-- 10. High operational overhead for migration planning and execution

MongoDB provides sophisticated schema evolution capabilities with flexible document structures:

// MongoDB Schema Evolution - flexible and non-disruptive approach
const { MongoClient } = require('mongodb');

// Advanced MongoDB Schema Migration and Evolution Management System
class MongoSchemaEvolutionManager {
  constructor(connectionUri, options = {}) {
    this.client = new MongoClient(connectionUri);
    this.db = null;
    this.collections = new Map();

    // Schema evolution configuration
    this.config = {
      // Migration strategy settings
      migrationStrategy: {
        approachType: options.migrationStrategy?.approachType || 'gradual', // gradual, immediate, hybrid
        batchSize: options.migrationStrategy?.batchSize || 1000,
        concurrentOperations: options.migrationStrategy?.concurrentOperations || 3,
        maxExecutionTimeMs: options.migrationStrategy?.maxExecutionTimeMs || 300000, // 5 minutes
        enableRollback: options.migrationStrategy?.enableRollback !== false
      },

      // Version management
      versionManagement: {
        trackDocumentVersions: options.versionManagement?.trackDocumentVersions !== false,
        versionField: options.versionManagement?.versionField || '_schema_version',
        migrationLogCollection: options.versionManagement?.migrationLogCollection || 'schema_migrations',
        enableVersionValidation: options.versionManagement?.enableVersionValidation !== false
      },

      // Backward compatibility
      backwardCompatibility: {
        maintainOldFields: options.backwardCompatibility?.maintainOldFields !== false,
        gracefulDegradation: options.backwardCompatibility?.gracefulDegradation !== false,
        compatibilityPeriodDays: options.backwardCompatibility?.compatibilityPeriodDays || 90,
        enableFieldAliasing: options.backwardCompatibility?.enableFieldAliasing !== false
      },

      // Performance optimization
      performanceSettings: {
        useIndexedMigration: options.performanceSettings?.useIndexedMigration !== false,
        enableProgressTracking: options.performanceSettings?.enableProgressTracking !== false,
        optimizeConcurrency: options.performanceSettings?.optimizeConcurrency !== false,
        memoryLimitMB: options.performanceSettings?.memoryLimitMB || 512
      }
    };

    // Schema version registry
    this.schemaVersions = new Map();
    this.migrationPlans = new Map();
    this.activeMigrations = new Map();

    // Migration execution state
    this.migrationProgress = new Map();
    this.rollbackStrategies = new Map();
  }

  async initialize(databaseName) {
    console.log('Initializing MongoDB Schema Evolution Manager...');

    try {
      await this.client.connect();
      this.db = this.client.db(databaseName);

      // Setup system collections for schema management
      await this.setupSchemaManagementCollections();

      // Load existing schema versions and migration history
      await this.loadSchemaVersionRegistry();

      console.log('Schema evolution manager initialized successfully');

    } catch (error) {
      console.error('Error initializing schema evolution manager:', error);
      throw error;
    }
  }

  async setupSchemaManagementCollections() {
    console.log('Setting up schema management collections...');

    // Schema version registry
    const schemaVersions = this.db.collection('schema_versions');
    await schemaVersions.createIndexes([
      { key: { collection_name: 1, version: 1 }, unique: true },
      { key: { is_active: 1 } },
      { key: { created_at: -1 } }
    ]);

    // Migration execution log
    const migrationLog = this.db.collection(this.config.versionManagement.migrationLogCollection);
    await migrationLog.createIndexes([
      { key: { migration_id: 1 }, unique: true },
      { key: { collection_name: 1, execution_timestamp: -1 } },
      { key: { migration_status: 1 } },
      { key: { schema_version_from: 1, schema_version_to: 1 } }
    ]);

    // Migration progress tracking
    const migrationProgress = this.db.collection('migration_progress');
    await migrationProgress.createIndexes([
      { key: { migration_id: 1 }, unique: true },
      { key: { collection_name: 1 } },
      { key: { status: 1 } }
    ]);
  }

  async defineSchemaVersion(collectionName, versionConfig) {
    console.log(`Defining schema version for collection: ${collectionName}`);

    const schemaVersion = {
      collection_name: collectionName,
      version: versionConfig.version,
      version_name: versionConfig.versionName || `v${versionConfig.version}`,

      // Schema definition
      schema_definition: {
        fields: versionConfig.fields || {},
        required_fields: versionConfig.requiredFields || [],
        optional_fields: versionConfig.optionalFields || [],
        deprecated_fields: versionConfig.deprecatedFields || [],

        // Field transformations and mappings
        field_mappings: versionConfig.fieldMappings || {},
        data_transformations: versionConfig.dataTransformations || {},
        validation_rules: versionConfig.validationRules || {}
      },

      // Migration configuration
      migration_config: {
        migration_type: versionConfig.migrationType || 'additive', // additive, transformative, breaking
        backward_compatible: versionConfig.backwardCompatible !== false,
        requires_reindex: versionConfig.requiresReindex || false,
        data_transformation_required: versionConfig.dataTransformationRequired || false,

        // Performance settings
        batch_processing: versionConfig.batchProcessing !== false,
        parallel_execution: versionConfig.parallelExecution || false,
        estimated_duration_minutes: versionConfig.estimatedDuration || 0
      },

      // Compatibility and rollback
      compatibility_info: {
        compatible_with_versions: versionConfig.compatibleVersions || [],
        breaking_changes: versionConfig.breakingChanges || [],
        rollback_strategy: versionConfig.rollbackStrategy || 'automatic',
        rollback_script: versionConfig.rollbackScript || null
      },

      // Metadata
      version_metadata: {
        created_by: versionConfig.createdBy || 'system',
        created_at: new Date(),
        is_active: versionConfig.isActive !== false,
        deployment_notes: versionConfig.deploymentNotes || '',
        business_justification: versionConfig.businessJustification || ''
      }
    };

    // Store schema version definition
    const schemaVersions = this.db.collection('schema_versions');
    await schemaVersions.replaceOne(
      { collection_name: collectionName, version: versionConfig.version },
      schemaVersion,
      { upsert: true }
    );

    // Cache schema version
    this.schemaVersions.set(`${collectionName}:${versionConfig.version}`, schemaVersion);

    console.log(`Schema version ${versionConfig.version} defined for ${collectionName}`);
    return schemaVersion;
  }

  async createMigrationPlan(collectionName, fromVersion, toVersion, options = {}) {
    console.log(`Creating migration plan: ${collectionName} v${fromVersion} → v${toVersion}`);

    const sourceSchema = this.schemaVersions.get(`${collectionName}:${fromVersion}`);
    const targetSchema = this.schemaVersions.get(`${collectionName}:${toVersion}`);

    if (!sourceSchema || !targetSchema) {
      throw new Error(`Schema version not found for migration: ${fromVersion} → ${toVersion}`);
    }

    const migrationPlan = {
      migration_id: this.generateMigrationId(),
      collection_name: collectionName,
      schema_version_from: fromVersion,
      schema_version_to: toVersion,

      // Migration analysis
      migration_analysis: {
        migration_type: this.analyzeMigrationType(sourceSchema, targetSchema),
        impact_assessment: await this.assessMigrationImpact(collectionName, sourceSchema, targetSchema),
        field_changes: this.analyzeFieldChanges(sourceSchema, targetSchema),
        data_transformation_required: this.requiresDataTransformation(sourceSchema, targetSchema)
      },

      // Execution plan
      execution_plan: {
        migration_steps: await this.generateMigrationSteps(sourceSchema, targetSchema),
        execution_order: options.executionOrder || 'sequential',
        batch_configuration: {
          batch_size: options.batchSize || this.config.migrationStrategy.batchSize,
          concurrent_batches: options.concurrentBatches || this.config.migrationStrategy.concurrentOperations,
          throttle_delay_ms: options.throttleDelay || 10
        },

        // Performance predictions
        estimated_execution_time: await this.estimateExecutionTime(collectionName, sourceSchema, targetSchema),
        resource_requirements: await this.calculateResourceRequirements(collectionName, sourceSchema, targetSchema)
      },

      // Safety and rollback
      safety_measures: {
        backup_required: options.backupRequired !== false,
        validation_checks: await this.generateValidationChecks(sourceSchema, targetSchema),
        rollback_plan: await this.generateRollbackPlan(sourceSchema, targetSchema),
        progress_checkpoints: options.progressCheckpoints || []
      },

      // Metadata
      plan_metadata: {
        created_at: new Date(),
        created_by: options.createdBy || 'system',
        plan_version: '1.0',
        approval_required: options.approvalRequired || false,
        deployment_window: options.deploymentWindow || null
      }
    };

    // Store migration plan
    await this.db.collection('migration_plans').replaceOne(
      { migration_id: migrationPlan.migration_id },
      migrationPlan,
      { upsert: true }
    );

    // Cache migration plan
    this.migrationPlans.set(migrationPlan.migration_id, migrationPlan);

    console.log(`Migration plan created: ${migrationPlan.migration_id}`);
    return migrationPlan;
  }

  async executeMigration(migrationId, options = {}) {
    console.log(`Executing migration: ${migrationId}`);

    const migrationPlan = this.migrationPlans.get(migrationId);
    if (!migrationPlan) {
      throw new Error(`Migration plan not found: ${migrationId}`);
    }

    const executionId = this.generateExecutionId();
    const startTime = Date.now();

    try {
      // Initialize migration execution tracking
      await this.initializeMigrationExecution(executionId, migrationPlan, options);

      // Pre-migration validation and preparation
      await this.performPreMigrationChecks(migrationPlan);

      // Execute migration based on strategy
      const migrationResult = await this.executeByStrategy(migrationPlan, executionId, options);

      // Post-migration validation
      await this.performPostMigrationValidation(migrationPlan, migrationResult);

      // Update migration log
      await this.logMigrationCompletion(executionId, migrationPlan, migrationResult, {
        start_time: startTime,
        end_time: Date.now(),
        status: 'success'
      });

      console.log(`Migration completed successfully: ${migrationId}`);
      return migrationResult;

    } catch (error) {
      console.error(`Migration failed: ${migrationId}`, error);

      // Attempt automatic rollback if enabled
      if (this.config.migrationStrategy.enableRollback && options.autoRollback !== false) {
        try {
          await this.executeRollback(executionId, migrationPlan);
        } catch (rollbackError) {
          console.error('Rollback failed:', rollbackError);
        }
      }

      // Log migration failure
      await this.logMigrationCompletion(executionId, migrationPlan, null, {
        start_time: startTime,
        end_time: Date.now(),
        status: 'failed',
        error: error.message
      });

      throw error;
    }
  }

  async executeByStrategy(migrationPlan, executionId, options) {
    const strategy = options.strategy || this.config.migrationStrategy.approachType;

    switch (strategy) {
      case 'gradual':
        return await this.executeGradualMigration(migrationPlan, executionId, options);
      case 'immediate':
        return await this.executeImmediateMigration(migrationPlan, executionId, options);
      case 'hybrid':
        return await this.executeHybridMigration(migrationPlan, executionId, options);
      default:
        throw new Error(`Unknown migration strategy: ${strategy}`);
    }
  }

  async executeGradualMigration(migrationPlan, executionId, options) {
    console.log('Executing gradual migration strategy...');

    const collection = this.db.collection(migrationPlan.collection_name);
    const batchConfig = migrationPlan.execution_plan.batch_configuration;

    let processedCount = 0;
    let totalCount = await collection.countDocuments();
    let lastId = null;

    console.log(`Processing ${totalCount} documents in batches of ${batchConfig.batch_size}`);

    while (processedCount < totalCount) {
      // Build batch query
      const batchQuery = lastId 
        ? { _id: { $gt: lastId }, [this.config.versionManagement.versionField]: migrationPlan.schema_version_from }
        : { [this.config.versionManagement.versionField]: migrationPlan.schema_version_from };

      // Get batch of documents
      const batch = await collection
        .find(batchQuery)
        .sort({ _id: 1 })
        .limit(batchConfig.batch_size)
        .toArray();

      if (batch.length === 0) {
        break; // No more documents to process
      }

      // Process batch
      const batchResult = await this.processMigrationBatch(
        collection, 
        batch, 
        migrationPlan.execution_plan.migration_steps,
        migrationPlan.schema_version_to
      );

      processedCount += batch.length;
      lastId = batch[batch.length - 1]._id;

      // Update progress
      await this.updateMigrationProgress(executionId, {
        processed_count: processedCount,
        total_count: totalCount,
        progress_percentage: (processedCount / totalCount) * 100,
        last_processed_id: lastId
      });

      // Throttle to avoid overwhelming the system
      if (batchConfig.throttle_delay_ms > 0) {
        await new Promise(resolve => setTimeout(resolve, batchConfig.throttle_delay_ms));
      }

      console.log(`Processed ${processedCount}/${totalCount} documents (${((processedCount / totalCount) * 100).toFixed(1)}%)`);
    }

    return {
      strategy: 'gradual',
      processed_count: processedCount,
      total_count: totalCount,
      batches_processed: Math.ceil(processedCount / batchConfig.batch_size),
      success: true
    };
  }

  async processMigrationBatch(collection, documents, migrationSteps, targetVersion) {
    const bulkOperations = [];

    for (const doc of documents) {
      let transformedDoc = { ...doc };

      // Apply each migration step
      for (const step of migrationSteps) {
        transformedDoc = await this.applyMigrationStep(transformedDoc, step);
      }

      // Update schema version
      transformedDoc[this.config.versionManagement.versionField] = targetVersion;
      transformedDoc._migration_timestamp = new Date();

      // Add to bulk operations
      bulkOperations.push({
        replaceOne: {
          filter: { _id: doc._id },
          replacement: transformedDoc
        }
      });
    }

    // Execute bulk operation
    if (bulkOperations.length > 0) {
      const result = await collection.bulkWrite(bulkOperations, { ordered: false });
      return {
        modified_count: result.modifiedCount,
        matched_count: result.matchedCount,
        errors: result.getWriteErrors()
      };
    }

    return { modified_count: 0, matched_count: 0, errors: [] };
  }

  async applyMigrationStep(document, migrationStep) {
    let transformedDoc = { ...document };

    switch (migrationStep.type) {
      case 'add_field':
        transformedDoc[migrationStep.field_name] = migrationStep.default_value;
        break;

      case 'rename_field':
        if (transformedDoc[migrationStep.old_field_name] !== undefined) {
          transformedDoc[migrationStep.new_field_name] = transformedDoc[migrationStep.old_field_name];
          delete transformedDoc[migrationStep.old_field_name];
        }
        break;

      case 'transform_field':
        if (transformedDoc[migrationStep.field_name] !== undefined) {
          transformedDoc[migrationStep.field_name] = await this.applyFieldTransformation(
            transformedDoc[migrationStep.field_name],
            migrationStep.transformation
          );
        }
        break;

      case 'nested_restructure':
        transformedDoc = await this.applyNestedRestructure(transformedDoc, migrationStep.restructure_config);
        break;

      case 'data_type_conversion':
        if (transformedDoc[migrationStep.field_name] !== undefined) {
          transformedDoc[migrationStep.field_name] = this.convertDataType(
            transformedDoc[migrationStep.field_name],
            migrationStep.target_type
          );
        }
        break;

      case 'conditional_transformation':
        if (this.evaluateCondition(transformedDoc, migrationStep.condition)) {
          transformedDoc = await this.applyConditionalTransformation(transformedDoc, migrationStep.transformation);
        }
        break;

      default:
        console.warn(`Unknown migration step type: ${migrationStep.type}`);
    }

    return transformedDoc;
  }

  async generateBackwardCompatibilityLayer(collectionName, fromVersion, toVersion) {
    console.log(`Generating backward compatibility layer: ${collectionName} v${fromVersion} ↔ v${toVersion}`);

    const sourceSchema = this.schemaVersions.get(`${collectionName}:${fromVersion}`);
    const targetSchema = this.schemaVersions.get(`${collectionName}:${toVersion}`);

    const compatibilityLayer = {
      collection_name: collectionName,
      source_version: fromVersion,
      target_version: toVersion,

      // Field mapping for backward compatibility
      field_mappings: {
        // Map old field names to new field names
        old_to_new: this.generateFieldMappings(sourceSchema, targetSchema, 'forward'),
        new_to_old: this.generateFieldMappings(targetSchema, sourceSchema, 'backward')
      },

      // Data transformation functions
      transformation_functions: {
        forward_transform: await this.generateTransformationFunction(sourceSchema, targetSchema, 'forward'),
        backward_transform: await this.generateTransformationFunction(targetSchema, sourceSchema, 'backward')
      },

      // API compatibility
      api_compatibility: {
        deprecated_fields: this.identifyDeprecatedFields(sourceSchema, targetSchema),
        field_aliases: this.generateFieldAliases(sourceSchema, targetSchema),
        default_values: this.generateDefaultValues(targetSchema)
      },

      // Migration instructions
      migration_instructions: {
        application_changes_required: this.identifyRequiredApplicationChanges(sourceSchema, targetSchema),
        breaking_changes: this.identifyBreakingChanges(sourceSchema, targetSchema),
        migration_timeline: this.generateMigrationTimeline(sourceSchema, targetSchema)
      }
    };

    // Store compatibility layer configuration
    await this.db.collection('compatibility_layers').replaceOne(
      { collection_name: collectionName, source_version: fromVersion, target_version: toVersion },
      compatibilityLayer,
      { upsert: true }
    );

    return compatibilityLayer;
  }

  async validateMigrationIntegrity(collectionName, migrationId, options = {}) {
    console.log(`Validating migration integrity: ${collectionName} (${migrationId})`);

    const collection = this.db.collection(collectionName);
    const migrationPlan = this.migrationPlans.get(migrationId);

    if (!migrationPlan) {
      throw new Error(`Migration plan not found: ${migrationId}`);
    }

    const validationResults = {
      migration_id: migrationId,
      collection_name: collectionName,
      validation_timestamp: new Date(),

      // Document count validation
      document_counts: {
        total_documents: await collection.countDocuments(),
        migrated_documents: await collection.countDocuments({
          [this.config.versionManagement.versionField]: migrationPlan.schema_version_to
        }),
        unmigrated_documents: await collection.countDocuments({
          [this.config.versionManagement.versionField]: { $ne: migrationPlan.schema_version_to }
        })
      },

      // Schema validation
      schema_validation: await this.validateSchemaCompliance(collection, migrationPlan.schema_version_to),

      // Data integrity checks
      data_integrity: await this.performDataIntegrityChecks(collection, migrationPlan),

      // Performance impact assessment
      performance_impact: await this.assessPerformanceImpact(collection, migrationPlan),

      // Compatibility verification
      compatibility_status: await this.verifyBackwardCompatibility(collection, migrationPlan)
    };

    // Calculate overall validation status
    validationResults.overall_status = this.calculateOverallValidationStatus(validationResults);

    // Store validation results
    await this.db.collection('migration_validations').insertOne(validationResults);

    console.log(`Migration validation completed: ${validationResults.overall_status}`);
    return validationResults;
  }

  // Utility methods for migration management
  generateMigrationId() {
    return `migration_${Date.now()}_${Math.random().toString(36).substr(2, 9)}`;
  }

  generateExecutionId() {
    return `exec_${Date.now()}_${Math.random().toString(36).substr(2, 9)}`;
  }

  async loadSchemaVersionRegistry() {
    const schemaVersions = await this.db.collection('schema_versions')
      .find({ 'version_metadata.is_active': true })
      .toArray();

    schemaVersions.forEach(schema => {
      this.schemaVersions.set(`${schema.collection_name}:${schema.version}`, schema);
    });

    console.log(`Loaded ${schemaVersions.length} active schema versions`);
  }

  analyzeMigrationType(sourceSchema, targetSchema) {
    const sourceFields = new Set(Object.keys(sourceSchema.schema_definition.fields));
    const targetFields = new Set(Object.keys(targetSchema.schema_definition.fields));

    const addedFields = [...targetFields].filter(field => !sourceFields.has(field));
    const removedFields = [...sourceFields].filter(field => !targetFields.has(field));
    const modifiedFields = [...sourceFields].filter(field => 
      targetFields.has(field) && 
      JSON.stringify(sourceSchema.schema_definition.fields[field]) !== 
      JSON.stringify(targetSchema.schema_definition.fields[field])
    );

    if (removedFields.length > 0 || modifiedFields.length > 0) {
      return 'breaking';
    } else if (addedFields.length > 0) {
      return 'additive';
    } else {
      return 'maintenance';
    }
  }
}

// Example usage demonstrating comprehensive MongoDB schema evolution
async function demonstrateSchemaEvolution() {
  const schemaManager = new MongoSchemaEvolutionManager('mongodb://localhost:27017');

  try {
    await schemaManager.initialize('ecommerce_platform');

    console.log('Defining initial user schema version...');

    // Define initial schema version
    await schemaManager.defineSchemaVersion('users', {
      version: '1.0',
      versionName: 'initial_user_schema',
      fields: {
        _id: { type: 'ObjectId', required: true },
        email: { type: 'String', required: true, unique: true },
        password_hash: { type: 'String', required: true },
        created_at: { type: 'Date', required: true },
        last_login: { type: 'Date', required: false }
      },
      requiredFields: ['_id', 'email', 'password_hash', 'created_at'],
      migrationType: 'initial',
      backwardCompatible: true
    });

    // Define enhanced schema version
    await schemaManager.defineSchemaVersion('users', {
      version: '2.0',
      versionName: 'enhanced_user_profile',
      fields: {
        _id: { type: 'ObjectId', required: true },
        email: { type: 'String', required: true, unique: true },
        password_hash: { type: 'String', required: true },

        // New profile fields
        profile: {
          type: 'Object',
          required: false,
          fields: {
            first_name: { type: 'String', required: false },
            last_name: { type: 'String', required: false },
            avatar_url: { type: 'String', required: false },
            bio: { type: 'String', required: false, max_length: 500 }
          }
        },

        // Enhanced user preferences
        preferences: {
          type: 'Object',
          required: false,
          fields: {
            email_notifications: { type: 'Boolean', default: true },
            privacy_level: { type: 'String', enum: ['public', 'friends', 'private'], default: 'public' },
            theme: { type: 'String', enum: ['light', 'dark'], default: 'light' },
            language: { type: 'String', default: 'en' }
          }
        },

        // Subscription and status
        subscription: {
          type: 'Object',
          required: false,
          fields: {
            tier: { type: 'String', enum: ['basic', 'plus', 'premium'], default: 'basic' },
            expires_at: { type: 'Date', required: false },
            auto_renewal: { type: 'Boolean', default: false }
          }
        },

        // Tracking and analytics
        activity: {
          type: 'Object',
          required: false,
          fields: {
            last_login: { type: 'Date', required: false },
            login_count: { type: 'Number', default: 0 },
            profile_completion: { type: 'Number', min: 0, max: 100, default: 0 }
          }
        },

        created_at: { type: 'Date', required: true },
        updated_at: { type: 'Date', required: true }
      },
      requiredFields: ['_id', 'email', 'password_hash', 'created_at', 'updated_at'],

      // Migration configuration
      migrationType: 'additive',
      backwardCompatible: true,

      // Field mappings and transformations
      fieldMappings: {
        last_login: 'activity.last_login'
      },

      dataTransformations: {
        // Transform old last_login field to new nested structure
        'activity.last_login': 'document.last_login',
        'activity.login_count': '1',
        'profile_completion': 'calculateProfileCompletion(document)',
        'preferences': 'generateDefaultPreferences()',
        'subscription.tier': 'deriveTierFromHistory(document)'
      }
    });

    // Create migration plan
    const migrationPlan = await schemaManager.createMigrationPlan('users', '1.0', '2.0', {
      batchSize: 500,
      concurrentBatches: 2,
      backupRequired: true,
      deploymentWindow: {
        start: '2024-01-15T02:00:00Z',
        end: '2024-01-15T06:00:00Z'
      }
    });

    console.log('Migration plan created:', migrationPlan.migration_id);

    // Generate backward compatibility layer
    const compatibilityLayer = await schemaManager.generateBackwardCompatibilityLayer('users', '1.0', '2.0');
    console.log('Backward compatibility layer generated');

    // Execute migration (if approved and in deployment window)
    if (process.env.EXECUTE_MIGRATION === 'true') {
      const migrationResult = await schemaManager.executeMigration(migrationPlan.migration_id, {
        strategy: 'gradual',
        autoRollback: true
      });

      console.log('Migration executed:', migrationResult);

      // Validate migration integrity
      const validationResults = await schemaManager.validateMigrationIntegrity('users', migrationPlan.migration_id);
      console.log('Migration validation:', validationResults.overall_status);
    }

  } catch (error) {
    console.error('Schema evolution demonstration error:', error);
  }
}

module.exports = {
  MongoSchemaEvolutionManager,
  demonstrateSchemaEvolution
};

Understanding MongoDB Schema Evolution Patterns

Advanced Migration Strategies and Version Management

Implement sophisticated schema evolution with enterprise-grade version control and migration orchestration:

// Production-ready schema evolution with advanced migration patterns
class EnterpriseSchemaEvolutionManager extends MongoSchemaEvolutionManager {
  constructor(connectionUri, enterpriseConfig) {
    super(connectionUri, enterpriseConfig);

    this.enterpriseFeatures = {
      // Advanced migration orchestration
      migrationOrchestration: {
        distributedMigration: true,
        crossCollectionDependencies: true,
        transactionalMigration: true,
        rollbackOrchestration: true
      },

      // Enterprise integration
      enterpriseIntegration: {
        cicdIntegration: true,
        approvalWorkflows: true,
        auditCompliance: true,
        performanceMonitoring: true
      },

      // Advanced compatibility management
      compatibilityManagement: {
        multiVersionSupport: true,
        apiVersioning: true,
        clientCompatibilityTracking: true,
        automaticDeprecation: true
      }
    };
  }

  async orchestrateDistributedMigration(migrationConfig) {
    console.log('Orchestrating distributed migration across collections...');

    const distributedPlan = {
      // Cross-collection dependency management
      dependencyGraph: await this.analyzeCrossCollectionDependencies(migrationConfig.collections),

      // Coordinated execution strategy
      executionStrategy: {
        coordinationMethod: 'transaction', // transaction, phased, eventually_consistent
        consistencyLevel: 'strong', // strong, eventual, causal
        isolationLevel: 'snapshot', // snapshot, read_committed, read_uncommitted
        rollbackStrategy: 'coordinated' // coordinated, independent, manual
      },

      // Performance optimization
      performanceOptimization: {
        parallelCollections: true,
        resourceBalancing: true,
        priorityQueueing: true,
        adaptiveThrottling: true
      }
    };

    return await this.executeDistributedMigration(distributedPlan);
  }

  async implementSmartRollback(migrationId, rollbackConfig) {
    console.log('Implementing smart rollback with data recovery...');

    const rollbackStrategy = {
      // Intelligent rollback analysis
      rollbackAnalysis: {
        dataImpactAssessment: true,
        dependencyReversal: true,
        performanceImpactMinimization: true,
        dataConsistencyVerification: true
      },

      // Recovery mechanisms
      recoveryMechanisms: {
        pointInTimeRecovery: rollbackConfig.pointInTimeRecovery || false,
        incrementalRollback: rollbackConfig.incrementalRollback || false,
        dataReconciliation: rollbackConfig.dataReconciliation || true,
        consistencyRepair: rollbackConfig.consistencyRepair || true
      }
    };

    return await this.executeSmartRollback(migrationId, rollbackStrategy);
  }
}

SQL-Style Schema Management with QueryLeaf

QueryLeaf provides familiar SQL syntax for MongoDB schema evolution and migration management:

-- QueryLeaf schema evolution with SQL-familiar migration patterns

-- Define comprehensive schema version with validation and constraints
CREATE SCHEMA_VERSION users_v2 FOR COLLECTION users AS (
  -- Schema version metadata
  version_number = '2.0',
  version_name = 'enhanced_user_profiles',
  migration_type = 'additive',
  backward_compatible = true,

  -- Field definitions with validation rules
  field_definitions = JSON_OBJECT(
    '_id', JSON_OBJECT('type', 'ObjectId', 'required', true, 'primary_key', true),
    'email', JSON_OBJECT('type', 'String', 'required', true, 'unique', true, 'format', 'email'),
    'password_hash', JSON_OBJECT('type', 'String', 'required', true, 'min_length', 60),

    -- New nested profile structure
    'profile', JSON_OBJECT(
      'type', 'Object',
      'required', false,
      'fields', JSON_OBJECT(
        'first_name', JSON_OBJECT('type', 'String', 'max_length', 50),
        'last_name', JSON_OBJECT('type', 'String', 'max_length', 50),
        'display_name', JSON_OBJECT('type', 'String', 'max_length', 100),
        'avatar_url', JSON_OBJECT('type', 'String', 'format', 'url'),
        'bio', JSON_OBJECT('type', 'String', 'max_length', 500),
        'date_of_birth', JSON_OBJECT('type', 'Date', 'format', 'YYYY-MM-DD'),
        'location', JSON_OBJECT(
          'type', 'Object',
          'fields', JSON_OBJECT(
            'city', JSON_OBJECT('type', 'String'),
            'country', JSON_OBJECT('type', 'String', 'length', 2),
            'timezone', JSON_OBJECT('type', 'String')
          )
        )
      )
    ),

    -- Enhanced user preferences with defaults
    'preferences', JSON_OBJECT(
      'type', 'Object',
      'required', false,
      'default', JSON_OBJECT(
        'email_notifications', true,
        'privacy_level', 'public',
        'theme', 'light',
        'language', 'en'
      ),
      'fields', JSON_OBJECT(
        'email_notifications', JSON_OBJECT('type', 'Boolean', 'default', true),
        'privacy_level', JSON_OBJECT('type', 'String', 'enum', JSON_ARRAY('public', 'friends', 'private'), 'default', 'public'),
        'theme', JSON_OBJECT('type', 'String', 'enum', JSON_ARRAY('light', 'dark', 'auto'), 'default', 'light'),
        'language', JSON_OBJECT('type', 'String', 'pattern', '^[a-z]{2}$', 'default', 'en'),
        'notification_settings', JSON_OBJECT(
          'type', 'Object',
          'fields', JSON_OBJECT(
            'push_notifications', JSON_OBJECT('type', 'Boolean', 'default', true),
            'email_frequency', JSON_OBJECT('type', 'String', 'enum', JSON_ARRAY('immediate', 'daily', 'weekly'), 'default', 'daily')
          )
        )
      )
    ),

    -- Subscription and billing information
    'subscription', JSON_OBJECT(
      'type', 'Object',
      'required', false,
      'fields', JSON_OBJECT(
        'tier', JSON_OBJECT('type', 'String', 'enum', JSON_ARRAY('free', 'basic', 'plus', 'premium'), 'default', 'free'),
        'status', JSON_OBJECT('type', 'String', 'enum', JSON_ARRAY('active', 'cancelled', 'expired', 'trial'), 'default', 'active'),
        'starts_at', JSON_OBJECT('type', 'Date'),
        'expires_at', JSON_OBJECT('type', 'Date'),
        'auto_renewal', JSON_OBJECT('type', 'Boolean', 'default', false),
        'billing_cycle', JSON_OBJECT('type', 'String', 'enum', JSON_ARRAY('monthly', 'yearly'), 'default', 'monthly')
      )
    ),

    -- Activity tracking and analytics
    'activity_metrics', JSON_OBJECT(
      'type', 'Object',
      'required', false,
      'fields', JSON_OBJECT(
        'last_login_at', JSON_OBJECT('type', 'Date'),
        'login_count', JSON_OBJECT('type', 'Integer', 'min', 0, 'default', 0),
        'profile_completion_score', JSON_OBJECT('type', 'Integer', 'min', 0, 'max', 100, 'default', 0),
        'account_verification_status', JSON_OBJECT('type', 'String', 'enum', JSON_ARRAY('pending', 'verified', 'rejected'), 'default', 'pending'),
        'last_profile_update', JSON_OBJECT('type', 'Date'),
        'feature_usage_stats', JSON_OBJECT(
          'type', 'Object',
          'fields', JSON_OBJECT(
            'dashboard_visits', JSON_OBJECT('type', 'Integer', 'default', 0),
            'api_calls_count', JSON_OBJECT('type', 'Integer', 'default', 0),
            'storage_usage_bytes', JSON_OBJECT('type', 'Long', 'default', 0)
          )
        )
      )
    ),

    -- Timestamps and audit trail
    'created_at', JSON_OBJECT('type', 'Date', 'required', true, 'immutable', true),
    'updated_at', JSON_OBJECT('type', 'Date', 'required', true, 'auto_update', true),
    '_schema_version', JSON_OBJECT('type', 'String', 'required', true, 'default', '2.0')
  ),

  -- Migration mapping from previous version
  migration_mappings = JSON_OBJECT(
    -- Direct field mappings
    'last_login', 'activity_metrics.last_login_at',

    -- Computed field mappings
    'activity_metrics.login_count', 'COALESCE(login_count, 1)',
    'activity_metrics.profile_completion_score', 'CALCULATE_PROFILE_COMPLETION(profile)',
    'subscription.tier', 'DERIVE_TIER_FROM_USAGE(usage_history)',
    'preferences', 'GENERATE_DEFAULT_PREFERENCES()',
    'updated_at', 'CURRENT_TIMESTAMP'
  ),

  -- Validation rules for data integrity
  validation_rules = JSON_ARRAY(
    JSON_OBJECT('rule', 'email_domain_validation', 'expression', 'email REGEXP ''^[^@]+@[^@]+\\.[^@]+$'''),
    JSON_OBJECT('rule', 'subscription_dates_consistency', 'expression', 'subscription.expires_at > subscription.starts_at'),
    JSON_OBJECT('rule', 'profile_completion_accuracy', 'expression', 'activity_metrics.profile_completion_score <= 100'),
    JSON_OBJECT('rule', 'timezone_validation', 'expression', 'profile.location.timezone IN (SELECT timezone FROM valid_timezones)')
  ),

  -- Index optimization for new schema
  index_definitions = JSON_ARRAY(
    JSON_OBJECT('fields', JSON_OBJECT('email', 1), 'unique', true, 'sparse', false),
    JSON_OBJECT('fields', JSON_OBJECT('subscription.tier', 1, 'subscription.status', 1), 'background', true),
    JSON_OBJECT('fields', JSON_OBJECT('activity_metrics.last_login_at', -1), 'background', true),
    JSON_OBJECT('fields', JSON_OBJECT('profile.location.country', 1), 'sparse', true),
    JSON_OBJECT('fields', JSON_OBJECT('_schema_version', 1), 'background', true)
  ),

  -- Compatibility and deprecation settings
  compatibility_settings = JSON_OBJECT(
    'maintain_old_fields_days', 90,
    'deprecated_fields', JSON_ARRAY('last_login', 'login_count'),
    'breaking_changes', JSON_ARRAY(),
    'migration_required_for', JSON_ARRAY('v1.0', 'v1.5')
  )
);

-- Create comprehensive migration plan with performance optimization
WITH migration_analysis AS (
  SELECT 
    collection_name,
    current_schema_version,
    target_schema_version,

    -- Document analysis for migration planning
    COUNT(*) as total_documents,
    AVG(LENGTH(BSON_SIZE(document))) as avg_document_size,
    SUM(LENGTH(BSON_SIZE(document))) / 1024 / 1024 as total_size_mb,

    -- Performance projections
    CASE 
      WHEN COUNT(*) > 10000000 THEN 'large_collection_parallel_required'
      WHEN COUNT(*) > 1000000 THEN 'medium_collection_batch_optimize'
      ELSE 'small_collection_standard_processing'
    END as processing_category,

    -- Migration complexity assessment
    CASE 
      WHEN target_schema_version LIKE '%.0' THEN 'major_version_comprehensive_testing'
      WHEN COUNT_SCHEMA_CHANGES(current_schema_version, target_schema_version) > 10 THEN 'complex_migration'
      ELSE 'standard_migration'
    END as migration_complexity,

    -- Resource requirements estimation
    CEIL(COUNT(*) / 1000.0) as estimated_batches,
    CEIL((SUM(LENGTH(BSON_SIZE(document))) / 1024 / 1024) / 100.0) * 2 as estimated_duration_minutes,
    CEIL(COUNT(*) / 10000.0) * 512 as estimated_memory_mb

  FROM users u
  JOIN schema_version_registry svr ON u._schema_version = svr.version
  WHERE svr.collection_name = 'users'
  GROUP BY collection_name, current_schema_version, target_schema_version
),

-- Generate optimized migration execution plan
migration_execution_plan AS (
  SELECT 
    ma.*,

    -- Batch processing configuration
    CASE ma.processing_category
      WHEN 'large_collection_parallel_required' THEN 
        JSON_OBJECT(
          'batch_size', 500,
          'concurrent_batches', 5,
          'parallel_collections', true,
          'memory_limit_per_batch_mb', 256,
          'throttle_delay_ms', 50
        )
      WHEN 'medium_collection_batch_optimize' THEN
        JSON_OBJECT(
          'batch_size', 1000,
          'concurrent_batches', 3,
          'parallel_collections', false,
          'memory_limit_per_batch_mb', 128,
          'throttle_delay_ms', 10
        )
      ELSE
        JSON_OBJECT(
          'batch_size', 2000,
          'concurrent_batches', 1,
          'parallel_collections', false,
          'memory_limit_per_batch_mb', 64,
          'throttle_delay_ms', 0
        )
    END as batch_configuration,

    -- Safety and rollback configuration
    JSON_OBJECT(
      'backup_required', CASE WHEN ma.total_documents > 100000 THEN true ELSE false END,
      'rollback_enabled', true,
      'validation_sample_size', LEAST(ma.total_documents * 0.1, 10000),
      'progress_checkpoint_interval', GREATEST(ma.estimated_batches / 10, 1),
      'failure_threshold_percent', 5.0
    ) as safety_configuration,

    -- Performance monitoring setup
    JSON_OBJECT(
      'monitor_memory_usage', true,
      'monitor_throughput', true,
      'monitor_lock_contention', true,
      'alert_on_slowdown_percent', 50,
      'performance_baseline_samples', 100
    ) as monitoring_configuration

  FROM migration_analysis ma
)

-- Create and execute migration plan
CREATE MIGRATION_PLAN users_v1_to_v2 AS (
  SELECT 
    mep.*,

    -- Migration steps with detailed transformations
    JSON_ARRAY(
      -- Step 1: Add new schema version field
      JSON_OBJECT(
        'step_number', 1,
        'step_type', 'add_field',
        'field_name', '_schema_version',
        'default_value', '2.0',
        'description', 'Add schema version tracking'
      ),

      -- Step 2: Restructure activity data
      JSON_OBJECT(
        'step_number', 2,
        'step_type', 'nested_restructure',
        'restructure_config', JSON_OBJECT(
          'create_nested_object', 'activity_metrics',
          'field_mappings', JSON_OBJECT(
            'last_login', 'activity_metrics.last_login_at',
            'login_count', 'activity_metrics.login_count'
          ),
          'computed_fields', JSON_OBJECT(
            'activity_metrics.profile_completion_score', 'CALCULATE_PROFILE_COMPLETION(profile)',
            'activity_metrics.account_verification_status', '''pending'''
          )
        )
      ),

      -- Step 3: Generate default preferences
      JSON_OBJECT(
        'step_number', 3,
        'step_type', 'add_field',
        'field_name', 'preferences',
        'transformation', 'GENERATE_DEFAULT_PREFERENCES()',
        'description', 'Add user preferences with smart defaults'
      ),

      -- Step 4: Initialize subscription data
      JSON_OBJECT(
        'step_number', 4,
        'step_type', 'add_field',
        'field_name', 'subscription',
        'transformation', 'DERIVE_SUBSCRIPTION_INFO(user_history)',
        'description', 'Initialize subscription information from usage history'
      ),

      -- Step 5: Update timestamps
      JSON_OBJECT(
        'step_number', 5,
        'step_type', 'add_field',
        'field_name', 'updated_at',
        'default_value', 'CURRENT_TIMESTAMP',
        'description', 'Add updated timestamp for audit trail'
      )
    ) as migration_steps,

    -- Validation and verification tests
    JSON_ARRAY(
      JSON_OBJECT(
        'test_name', 'schema_version_consistency',
        'test_query', 'SELECT COUNT(*) FROM users WHERE _schema_version != ''2.0''',
        'expected_result', 0,
        'severity', 'critical'
      ),
      JSON_OBJECT(
        'test_name', 'data_completeness_check',
        'test_query', 'SELECT COUNT(*) FROM users WHERE activity_metrics IS NULL',
        'expected_result', 0,
        'severity', 'critical'
      ),
      JSON_OBJECT(
        'test_name', 'preferences_initialization',
        'test_query', 'SELECT COUNT(*) FROM users WHERE preferences IS NULL',
        'expected_result', 0,
        'severity', 'high'
      ),
      JSON_OBJECT(
        'test_name', 'profile_completion_accuracy',
        'test_query', 'SELECT COUNT(*) FROM users WHERE activity_metrics.profile_completion_score < 0 OR activity_metrics.profile_completion_score > 100',
        'expected_result', 0,
        'severity', 'medium'
      )
    ) as validation_tests

  FROM migration_execution_plan mep
);

-- Execute migration with comprehensive monitoring and safety checks
EXECUTE MIGRATION users_v1_to_v2 WITH OPTIONS (
  -- Execution settings
  execution_mode = 'gradual',  -- gradual, immediate, test_mode
  safety_checks_enabled = true,
  automatic_rollback = true,

  -- Performance settings
  resource_limits = JSON_OBJECT(
    'max_memory_usage_mb', 1024,
    'max_execution_time_minutes', 120,
    'max_cpu_usage_percent', 80,
    'io_throttling_enabled', true
  ),

  -- Monitoring and alerting
  monitoring = JSON_OBJECT(
    'progress_reporting_interval_seconds', 30,
    'performance_metrics_collection', true,
    'alert_on_errors', true,
    'alert_email', '[email protected]'
  ),

  -- Backup and recovery
  backup_settings = JSON_OBJECT(
    'create_backup_before_migration', true,
    'backup_location', 'migrations/backup_users_v1_to_v2',
    'verify_backup_integrity', true
  )
);

-- Monitor migration progress with real-time analytics
WITH migration_progress AS (
  SELECT 
    migration_id,
    execution_id,
    collection_name,
    schema_version_from,
    schema_version_to,

    -- Progress tracking
    total_documents,
    processed_documents,
    ROUND((processed_documents::numeric / total_documents) * 100, 2) as progress_percentage,

    -- Performance metrics
    EXTRACT(SECONDS FROM CURRENT_TIMESTAMP - migration_started_at) as elapsed_seconds,
    ROUND(processed_documents::numeric / EXTRACT(SECONDS FROM CURRENT_TIMESTAMP - migration_started_at), 2) as documents_per_second,

    -- Resource utilization
    current_memory_usage_mb,
    peak_memory_usage_mb,
    cpu_usage_percent,

    -- Quality indicators
    error_count,
    warning_count,
    validation_failures,

    -- ETA calculation
    CASE 
      WHEN processed_documents > 0 AND migration_status = 'running' THEN
        CURRENT_TIMESTAMP + 
        (INTERVAL '1 second' * 
         ((total_documents - processed_documents) / 
          (processed_documents::numeric / EXTRACT(SECONDS FROM CURRENT_TIMESTAMP - migration_started_at))))
      ELSE NULL
    END as estimated_completion_time,

    migration_status

  FROM migration_execution_status
  WHERE migration_status IN ('running', 'validating', 'finalizing')
),

-- Performance trend analysis
performance_trends AS (
  SELECT 
    migration_id,

    -- Throughput trends (last 5 minutes)
    AVG(documents_per_second) OVER (
      ORDER BY checkpoint_timestamp 
      ROWS BETWEEN 4 PRECEDING AND CURRENT ROW
    ) as avg_throughput_5min,

    -- Memory usage trends
    AVG(memory_usage_mb) OVER (
      ORDER BY checkpoint_timestamp
      ROWS BETWEEN 9 PRECEDING AND CURRENT ROW  
    ) as avg_memory_usage_10min,

    -- Error rate trends
    SUM(errors_since_last_checkpoint) OVER (
      ORDER BY checkpoint_timestamp
      ROWS BETWEEN 19 PRECEDING AND CURRENT ROW
    ) as error_count_20min,

    -- Performance indicators
    CASE 
      WHEN documents_per_second < avg_documents_per_second * 0.7 THEN 'degraded_performance'
      WHEN memory_usage_mb > peak_memory_usage_mb * 0.9 THEN 'high_memory_usage'
      WHEN error_count > 0 THEN 'errors_detected'
      ELSE 'healthy'
    END as health_status

  FROM migration_performance_checkpoints
  WHERE checkpoint_timestamp >= CURRENT_TIMESTAMP - INTERVAL '1 hour'
)

-- Migration monitoring dashboard
SELECT 
  -- Current status overview
  mp.migration_id,
  mp.collection_name,
  mp.progress_percentage || '%' as progress,
  mp.documents_per_second || ' docs/sec' as throughput,
  mp.estimated_completion_time,
  mp.migration_status,

  -- Resource utilization
  mp.current_memory_usage_mb || 'MB (' || 
    ROUND((mp.current_memory_usage_mb::numeric / mp.peak_memory_usage_mb) * 100, 1) || '% of peak)' as memory_usage,
  mp.cpu_usage_percent || '%' as cpu_usage,

  -- Quality indicators
  mp.error_count as errors,
  mp.warning_count as warnings,
  mp.validation_failures as validation_issues,

  -- Performance health
  pt.health_status,
  pt.avg_throughput_5min || ' docs/sec (5min avg)' as recent_throughput,

  -- Recommendations
  CASE 
    WHEN pt.health_status = 'degraded_performance' THEN 'Consider reducing batch size or increasing resources'
    WHEN pt.health_status = 'high_memory_usage' THEN 'Monitor for potential memory issues'
    WHEN pt.health_status = 'errors_detected' THEN 'Review error logs and consider pausing migration'
    WHEN mp.progress_percentage > 95 THEN 'Migration nearing completion, prepare for validation'
    ELSE 'Migration proceeding normally'
  END as recommendation,

  -- Next actions
  CASE 
    WHEN mp.migration_status = 'running' AND mp.progress_percentage > 99 THEN 'Begin final validation phase'
    WHEN mp.migration_status = 'validating' THEN 'Performing post-migration validation tests'
    WHEN mp.migration_status = 'finalizing' THEN 'Completing migration and cleanup'
    ELSE 'Continue monitoring progress'
  END as next_action

FROM migration_progress mp
LEFT JOIN performance_trends pt ON mp.migration_id = pt.migration_id
WHERE mp.migration_id = (SELECT MAX(migration_id) FROM migration_progress)

UNION ALL

-- Historical migration performance summary
SELECT 
  'HISTORICAL_SUMMARY' as migration_id,
  collection_name,
  NULL as progress,
  AVG(final_throughput) || ' docs/sec avg' as throughput,
  NULL as estimated_completion_time,
  'completed' as migration_status,
  AVG(peak_memory_usage_mb) || 'MB avg peak' as memory_usage,
  AVG(avg_cpu_usage_percent) || '% avg' as cpu_usage,
  SUM(total_errors) as errors,
  SUM(total_warnings) as warnings,
  SUM(validation_failures) as validation_issues,

  CASE 
    WHEN AVG(success_rate) > 99 THEN 'excellent_historical_performance'
    WHEN AVG(success_rate) > 95 THEN 'good_historical_performance'
    ELSE 'performance_issues_detected'
  END as health_status,

  COUNT(*) || ' previous migrations' as recent_throughput,
  'Historical performance baseline' as recommendation,
  'Use for future migration planning' as next_action

FROM migration_history
WHERE migration_completed_at >= CURRENT_DATE - INTERVAL '6 months'
  AND collection_name = 'users'
GROUP BY collection_name;

-- QueryLeaf schema evolution capabilities:
-- 1. SQL-familiar schema version definition with comprehensive validation rules
-- 2. Automated migration plan generation with performance optimization
-- 3. Advanced batch processing configuration based on collection size and complexity
-- 4. Real-time migration monitoring with progress tracking and performance analytics
-- 5. Comprehensive safety checks including automatic rollback and validation testing
-- 6. Backward compatibility management with deprecated field handling
-- 7. Resource utilization monitoring and optimization recommendations
-- 8. Historical performance analysis for migration planning and optimization
-- 9. Enterprise-grade error handling and recovery mechanisms
-- 10. Integration with MongoDB's native document flexibility while maintaining SQL familiarity

Best Practices for MongoDB Schema Evolution

Migration Strategy Design

Essential principles for effective MongoDB schema evolution and migration management:

Gradual Evolution: Implement incremental schema changes that support both old and new document structures during transition periods
Version Tracking: Maintain explicit schema version fields in documents to enable targeted migration and compatibility management
Backward Compatibility: Design migrations that preserve application functionality across deployment cycles and rollback scenarios
Performance Optimization: Utilize batch processing, indexing strategies, and resource throttling to minimize production impact
Validation and Testing: Implement comprehensive validation frameworks that verify data integrity and schema compliance
Rollback Planning: Design robust rollback strategies with automated recovery mechanisms for migration failures

Production Deployment Strategies

Optimize MongoDB schema evolution for enterprise-scale applications:

Zero-Downtime Migrations: Implement rolling migration strategies that maintain application availability during schema transitions
Resource Management: Configure memory limits, CPU throttling, and I/O optimization to prevent system impact during migrations
Monitoring and Alerting: Deploy real-time monitoring systems that track migration progress, performance, and error conditions
Documentation and Compliance: Maintain comprehensive migration documentation and audit trails for regulatory compliance
Testing and Validation: Establish staging environments that replicate production conditions for migration testing and validation
Team Coordination: Implement approval workflows and deployment coordination processes for enterprise migration management

Conclusion

MongoDB schema evolution provides comprehensive capabilities for managing database structure changes through flexible document models, automated migration frameworks, and sophisticated compatibility management systems. The document-based architecture enables gradual schema transitions that maintain application stability while supporting continuous evolution of data models and business requirements.

Key MongoDB Schema Evolution benefits include:

Flexible Migration Strategies: Support for gradual, immediate, and hybrid migration approaches that adapt to different application requirements and constraints
Zero-Downtime Evolution: Advanced migration patterns that maintain application availability during schema transitions and data transformations
Comprehensive Version Management: Sophisticated version tracking and compatibility management that supports multiple application versions simultaneously
Performance Optimization: Intelligent batch processing and resource management that minimizes production system impact during migrations
Automated Validation: Built-in validation frameworks that ensure data integrity and schema compliance throughout migration processes
Enterprise Integration: Advanced orchestration capabilities that integrate with CI/CD pipelines, approval workflows, and enterprise monitoring systems

Whether you're evolving simple document structures, implementing complex data transformations, or managing enterprise-scale schema migrations, MongoDB's schema evolution capabilities with QueryLeaf's familiar SQL interface provide the foundation for robust, maintainable database evolution strategies.

QueryLeaf Integration: QueryLeaf automatically translates SQL-style schema definition and migration commands into optimized MongoDB operations, providing familiar DDL syntax for schema versions, migration plan creation, and execution monitoring. Advanced schema evolution patterns, backward compatibility management, and performance optimization are seamlessly accessible through SQL constructs, making sophisticated database evolution both powerful and approachable for SQL-oriented development teams.

The combination of MongoDB's flexible schema capabilities with SQL-style migration management makes it an ideal platform for modern applications requiring both database evolution flexibility and operational simplicity, ensuring your schema management processes can scale efficiently while maintaining data integrity and application stability throughout continuous development cycles.