2025

December 20, 2025
7 min read

MongoDB Collection Sharding and Horizontal Scaling: Production Deployment Strategies for High-Volume Applications

As your MongoDB application grows, you'll eventually hit the limits of vertical scaling. When a single server can no longer handle your data volume or throughput requirements, MongoDB's sharding capabilities become essential for horizontal scaling across multiple machines.

This guide covers production-ready sharding strategies, from shard key selection to deployment patterns that handle millions of operations per second.

Understanding MongoDB Sharding Architecture

MongoDB sharding distributes data across multiple servers called shards. Each shard contains a subset of the data, allowing your cluster to scale beyond the capacity of a single machine.

Core Components

Shards: Individual MongoDB instances (or replica sets) that store portions of your data Config Servers: Store cluster metadata and configuration settings Query Routers (mongos): Route queries to appropriate shards and aggregate results

// Example sharded cluster topology
{
  "configServers": ["config1:27019", "config2:27019", "config3:27019"],
  "shards": {
    "shard0": "shard0-rs/shard0-1:27018,shard0-2:27018,shard0-3:27018",
    "shard1": "shard1-rs/shard1-1:27018,shard1-2:27018,shard1-3:27018",
    "shard2": "shard2-rs/shard2-1:27018,shard2-2:27018,shard2-3:27018"
  },
  "mongosInstances": ["mongos1:27017", "mongos2:27017"]
}

Shard Key Selection: The Foundation of Performance

Your shard key determines how data distributes across shards. Poor shard key choices lead to uneven data distribution and performance bottlenecks.

Characteristics of Effective Shard Keys

High Cardinality: Many possible values distribute data evenly Even Distribution: Writes spread across all shards Query Isolation: Common queries target specific shards Monotonically Changing: Avoid always writing to the same shard

E-commerce Sharding Example

For an e-commerce platform with orders:

// Poor shard key - timestamp creates hotspots
{
  "_id": ObjectId("..."),
  "customerId": "user123",
  "orderDate": ISODate("2025-12-20"),
  "region": "us-east",
  "items": [...],
  "totalAmount": 156.99
}

// Poor choice: { orderDate: 1 }
// Problem: All new orders go to the same chunk

Better approach - Compound shard key:

// Good shard key: { region: 1, customerId: 1 }
// Benefits:
// - Distributes by region first
// - Further distributes by customer within region
// - Query isolation for region-specific queries

Advanced Shard Key Patterns

1. Hashed Shard Keys for Random Distribution

// Creates random distribution for high-write scenarios
sh.shardCollection("orders.transactions", { "transactionId": "hashed" })

// Benefits: Perfect distribution, high write throughput
// Drawbacks: Range queries become inefficient

2. Zone-Based Sharding for Geographic Distribution

// Configure zones for geographic data locality
sh.addShardToZone("shard-us-east", "US_EAST")
sh.addShardToZone("shard-us-west", "US_WEST")
sh.addShardToZone("shard-eu", "EUROPE")

// Define zone ranges
sh.updateZoneKeyRange(
  "ecommerce.orders",
  { region: "us-east", customerId: MinKey },
  { region: "us-east", customerId: MaxKey },
  "US_EAST"
)

3. Time-Series Sharding for IoT Data

// IoT sensor data with time-based distribution
{
  "deviceId": "sensor_001",
  "timestamp": ISODate("2025-12-20T10:30:00Z"),
  "location": { "type": "Point", "coordinates": [-122.4, 37.8] },
  "readings": {
    "temperature": 23.5,
    "humidity": 65.2,
    "pressure": 1013.25
  }
}

// Shard key: { deviceId: 1, timestamp: 1 }
// Allows efficient time-range queries per device

Production Sharding Deployment Strategies

1. Pre-Sharding for New Applications

Start with sharding from day one for predictable growth:

# Initialize config server replica set
rs.initiate({
  _id: "configReplSet",
  members: [
    { _id: 0, host: "config1:27019" },
    { _id: 1, host: "config2:27019" },
    { _id: 2, host: "config3:27019" }
  ]
})

# Start mongos router
mongos --configdb configReplSet/config1:27019,config2:27019,config3:27019

# Add shards
sh.addShard("shard1-rs/shard1-1:27018")
sh.addShard("shard2-rs/shard2-1:27018")

# Enable sharding and shard collections
sh.enableSharding("ecommerce")
sh.shardCollection("ecommerce.orders", { "region": 1, "customerId": 1 })

2. Scaling Existing Single-Node Deployments

Convert a standalone MongoDB to a sharded cluster:

// Step 1: Create initial chunk distribution
sh.shardCollection("myapp.users", { "region": 1, "userId": 1 })

// Step 2: Pre-split chunks for better distribution
for (let region of ["us-east", "us-west", "europe", "asia"]) {
  sh.splitAt("myapp.users", { "region": region, "userId": MinKey })
}

// Step 3: Move chunks to balance load
sh.moveChunk("myapp.users", 
  { "region": "us-west", "userId": MinKey }, 
  "shard2-rs"
)

3. Multi-Tenant Application Sharding

Isolate tenant data for compliance and performance:

// Tenant-based shard key design
{
  "tenantId": "company_123",
  "userId": "user_456", 
  "data": { ... },
  "createdAt": ISODate("2025-12-20")
}

// Shard by tenant for complete isolation
sh.shardCollection("saas.userData", { "tenantId": 1, "userId": 1 })

// Configure zones for tenant isolation
sh.addShardToZone("shard-enterprise", "ENTERPRISE")
sh.updateZoneKeyRange(
  "saas.userData",
  { tenantId: "enterprise_client_001", userId: MinKey },
  { tenantId: "enterprise_client_001", userId: MaxKey },
  "ENTERPRISE"
)

Performance Optimization Patterns

1. Chunk Management and Balancing

Monitor and control chunk distribution:

// Check chunk distribution
db.adminCommand("shardDistribution")

// Manual chunk operations for fine-tuning
sh.splitAt("orders.transactions", { "region": "us-east", "customerId": "user50000" })
sh.moveChunk("orders.transactions", 
  { "region": "us-east", "customerId": "user50000" }, 
  "shard3-rs"
)

// Configure balancer window for off-peak hours
sh.setBalancerState(false)
db.settings.update(
  { _id: "balancer" },
  { $set: { 
    activeWindow: { start: "02:00", stop: "06:00" }
  }},
  { upsert: true }
)

2. Query Pattern Optimization

Design queries to target specific shards:

// Efficient: Targets specific shard
db.orders.find({ 
  "region": "us-east", 
  "customerId": "user123",
  "orderDate": { $gte: ISODate("2025-12-01") }
})

// Inefficient: Scatter-gather across all shards  
db.orders.find({
  "totalAmount": { $gt: 100 },
  "orderDate": { $gte: ISODate("2025-12-01") }
})

// Solution: Include shard key in queries
db.orders.aggregate([
  { $match: { 
    "region": { $in: ["us-east", "us-west"] },
    "totalAmount": { $gt: 100 }
  }},
  { $group: { 
    _id: "$region",
    totalSales: { $sum: "$totalAmount" },
    orderCount: { $sum: 1 }
  }}
])

3. Connection Pool Management

Configure connection pooling for sharded clusters:

// Application connection to mongos
const client = new MongoClient('mongodb://mongos1:27017,mongos2:27017/', {
  maxPoolSize: 50,      // Connection pool per mongos
  minPoolSize: 5,
  maxIdleTimeMS: 300000, // 5 minutes
  serverSelectionTimeoutMS: 5000,
  readPreference: 'secondaryPreferred'
})

Monitoring and Maintenance

1. Shard Health Monitoring

Track key metrics across your sharded cluster:

// Check shard status and chunk counts
db.adminCommand("listShards")
db.chunks.countDocuments({ ns: "ecommerce.orders" })

// Monitor chunk distribution by shard
db.chunks.aggregate([
  { $match: { ns: "ecommerce.orders" } },
  { $group: { _id: "$shard", count: { $sum: 1 } } },
  { $sort: { count: -1 } }
])

// Check for jumbo chunks (>64MB)
db.chunks.find({ 
  ns: "ecommerce.orders",
  jumbo: true
}).forEach(chunk => {
  print(`Jumbo chunk: ${chunk.min} - ${chunk.max} on ${chunk.shard}`)
})

2. Performance Metrics

Essential metrics to monitor:

// Operation latency by shard
db.runCommand({
  serverStatus: 1,
  metrics: 1,
  sharding: 1
})

// Query execution stats
db.orders.explain("executionStats").find({
  "region": "us-east",
  "orderDate": { $gte: ISODate("2025-12-01") }
})

// Balancer statistics  
sh.getBalancerState()
db.settings.find({ _id: "balancer" })

3. Automated Scaling Strategies

Implement automatic shard addition:

#!/bin/bash
# Automated shard addition script

# Monitor average chunk count per shard
AVG_CHUNKS=$(mongo --eval "
  db.chunks.aggregate([
    { \$group: { _id: '\$shard', count: { \$sum: 1 } } },
    { \$group: { _id: null, avg: { \$avg: '\$count' } } }
  ]).next().avg
" --quiet)

# Add new shard when average exceeds threshold
if [ $(echo "$AVG_CHUNKS > 1000" | bc -l) -eq 1 ]; then
  echo "Adding new shard..."
  mongo --eval "sh.addShard('new-shard-rs/shard-new:27018')"
fi

Advanced Sharding Patterns

1. Hierarchical Sharding

For applications with multiple data access patterns:

// Multi-level sharding strategy
const collections = {
  // High-frequency reads: shard by access pattern
  "user_sessions": { "userId": 1, "sessionDate": 1 },

  // Analytics data: shard by time ranges  
  "user_events": { "eventDate": 1, "userId": 1 },

  // Reference data: shard by category
  "product_catalog": { "category": 1, "productId": 1 }
}

// Configure different chunk sizes per collection
db.adminCommand({
  configureCollectionBalancing: "myapp.user_events",
  chunkSize: 128,  // MB - larger chunks for sequential access
  enableAutoSplit: true
})

2. Cross-Shard Aggregations

Optimize complex analytics queries:

// Efficient cross-shard aggregation
db.orders.aggregate([
  // Stage 1: Parallel execution per shard
  { $match: { 
    "orderDate": { $gte: ISODate("2025-12-01") },
    "status": "completed" 
  }},

  // Stage 2: Local grouping per shard
  { $group: {
    _id: { region: "$region", date: { $dateToString: { format: "%Y-%m-%d", date: "$orderDate" }}},
    dailySales: { $sum: "$totalAmount" },
    orderCount: { $sum: 1 }
  }},

  // Stage 3: Merge results from all shards
  { $group: {
    _id: "$_id.date",
    totalSales: { $sum: "$dailySales" },
    totalOrders: { $sum: "$orderCount" },
    regions: { $push: { region: "$_id.region", sales: "$dailySales" }}
  }},

  { $sort: { "_id": 1 }}
], { 
  allowDiskUse: true,
  maxTimeMS: 30000
})

SQL Query Patterns for Sharded Collections

When using QueryLeaf with sharded MongoDB clusters, SQL queries can automatically benefit from sharding optimizations:

-- This SQL query targets specific shards when region is included
SELECT 
  region,
  DATE(orderDate) AS orderDay,
  COUNT(*) AS orderCount,
  SUM(totalAmount) AS dailyRevenue,
  AVG(totalAmount) AS avgOrderValue
FROM orders
WHERE region IN ('us-east', 'us-west')
  AND orderDate >= '2025-12-01'
  AND status = 'completed'
GROUP BY region, DATE(orderDate)
ORDER BY region, orderDay

-- QueryLeaf optimizes this by:
-- 1. Routing queries only to relevant shards
-- 2. Parallel execution across targeted shards
-- 3. Efficient aggregation of results

For complex cross-shard joins:

-- Efficient sharded join using broadcast strategy
SELECT 
  c.region,
  p.category,
  SUM(oi.price * oi.quantity) AS categoryRevenue
FROM customers c
JOIN orders o ON c._id = o.customerId
CROSS JOIN UNNEST(o.items) AS oi
JOIN products p ON oi.productId = p._id
WHERE o.orderDate >= '2025-12-01'
  AND o.status = 'completed'
GROUP BY c.region, p.category
HAVING categoryRevenue > 10000
ORDER BY categoryRevenue DESC

Best Practices and Gotchas

Do's

Plan your shard key carefully - Changing it requires migration
Monitor chunk distribution regularly - Uneven distribution kills performance
Use zones for compliance - Geographic or regulatory data placement
Test with realistic data volumes - Sharding behavior changes with scale
Include shard key in queries - Avoid scatter-gather operations

Don'ts

Don't use low-cardinality shard keys - Limits scaling potential
Don't ignore the balancer - Let it run during low-traffic periods
Don't create too many small chunks - Increases metadata overhead
Don't forget about config server capacity - They handle all cluster metadata
Don't assume linear scaling - Network and coordination overhead exists

Common Pitfalls

// Avoid these anti-patterns:

// 1. Sequential shard keys (creates hotspots)
{ shardKey: { timestamp: 1 } }

// 2. Low cardinality keys (limits distribution)  
{ shardKey: { status: 1 } }

// 3. Queries without shard key (scatter-gather)
db.orders.find({ totalAmount: { $gt: 100 } })

// 4. Cross-shard sorts without limits
db.orders.find().sort({ totalAmount: -1 }) // Expensive!

Conclusion

MongoDB sharding enables applications to scale beyond single-server limitations, but requires careful planning and ongoing management. Success depends on:

Strategic shard key selection based on your query patterns
Proper cluster topology with adequate config servers and mongos instances
Continuous monitoring of chunk distribution and performance metrics
Query optimization to take advantage of shard targeting

When combined with SQL query capabilities through tools like QueryLeaf, sharded MongoDB clusters can deliver both familiar query syntax and massive horizontal scale, making them ideal for high-volume production applications.

Remember that sharding adds complexity - only implement it when vertical scaling is no longer sufficient. But when you need it, MongoDB's sharding architecture provides a robust foundation for applications that need to scale to millions of operations per second across petabytes of data.

December 19, 2025
26 min read

MongoDB NoSQL Migration and Relational Database Modernization: Strategic Database Evolution for Modern Applications

Modern applications increasingly demand the flexibility, scalability, and performance characteristics that NoSQL databases provide, driving organizations to migrate from traditional relational database architectures to MongoDB's document-oriented model. However, this migration presents complex challenges including data transformation, schema evolution, application refactoring, and maintaining data consistency during the transition period.

MongoDB migration from relational databases requires sophisticated strategies that preserve data integrity while unlocking the benefits of document-oriented storage, flexible schemas, horizontal scaling capabilities, and improved developer productivity. Unlike simple data exports, successful MongoDB migrations involve careful planning of document structure design, relationship modeling, performance optimization, and gradual transition strategies that minimize application downtime and business disruption.

Traditional Relational Database Migration Challenges

Conventional database migration approaches often fail to leverage MongoDB's unique capabilities and struggle with complex data transformation requirements:

-- Traditional PostgreSQL schema - complex normalization causing migration challenges

-- Customer management with excessive normalization
CREATE TABLE customers (
    customer_id SERIAL PRIMARY KEY,
    customer_uuid UUID UNIQUE NOT NULL DEFAULT uuid_generate_v4(),
    company_name VARCHAR(255) NOT NULL,
    industry_code VARCHAR(10) REFERENCES industry_lookup(code),
    company_size VARCHAR(20) CHECK (company_size IN ('startup', 'small', 'medium', 'large', 'enterprise')),

    -- Address normalization requiring complex joins
    billing_address_id INTEGER REFERENCES addresses(address_id),
    shipping_address_id INTEGER REFERENCES addresses(address_id),

    -- Contact information spread across multiple tables
    primary_contact_id INTEGER REFERENCES contacts(contact_id),
    billing_contact_id INTEGER REFERENCES contacts(contact_id),
    technical_contact_id INTEGER REFERENCES contacts(contact_id),

    -- Account management
    account_status VARCHAR(20) DEFAULT 'active',
    subscription_tier VARCHAR(50) DEFAULT 'basic',
    annual_contract_value DECIMAL(12,2) DEFAULT 0,

    -- Audit fields
    created_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
    created_by INTEGER REFERENCES users(user_id),
    updated_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
    updated_by INTEGER REFERENCES users(user_id)
);

-- Separate address table creating complex relationships
CREATE TABLE addresses (
    address_id SERIAL PRIMARY KEY,
    address_type VARCHAR(20) NOT NULL CHECK (address_type IN ('billing', 'shipping', 'corporate')),
    street_address_1 VARCHAR(200) NOT NULL,
    street_address_2 VARCHAR(200),
    city VARCHAR(100) NOT NULL,
    state_province VARCHAR(100),
    postal_code VARCHAR(20),
    country_code VARCHAR(3) NOT NULL REFERENCES countries(code),

    -- Geographic data
    latitude DECIMAL(10,8),
    longitude DECIMAL(11,8),
    timezone VARCHAR(50),

    -- Validation and verification
    address_verified BOOLEAN DEFAULT FALSE,
    verification_date TIMESTAMP WITH TIME ZONE,
    verification_service VARCHAR(50),

    created_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP
);

-- Contact information table with complex relationships
CREATE TABLE contacts (
    contact_id SERIAL PRIMARY KEY,
    contact_type VARCHAR(30) NOT NULL,
    title VARCHAR(100),
    first_name VARCHAR(100) NOT NULL,
    last_name VARCHAR(100) NOT NULL,
    middle_name VARCHAR(100),
    suffix VARCHAR(20),

    -- Communication preferences
    preferred_communication VARCHAR(20) DEFAULT 'email',
    email_address VARCHAR(255) UNIQUE,
    phone_number VARCHAR(20),
    mobile_number VARCHAR(20),
    fax_number VARCHAR(20),

    -- Professional information
    job_title VARCHAR(200),
    department VARCHAR(100),
    manager_contact_id INTEGER REFERENCES contacts(contact_id),

    -- Social and professional networks
    linkedin_url VARCHAR(500),
    twitter_handle VARCHAR(100),

    -- Communication preferences and consent
    email_opt_in BOOLEAN DEFAULT TRUE,
    sms_opt_in BOOLEAN DEFAULT FALSE,
    marketing_consent BOOLEAN DEFAULT FALSE,
    marketing_consent_date TIMESTAMP WITH TIME ZONE,

    created_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP
);

-- Product catalog with complex category hierarchy
CREATE TABLE products (
    product_id SERIAL PRIMARY KEY,
    product_uuid UUID UNIQUE NOT NULL DEFAULT uuid_generate_v4(),
    product_name VARCHAR(255) NOT NULL,
    product_sku VARCHAR(100) UNIQUE NOT NULL,
    product_description TEXT,

    -- Category management through separate table
    primary_category_id INTEGER REFERENCES product_categories(category_id),

    -- Pricing structure
    base_price DECIMAL(10,2) NOT NULL,
    currency_code VARCHAR(3) DEFAULT 'USD',

    -- Inventory management
    inventory_tracked BOOLEAN DEFAULT TRUE,
    current_stock_level INTEGER DEFAULT 0,
    minimum_stock_level INTEGER DEFAULT 10,
    maximum_stock_level INTEGER DEFAULT 1000,

    -- Product status and lifecycle
    product_status VARCHAR(20) DEFAULT 'active' CHECK (product_status IN ('active', 'discontinued', 'draft', 'archived')),
    launch_date DATE,
    discontinuation_date DATE,

    -- SEO and marketing
    seo_title VARCHAR(200),
    seo_description TEXT,
    meta_keywords TEXT,

    -- Physical characteristics
    weight_kg DECIMAL(8,3),
    length_cm DECIMAL(8,2),
    width_cm DECIMAL(8,2),
    height_cm DECIMAL(8,2),

    -- Shipping and fulfillment
    requires_shipping BOOLEAN DEFAULT TRUE,
    shipping_class VARCHAR(50),
    hazardous_material BOOLEAN DEFAULT FALSE,

    created_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP
);

-- Product categories with complex hierarchy
CREATE TABLE product_categories (
    category_id SERIAL PRIMARY KEY,
    category_name VARCHAR(200) NOT NULL,
    category_slug VARCHAR(200) UNIQUE NOT NULL,
    category_description TEXT,

    -- Hierarchical structure
    parent_category_id INTEGER REFERENCES product_categories(category_id),
    category_level INTEGER NOT NULL DEFAULT 1,
    category_path TEXT, -- Computed field like '/electronics/computers/laptops'

    -- Display and ordering
    display_order INTEGER DEFAULT 0,
    is_featured BOOLEAN DEFAULT FALSE,
    is_visible BOOLEAN DEFAULT TRUE,

    -- SEO optimization
    seo_title VARCHAR(200),
    seo_description TEXT,
    category_image_url VARCHAR(500),

    created_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP
);

-- Order management with complex structure
CREATE TABLE orders (
    order_id SERIAL PRIMARY KEY,
    order_uuid UUID UNIQUE NOT NULL DEFAULT uuid_generate_v4(),
    order_number VARCHAR(50) UNIQUE NOT NULL,
    customer_id INTEGER REFERENCES customers(customer_id),

    -- Order timing
    order_date TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
    required_date DATE,
    shipped_date TIMESTAMP WITH TIME ZONE,
    delivered_date TIMESTAMP WITH TIME ZONE,

    -- Financial information
    subtotal DECIMAL(12,2) NOT NULL DEFAULT 0,
    tax_amount DECIMAL(12,2) NOT NULL DEFAULT 0,
    shipping_amount DECIMAL(12,2) NOT NULL DEFAULT 0,
    discount_amount DECIMAL(12,2) NOT NULL DEFAULT 0,
    total_amount DECIMAL(12,2) NOT NULL DEFAULT 0,
    currency_code VARCHAR(3) DEFAULT 'USD',

    -- Order status management
    order_status VARCHAR(30) DEFAULT 'pending' CHECK (
        order_status IN ('pending', 'confirmed', 'processing', 'shipped', 'delivered', 'cancelled', 'returned')
    ),
    payment_status VARCHAR(30) DEFAULT 'pending' CHECK (
        payment_status IN ('pending', 'authorized', 'captured', 'failed', 'refunded', 'partially_refunded')
    ),
    fulfillment_status VARCHAR(30) DEFAULT 'unfulfilled' CHECK (
        fulfillment_status IN ('unfulfilled', 'partially_fulfilled', 'fulfilled', 'restocked')
    ),

    -- Shipping information
    shipping_address_id INTEGER REFERENCES addresses(address_id),
    billing_address_id INTEGER REFERENCES addresses(address_id),
    shipping_method VARCHAR(100),
    tracking_number VARCHAR(200),
    shipping_carrier VARCHAR(100),

    -- Customer service and notes
    order_notes TEXT,
    internal_notes TEXT,
    customer_service_notes TEXT,

    created_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP
);

-- Order line items with product relationships
CREATE TABLE order_items (
    order_item_id SERIAL PRIMARY KEY,
    order_id INTEGER REFERENCES orders(order_id) ON DELETE CASCADE,
    product_id INTEGER REFERENCES products(product_id),

    -- Product information at time of order
    product_name VARCHAR(255) NOT NULL,
    product_sku VARCHAR(100) NOT NULL,
    product_description TEXT,

    -- Pricing and quantity
    unit_price DECIMAL(10,2) NOT NULL,
    quantity INTEGER NOT NULL DEFAULT 1,
    line_total DECIMAL(12,2) NOT NULL,

    -- Discounts and promotions
    discount_percentage DECIMAL(5,2) DEFAULT 0,
    discount_amount DECIMAL(10,2) DEFAULT 0,
    promotion_code VARCHAR(100),

    -- Fulfillment tracking
    fulfillment_status VARCHAR(30) DEFAULT 'pending',
    shipped_quantity INTEGER DEFAULT 0,
    returned_quantity INTEGER DEFAULT 0,

    created_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP
);

-- Migration challenges with traditional relational approach:

-- Complex join requirements for simple operations
SELECT 
    c.company_name,
    c.account_status,
    c.annual_contract_value,

    -- Billing address requires join
    ba.street_address_1 as billing_address,
    ba.city as billing_city,
    ba.state_province as billing_state,
    ba.country_code as billing_country,

    -- Shipping address requires separate join
    sa.street_address_1 as shipping_address,
    sa.city as shipping_city,
    sa.state_province as shipping_state,
    sa.country_code as shipping_country,

    -- Primary contact requires join
    pc.first_name || ' ' || pc.last_name as primary_contact_name,
    pc.email_address as primary_contact_email,
    pc.phone_number as primary_contact_phone,

    -- Billing contact requires separate join
    bc.first_name || ' ' || bc.last_name as billing_contact_name,
    bc.email_address as billing_contact_email,

    -- Order summary requires complex aggregation
    COUNT(o.order_id) as total_orders,
    SUM(o.total_amount) as total_order_value,
    MAX(o.order_date) as last_order_date

FROM customers c
LEFT JOIN addresses ba ON c.billing_address_id = ba.address_id
LEFT JOIN addresses sa ON c.shipping_address_id = sa.address_id
LEFT JOIN contacts pc ON c.primary_contact_id = pc.contact_id
LEFT JOIN contacts bc ON c.billing_contact_id = bc.contact_id
LEFT JOIN orders o ON c.customer_id = o.customer_id
WHERE c.account_status = 'active'
  AND c.annual_contract_value > 50000
GROUP BY 
    c.customer_id, c.company_name, c.account_status, c.annual_contract_value,
    ba.street_address_1, ba.city, ba.state_province, ba.country_code,
    sa.street_address_1, sa.city, sa.state_province, sa.country_code,
    pc.first_name, pc.last_name, pc.email_address, pc.phone_number,
    bc.first_name, bc.last_name, bc.email_address
ORDER BY c.annual_contract_value DESC;

-- Product catalog with category hierarchy complexity
WITH RECURSIVE category_hierarchy AS (
    -- Base case: root categories
    SELECT 
        category_id,
        category_name,
        category_slug,
        parent_category_id,
        category_level,
        category_name as category_path,
        ARRAY[category_id] as path_ids
    FROM product_categories
    WHERE parent_category_id IS NULL

    UNION ALL

    -- Recursive case: child categories
    SELECT 
        pc.category_id,
        pc.category_name,
        pc.category_slug,
        pc.parent_category_id,
        pc.category_level,
        ch.category_path || ' > ' || pc.category_name,
        ch.path_ids || pc.category_id
    FROM product_categories pc
    INNER JOIN category_hierarchy ch ON pc.parent_category_id = ch.category_id
)
SELECT 
    p.product_name,
    p.product_sku,
    p.base_price,
    p.current_stock_level,
    p.product_status,

    -- Category hierarchy information
    ch.category_path as full_category_path,
    ch.category_level as category_depth,

    -- Recent order information
    COUNT(oi.order_item_id) as times_ordered,
    SUM(oi.quantity) as total_quantity_sold,
    SUM(oi.line_total) as total_revenue,
    MAX(o.order_date) as last_ordered_date

FROM products p
LEFT JOIN category_hierarchy ch ON p.primary_category_id = ch.category_id
LEFT JOIN order_items oi ON p.product_id = oi.product_id
LEFT JOIN orders o ON oi.order_id = o.order_id
WHERE p.product_status = 'active'
  AND o.order_date >= CURRENT_DATE - INTERVAL '6 months'
GROUP BY 
    p.product_id, p.product_name, p.product_sku, p.base_price,
    p.current_stock_level, p.product_status,
    ch.category_path, ch.category_level
ORDER BY total_revenue DESC;

-- Problems with relational schema for migration:
-- 1. Excessive normalization creates complex join requirements
-- 2. Rigid schema makes it difficult to add new fields
-- 3. Foreign key constraints complicate data transformation
-- 4. Complex hierarchical relationships require recursive queries
-- 5. Performance degradation with increasing data volume
-- 6. Difficult to represent nested or array data structures
-- 7. Schema changes require coordinated migrations
-- 8. Poor fit for document-oriented access patterns
-- 9. Limited flexibility for varying data structures
-- 10. Complex application logic to reconstruct related data

-- Traditional migration approach (problematic):
-- Attempting direct table-to-collection mapping without optimization

-- Naive migration - maintains relational structure (inefficient)
INSERT INTO mongodb.customers 
SELECT 
    customer_id,
    customer_uuid,
    company_name,
    industry_code,
    company_size,
    billing_address_id,  -- Foreign key becomes meaningless
    shipping_address_id, -- Foreign key becomes meaningless
    primary_contact_id,  -- Foreign key becomes meaningless
    account_status,
    subscription_tier,
    annual_contract_value,
    created_at,
    updated_at
FROM postgresql_customers;

-- This approach fails to leverage MongoDB's strengths:
-- - Maintains relational thinking
-- - Requires separate queries to reconstruct relationships
-- - Doesn't utilize document embedding opportunities
-- - Misses schema flexibility benefits
-- - Poor query performance due to separate collections
-- - Complex application logic to manage references
-- - No improvement in developer experience
-- - Limited scalability improvements

MongoDB provides powerful migration capabilities that transform relational schemas into optimized document structures:

// MongoDB Advanced Migration and Database Modernization Manager
const { MongoClient, ObjectId } = require('mongodb');

class MongoMigrationManager {
  constructor(sourceDb, targetMongoUri, migrationConfig = {}) {
    this.sourceDb = sourceDb; // Source database connection (PostgreSQL, MySQL, etc.)
    this.mongoClient = new MongoClient(targetMongoUri);
    this.targetDb = null;

    this.migrationConfig = {
      // Migration strategy configuration
      migrationStrategy: migrationConfig.strategy || 'staged_migration', // Options: bulk_migration, staged_migration, live_migration
      batchSize: migrationConfig.batchSize || 1000,
      enableDataValidation: migrationConfig.enableDataValidation !== false,
      enablePerformanceOptimization: migrationConfig.enablePerformanceOptimization !== false,

      // Schema transformation settings
      schemaTransformation: {
        enableDocumentEmbedding: migrationConfig.enableDocumentEmbedding !== false,
        enableArrayFields: migrationConfig.enableArrayFields !== false,
        enableDenormalization: migrationConfig.enableDenormalization !== false,
        optimizeForQueryPatterns: migrationConfig.optimizeForQueryPatterns !== false
      },

      // Data quality and validation
      dataQuality: {
        enableDataCleaning: migrationConfig.enableDataCleaning !== false,
        enableReferentialIntegrityChecks: migrationConfig.enableReferentialIntegrityChecks !== false,
        enableDuplicateDetection: migrationConfig.enableDuplicateDetection !== false,
        enableDataTypeConversion: migrationConfig.enableDataTypeConversion !== false
      },

      // Performance and monitoring
      performance: {
        enableProgressTracking: migrationConfig.enableProgressTracking !== false,
        enablePerformanceMetrics: migrationConfig.enablePerformanceMetrics !== false,
        enableRollbackCapability: migrationConfig.enableRollbackCapability !== false,
        maxConcurrentOperations: migrationConfig.maxConcurrentOperations || 4
      }
    };

    this.migrationMetrics = {
      startTime: null,
      endTime: null,
      recordsProcessed: 0,
      recordsMigrated: 0,
      errorsEncountered: 0,
      performanceStats: {}
    };
  }

  async initializeMigration() {
    console.log('Initializing MongoDB migration system...');

    try {
      await this.mongoClient.connect();
      this.targetDb = this.mongoClient.db('modernized_application');

      // Setup target collections with optimized schemas
      await this.setupOptimizedCollections();

      // Initialize migration tracking
      await this.initializeMigrationTracking();

      // Setup performance monitoring
      await this.setupPerformanceMonitoring();

      console.log('Migration system initialized successfully');

    } catch (error) {
      console.error('Error initializing migration system:', error);
      throw error;
    }
  }

  async setupOptimizedCollections() {
    console.log('Setting up optimized MongoDB collections...');

    // Create customers collection with embedded documents
    const customers = this.targetDb.collection('customers');

    // Create optimized indexes for customer operations
    await customers.createIndex({ customer_uuid: 1 }, { unique: true });
    await customers.createIndex({ company_name: 1 });
    await customers.createIndex({ "billing_contact.email": 1 });
    await customers.createIndex({ account_status: 1, annual_contract_value: -1 });
    await customers.createIndex({ industry_code: 1, company_size: 1 });
    await customers.createIndex({ created_at: -1 });

    // Create products collection with optimized structure
    const products = this.targetDb.collection('products');
    await products.createIndex({ product_uuid: 1 }, { unique: true });
    await products.createIndex({ product_sku: 1 }, { unique: true });
    await products.createIndex({ "categories.primary": 1 });
    await products.createIndex({ "pricing.base_price": 1 });
    await products.createIndex({ product_status: 1 });
    await products.createIndex({ "inventory.current_stock": 1 });

    // Create orders collection with embedded line items
    const orders = this.targetDb.collection('orders');
    await orders.createIndex({ order_uuid: 1 }, { unique: true });
    await orders.createIndex({ order_number: 1 }, { unique: true });
    await orders.createIndex({ customer_uuid: 1, order_date: -1 });
    await orders.createIndex({ order_status: 1 });
    await orders.createIndex({ order_date: -1 });
    await orders.createIndex({ "financial.total_amount": -1 });

    console.log('Optimized collections setup completed');
  }

  async migrateCustomerData() {
    console.log('Migrating customer data with document optimization...');
    const startTime = Date.now();

    try {
      const customersCollection = this.targetDb.collection('customers');

      // Fetch customer data with all related information in batches
      const customerQuery = `
        SELECT 
          c.customer_id,
          c.customer_uuid,
          c.company_name,
          c.industry_code,
          c.company_size,
          c.account_status,
          c.subscription_tier,
          c.annual_contract_value,
          c.created_at,
          c.updated_at,

          -- Billing address information
          ba.street_address_1 as billing_street_1,
          ba.street_address_2 as billing_street_2,
          ba.city as billing_city,
          ba.state_province as billing_state,
          ba.postal_code as billing_postal_code,
          ba.country_code as billing_country,
          ba.latitude as billing_lat,
          ba.longitude as billing_lng,
          ba.timezone as billing_timezone,

          -- Shipping address information
          sa.street_address_1 as shipping_street_1,
          sa.street_address_2 as shipping_street_2,
          sa.city as shipping_city,
          sa.state_province as shipping_state,
          sa.postal_code as shipping_postal_code,
          sa.country_code as shipping_country,
          sa.latitude as shipping_lat,
          sa.longitude as shipping_lng,
          sa.timezone as shipping_timezone,

          -- Primary contact information
          pc.contact_id as primary_contact_id,
          pc.first_name as primary_first_name,
          pc.last_name as primary_last_name,
          pc.email_address as primary_email,
          pc.phone_number as primary_phone,
          pc.mobile_number as primary_mobile,
          pc.job_title as primary_job_title,
          pc.department as primary_department,

          -- Billing contact information
          bc.contact_id as billing_contact_id,
          bc.first_name as billing_first_name,
          bc.last_name as billing_last_name,
          bc.email_address as billing_email,
          bc.phone_number as billing_phone,
          bc.job_title as billing_job_title

        FROM customers c
        LEFT JOIN addresses ba ON c.billing_address_id = ba.address_id
        LEFT JOIN addresses sa ON c.shipping_address_id = sa.address_id
        LEFT JOIN contacts pc ON c.primary_contact_id = pc.contact_id
        LEFT JOIN contacts bc ON c.billing_contact_id = bc.contact_id
        ORDER BY c.customer_id
      `;

      const customerResults = await this.sourceDb.query(customerQuery);
      const batchSize = this.migrationConfig.batchSize;

      for (let i = 0; i < customerResults.rows.length; i += batchSize) {
        const batch = customerResults.rows.slice(i, i + batchSize);
        const mongoDocuments = batch.map(customer => this.transformCustomerDocument(customer));

        // Validate documents before insertion
        const validatedDocuments = await this.validateDocuments(mongoDocuments);

        // Insert batch with error handling
        try {
          await customersCollection.insertMany(validatedDocuments, { ordered: false });
          this.migrationMetrics.recordsMigrated += validatedDocuments.length;
        } catch (error) {
          console.error(`Error inserting customer batch:`, error);
          this.migrationMetrics.errorsEncountered += 1;
        }

        // Update progress tracking
        await this.updateMigrationProgress('customers', i + batch.length, customerResults.rows.length);
      }

      const endTime = Date.now();
      console.log(`Customer migration completed in ${endTime - startTime}ms`);

    } catch (error) {
      console.error('Error migrating customer data:', error);
      throw error;
    }
  }

  transformCustomerDocument(customerRow) {
    // Transform relational data into optimized MongoDB document
    const customer = {
      _id: new ObjectId(),
      customer_uuid: customerRow.customer_uuid,
      company_name: customerRow.company_name,
      industry_code: customerRow.industry_code,
      company_size: customerRow.company_size,

      // Account information
      account: {
        status: customerRow.account_status,
        subscription_tier: customerRow.subscription_tier,
        annual_contract_value: customerRow.annual_contract_value,
        created_at: customerRow.created_at,
        updated_at: customerRow.updated_at
      },

      // Embedded billing address
      billing_address: customerRow.billing_street_1 ? {
        street_address_1: customerRow.billing_street_1,
        street_address_2: customerRow.billing_street_2,
        city: customerRow.billing_city,
        state_province: customerRow.billing_state,
        postal_code: customerRow.billing_postal_code,
        country_code: customerRow.billing_country,
        coordinates: customerRow.billing_lat && customerRow.billing_lng ? {
          latitude: customerRow.billing_lat,
          longitude: customerRow.billing_lng
        } : null,
        timezone: customerRow.billing_timezone
      } : null,

      // Embedded shipping address
      shipping_address: customerRow.shipping_street_1 ? {
        street_address_1: customerRow.shipping_street_1,
        street_address_2: customerRow.shipping_street_2,
        city: customerRow.shipping_city,
        state_province: customerRow.shipping_state,
        postal_code: customerRow.shipping_postal_code,
        country_code: customerRow.shipping_country,
        coordinates: customerRow.shipping_lat && customerRow.shipping_lng ? {
          latitude: customerRow.shipping_lat,
          longitude: customerRow.shipping_lng
        } : null,
        timezone: customerRow.shipping_timezone
      } : null,

      // Embedded contact information
      contacts: {
        primary_contact: customerRow.primary_contact_id ? {
          contact_id: customerRow.primary_contact_id,
          name: {
            first: customerRow.primary_first_name,
            last: customerRow.primary_last_name,
            full: `${customerRow.primary_first_name} ${customerRow.primary_last_name}`
          },
          communication: {
            email: customerRow.primary_email,
            phone: customerRow.primary_phone,
            mobile: customerRow.primary_mobile
          },
          professional: {
            job_title: customerRow.primary_job_title,
            department: customerRow.primary_department
          }
        } : null,

        billing_contact: customerRow.billing_contact_id ? {
          contact_id: customerRow.billing_contact_id,
          name: {
            first: customerRow.billing_first_name,
            last: customerRow.billing_last_name,
            full: `${customerRow.billing_first_name} ${customerRow.billing_last_name}`
          },
          communication: {
            email: customerRow.billing_email,
            phone: customerRow.billing_phone
          },
          professional: {
            job_title: customerRow.billing_job_title
          }
        } : null
      },

      // Migration metadata
      migration_info: {
        migrated_at: new Date(),
        source_customer_id: customerRow.customer_id,
        migration_version: '1.0',
        data_validated: true
      }
    };

    return customer;
  }

  async migrateProductData() {
    console.log('Migrating product catalog with optimized structure...');
    const startTime = Date.now();

    try {
      const productsCollection = this.targetDb.collection('products');

      // Fetch products with category hierarchy and recent sales data
      const productQuery = `
        WITH RECURSIVE category_hierarchy AS (
          SELECT 
            category_id,
            category_name,
            category_slug,
            parent_category_id,
            category_level,
            category_name as category_path,
            ARRAY[category_id] as path_ids
          FROM product_categories
          WHERE parent_category_id IS NULL

          UNION ALL

          SELECT 
            pc.category_id,
            pc.category_name,
            pc.category_slug,
            pc.parent_category_id,
            pc.category_level,
            ch.category_path || ' > ' || pc.category_name,
            ch.path_ids || pc.category_id
          FROM product_categories pc
          INNER JOIN category_hierarchy ch ON pc.parent_category_id = ch.category_id
        ),

        recent_sales AS (
          SELECT 
            oi.product_id,
            COUNT(oi.order_item_id) as times_ordered,
            SUM(oi.quantity) as total_quantity_sold,
            SUM(oi.line_total) as total_revenue,
            AVG(oi.unit_price) as avg_selling_price,
            MAX(o.order_date) as last_ordered_date
          FROM order_items oi
          INNER JOIN orders o ON oi.order_id = o.order_id
          WHERE o.order_date >= CURRENT_DATE - INTERVAL '12 months'
          GROUP BY oi.product_id
        )

        SELECT 
          p.product_id,
          p.product_uuid,
          p.product_name,
          p.product_sku,
          p.product_description,
          p.base_price,
          p.currency_code,
          p.current_stock_level,
          p.minimum_stock_level,
          p.maximum_stock_level,
          p.product_status,
          p.launch_date,
          p.discontinuation_date,
          p.weight_kg,
          p.length_cm,
          p.width_cm,
          p.height_cm,
          p.requires_shipping,
          p.shipping_class,
          p.hazardous_material,
          p.seo_title,
          p.seo_description,
          p.meta_keywords,
          p.created_at,
          p.updated_at,

          -- Category information
          ch.category_path,
          ch.category_level,
          ch.path_ids,

          -- Sales analytics
          COALESCE(rs.times_ordered, 0) as times_ordered,
          COALESCE(rs.total_quantity_sold, 0) as total_quantity_sold,
          COALESCE(rs.total_revenue, 0) as total_revenue,
          COALESCE(rs.avg_selling_price, 0) as avg_selling_price,
          rs.last_ordered_date

        FROM products p
        LEFT JOIN category_hierarchy ch ON p.primary_category_id = ch.category_id
        LEFT JOIN recent_sales rs ON p.product_id = rs.product_id
        ORDER BY p.product_id
      `;

      const productResults = await this.sourceDb.query(productQuery);
      const batchSize = this.migrationConfig.batchSize;

      for (let i = 0; i < productResults.rows.length; i += batchSize) {
        const batch = productResults.rows.slice(i, i + batchSize);
        const mongoDocuments = batch.map(product => this.transformProductDocument(product));

        const validatedDocuments = await this.validateDocuments(mongoDocuments);

        try {
          await productsCollection.insertMany(validatedDocuments, { ordered: false });
          this.migrationMetrics.recordsMigrated += validatedDocuments.length;
        } catch (error) {
          console.error(`Error inserting product batch:`, error);
          this.migrationMetrics.errorsEncountered += 1;
        }

        await this.updateMigrationProgress('products', i + batch.length, productResults.rows.length);
      }

      const endTime = Date.now();
      console.log(`Product migration completed in ${endTime - startTime}ms`);

    } catch (error) {
      console.error('Error migrating product data:', error);
      throw error;
    }
  }

  transformProductDocument(productRow) {
    // Transform relational product data into optimized MongoDB document
    const product = {
      _id: new ObjectId(),
      product_uuid: productRow.product_uuid,
      product_name: productRow.product_name,
      product_sku: productRow.product_sku,
      product_description: productRow.product_description,
      product_status: productRow.product_status,

      // Category information with full hierarchy
      categories: {
        primary: productRow.category_path ? productRow.category_path.split(' > ')[0] : null,
        hierarchy: productRow.category_path ? productRow.category_path.split(' > ') : [],
        level: productRow.category_level || 1,
        path_ids: productRow.path_ids || []
      },

      // Pricing information
      pricing: {
        base_price: productRow.base_price,
        currency_code: productRow.currency_code,
        avg_selling_price: productRow.avg_selling_price || productRow.base_price
      },

      // Inventory management
      inventory: {
        current_stock: productRow.current_stock_level,
        minimum_stock: productRow.minimum_stock_level,
        maximum_stock: productRow.maximum_stock_level,
        tracked: productRow.current_stock_level !== null
      },

      // Physical characteristics
      physical: {
        dimensions: {
          weight_kg: productRow.weight_kg,
          length_cm: productRow.length_cm,
          width_cm: productRow.width_cm,
          height_cm: productRow.height_cm
        },
        shipping: {
          requires_shipping: productRow.requires_shipping,
          shipping_class: productRow.shipping_class,
          hazardous_material: productRow.hazardous_material
        }
      },

      // SEO and marketing
      seo: {
        title: productRow.seo_title,
        description: productRow.seo_description,
        keywords: productRow.meta_keywords ? productRow.meta_keywords.split(',').map(k => k.trim()) : []
      },

      // Lifecycle information
      lifecycle: {
        launch_date: productRow.launch_date,
        discontinuation_date: productRow.discontinuation_date,
        created_at: productRow.created_at,
        updated_at: productRow.updated_at
      },

      // Sales analytics embedded for performance
      analytics: {
        times_ordered: productRow.times_ordered || 0,
        total_quantity_sold: productRow.total_quantity_sold || 0,
        total_revenue: productRow.total_revenue || 0,
        last_ordered_date: productRow.last_ordered_date,
        performance_tier: this.calculatePerformanceTier(productRow)
      },

      // Migration tracking
      migration_info: {
        migrated_at: new Date(),
        source_product_id: productRow.product_id,
        migration_version: '1.0',
        data_validated: true
      }
    };

    return product;
  }

  calculatePerformanceTier(productRow) {
    const revenue = productRow.total_revenue || 0;
    const timesOrdered = productRow.times_ordered || 0;

    if (revenue > 50000 && timesOrdered > 100) return 'high_performer';
    if (revenue > 10000 && timesOrdered > 20) return 'good_performer';
    if (revenue > 1000 && timesOrdered > 5) return 'average_performer';
    if (timesOrdered > 0) return 'low_performer';
    return 'no_sales';
  }

  async migrateOrderData() {
    console.log('Migrating order data with embedded line items...');
    const startTime = Date.now();

    try {
      const ordersCollection = this.targetDb.collection('orders');

      // Fetch orders with embedded line items and customer information
      const orderQuery = `
        SELECT 
          o.order_id,
          o.order_uuid,
          o.order_number,
          c.customer_uuid,
          c.company_name,
          o.order_date,
          o.required_date,
          o.shipped_date,
          o.delivered_date,
          o.subtotal,
          o.tax_amount,
          o.shipping_amount,
          o.discount_amount,
          o.total_amount,
          o.currency_code,
          o.order_status,
          o.payment_status,
          o.fulfillment_status,
          o.shipping_method,
          o.tracking_number,
          o.shipping_carrier,
          o.order_notes,
          o.internal_notes,
          o.created_at,
          o.updated_at,

          -- Shipping address
          sa.street_address_1 as shipping_street_1,
          sa.street_address_2 as shipping_street_2,
          sa.city as shipping_city,
          sa.state_province as shipping_state,
          sa.postal_code as shipping_postal_code,
          sa.country_code as shipping_country,

          -- Billing address
          ba.street_address_1 as billing_street_1,
          ba.street_address_2 as billing_street_2,
          ba.city as billing_city,
          ba.state_province as billing_state,
          ba.postal_code as billing_postal_code,
          ba.country_code as billing_country,

          -- Aggregated order items as JSON
          json_agg(
            json_build_object(
              'order_item_id', oi.order_item_id,
              'product_uuid', p.product_uuid,
              'product_name', oi.product_name,
              'product_sku', oi.product_sku,
              'unit_price', oi.unit_price,
              'quantity', oi.quantity,
              'line_total', oi.line_total,
              'discount_percentage', oi.discount_percentage,
              'discount_amount', oi.discount_amount,
              'promotion_code', oi.promotion_code,
              'fulfillment_status', oi.fulfillment_status
            ) ORDER BY oi.order_item_id
          ) as order_items

        FROM orders o
        INNER JOIN customers c ON o.customer_id = c.customer_id
        LEFT JOIN addresses sa ON o.shipping_address_id = sa.address_id
        LEFT JOIN addresses ba ON o.billing_address_id = ba.address_id
        INNER JOIN order_items oi ON o.order_id = oi.order_id
        LEFT JOIN products p ON oi.product_id = p.product_id
        GROUP BY 
          o.order_id, o.order_uuid, o.order_number, c.customer_uuid, c.company_name,
          o.order_date, o.required_date, o.shipped_date, o.delivered_date,
          o.subtotal, o.tax_amount, o.shipping_amount, o.discount_amount, o.total_amount,
          o.currency_code, o.order_status, o.payment_status, o.fulfillment_status,
          o.shipping_method, o.tracking_number, o.shipping_carrier,
          o.order_notes, o.internal_notes, o.created_at, o.updated_at,
          sa.street_address_1, sa.street_address_2, sa.city, sa.state_province, sa.postal_code, sa.country_code,
          ba.street_address_1, ba.street_address_2, ba.city, ba.state_province, ba.postal_code, ba.country_code
        ORDER BY o.order_id
      `;

      const orderResults = await this.sourceDb.query(orderQuery);
      const batchSize = this.migrationConfig.batchSize;

      for (let i = 0; i < orderResults.rows.length; i += batchSize) {
        const batch = orderResults.rows.slice(i, i + batchSize);
        const mongoDocuments = batch.map(order => this.transformOrderDocument(order));

        const validatedDocuments = await this.validateDocuments(mongoDocuments);

        try {
          await ordersCollection.insertMany(validatedDocuments, { ordered: false });
          this.migrationMetrics.recordsMigrated += validatedDocuments.length;
        } catch (error) {
          console.error(`Error inserting order batch:`, error);
          this.migrationMetrics.errorsEncountered += 1;
        }

        await this.updateMigrationProgress('orders', i + batch.length, orderResults.rows.length);
      }

      const endTime = Date.now();
      console.log(`Order migration completed in ${endTime - startTime}ms`);

    } catch (error) {
      console.error('Error migrating order data:', error);
      throw error;
    }
  }

  transformOrderDocument(orderRow) {
    // Transform relational order data into optimized MongoDB document with embedded items
    const order = {
      _id: new ObjectId(),
      order_uuid: orderRow.order_uuid,
      order_number: orderRow.order_number,

      // Customer reference
      customer: {
        customer_uuid: orderRow.customer_uuid,
        company_name: orderRow.company_name
      },

      // Order timing
      dates: {
        order_date: orderRow.order_date,
        required_date: orderRow.required_date,
        shipped_date: orderRow.shipped_date,
        delivered_date: orderRow.delivered_date,
        created_at: orderRow.created_at,
        updated_at: orderRow.updated_at
      },

      // Financial information
      financial: {
        subtotal: orderRow.subtotal,
        tax_amount: orderRow.tax_amount,
        shipping_amount: orderRow.shipping_amount,
        discount_amount: orderRow.discount_amount,
        total_amount: orderRow.total_amount,
        currency_code: orderRow.currency_code
      },

      // Status tracking
      status: {
        order_status: orderRow.order_status,
        payment_status: orderRow.payment_status,
        fulfillment_status: orderRow.fulfillment_status
      },

      // Shipping information
      shipping: {
        method: orderRow.shipping_method,
        tracking_number: orderRow.tracking_number,
        carrier: orderRow.shipping_carrier,
        address: orderRow.shipping_street_1 ? {
          street_address_1: orderRow.shipping_street_1,
          street_address_2: orderRow.shipping_street_2,
          city: orderRow.shipping_city,
          state_province: orderRow.shipping_state,
          postal_code: orderRow.shipping_postal_code,
          country_code: orderRow.shipping_country
        } : null
      },

      // Billing address
      billing_address: orderRow.billing_street_1 ? {
        street_address_1: orderRow.billing_street_1,
        street_address_2: orderRow.billing_street_2,
        city: orderRow.billing_city,
        state_province: orderRow.billing_state,
        postal_code: orderRow.billing_postal_code,
        country_code: orderRow.billing_country
      } : null,

      // Embedded order items for optimal query performance
      items: orderRow.order_items.map(item => ({
        product_uuid: item.product_uuid,
        product_name: item.product_name,
        product_sku: item.product_sku,
        pricing: {
          unit_price: item.unit_price,
          quantity: item.quantity,
          line_total: item.line_total,
          discount_percentage: item.discount_percentage,
          discount_amount: item.discount_amount
        },
        promotion_code: item.promotion_code,
        fulfillment_status: item.fulfillment_status
      })),

      // Order summary for quick access
      summary: {
        total_items: orderRow.order_items.length,
        total_quantity: orderRow.order_items.reduce((sum, item) => sum + item.quantity, 0),
        unique_products: new Set(orderRow.order_items.map(item => item.product_uuid)).size
      },

      // Notes and comments
      notes: {
        order_notes: orderRow.order_notes,
        internal_notes: orderRow.internal_notes
      },

      // Migration tracking
      migration_info: {
        migrated_at: new Date(),
        source_order_id: orderRow.order_id,
        migration_version: '1.0',
        data_validated: true
      }
    };

    return order;
  }

  async validateDocuments(documents) {
    if (!this.migrationConfig.dataQuality.enableDataCleaning) {
      return documents;
    }

    // Validate and clean documents before insertion
    return documents.filter(doc => {
      // Basic validation
      if (!doc._id) return false;

      // Remove null/undefined fields
      this.removeEmptyFields(doc);

      return true;
    });
  }

  removeEmptyFields(obj) {
    Object.keys(obj).forEach(key => {
      if (obj[key] && typeof obj[key] === 'object') {
        this.removeEmptyFields(obj[key]);
      } else if (obj[key] === null || obj[key] === undefined) {
        delete obj[key];
      }
    });
  }

  async updateMigrationProgress(collection, processed, total) {
    if (this.migrationConfig.performance.enableProgressTracking) {
      const progress = {
        collection: collection,
        processed: processed,
        total: total,
        percentage: ((processed / total) * 100).toFixed(2),
        timestamp: new Date()
      };

      console.log(`Migration progress - ${collection}: ${progress.percentage}% (${processed}/${total})`);

      // Store progress in monitoring collection
      await this.targetDb.collection('migration_progress').replaceOne(
        { collection: collection },
        progress,
        { upsert: true }
      );
    }
  }

  async generateMigrationReport() {
    console.log('Generating comprehensive migration report...');

    const report = {
      migration_summary: {
        start_time: this.migrationMetrics.startTime,
        end_time: this.migrationMetrics.endTime,
        total_duration: this.migrationMetrics.endTime - this.migrationMetrics.startTime,
        records_processed: this.migrationMetrics.recordsProcessed,
        records_migrated: this.migrationMetrics.recordsMigrated,
        errors_encountered: this.migrationMetrics.errorsEncountered,
        success_rate: ((this.migrationMetrics.recordsMigrated / this.migrationMetrics.recordsProcessed) * 100).toFixed(2)
      },

      // Collection-specific statistics
      collection_statistics: await this.generateCollectionStatistics(),

      // Data quality assessment
      data_quality_report: await this.generateDataQualityReport(),

      // Performance metrics
      performance_analysis: await this.generatePerformanceAnalysis(),

      // Recommendations for optimization
      optimization_recommendations: await this.generateOptimizationRecommendations()
    };

    // Store migration report
    await this.targetDb.collection('migration_reports').insertOne({
      ...report,
      generated_at: new Date(),
      migration_id: new ObjectId()
    });

    return report;
  }

  async generateCollectionStatistics() {
    const collections = ['customers', 'products', 'orders'];
    const stats = {};

    for (const collectionName of collections) {
      const collection = this.targetDb.collection(collectionName);

      stats[collectionName] = {
        document_count: await collection.countDocuments(),
        average_document_size: await this.calculateAverageDocumentSize(collection),
        index_count: (await collection.indexes()).length,
        storage_metrics: await this.getStorageMetrics(collection)
      };
    }

    return stats;
  }
}

// Benefits of MongoDB Advanced Migration:
// - Intelligent document transformation preserving relational relationships
// - Optimized schema design leveraging MongoDB's document model strengths  
// - Embedded documents eliminating complex joins and improving performance
// - Flexible schema evolution supporting future application requirements
// - Comprehensive data validation and quality assurance during migration
// - Real-time migration progress tracking and performance monitoring
// - Automated optimization recommendations based on data patterns
// - Seamless integration with existing application architectures
// - Scalable migration strategies supporting large datasets
// - SQL-compatible operations through QueryLeaf for familiar database interactions

module.exports = {
  MongoMigrationManager
};

Advanced Migration Patterns and Data Transformation

Complex Relationship Modeling and Document Design

Transform complex relational structures into optimized MongoDB documents:

// Advanced migration patterns for complex enterprise scenarios
class EnterpriseMigrationManager extends MongoMigrationManager {
  constructor(sourceDb, targetMongoUri, enterpriseConfig) {
    super(sourceDb, targetMongoUri, enterpriseConfig);

    this.enterpriseConfig = {
      ...enterpriseConfig,

      // Enterprise migration features
      enableHierarchicalDataMigration: true,
      enablePolymorphicDocumentSupport: true,
      enableAdvancedRelationshipModeling: true,
      enableDataGovernanceCompliance: true,
      enableRealTimeMigrationMonitoring: true
    };
  }

  async migrateHierarchicalData() {
    console.log('Migrating hierarchical data with optimized tree structures...');

    // Migrate organizational hierarchies, category trees, and nested structures
    const organizationQuery = `
      WITH RECURSIVE org_hierarchy AS (
        SELECT 
          org_id,
          org_name,
          parent_org_id,
          org_level,
          org_type,
          ARRAY[org_id] as path,
          org_name as full_path
        FROM organizations
        WHERE parent_org_id IS NULL

        UNION ALL

        SELECT 
          o.org_id,
          o.org_name,
          o.parent_org_id,
          o.org_level,
          o.org_type,
          oh.path || o.org_id,
          oh.full_path || ' > ' || o.org_name
        FROM organizations o
        INNER JOIN org_hierarchy oh ON o.parent_org_id = oh.org_id
      )
      SELECT * FROM org_hierarchy ORDER BY path
    `;

    const orgResults = await this.sourceDb.query(organizationQuery);
    const hierarchicalDocuments = this.transformHierarchicalStructure(orgResults.rows);

    await this.targetDb.collection('organizations').insertMany(hierarchicalDocuments);
  }

  transformHierarchicalStructure(hierarchicalData) {
    // Transform tree structures into MongoDB-optimized format
    const rootNodes = hierarchicalData.filter(node => !node.parent_org_id);

    return rootNodes.map(root => this.buildHierarchicalDocument(root, hierarchicalData));
  }

  buildHierarchicalDocument(node, allNodes) {
    const children = allNodes.filter(n => n.parent_org_id === node.org_id);

    return {
      _id: new ObjectId(),
      org_id: node.org_id,
      org_name: node.org_name,
      org_type: node.org_type,
      org_level: node.org_level,

      // Hierarchical metadata for efficient queries
      hierarchy: {
        path: node.path,
        full_path: node.full_path,
        level: node.org_level,
        is_leaf: children.length === 0,
        child_count: children.length
      },

      // Embedded children for complete tree access
      children: children.map(child => this.buildHierarchicalDocument(child, allNodes)),

      // Flattened descendant references for efficient searching
      all_descendants: this.getAllDescendants(node.org_id, allNodes),

      migration_info: {
        migrated_at: new Date(),
        source_org_id: node.org_id,
        hierarchy_complete: true
      }
    };
  }
}

SQL-Style Migration with QueryLeaf

QueryLeaf provides familiar SQL syntax for MongoDB migration and database modernization operations:

-- QueryLeaf advanced migration and database modernization with SQL-familiar syntax

-- Set migration context and configuration
SET migration_mode = 'staged_migration';
SET migration_batch_size = 1000;
SET enable_data_validation = true;
SET enable_performance_optimization = true;

-- Configure target MongoDB collections with optimized schemas
CREATE COLLECTION customers WITH (
    -- Document structure optimization
    enable_embedded_documents = true,
    enable_array_fields = true,
    enable_flexible_schema = true,

    -- Performance optimization
    default_write_concern = 'majority',
    read_preference = 'primary',

    -- Storage optimization
    compression_algorithm = 'snappy',
    enable_sharding = false
) AS (
    customer_uuid UUID PRIMARY KEY,
    company_name VARCHAR(255) NOT NULL,
    industry_code VARCHAR(10),
    company_size VARCHAR(20),

    -- Embedded account information
    account JSONB NOT NULL,

    -- Embedded address information
    billing_address JSONB,
    shipping_address JSONB,

    -- Embedded contact information
    contacts JSONB NOT NULL,

    -- Migration tracking
    migration_info JSONB NOT NULL,

    -- Optimized indexes
    INDEX customer_uuid_idx (customer_uuid) UNIQUE,
    INDEX company_name_idx (company_name),
    INDEX account_status_idx ((account.status)),
    INDEX billing_email_idx ((contacts.billing_contact.communication.email)),
    TEXT INDEX company_search_idx (company_name, (contacts.primary_contact.name.full))
);

-- Advanced migration with data transformation and optimization
WITH source_customer_data AS (
    SELECT 
        c.customer_uuid,
        c.company_name,
        c.industry_code,
        c.company_size,
        c.account_status,
        c.subscription_tier,
        c.annual_contract_value,
        c.created_at,
        c.updated_at,

        -- Billing address aggregation
        JSON_OBJECT(
            'street_address_1', ba.street_address_1,
            'street_address_2', ba.street_address_2,
            'city', ba.city,
            'state_province', ba.state_province,
            'postal_code', ba.postal_code,
            'country_code', ba.country_code,
            'coordinates', 
                CASE 
                    WHEN ba.latitude IS NOT NULL AND ba.longitude IS NOT NULL 
                    THEN JSON_OBJECT('latitude', ba.latitude, 'longitude', ba.longitude)
                    ELSE NULL
                END,
            'timezone', ba.timezone
        ) as billing_address_json,

        -- Shipping address aggregation  
        JSON_OBJECT(
            'street_address_1', sa.street_address_1,
            'street_address_2', sa.street_address_2,
            'city', sa.city,
            'state_province', sa.state_province,
            'postal_code', sa.postal_code,
            'country_code', sa.country_code,
            'coordinates',
                CASE 
                    WHEN sa.latitude IS NOT NULL AND sa.longitude IS NOT NULL 
                    THEN JSON_OBJECT('latitude', sa.latitude, 'longitude', sa.longitude)
                    ELSE NULL
                END,
            'timezone', sa.timezone
        ) as shipping_address_json,

        -- Primary contact aggregation
        JSON_OBJECT(
            'contact_id', pc.contact_id,
            'name', JSON_OBJECT(
                'first', pc.first_name,
                'last', pc.last_name,
                'full', CONCAT(pc.first_name, ' ', pc.last_name)
            ),
            'communication', JSON_OBJECT(
                'email', pc.email_address,
                'phone', pc.phone_number,
                'mobile', pc.mobile_number
            ),
            'professional', JSON_OBJECT(
                'job_title', pc.job_title,
                'department', pc.department
            ),
            'preferences', JSON_OBJECT(
                'preferred_communication', pc.preferred_communication,
                'email_opt_in', pc.email_opt_in,
                'sms_opt_in', pc.sms_opt_in
            )
        ) as primary_contact_json,

        -- Billing contact aggregation
        JSON_OBJECT(
            'contact_id', bc.contact_id,
            'name', JSON_OBJECT(
                'first', bc.first_name,
                'last', bc.last_name,
                'full', CONCAT(bc.first_name, ' ', bc.last_name)
            ),
            'communication', JSON_OBJECT(
                'email', bc.email_address,
                'phone', bc.phone_number
            ),
            'professional', JSON_OBJECT(
                'job_title', bc.job_title
            )
        ) as billing_contact_json

    FROM postgresql_customers c
    LEFT JOIN postgresql_addresses ba ON c.billing_address_id = ba.address_id
    LEFT JOIN postgresql_addresses sa ON c.shipping_address_id = sa.address_id  
    LEFT JOIN postgresql_contacts pc ON c.primary_contact_id = pc.contact_id
    LEFT JOIN postgresql_contacts bc ON c.billing_contact_id = bc.contact_id
    WHERE c.account_status IN ('active', 'inactive', 'trial')
),

optimized_customer_documents AS (
    SELECT 
        scd.customer_uuid,
        scd.company_name,
        scd.industry_code,
        scd.company_size,

        -- Optimized account information structure
        JSON_OBJECT(
            'status', scd.account_status,
            'subscription_tier', scd.subscription_tier,
            'annual_contract_value', scd.annual_contract_value,
            'created_at', scd.created_at,
            'updated_at', scd.updated_at,

            -- Calculated account metrics
            'customer_tier', 
                CASE 
                    WHEN scd.annual_contract_value >= 100000 THEN 'enterprise'
                    WHEN scd.annual_contract_value >= 50000 THEN 'professional'
                    WHEN scd.annual_contract_value >= 10000 THEN 'standard'
                    ELSE 'basic'
                END,

            'account_age_days', DATE_PART('days', CURRENT_TIMESTAMP - scd.created_at),

            'lifecycle_stage',
                CASE 
                    WHEN scd.account_status = 'trial' THEN 'prospect'
                    WHEN scd.account_status = 'active' AND DATE_PART('days', CURRENT_TIMESTAMP - scd.created_at) <= 90 THEN 'new_customer'
                    WHEN scd.account_status = 'active' THEN 'established_customer'
                    WHEN scd.account_status = 'inactive' THEN 'at_risk'
                    ELSE 'unknown'
                END
        ) as account,

        -- Optimized address structures
        scd.billing_address_json as billing_address,
        scd.shipping_address_json as shipping_address,

        -- Optimized contact structure
        JSON_OBJECT(
            'primary_contact', scd.primary_contact_json,
            'billing_contact', scd.billing_contact_json,
            'contact_count', 
                CASE 
                    WHEN scd.primary_contact_json IS NOT NULL AND scd.billing_contact_json IS NOT NULL THEN 2
                    WHEN scd.primary_contact_json IS NOT NULL OR scd.billing_contact_json IS NOT NULL THEN 1
                    ELSE 0
                END
        ) as contacts,

        -- Migration metadata
        JSON_OBJECT(
            'migrated_at', CURRENT_TIMESTAMP,
            'migration_version', '1.0',
            'data_validated', true,
            'transformation_applied', true,
            'optimization_level', 'advanced'
        ) as migration_info

    FROM source_customer_data scd
    WHERE scd.company_name IS NOT NULL
      AND scd.customer_uuid IS NOT NULL
)

-- Execute optimized bulk migration with data validation
INSERT INTO customers (
    customer_uuid,
    company_name,
    industry_code,
    company_size,
    account,
    billing_address,
    shipping_address,
    contacts,
    migration_info
)
SELECT 
    ocd.customer_uuid,
    ocd.company_name,
    ocd.industry_code,
    ocd.company_size,
    ocd.account,

    -- Conditional address insertion (only if data exists)
    CASE 
        WHEN ocd.billing_address->>'street_address_1' IS NOT NULL 
        THEN ocd.billing_address 
        ELSE NULL 
    END,

    CASE 
        WHEN ocd.shipping_address->>'street_address_1' IS NOT NULL 
        THEN ocd.shipping_address 
        ELSE NULL 
    END,

    ocd.contacts,
    ocd.migration_info

FROM optimized_customer_documents ocd

-- Migration execution options
WITH (
    migration_strategy = 'staged_migration',
    batch_size = 1000,
    parallel_batches = 4,
    enable_progress_tracking = true,
    enable_data_validation = true,
    enable_rollback_capability = true,

    -- Performance optimization
    write_concern = 'majority',
    write_timeout = 30000,

    -- Error handling
    continue_on_error = true,
    max_errors_allowed = 100
);

-- Advanced product catalog migration with category optimization
WITH recursive_category_hierarchy AS (
    WITH RECURSIVE category_tree AS (
        SELECT 
            pc.category_id,
            pc.category_name,
            pc.category_slug,
            pc.parent_category_id,
            pc.category_level,
            pc.category_description,
            pc.display_order,
            pc.is_featured,
            ARRAY[pc.category_id] as path_ids,
            pc.category_name as path_names,
            1 as depth
        FROM postgresql_product_categories pc
        WHERE pc.parent_category_id IS NULL

        UNION ALL

        SELECT 
            pc.category_id,
            pc.category_name,
            pc.category_slug,
            pc.parent_category_id,
            pc.category_level,
            pc.category_description,
            pc.display_order,
            pc.is_featured,
            ct.path_ids || pc.category_id,
            ct.path_names || ' > ' || pc.category_name,
            ct.depth + 1
        FROM postgresql_product_categories pc
        INNER JOIN category_tree ct ON pc.parent_category_id = ct.category_id
        WHERE ct.depth < 10  -- Prevent infinite recursion
    )
    SELECT * FROM category_tree
),

enhanced_product_data AS (
    SELECT 
        p.product_uuid,
        p.product_name,
        p.product_sku,
        p.product_description,
        p.base_price,
        p.currency_code,
        p.current_stock_level,
        p.minimum_stock_level,
        p.maximum_stock_level,
        p.product_status,
        p.launch_date,
        p.discontinuation_date,
        p.weight_kg,
        p.length_cm,
        p.width_cm,
        p.height_cm,
        p.requires_shipping,
        p.shipping_class,
        p.hazardous_material,
        p.seo_title,
        p.seo_description,
        p.meta_keywords,
        p.created_at,
        p.updated_at,

        -- Enhanced category information
        rch.category_name as primary_category,
        rch.path_names as category_hierarchy,
        rch.path_ids as category_path_ids,
        rch.depth as category_depth,
        rch.is_featured as category_featured,

        -- Recent sales analytics
        COALESCE(sales.times_ordered, 0) as times_ordered,
        COALESCE(sales.total_quantity_sold, 0) as total_quantity_sold,
        COALESCE(sales.total_revenue, 0) as total_revenue,
        COALESCE(sales.avg_selling_price, p.base_price) as avg_selling_price,
        sales.last_ordered_date,

        -- Inventory analytics
        CASE 
            WHEN p.current_stock_level <= p.minimum_stock_level THEN 'low_stock'
            WHEN p.current_stock_level >= p.maximum_stock_level THEN 'overstock'
            ELSE 'normal'
        END as stock_status,

        -- Product performance classification
        CASE 
            WHEN COALESCE(sales.total_revenue, 0) >= 50000 AND COALESCE(sales.times_ordered, 0) >= 100 THEN 'high_performer'
            WHEN COALESCE(sales.total_revenue, 0) >= 10000 AND COALESCE(sales.times_ordered, 0) >= 20 THEN 'good_performer'
            WHEN COALESCE(sales.total_revenue, 0) >= 1000 AND COALESCE(sales.times_ordered, 0) >= 5 THEN 'average_performer'
            WHEN COALESCE(sales.times_ordered, 0) > 0 THEN 'low_performer'
            ELSE 'no_sales'
        END as performance_tier

    FROM postgresql_products p
    LEFT JOIN recursive_category_hierarchy rch ON p.primary_category_id = rch.category_id
    LEFT JOIN (
        SELECT 
            oi.product_id,
            COUNT(oi.order_item_id) as times_ordered,
            SUM(oi.quantity) as total_quantity_sold,
            SUM(oi.line_total) as total_revenue,
            AVG(oi.unit_price) as avg_selling_price,
            MAX(o.order_date) as last_ordered_date
        FROM postgresql_order_items oi
        INNER JOIN postgresql_orders o ON oi.order_id = o.order_id
        WHERE o.order_date >= CURRENT_DATE - INTERVAL '12 months'
          AND o.order_status NOT IN ('cancelled', 'returned')
        GROUP BY oi.product_id
    ) sales ON p.product_id = sales.product_id
    WHERE p.product_status IN ('active', 'discontinued')
)

-- Create optimized products collection
CREATE COLLECTION products WITH (
    enable_embedded_documents = true,
    enable_array_fields = true,
    compression_algorithm = 'snappy'
) AS (
    product_uuid UUID PRIMARY KEY,
    product_name VARCHAR(255) NOT NULL,
    product_sku VARCHAR(100) UNIQUE NOT NULL,
    product_description TEXT,
    product_status VARCHAR(20) NOT NULL,

    -- Embedded category information
    categories JSONB NOT NULL,

    -- Embedded pricing information
    pricing JSONB NOT NULL,

    -- Embedded inventory information
    inventory JSONB NOT NULL,

    -- Embedded physical characteristics
    physical JSONB,

    -- Embedded SEO information
    seo JSONB,

    -- Embedded lifecycle information
    lifecycle JSONB NOT NULL,

    -- Embedded analytics information
    analytics JSONB NOT NULL,

    -- Migration tracking
    migration_info JSONB NOT NULL,

    -- Optimized indexes
    INDEX product_uuid_idx (product_uuid) UNIQUE,
    INDEX product_sku_idx (product_sku) UNIQUE,
    INDEX primary_category_idx ((categories.primary)),
    INDEX performance_tier_idx ((analytics.performance_tier)),
    INDEX product_status_idx (product_status),
    INDEX stock_status_idx ((inventory.stock_status)),
    TEXT INDEX product_search_idx (product_name, product_description, product_sku)
);

-- Execute advanced product migration with comprehensive optimization
INSERT INTO products (
    product_uuid,
    product_name,
    product_sku,
    product_description,
    product_status,
    categories,
    pricing,
    inventory,
    physical,
    seo,
    lifecycle,
    analytics,
    migration_info
)
SELECT 
    epd.product_uuid,
    epd.product_name,
    epd.product_sku,
    epd.product_description,
    epd.product_status,

    -- Optimized category structure
    JSON_OBJECT(
        'primary', epd.primary_category,
        'hierarchy', STRING_TO_ARRAY(epd.category_hierarchy, ' > '),
        'path_ids', epd.category_path_ids,
        'depth', epd.category_depth,
        'is_featured', epd.category_featured
    ),

    -- Comprehensive pricing information
    JSON_OBJECT(
        'base_price', epd.base_price,
        'currency_code', epd.currency_code,
        'avg_selling_price', epd.avg_selling_price,
        'pricing_tier',
            CASE 
                WHEN epd.base_price >= 1000 THEN 'premium'
                WHEN epd.base_price >= 100 THEN 'standard'
                ELSE 'economy'
            END,
        'last_price_update', epd.updated_at
    ),

    -- Advanced inventory management
    JSON_OBJECT(
        'current_stock', epd.current_stock_level,
        'minimum_stock', epd.minimum_stock_level,
        'maximum_stock', epd.maximum_stock_level,
        'stock_status', epd.stock_status,
        'tracked', epd.current_stock_level IS NOT NULL,
        'reorder_point', epd.minimum_stock_level * 1.2,
        'stock_turnover_estimate',
            CASE 
                WHEN epd.total_quantity_sold > 0 AND epd.current_stock_level > 0 
                THEN ROUND(epd.total_quantity_sold::float / epd.current_stock_level, 2)
                ELSE 0
            END
    ),

    -- Physical characteristics and shipping
    JSON_OBJECT(
        'dimensions', JSON_OBJECT(
            'weight_kg', epd.weight_kg,
            'length_cm', epd.length_cm,
            'width_cm', epd.width_cm,
            'height_cm', epd.height_cm,
            'volume_cubic_cm', 
                CASE 
                    WHEN epd.length_cm IS NOT NULL AND epd.width_cm IS NOT NULL AND epd.height_cm IS NOT NULL 
                    THEN epd.length_cm * epd.width_cm * epd.height_cm
                    ELSE NULL
                END
        ),
        'shipping', JSON_OBJECT(
            'requires_shipping', epd.requires_shipping,
            'shipping_class', epd.shipping_class,
            'hazardous_material', epd.hazardous_material,
            'dimensional_weight',
                CASE 
                    WHEN epd.length_cm IS NOT NULL AND epd.width_cm IS NOT NULL AND epd.height_cm IS NOT NULL 
                    THEN (epd.length_cm * epd.width_cm * epd.height_cm) / 5000  -- Standard dimensional weight formula
                    ELSE NULL
                END
        )
    ),

    -- SEO and marketing optimization
    JSON_OBJECT(
        'title', epd.seo_title,
        'description', epd.seo_description,
        'keywords', 
            CASE 
                WHEN epd.meta_keywords IS NOT NULL 
                THEN STRING_TO_ARRAY(epd.meta_keywords, ',')
                ELSE ARRAY[]::text[]
            END,
        'url_slug', LOWER(REGEXP_REPLACE(epd.product_name, '[^a-zA-Z0-9]+', '-', 'g')),
        'search_terms', ARRAY[epd.product_name, epd.product_sku, epd.primary_category]
    ),

    -- Product lifecycle management
    JSON_OBJECT(
        'launch_date', epd.launch_date,
        'discontinuation_date', epd.discontinuation_date,
        'created_at', epd.created_at,
        'updated_at', epd.updated_at,
        'days_since_launch', 
            CASE 
                WHEN epd.launch_date IS NOT NULL 
                THEN DATE_PART('days', CURRENT_DATE - epd.launch_date)
                ELSE DATE_PART('days', CURRENT_DATE - epd.created_at)
            END,
        'lifecycle_stage',
            CASE 
                WHEN epd.discontinuation_date IS NOT NULL THEN 'discontinued'
                WHEN epd.launch_date > CURRENT_DATE - INTERVAL '90 days' THEN 'new'
                WHEN epd.launch_date > CURRENT_DATE - INTERVAL '1 year' THEN 'growing'
                ELSE 'mature'
            END
    ),

    -- Comprehensive analytics and performance metrics
    JSON_OBJECT(
        'times_ordered', epd.times_ordered,
        'total_quantity_sold', epd.total_quantity_sold,
        'total_revenue', epd.total_revenue,
        'last_ordered_date', epd.last_ordered_date,
        'performance_tier', epd.performance_tier,
        'revenue_per_order',
            CASE 
                WHEN epd.times_ordered > 0 
                THEN ROUND(epd.total_revenue / epd.times_ordered, 2)
                ELSE 0
            END,
        'average_order_quantity',
            CASE 
                WHEN epd.times_ordered > 0 
                THEN ROUND(epd.total_quantity_sold::float / epd.times_ordered, 2)
                ELSE 0
            END,
        'days_since_last_order',
            CASE 
                WHEN epd.last_ordered_date IS NOT NULL 
                THEN DATE_PART('days', CURRENT_DATE - epd.last_ordered_date)
                ELSE NULL
            END,
        'popularity_score',
            CASE 
                WHEN epd.times_ordered >= 100 THEN 'very_high'
                WHEN epd.times_ordered >= 50 THEN 'high'
                WHEN epd.times_ordered >= 10 THEN 'medium'
                WHEN epd.times_ordered > 0 THEN 'low'
                ELSE 'none'
            END
    ),

    -- Migration tracking and validation
    JSON_OBJECT(
        'migrated_at', CURRENT_TIMESTAMP,
        'migration_version', '1.0',
        'source_product_id', p.product_id,
        'data_validated', true,
        'transformation_applied', true,
        'analytics_computed', true,
        'optimization_level', 'comprehensive'
    )

FROM enhanced_product_data epd
JOIN postgresql_products p ON epd.product_uuid = p.product_uuid
WHERE epd.product_name IS NOT NULL
  AND epd.product_sku IS NOT NULL
  AND epd.product_status IN ('active', 'discontinued')

-- Migration execution with comprehensive options
WITH (
    migration_strategy = 'staged_migration',
    batch_size = 500,  -- Smaller batches for complex documents
    parallel_batches = 2,
    enable_progress_tracking = true,
    enable_data_validation = true,
    enable_performance_optimization = true,
    enable_rollback_capability = true,

    -- Advanced optimization
    enable_document_optimization = true,
    enable_index_optimization = true,
    enable_compression = true,

    -- Quality assurance
    validate_document_structure = true,
    validate_data_relationships = true,
    validate_business_rules = true,

    -- Monitoring
    enable_migration_metrics = true,
    enable_performance_tracking = true,
    track_optimization_effectiveness = true
);

-- Migration monitoring and analytics
CREATE VIEW migration_progress_dashboard AS
WITH migration_statistics AS (
    SELECT 
        'customers' as collection_name,
        COUNT(*) as migrated_documents,
        AVG(LENGTH(BSON(document))) as avg_document_size,
        MIN((migration_info->>'migrated_at')::timestamp) as migration_start,
        MAX((migration_info->>'migrated_at')::timestamp) as migration_end
    FROM customers
    WHERE migration_info->>'migrated_at' IS NOT NULL

    UNION ALL

    SELECT 
        'products' as collection_name,
        COUNT(*) as migrated_documents,
        AVG(LENGTH(BSON(document))) as avg_document_size,
        MIN((migration_info->>'migrated_at')::timestamp) as migration_start,
        MAX((migration_info->>'migrated_at')::timestamp) as migration_end
    FROM products
    WHERE migration_info->>'migrated_at' IS NOT NULL
),

performance_metrics AS (
    SELECT 
        collection_name,
        migrated_documents,
        avg_document_size,
        migration_start,
        migration_end,

        -- Calculate migration performance
        EXTRACT(EPOCH FROM (migration_end - migration_start)) as migration_duration_seconds,
        ROUND(migrated_documents::float / EXTRACT(EPOCH FROM (migration_end - migration_start)) * 60, 2) as documents_per_minute,

        -- Document optimization metrics
        CASE 
            WHEN avg_document_size < 16000 THEN 'optimal'  -- < 16KB
            WHEN avg_document_size < 100000 THEN 'good'    -- < 100KB  
            WHEN avg_document_size < 1000000 THEN 'large'  -- < 1MB
            ELSE 'very_large'
        END as document_size_classification,

        -- Migration efficiency assessment
        CASE 
            WHEN migrated_documents / EXTRACT(EPOCH FROM (migration_end - migration_start)) > 100 THEN 'excellent'
            WHEN migrated_documents / EXTRACT(EPOCH FROM (migration_end - migration_start)) > 50 THEN 'good'
            WHEN migrated_documents / EXTRACT(EPOCH FROM (migration_end - migration_start)) > 10 THEN 'acceptable'
            ELSE 'needs_optimization'
        END as migration_efficiency

    FROM migration_statistics
)

SELECT 
    pm.collection_name,
    pm.migrated_documents,
    ROUND(pm.avg_document_size / 1024.0, 2) || ' KB' as avg_document_size,
    pm.migration_start,
    pm.migration_end,
    ROUND(pm.migration_duration_seconds / 60.0, 2) || ' minutes' as migration_duration,
    pm.documents_per_minute || ' docs/min' as migration_throughput,
    pm.document_size_classification,
    pm.migration_efficiency,

    -- Success metrics
    'Migration completed successfully with optimized document structure' as status,

    -- Recommendations
    CASE pm.migration_efficiency
        WHEN 'needs_optimization' THEN 'Consider increasing batch size and parallel processing'
        WHEN 'acceptable' THEN 'Performance adequate, monitor for optimization opportunities'  
        WHEN 'good' THEN 'Good performance, continue with current configuration'
        WHEN 'excellent' THEN 'Excellent performance, configuration optimized'
    END as recommendation

FROM performance_metrics pm
ORDER BY pm.collection_name;

-- QueryLeaf provides comprehensive migration capabilities:
-- 1. Advanced data transformation with embedded documents and optimized structures
-- 2. Intelligent relationship modeling preserving data integrity
-- 3. Performance-optimized document design for MongoDB query patterns
-- 4. Comprehensive data validation and quality assurance
-- 5. Real-time migration monitoring and progress tracking
-- 6. Automated optimization recommendations based on data patterns
-- 7. Flexible schema evolution supporting future application requirements
-- 8. SQL-familiar syntax for MongoDB migration operations
-- 9. Enterprise-grade migration strategies with rollback capabilities
-- 10. Seamless integration with existing database infrastructure

Best Practices for MongoDB Migration

Migration Strategy and Planning

Essential principles for successful database modernization:

Assessment and Planning: Thoroughly analyze source schemas, data relationships, and application access patterns before migration
Document Design: Design MongoDB documents to optimize for application query patterns rather than maintaining relational structures
Staged Migration: Implement gradual migration strategies to minimize downtime and enable thorough testing
Data Validation: Implement comprehensive validation processes to ensure data integrity throughout the migration
Performance Testing: Test migration performance and application performance against migrated data structures
Rollback Capabilities: Maintain rollback strategies and data synchronization during transition periods

Data Transformation and Optimization

Optimize migration strategies for production-scale requirements:

Relationship Analysis: Analyze relational patterns to determine optimal embedding vs. referencing strategies
Index Planning: Design MongoDB indexes based on application query patterns and performance requirements
Schema Flexibility: Leverage MongoDB's flexible schema capabilities while maintaining data consistency
Batch Processing: Optimize batch sizes and parallel processing for large dataset migrations
Monitoring Integration: Implement comprehensive monitoring for migration progress and post-migration performance
Application Evolution: Plan application refactoring to leverage MongoDB's document model benefits

Conclusion

MongoDB NoSQL migration provides powerful capabilities for modernizing relational database architectures while preserving data integrity and improving application performance through optimized document design, flexible schemas, and intelligent data transformation strategies. The document-oriented approach naturally supports complex data relationships while eliminating the performance bottlenecks and complexity associated with traditional relational joins.

Key MongoDB Migration benefits include:

Intelligent Data Transformation: Sophisticated migration tools that preserve relationships while optimizing for document-oriented access patterns
Flexible Schema Evolution: Native support for schema changes without complex migrations or downtime requirements
Performance Optimization: Dramatic improvements in query performance through embedded documents and optimized data structures
Scalability Enhancement: Horizontal scaling capabilities that support modern application growth requirements
Developer Productivity: Simplified data modeling that aligns with object-oriented application architectures
SQL Accessibility: Familiar SQL-style migration operations through QueryLeaf for accessible database modernization

Whether you're modernizing legacy applications, building new cloud-native systems, or optimizing existing database performance, MongoDB migration with QueryLeaf's familiar SQL interface provides the foundation for successful database evolution.

QueryLeaf Integration: QueryLeaf automatically optimizes MongoDB migration operations while providing familiar SQL syntax for data transformation, validation, and optimization. Advanced migration strategies, relationship modeling, and performance optimization techniques are seamlessly accessible through familiar SQL constructs, making sophisticated database modernization approachable for SQL-oriented development teams.

The combination of MongoDB's robust migration capabilities with SQL-style database operations makes it an ideal platform for organizations requiring both modern database performance and familiar database management patterns, ensuring your migration project can succeed while maintaining operational continuity and developer productivity.

December 18, 2025
27 min read

MongoDB Time-Series Data Management and IoT Analytics: Advanced Optimization Patterns for Real-Time Sensor Data Processing

Internet of Things (IoT) applications generate massive volumes of time-stamped sensor data that require specialized database architectures capable of handling high-velocity data ingestion, efficient storage optimization, and real-time analytical processing. Traditional relational databases struggle with the unique characteristics of time-series data: irregular intervals, high write throughput, temporal query patterns, and the need for real-time aggregation across millions of data points from distributed sensors and devices.

MongoDB's time-series collections provide purpose-built capabilities for IoT data management that dramatically improve storage efficiency, query performance, and analytical processing through automatic data bucketing, intelligent compression, and optimized indexing strategies. Unlike traditional databases that treat time-series data as regular tables, MongoDB's native time-series architecture is specifically designed for temporal workloads with built-in optimization for IoT use cases, sensor data patterns, and real-time analytics requirements.

The Traditional IoT Data Management Challenge

Conventional relational database approaches for IoT time-series data face significant performance and scalability limitations:

-- Traditional PostgreSQL approach for IoT sensor data - inefficient and problematic

-- Typical IoT sensor data table with poor time-series optimization
CREATE TABLE sensor_readings (
    reading_id BIGSERIAL PRIMARY KEY,
    device_id VARCHAR(100) NOT NULL,
    sensor_type VARCHAR(50) NOT NULL,
    location VARCHAR(200) NOT NULL,
    timestamp TIMESTAMP WITH TIME ZONE NOT NULL,

    -- Multiple sensor measurements (inefficient normalized structure)
    temperature DECIMAL(5,2),
    humidity DECIMAL(5,2),
    pressure DECIMAL(7,2),
    voltage DECIMAL(4,2),
    current DECIMAL(6,3),
    power_consumption DECIMAL(8,2),

    -- Device metadata (repetitive storage)
    device_model VARCHAR(100),
    firmware_version VARCHAR(50),
    installation_date DATE,

    -- Location data (redundant storage)
    building_id VARCHAR(50),
    floor_number INTEGER,
    room_number VARCHAR(20),
    latitude DECIMAL(10,8),
    longitude DECIMAL(11,8),

    -- Quality and status indicators
    signal_strength INTEGER,
    battery_level DECIMAL(5,2),
    reading_quality VARCHAR(20),
    data_source VARCHAR(50),

    -- Audit and tracking fields
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    processed_at TIMESTAMP,
    validation_status VARCHAR(20) DEFAULT 'pending'
);

-- Attempt at optimization with basic indexes (insufficient for time-series workloads)
CREATE INDEX sensor_readings_device_time_idx ON sensor_readings (device_id, timestamp DESC);
CREATE INDEX sensor_readings_timestamp_idx ON sensor_readings (timestamp DESC);
CREATE INDEX sensor_readings_sensor_type_idx ON sensor_readings (sensor_type, timestamp DESC);
CREATE INDEX sensor_readings_location_idx ON sensor_readings (building_id, floor_number, timestamp DESC);

-- Example of problematic IoT data insertion (individual row inserts are slow)
INSERT INTO sensor_readings (
    device_id, sensor_type, location, timestamp,
    temperature, humidity, pressure, voltage, current, power_consumption,
    device_model, firmware_version, installation_date,
    building_id, floor_number, room_number, latitude, longitude,
    signal_strength, battery_level, reading_quality, data_source
) VALUES (
    'TEMP_SENSOR_001_FLOOR_1_ROOM_A', 
    'Environmental Sensor',
    'Building A - Floor 1 - Room A - North Wall',
    CURRENT_TIMESTAMP,
    22.5, 65.3, 1013.25, 3.3, 0.125, 0.4125,
    'SensorTech ST-2000 Environmental Monitor',
    'v2.1.5',
    '2024-01-15',
    'BUILDING_A_MAIN_CAMPUS',
    1,
    'A101',
    40.7128,
    -74.0060,
    85,
    94.2,
    'excellent',
    'real_time_telemetry'
);

-- This approach requires individual inserts for each sensor reading:
-- - Device TEMP_SENSOR_001 sends reading every 30 seconds = 2,880 inserts/day
-- - Device HUMIDITY_SENSOR_002 sends reading every 60 seconds = 1,440 inserts/day  
-- - Device PRESSURE_SENSOR_003 sends reading every 15 seconds = 5,760 inserts/day
-- - For 1,000 devices = millions of individual INSERT operations per day

-- Example of inefficient time-series queries
-- Get hourly averages for temperature sensors (slow aggregation)
SELECT 
    device_id,
    DATE_TRUNC('hour', timestamp) as hour_bucket,

    -- Basic aggregations (computationally expensive)
    AVG(temperature) as avg_temperature,
    MIN(temperature) as min_temperature,
    MAX(temperature) as max_temperature,
    STDDEV(temperature) as temperature_variance,

    -- Count and quality metrics
    COUNT(*) as reading_count,
    COUNT(*) FILTER (WHERE reading_quality = 'excellent') as excellent_readings,
    AVG(signal_strength) as avg_signal_strength,

    -- Power consumption analysis
    AVG(power_consumption) as avg_power,
    SUM(power_consumption) as total_power

FROM sensor_readings
WHERE timestamp >= CURRENT_TIMESTAMP - INTERVAL '24 hours'
  AND sensor_type = 'Environmental Sensor'
  AND temperature IS NOT NULL
GROUP BY device_id, DATE_TRUNC('hour', timestamp)
ORDER BY device_id, hour_bucket DESC;

-- Problems with traditional time-series queries:
-- 1. Full table scans despite indexes when dealing with large time ranges
-- 2. Expensive GROUP BY operations on timestamp truncation
-- 3. Multiple aggregation calculations are computationally intensive
-- 4. Poor performance with concurrent analytical queries
-- 5. No pre-aggregated rollups or materialized views for common patterns
-- 6. Inefficient storage with repeated device metadata in every row

-- Attempting to get device performance trends (extremely slow)
WITH daily_device_metrics AS (
    SELECT 
        device_id,
        DATE_TRUNC('day', timestamp) as date_bucket,

        -- Performance indicators
        AVG(temperature) as avg_temp,
        AVG(humidity) as avg_humidity,
        AVG(pressure) as avg_pressure,
        AVG(power_consumption) as avg_power,
        AVG(battery_level) as avg_battery,
        AVG(signal_strength) as avg_signal,

        -- Operational metrics
        COUNT(*) as total_readings,
        COUNT(*) FILTER (WHERE reading_quality = 'poor') as poor_quality_readings,

        -- Calculate uptime percentage
        (COUNT(*)::DECIMAL / (24 * 60 * 2)) * 100 as uptime_percentage, -- Assuming 30-second intervals

        -- Data quality assessment
        (COUNT(*) FILTER (WHERE reading_quality IN ('excellent', 'good'))::DECIMAL / COUNT(*)) * 100 as quality_percentage

    FROM sensor_readings
    WHERE timestamp >= CURRENT_DATE - INTERVAL '30 days'
    GROUP BY device_id, DATE_TRUNC('day', timestamp)
),

device_trend_analysis AS (
    SELECT 
        device_id,

        -- Performance trends using window functions (very expensive)
        AVG(avg_temp) as overall_avg_temp,
        STDDEV(avg_temp) as temp_stability,

        -- Power consumption trends
        AVG(avg_power) as overall_avg_power,

        -- Calculate linear regression slope for temperature trend
        REGR_SLOPE(avg_temp, EXTRACT(EPOCH FROM date_bucket)) as temp_trend_slope,
        REGR_R2(avg_temp, EXTRACT(EPOCH FROM date_bucket)) as temp_trend_correlation,

        -- Battery degradation analysis
        REGR_SLOPE(avg_battery, EXTRACT(EPOCH FROM date_bucket)) as battery_degradation_slope,

        -- Reliability metrics
        AVG(uptime_percentage) as avg_uptime,
        MIN(quality_percentage) as min_quality_percentage,

        -- Trend classification
        CASE 
            WHEN REGR_SLOPE(avg_temp, EXTRACT(EPOCH FROM date_bucket)) > 0.01 THEN 'temperature_increasing'
            WHEN REGR_SLOPE(avg_temp, EXTRACT(EPOCH FROM date_bucket)) < -0.01 THEN 'temperature_decreasing'
            ELSE 'temperature_stable'
        END as temperature_trend,

        CASE 
            WHEN REGR_SLOPE(avg_battery, EXTRACT(EPOCH FROM date_bucket)) < -0.1 THEN 'battery_degrading'
            WHEN REGR_SLOPE(avg_battery, EXTRACT(EPOCH FROM date_bucket)) > 0.1 THEN 'battery_improving'
            ELSE 'battery_stable'
        END as battery_trend

    FROM daily_device_metrics
    GROUP BY device_id
)

-- This query would take minutes or hours on large datasets
SELECT 
    device_id,
    overall_avg_temp,
    temp_stability,
    overall_avg_power,
    temp_trend_slope,
    battery_degradation_slope,
    avg_uptime,
    temperature_trend,
    battery_trend,

    -- Device health scoring
    CASE 
        WHEN avg_uptime > 95 AND min_quality_percentage > 90 AND battery_degradation_slope > -0.05 THEN 'excellent'
        WHEN avg_uptime > 90 AND min_quality_percentage > 80 AND battery_degradation_slope > -0.1 THEN 'good'
        WHEN avg_uptime > 80 AND min_quality_percentage > 70 THEN 'fair'
        ELSE 'poor'
    END as device_health_grade

FROM device_trend_analysis
ORDER BY overall_avg_power DESC;

-- Storage analysis showing massive inefficiency
SELECT 
    schemaname,
    tablename,

    -- Table size metrics showing bloat
    pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) as total_size,
    pg_size_pretty(pg_relation_size(schemaname||'.'||tablename)) as table_size,
    pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename) - pg_relation_size(schemaname||'.'||tablename)) as indexes_size,

    -- Row statistics
    n_tup_ins as total_rows,
    CASE 
        WHEN n_tup_ins > 0 THEN 
            pg_relation_size(schemaname||'.'||tablename) / n_tup_ins 
        ELSE 0 
    END as avg_row_size_bytes,

    -- Storage efficiency (poor due to repetitive data)
    CASE 
        WHEN n_tup_ins > 0 THEN
            ROUND((pg_relation_size(schemaname||'.'||tablename)::DECIMAL / 1024 / 1024 / 1024), 2)
        ELSE 0.0
    END as storage_size_gb

FROM pg_stat_user_tables 
WHERE tablename = 'sensor_readings'
ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC;

-- Traditional approach problems:
-- 1. Massive storage bloat due to repetitive device metadata in every row
-- 2. Poor compression due to mixed data types and variable-length fields
-- 3. Inefficient indexing strategies that don't leverage time-series patterns
-- 4. Slow aggregation queries that require full scans and expensive calculations
-- 5. No built-in support for time-based data retention and archival
-- 6. Complex partitioning strategies required for reasonable performance
-- 7. Poor concurrent read/write performance for mixed analytical and operational workloads
-- 8. Difficult real-time analytics due to lack of incremental aggregation support
-- 9. No specialized compression for time-series data patterns
-- 10. Manual optimization required for downsample and rollup operations

-- PostgreSQL time partitioning attempt (complex and limited)
-- Manual partitioning by month (operational overhead)
CREATE TABLE sensor_readings_2024_12 PARTITION OF sensor_readings
FOR VALUES FROM ('2024-12-01') TO ('2025-01-01');

CREATE TABLE sensor_readings_2024_11 PARTITION OF sensor_readings  
FOR VALUES FROM ('2024-11-01') TO ('2024-12-01');

-- Even with partitioning:
-- 1. Manual partition management overhead
-- 2. Limited optimization for time-series specific operations
-- 3. No automatic data retention and archival policies
-- 4. Complex cross-partition analytics and aggregation
-- 5. Poor storage compression compared to purpose-built time-series systems

MongoDB provides powerful time-series collection capabilities designed specifically for IoT data management:

// MongoDB Advanced Time-Series Data Management for IoT Analytics

const { MongoClient, ObjectId } = require('mongodb');

// Comprehensive IoT Time-Series Data Management System
class MongoDBIoTTimeSeriesManager {
  constructor(connectionString, config = {}) {
    this.client = new MongoClient(connectionString);
    this.db = null;
    this.timeSeriesCollections = new Map();

    // Advanced time-series configuration for IoT workloads
    this.config = {
      // Time-series collection optimization
      timeSeries: {
        metaField: 'device', // Device metadata field
        timeField: 'timestamp', // Time field
        granularity: 'seconds', // Data granularity: seconds, minutes, hours
        enableOptimization: true,
        enableCompression: true
      },

      // IoT-specific optimization settings
      iotOptimization: {
        enableDeviceBatching: config.enableDeviceBatching !== false,
        enableSensorGrouping: config.enableSensorGrouping !== false,
        enableDataValidation: config.enableDataValidation !== false,
        enableRealTimeAggregation: config.enableRealTimeAggregation !== false,
        enableAnomalyDetection: config.enableAnomalyDetection || false
      },

      // Data retention and archival
      dataRetention: {
        enableAutoArchival: config.enableAutoArchival || false,
        hotDataRetentionDays: config.hotDataRetentionDays || 30,
        warmDataRetentionDays: config.warmDataRetentionDays || 90,
        coldDataRetentionDays: config.coldDataRetentionDays || 365,
        enableCompression: config.enableRetentionCompression !== false
      },

      // Performance and monitoring
      performance: {
        enableMetrics: config.enableMetrics !== false,
        enableAlerts: config.enableAlerts || false,
        enablePerformanceOptimization: config.enablePerformanceOptimization !== false,
        maxConcurrentReads: config.maxConcurrentReads || 10,
        maxConcurrentWrites: config.maxConcurrentWrites || 5
      },

      // Real-time analytics
      analytics: {
        enableRealTimeAggregation: config.enableRealTimeAggregation !== false,
        enableTrendAnalysis: config.enableTrendAnalysis || false,
        enablePredictiveAnalytics: config.enablePredictiveAnalytics || false,
        aggregationWindowSizes: config.aggregationWindowSizes || ['1m', '5m', '15m', '1h', '1d']
      }
    };

    this.deviceRegistry = new Map();
    this.metricsCollector = null;
    this.alertManager = null;
  }

  async initialize() {
    console.log('Initializing MongoDB IoT Time-Series management system...');

    try {
      await this.client.connect();
      this.db = this.client.db('iot_time_series_platform');

      // Setup time-series collections for different sensor types
      await this.setupTimeSeriesCollections();

      // Initialize device registry and metadata management
      await this.setupDeviceManagement();

      // Setup real-time aggregation pipelines
      await this.setupRealTimeAggregation();

      // Initialize monitoring and alerting
      await this.setupMonitoringAndAlerting();

      console.log('IoT Time-Series system initialized successfully');

    } catch (error) {
      console.error('Error initializing IoT Time-Series system:', error);
      throw error;
    }
  }

  async setupTimeSeriesCollections() {
    console.log('Setting up optimized time-series collections for IoT data...');

    try {
      // Environmental sensor data collection
      await this.createTimeSeriesCollection('environmental_sensors', {
        metaField: 'device',
        timeField: 'timestamp',
        granularity: 'seconds'
      });

      // Power monitoring collection
      await this.createTimeSeriesCollection('power_monitoring', {
        metaField: 'device',
        timeField: 'timestamp',
        granularity: 'seconds'
      });

      // Network performance collection
      await this.createTimeSeriesCollection('network_performance', {
        metaField: 'device',
        timeField: 'timestamp',
        granularity: 'seconds'
      });

      // Device health and diagnostics collection
      await this.createTimeSeriesCollection('device_diagnostics', {
        metaField: 'device',
        timeField: 'timestamp',
        granularity: 'minutes'
      });

      // Security events collection
      await this.createTimeSeriesCollection('security_events', {
        metaField: 'device',
        timeField: 'timestamp',
        granularity: 'seconds'
      });

    } catch (error) {
      console.error('Error setting up time-series collections:', error);
      throw error;
    }
  }

  async createTimeSeriesCollection(collectionName, timeSeriesOptions) {
    console.log(`Creating optimized time-series collection: ${collectionName}`);

    try {
      // Create time-series collection with MongoDB native optimization
      await this.db.createCollection(collectionName, {
        timeseries: {
          timeField: timeSeriesOptions.timeField,
          metaField: timeSeriesOptions.metaField,
          granularity: timeSeriesOptions.granularity
        },
        // Enable compression for storage efficiency
        storageEngine: {
          wiredTiger: {
            configString: 'block_compressor=zstd'
          }
        }
      });

      const collection = this.db.collection(collectionName);

      // Create optimized indexes for time-series access patterns
      await collection.createIndex({ 
        [timeSeriesOptions.metaField + '.device_id']: 1, 
        [timeSeriesOptions.timeField]: -1 
      });

      await collection.createIndex({ 
        [timeSeriesOptions.metaField + '.location']: 1, 
        [timeSeriesOptions.timeField]: -1 
      });

      await collection.createIndex({ 
        [timeSeriesOptions.metaField + '.sensor_type']: 1, 
        [timeSeriesOptions.timeField]: -1 
      });

      // Store collection reference
      this.timeSeriesCollections.set(collectionName, {
        collection: collection,
        options: timeSeriesOptions
      });

      console.log(`Time-series collection ${collectionName} created successfully`);

    } catch (error) {
      console.error(`Error creating time-series collection ${collectionName}:`, error);
      throw error;
    }
  }

  async setupDeviceManagement() {
    console.log('Setting up device registry and metadata management...');

    const deviceRegistry = this.db.collection('device_registry');

    // Create indexes for device management
    await deviceRegistry.createIndex({ device_id: 1 }, { unique: true });
    await deviceRegistry.createIndex({ location: 1 });
    await deviceRegistry.createIndex({ sensor_type: 1 });
    await deviceRegistry.createIndex({ status: 1, last_seen: -1 });

    // Create device groups collection for hierarchical organization
    const deviceGroups = this.db.collection('device_groups');
    await deviceGroups.createIndex({ group_id: 1 }, { unique: true });
    await deviceGroups.createIndex({ parent_group: 1 });
  }

  async registerIoTDevice(deviceData) {
    console.log(`Registering IoT device: ${deviceData.device_id}`);

    try {
      const deviceRegistry = this.db.collection('device_registry');

      // Check for existing device
      const existingDevice = await deviceRegistry.findOne({ device_id: deviceData.device_id });
      if (existingDevice) {
        throw new Error(`Device already registered: ${deviceData.device_id}`);
      }

      const deviceRegistration = {
        device_id: deviceData.device_id,
        device_name: deviceData.device_name,
        sensor_type: deviceData.sensor_type,

        // Location and installation information
        location: {
          building: deviceData.building,
          floor: deviceData.floor,
          room: deviceData.room,
          coordinates: {
            latitude: deviceData.latitude || null,
            longitude: deviceData.longitude || null
          },
          zone: deviceData.zone || null
        },

        // Technical specifications
        specifications: {
          model: deviceData.model,
          manufacturer: deviceData.manufacturer,
          firmware_version: deviceData.firmware_version,
          hardware_revision: deviceData.hardware_revision,
          communication_protocol: deviceData.communication_protocol || 'mqtt',
          power_source: deviceData.power_source || 'battery',
          sampling_rate: deviceData.sampling_rate || 60 // seconds
        },

        // Operational configuration
        configuration: {
          data_retention_days: deviceData.data_retention_days || 365,
          alert_thresholds: deviceData.alert_thresholds || {},
          measurement_units: deviceData.measurement_units || {},
          calibration_data: deviceData.calibration_data || {},
          quality_settings: deviceData.quality_settings || {}
        },

        // Status and monitoring
        status: 'active',
        health_status: 'unknown',
        last_seen: null,
        last_reading: null,

        // Registration metadata
        registered_at: new Date(),
        updated_at: new Date(),
        registered_by: deviceData.registered_by || 'system'
      };

      const result = await deviceRegistry.insertOne(deviceRegistration);

      // Add to in-memory registry for quick access
      this.deviceRegistry.set(deviceData.device_id, deviceRegistration);

      console.log(`Device ${deviceData.device_id} registered successfully`);
      return {
        success: true,
        device_id: deviceData.device_id,
        registration_id: result.insertedId
      };

    } catch (error) {
      console.error(`Error registering device ${deviceData.device_id}:`, error);
      throw error;
    }
  }

  async insertSensorData(collectionName, deviceId, sensorData, options = {}) {
    console.log(`Inserting sensor data for device ${deviceId} into ${collectionName}`);

    try {
      const timeSeriesInfo = this.timeSeriesCollections.get(collectionName);
      if (!timeSeriesInfo) {
        throw new Error(`Time-series collection not found: ${collectionName}`);
      }

      const collection = timeSeriesInfo.collection;

      // Get device metadata
      const deviceInfo = await this.getDeviceInfo(deviceId);
      if (!deviceInfo) {
        throw new Error(`Device not registered: ${deviceId}`);
      }

      // Prepare optimized time-series document
      const timeSeriesDocument = {
        timestamp: sensorData.timestamp || new Date(),

        // Device metadata (stored efficiently in metaField)
        device: {
          device_id: deviceId,
          sensor_type: deviceInfo.sensor_type,
          location: deviceInfo.location,
          model: deviceInfo.specifications.model
        },

        // Sensor measurements (optimized structure)
        measurements: this.optimizeMeasurements(sensorData.measurements),

        // Data quality indicators
        quality: {
          signal_strength: sensorData.signal_strength,
          battery_level: sensorData.battery_level,
          reading_quality: sensorData.reading_quality || 'good',
          data_source: sensorData.data_source || 'sensor'
        },

        // Operational metadata
        metadata: {
          firmware_version: deviceInfo.specifications.firmware_version,
          sampling_interval: deviceInfo.specifications.sampling_rate,
          processing_flags: sensorData.processing_flags || []
        }
      };

      // Apply data validation if enabled
      if (this.config.iotOptimization.enableDataValidation) {
        await this.validateSensorData(deviceId, timeSeriesDocument);
      }

      // Insert with time-series optimization
      const result = await collection.insertOne(timeSeriesDocument, {
        writeConcern: { w: 'majority', j: true }
      });

      // Update device last seen timestamp
      await this.updateDeviceLastSeen(deviceId);

      // Trigger real-time aggregation if enabled
      if (this.config.analytics.enableRealTimeAggregation) {
        await this.triggerRealTimeAggregation(collectionName, deviceId, timeSeriesDocument);
      }

      return {
        success: true,
        inserted_id: result.insertedId,
        collection: collectionName,
        device_id: deviceId,
        timestamp: timeSeriesDocument.timestamp
      };

    } catch (error) {
      console.error(`Error inserting sensor data for device ${deviceId}:`, error);
      throw error;
    }
  }

  optimizeMeasurements(measurements) {
    if (!measurements || typeof measurements !== 'object') {
      return measurements;
    }

    // Optimize measurement structure for compression and query performance
    const optimized = {};

    Object.entries(measurements).forEach(([key, value]) => {
      if (typeof value === 'number') {
        // Round to reasonable precision to improve compression
        optimized[key] = Math.round(value * 100) / 100;
      } else {
        optimized[key] = value;
      }
    });

    return optimized;
  }

  async validateSensorData(deviceId, document) {
    const deviceInfo = this.deviceRegistry.get(deviceId);
    if (!deviceInfo) return;

    // Apply device-specific validation rules
    const thresholds = deviceInfo.configuration.alert_thresholds;

    Object.entries(document.measurements).forEach(([measurement, value]) => {
      if (thresholds[measurement]) {
        const threshold = thresholds[measurement];

        if (value < threshold.min || value > threshold.max) {
          // Log anomaly but don't reject data
          console.warn(`Measurement ${measurement} out of range for device ${deviceId}: ${value}`);
          document.quality.anomaly_detected = true;
          document.quality.anomaly_type = `${measurement}_out_of_range`;
        }
      }
    });
  }

  async getDeviceInfo(deviceId) {
    if (this.deviceRegistry.has(deviceId)) {
      return this.deviceRegistry.get(deviceId);
    }

    const deviceRegistry = this.db.collection('device_registry');
    const device = await deviceRegistry.findOne({ device_id: deviceId });

    if (device) {
      this.deviceRegistry.set(deviceId, device);
    }

    return device;
  }

  async updateDeviceLastSeen(deviceId) {
    try {
      const deviceRegistry = this.db.collection('device_registry');

      await deviceRegistry.updateOne(
        { device_id: deviceId },
        { 
          $set: { 
            last_seen: new Date(),
            updated_at: new Date()
          }
        }
      );

      // Update in-memory cache
      if (this.deviceRegistry.has(deviceId)) {
        const device = this.deviceRegistry.get(deviceId);
        device.last_seen = new Date();
        device.updated_at = new Date();
      }

    } catch (error) {
      console.error(`Error updating last seen for device ${deviceId}:`, error);
      // Don't throw - last seen updates shouldn't break data insertion
    }
  }

  async queryTimeSeriesData(collectionName, query = {}, options = {}) {
    console.log(`Querying time-series data from ${collectionName}`);

    try {
      const timeSeriesInfo = this.timeSeriesCollections.get(collectionName);
      if (!timeSeriesInfo) {
        throw new Error(`Time-series collection not found: ${collectionName}`);
      }

      const collection = timeSeriesInfo.collection;
      const startTime = Date.now();

      // Apply time-series optimized query patterns
      const optimizedQuery = this.optimizeTimeSeriesQuery(query);
      const optimizedOptions = this.optimizeQueryOptions(options);

      const results = await collection.find(optimizedQuery, optimizedOptions).toArray();
      const queryTime = Date.now() - startTime;

      console.log(`Time-series query completed in ${queryTime}ms, returned ${results.length} documents`);

      return {
        data: results,
        metadata: {
          collection: collectionName,
          query_time_ms: queryTime,
          document_count: results.length,
          query: optimizedQuery
        }
      };

    } catch (error) {
      console.error(`Error querying time-series data from ${collectionName}:`, error);
      throw error;
    }
  }

  optimizeTimeSeriesQuery(query) {
    const optimized = { ...query };

    // Optimize time range queries for time-series performance
    if (query.timestamp) {
      optimized.timestamp = query.timestamp;
    } else if (query.timeRange) {
      optimized.timestamp = {
        $gte: new Date(query.timeRange.start),
        $lte: new Date(query.timeRange.end)
      };
      delete optimized.timeRange;
    }

    // Optimize device filtering for metadata field
    if (query.device_id) {
      optimized['device.device_id'] = query.device_id;
      delete optimized.device_id;
    }

    if (query.sensor_type) {
      optimized['device.sensor_type'] = query.sensor_type;
      delete optimized.sensor_type;
    }

    return optimized;
  }

  optimizeQueryOptions(options) {
    const optimized = {
      ...options,
      // Default sort by timestamp descending for time-series efficiency
      sort: options.sort || { timestamp: -1 },
      // Limit results for performance unless explicitly requested
      limit: options.limit || 1000
    };

    return optimized;
  }

  // Advanced time-series aggregation operations
  async generateDeviceAnalytics(deviceId, timeRange, granularity = '1h') {
    console.log(`Generating analytics for device ${deviceId} with ${granularity} granularity`);

    try {
      const pipeline = [
        {
          $match: {
            'device.device_id': deviceId,
            timestamp: {
              $gte: new Date(timeRange.start),
              $lte: new Date(timeRange.end)
            }
          }
        },
        {
          $group: {
            _id: {
              $dateTrunc: {
                date: '$timestamp',
                unit: this.getTimeUnit(granularity),
                binSize: this.getBinSize(granularity)
              }
            },

            // Basic statistics
            reading_count: { $sum: 1 },
            avg_signal_strength: { $avg: '$quality.signal_strength' },
            avg_battery_level: { $avg: '$quality.battery_level' },

            // Measurement aggregations (dynamic based on sensor type)
            measurements: {
              $push: '$measurements'
            }
          }
        },
        {
          $addFields: {
            timestamp: '$_id',
            // Calculate measurement statistics
            measurement_stats: {
              $reduce: {
                input: '$measurements',
                initialValue: {},
                in: this.createMeasurementReducer()
              }
            }
          }
        },
        {
          $project: {
            _id: 0,
            timestamp: 1,
            reading_count: 1,
            avg_signal_strength: 1,
            avg_battery_level: 1,
            measurement_stats: 1
          }
        },
        {
          $sort: { timestamp: 1 }
        }
      ];

      // Execute aggregation on appropriate collection
      const collection = this.getCollectionForDevice(deviceId);
      const results = await collection.aggregate(pipeline).toArray();

      return {
        device_id: deviceId,
        time_range: timeRange,
        granularity: granularity,
        analytics: results
      };

    } catch (error) {
      console.error(`Error generating analytics for device ${deviceId}:`, error);
      throw error;
    }
  }

  async generateFleetAnalytics(fleetQuery, timeRange, granularity = '1h') {
    console.log(`Generating fleet analytics with ${granularity} granularity`);

    try {
      const pipeline = [
        {
          $match: {
            ...this.optimizeTimeSeriesQuery(fleetQuery),
            timestamp: {
              $gte: new Date(timeRange.start),
              $lte: new Date(timeRange.end)
            }
          }
        },
        {
          $group: {
            _id: {
              time_bucket: {
                $dateTrunc: {
                  date: '$timestamp',
                  unit: this.getTimeUnit(granularity),
                  binSize: this.getBinSize(granularity)
                }
              },
              device_id: '$device.device_id',
              location: '$device.location.building'
            },

            // Device-level aggregations
            reading_count: { $sum: 1 },
            avg_battery: { $avg: '$quality.battery_level' },
            avg_signal: { $avg: '$quality.signal_strength' },

            // Measurement aggregations
            measurements: { $push: '$measurements' }
          }
        },
        {
          $group: {
            _id: '$_id.time_bucket',

            // Fleet-level metrics
            total_devices: { $sum: 1 },
            total_readings: { $sum: '$reading_count' },
            avg_fleet_battery: { $avg: '$avg_battery' },
            avg_fleet_signal: { $avg: '$avg_signal' },

            // Device health distribution
            healthy_devices: {
              $sum: {
                $cond: [
                  { $and: [
                    { $gte: ['$avg_battery', 20] },
                    { $gte: ['$avg_signal', 50] }
                  ]},
                  1,
                  0
                ]
              }
            },

            // Location distribution
            locations: {
              $addToSet: '$_id.location'
            }
          }
        },
        {
          $addFields: {
            timestamp: '$_id',
            device_health_percentage: {
              $multiply: [
                { $divide: ['$healthy_devices', '$total_devices'] },
                100
              ]
            }
          }
        },
        {
          $project: {
            _id: 0,
            timestamp: 1,
            total_devices: 1,
            total_readings: 1,
            avg_fleet_battery: { $round: ['$avg_fleet_battery', 2] },
            avg_fleet_signal: { $round: ['$avg_fleet_signal', 2] },
            device_health_percentage: { $round: ['$device_health_percentage', 2] },
            healthy_devices: 1,
            unique_locations: { $size: '$locations' }
          }
        },
        {
          $sort: { timestamp: 1 }
        }
      ];

      const collection = this.db.collection('environmental_sensors'); // Default collection
      const results = await collection.aggregate(pipeline).toArray();

      return {
        fleet_query: fleetQuery,
        time_range: timeRange,
        granularity: granularity,
        analytics: results
      };

    } catch (error) {
      console.error(`Error generating fleet analytics:`, error);
      throw error;
    }
  }

  getTimeUnit(granularity) {
    const unitMap = {
      '1s': 'second',
      '1m': 'minute',
      '5m': 'minute',
      '15m': 'minute',
      '1h': 'hour',
      '1d': 'day'
    };
    return unitMap[granularity] || 'hour';
  }

  getBinSize(granularity) {
    const binMap = {
      '1s': 1,
      '1m': 1,
      '5m': 5,
      '15m': 15,
      '1h': 1,
      '1d': 1
    };
    return binMap[granularity] || 1;
  }

  createMeasurementReducer() {
    return {
      $function: {
        body: `function(accumulator, measurement) {
          for (let key in measurement) {
            if (typeof measurement[key] === 'number') {
              if (!accumulator[key]) {
                accumulator[key] = { sum: 0, count: 0, min: measurement[key], max: measurement[key] };
              }
              accumulator[key].sum += measurement[key];
              accumulator[key].count++;
              accumulator[key].min = Math.min(accumulator[key].min, measurement[key]);
              accumulator[key].max = Math.max(accumulator[key].max, measurement[key]);
              accumulator[key].avg = accumulator[key].sum / accumulator[key].count;
            }
          }
          return accumulator;
        }`,
        args: ['$$value', '$$this'],
        lang: 'js'
      }
    };
  }

  getCollectionForDevice(deviceId) {
    // Simple mapping - in production, this would use device registry
    return this.timeSeriesCollections.get('environmental_sensors').collection;
  }

  async setupRealTimeAggregation() {
    if (!this.config.analytics.enableRealTimeAggregation) {
      return;
    }

    console.log('Setting up real-time aggregation pipelines...');

    // Setup materialized views for common aggregations
    await this.createMaterializedViews();

    // Setup change streams for real-time updates
    await this.setupChangeStreams();
  }

  async createMaterializedViews() {
    console.log('Creating materialized views for real-time analytics...');

    // Hourly device summaries
    await this.db.createCollection('device_hourly_summary');

    // Daily fleet metrics  
    await this.db.createCollection('fleet_daily_metrics');

    // Real-time alerts collection
    await this.db.createCollection('real_time_alerts');
  }

  async setupChangeStreams() {
    console.log('Setting up change streams for real-time processing...');

    for (const [collectionName, collectionInfo] of this.timeSeriesCollections) {
      const collection = collectionInfo.collection;

      const changeStream = collection.watch([], { fullDocument: 'updateLookup' });

      changeStream.on('change', async (change) => {
        if (change.operationType === 'insert') {
          await this.processRealTimeInsert(collectionName, change.fullDocument);
        }
      });
    }
  }

  async processRealTimeInsert(collectionName, document) {
    try {
      // Update real-time aggregations
      await this.updateRealTimeAggregations(collectionName, document);

      // Check for alerts
      if (this.config.performance.enableAlerts) {
        await this.checkAlertConditions(document);
      }

    } catch (error) {
      console.error(`Error processing real-time insert:`, error);
      // Don't throw - real-time processing shouldn't break inserts
    }
  }

  async updateRealTimeAggregations(collectionName, document) {
    const deviceId = document.device.device_id;
    const timestamp = document.timestamp;
    const hourBucket = new Date(timestamp.getFullYear(), timestamp.getMonth(), timestamp.getDate(), timestamp.getHours());

    // Update hourly device summary
    const deviceHourlySummary = this.db.collection('device_hourly_summary');

    await deviceHourlySummary.updateOne(
      {
        device_id: deviceId,
        hour_bucket: hourBucket
      },
      {
        $inc: {
          reading_count: 1,
          'quality.signal_strength_sum': document.quality.signal_strength || 0,
          'quality.battery_level_sum': document.quality.battery_level || 0
        },
        $min: {
          'timestamp.first_reading': timestamp
        },
        $max: {
          'timestamp.last_reading': timestamp
        },
        $set: {
          device: document.device,
          updated_at: new Date()
        }
      },
      { upsert: true }
    );
  }

  async checkAlertConditions(document) {
    const deviceId = document.device.device_id;
    const deviceInfo = await this.getDeviceInfo(deviceId);

    if (!deviceInfo || !deviceInfo.configuration.alert_thresholds) {
      return;
    }

    const thresholds = deviceInfo.configuration.alert_thresholds;
    const alerts = [];

    // Check measurement thresholds
    Object.entries(document.measurements).forEach(([measurement, value]) => {
      const threshold = thresholds[measurement];
      if (threshold) {
        if (value < threshold.critical_min || value > threshold.critical_max) {
          alerts.push({
            type: 'critical_threshold',
            measurement: measurement,
            value: value,
            threshold: threshold,
            severity: 'critical'
          });
        } else if (value < threshold.warning_min || value > threshold.warning_max) {
          alerts.push({
            type: 'warning_threshold',
            measurement: measurement,
            value: value,
            threshold: threshold,
            severity: 'warning'
          });
        }
      }
    });

    // Check device health
    if (document.quality.battery_level < 20) {
      alerts.push({
        type: 'low_battery',
        value: document.quality.battery_level,
        severity: 'warning'
      });
    }

    if (document.quality.signal_strength < 30) {
      alerts.push({
        type: 'poor_signal',
        value: document.quality.signal_strength,
        severity: 'warning'
      });
    }

    // Store alerts if any
    if (alerts.length > 0) {
      await this.storeAlerts(deviceId, document.timestamp, alerts);
    }
  }

  async storeAlerts(deviceId, timestamp, alerts) {
    const realTimeAlerts = this.db.collection('real_time_alerts');

    const alertDocument = {
      device_id: deviceId,
      timestamp: timestamp,
      alerts: alerts,
      status: 'active',
      created_at: new Date()
    };

    await realTimeAlerts.insertOne(alertDocument);
    console.log(`Stored ${alerts.length} alerts for device ${deviceId}`);
  }

  async setupMonitoringAndAlerting() {
    if (this.config.performance.enableMetrics) {
      console.log('Setting up performance monitoring and alerting...');

      // Create metrics collection
      const metricsCollection = this.db.collection('system_metrics');
      await metricsCollection.createIndex({ timestamp: -1 });
      await metricsCollection.createIndex({ metric_type: 1, timestamp: -1 });

      // Start metrics collection
      this.startMetricsCollection();
    }
  }

  startMetricsCollection() {
    setInterval(async () => {
      try {
        await this.collectSystemMetrics();
      } catch (error) {
        console.error('Error collecting system metrics:', error);
      }
    }, 60000); // Collect metrics every minute
  }

  async collectSystemMetrics() {
    const timestamp = new Date();

    // Collect performance metrics for each time-series collection
    for (const [collectionName, collectionInfo] of this.timeSeriesCollections) {
      const collection = collectionInfo.collection;

      // Get collection stats
      const stats = await this.db.runCommand({ collStats: collectionName });

      // Calculate recent insertion rate
      const oneMinuteAgo = new Date(timestamp.getTime() - 60000);
      const recentInserts = await collection.countDocuments({
        timestamp: { $gte: oneMinuteAgo }
      });

      const metrics = {
        timestamp: timestamp,
        metric_type: 'collection_performance',
        collection_name: collectionName,
        metrics: {
          document_count: stats.count,
          storage_size: stats.storageSize,
          index_size: stats.totalIndexSize,
          recent_inserts_per_minute: recentInserts,
          avg_object_size: stats.avgObjSize
        }
      };

      await this.db.collection('system_metrics').insertOne(metrics);
    }
  }
}

// Example usage: Comprehensive IoT time-series platform
async function demonstrateIoTTimeSeriesManagement() {
  const iotManager = new MongoDBIoTTimeSeriesManager('mongodb://localhost:27017', {
    enableRealTimeAggregation: true,
    enableAnomalyDetection: true,
    enableMetrics: true,
    hotDataRetentionDays: 30,
    warmDataRetentionDays: 90
  });

  await iotManager.initialize();

  // Register IoT devices
  await iotManager.registerIoTDevice({
    device_id: 'ENV_SENSOR_001',
    device_name: 'Environmental Sensor - Conference Room A',
    sensor_type: 'environmental',
    building: 'Main Office',
    floor: 2,
    room: 'Conference Room A',
    model: 'SensorTech ST-ENV-2000',
    manufacturer: 'SensorTech Corp',
    firmware_version: 'v2.1.3',
    communication_protocol: 'mqtt',
    sampling_rate: 60,
    alert_thresholds: {
      temperature: { min: 18, max: 26, warning_min: 20, warning_max: 24, critical_min: 15, critical_max: 30 },
      humidity: { min: 30, max: 70, warning_min: 40, warning_max: 60, critical_min: 20, critical_max: 80 }
    }
  });

  // Insert time-series sensor data
  await iotManager.insertSensorData('environmental_sensors', 'ENV_SENSOR_001', {
    measurements: {
      temperature: 22.5,
      humidity: 55.2,
      pressure: 1013.25,
      light_level: 450
    },
    signal_strength: 85,
    battery_level: 92,
    reading_quality: 'excellent'
  });

  // Generate device analytics
  const deviceAnalytics = await iotManager.generateDeviceAnalytics(
    'ENV_SENSOR_001',
    {
      start: new Date(Date.now() - 24 * 60 * 60 * 1000), // 24 hours ago
      end: new Date()
    },
    '1h'
  );

  console.log('Device Analytics:', deviceAnalytics);

  // Generate fleet analytics
  const fleetAnalytics = await iotManager.generateFleetAnalytics(
    { 'device.sensor_type': 'environmental' },
    {
      start: new Date(Date.now() - 7 * 24 * 60 * 60 * 1000), // 7 days ago
      end: new Date()
    },
    '1d'
  );

  console.log('Fleet Analytics:', fleetAnalytics);
}

module.exports = {
  MongoDBIoTTimeSeriesManager
};

Understanding MongoDB Time-Series Architecture for IoT

Native Time-Series Optimization and Storage Efficiency

MongoDB's time-series collections provide specialized optimization for IoT data management:

// Enterprise-grade IoT analytics with advanced time-series optimization
class EnterpriseIoTAnalyticsManager extends MongoDBIoTTimeSeriesManager {
  constructor(connectionString, enterpriseConfig) {
    super(connectionString, enterpriseConfig);

    this.enterpriseConfig = {
      ...enterpriseConfig,

      // Advanced analytics capabilities
      enablePredictiveAnalytics: true,
      enableMachineLearning: true,
      enableAnomalyDetection: true,
      enableTrendForecasting: true,

      // Performance optimization
      enableDataPreAggregation: true,
      enableIntelligentArchival: true,
      enableAdaptiveIndexing: true,
      enableQueryOptimization: true,

      // Enterprise features
      enableDataLineage: true,
      enableComplianceTracking: true,
      enableDataGovernance: true,
      enableAuditTrails: true
    };
  }

  async setupAdvancedAnalytics() {
    console.log('Setting up enterprise-grade IoT analytics pipelines...');

    // Setup predictive analytics collections
    await this.setupPredictiveAnalytics();

    // Initialize machine learning pipelines
    await this.setupMachineLearningPipelines();

    // Configure intelligent data lifecycle management
    await this.setupIntelligentDataLifecycle();
  }

  async performPredictiveAnalysis(deviceId, predictionHorizon = '7d') {
    console.log(`Performing predictive analysis for device ${deviceId}...`);

    const pipeline = [
      {
        $match: {
          'device.device_id': deviceId,
          timestamp: {
            $gte: new Date(Date.now() - 30 * 24 * 60 * 60 * 1000) // 30 days historical
          }
        }
      },
      {
        $group: {
          _id: {
            $dateTrunc: {
              date: '$timestamp',
              unit: 'hour'
            }
          },
          avg_temperature: { $avg: '$measurements.temperature' },
          avg_humidity: { $avg: '$measurements.humidity' },
          avg_battery: { $avg: '$quality.battery_level' },
          reading_count: { $sum: 1 }
        }
      },
      {
        $sort: { _id: 1 }
      },
      {
        $setWindowFields: {
          sortBy: { _id: 1 },
          output: {
            // Calculate moving averages
            temp_7h_ma: {
              $avg: '$avg_temperature',
              window: { range: [-3, 3], unit: 'position' }
            },
            humidity_7h_ma: {
              $avg: '$avg_humidity',
              window: { range: [-3, 3], unit: 'position' }
            },
            // Calculate trends
            temp_trend: {
              $linearFill: '$avg_temperature'
            },
            battery_degradation: {
              $derivative: {
                input: '$avg_battery',
                unit: 'hour'
              }
            }
          }
        }
      }
    ];

    const collection = this.getCollectionForDevice(deviceId);
    const results = await collection.aggregate(pipeline).toArray();

    return this.generatePredictions(results, predictionHorizon);
  }

  generatePredictions(historicalData, horizon) {
    // Simplified prediction logic - in production, use advanced ML models
    const predictions = {
      temperature: this.predictTrend(historicalData, 'temp_trend', horizon),
      humidity: this.predictTrend(historicalData, 'humidity_7h_ma', horizon),
      battery_life: this.predictBatteryLife(historicalData),
      maintenance_needed: this.predictMaintenanceWindow(historicalData),
      anomaly_probability: this.calculateAnomalyProbability(historicalData)
    };

    return predictions;
  }

  async performFleetOptimizationAnalysis(fleetQuery) {
    console.log('Performing fleet optimization analysis...');

    const pipeline = [
      {
        $match: this.optimizeTimeSeriesQuery(fleetQuery)
      },
      {
        $group: {
          _id: '$device.device_id',

          // Performance metrics
          avg_efficiency: { $avg: '$measurements.efficiency' },
          avg_power_consumption: { $avg: '$measurements.power_consumption' },
          avg_uptime: { $avg: { $cond: [{ $gt: ['$quality.signal_strength', 0] }, 1, 0] } },

          // Maintenance indicators
          maintenance_score: {
            $avg: {
              $add: [
                { $cond: [{ $lt: ['$quality.battery_level', 30] }, 0.3, 0] },
                { $cond: [{ $lt: ['$quality.signal_strength', 50] }, 0.2, 0] },
                { $cond: [{ $gt: ['$measurements.temperature', 25] }, 0.2, 0] }
              ]
            }
          },

          // Location and deployment data
          location: { $first: '$device.location' },
          last_seen: { $max: '$timestamp' }
        }
      },
      {
        $addFields: {
          // Calculate optimization recommendations
          optimization_priority: {
            $cond: [
              { $gt: ['$maintenance_score', 0.5] }, 'high',
              { $cond: [
                { $gt: ['$maintenance_score', 0.3] }, 'medium', 'low'
              ]}
            ]
          },

          efficiency_grade: {
            $cond: [
              { $gt: ['$avg_efficiency', 0.8] }, 'A',
              { $cond: [
                { $gt: ['$avg_efficiency', 0.6] }, 'B',
                { $cond: [
                  { $gt: ['$avg_efficiency', 0.4] }, 'C', 'D'
                ]}
              ]}
            ]
          }
        }
      },
      {
        $sort: { maintenance_score: -1, avg_efficiency: 1 }
      }
    ];

    const collection = this.db.collection('environmental_sensors');
    const results = await collection.aggregate(pipeline).toArray();

    return this.generateFleetOptimizationReport(results);
  }

  generateFleetOptimizationReport(analysisResults) {
    const report = {
      generated_at: new Date(),
      total_devices: analysisResults.length,

      // Fleet health summary
      health_distribution: {
        high_priority: analysisResults.filter(d => d.optimization_priority === 'high').length,
        medium_priority: analysisResults.filter(d => d.optimization_priority === 'medium').length,
        low_priority: analysisResults.filter(d => d.optimization_priority === 'low').length
      },

      // Efficiency distribution
      efficiency_grades: {
        grade_a: analysisResults.filter(d => d.efficiency_grade === 'A').length,
        grade_b: analysisResults.filter(d => d.efficiency_grade === 'B').length,
        grade_c: analysisResults.filter(d => d.efficiency_grade === 'C').length,
        grade_d: analysisResults.filter(d => d.efficiency_grade === 'D').length
      },

      // Detailed device recommendations
      device_recommendations: analysisResults.slice(0, 10), // Top 10 priority devices

      // Fleet-wide recommendations
      fleet_recommendations: this.generateFleetRecommendations(analysisResults)
    };

    return report;
  }

  generateFleetRecommendations(devices) {
    const recommendations = [];

    const highMaintenanceDevices = devices.filter(d => d.maintenance_score > 0.5);
    if (highMaintenanceDevices.length > devices.length * 0.2) {
      recommendations.push({
        type: 'maintenance_schedule',
        priority: 'high',
        description: `${highMaintenanceDevices.length} devices require immediate maintenance attention`,
        action: 'Schedule preventive maintenance for high-priority devices'
      });
    }

    const lowEfficiencyDevices = devices.filter(d => d.efficiency_grade === 'D');
    if (lowEfficiencyDevices.length > 0) {
      recommendations.push({
        type: 'efficiency_improvement',
        priority: 'medium',
        description: `${lowEfficiencyDevices.length} devices operating at low efficiency`,
        action: 'Review device configuration and environmental conditions'
      });
    }

    return recommendations;
  }
}

QueryLeaf Time-Series Operations

QueryLeaf provides familiar SQL syntax for MongoDB time-series operations and IoT analytics:

-- QueryLeaf time-series operations with SQL-familiar syntax for MongoDB IoT analytics

-- Create time-series enabled tables for IoT sensor data
CREATE TABLE environmental_sensors (
    timestamp TIMESTAMP WITH TIME ZONE NOT NULL,
    device_id VARCHAR(100) NOT NULL,
    sensor_type VARCHAR(50) NOT NULL,

    -- Device metadata (automatically optimized in MongoDB metaField)
    location JSONB NOT NULL,
    device_model VARCHAR(100),
    firmware_version VARCHAR(50),

    -- Sensor measurements (stored efficiently in MongoDB measurements)
    temperature DECIMAL(5,2),
    humidity DECIMAL(5,2),
    pressure DECIMAL(7,2),
    light_level INTEGER,
    air_quality_index INTEGER,

    -- Device health indicators
    battery_level DECIMAL(5,2),
    signal_strength INTEGER,
    reading_quality VARCHAR(20) DEFAULT 'good',

    -- Quality and operational metadata
    data_source VARCHAR(50) DEFAULT 'sensor',
    processing_flags TEXT[],

    -- Time-series optimization (QueryLeaf automatically configures MongoDB time-series collection)
    PRIMARY KEY (device_id, timestamp),

    -- Time-series specific indexes (optimized for temporal queries)
    INDEX ts_device_time_idx (device_id, timestamp DESC),
    INDEX ts_sensor_type_time_idx (sensor_type, timestamp DESC),
    INDEX ts_location_time_idx ((location->>'building'), (location->>'floor'), timestamp DESC)

) WITH (
    -- MongoDB time-series collection configuration
    timeseries_timefield = 'timestamp',
    timeseries_metafield = 'device_metadata',
    timeseries_granularity = 'seconds',

    -- Storage optimization
    compression_algorithm = 'zstd',
    enable_compression = true,

    -- Performance settings
    enable_time_series_optimization = true,
    enable_automatic_bucketing = true,
    bucket_max_span_seconds = 3600 -- 1 hour buckets
);

-- Power monitoring time-series table
CREATE TABLE power_monitoring (
    timestamp TIMESTAMP WITH TIME ZONE NOT NULL,
    device_id VARCHAR(100) NOT NULL,
    meter_type VARCHAR(50) NOT NULL,

    -- Device and location metadata
    location JSONB NOT NULL,
    circuit_id VARCHAR(50),
    panel_id VARCHAR(50),

    -- Power measurements
    voltage DECIMAL(6,2),
    current DECIMAL(8,3),
    power_active DECIMAL(10,2),
    power_reactive DECIMAL(10,2),
    power_factor DECIMAL(4,3),
    frequency DECIMAL(5,2),
    energy_consumed_kwh DECIMAL(12,3),

    -- Power quality metrics
    voltage_thd DECIMAL(5,2), -- Total Harmonic Distortion
    current_thd DECIMAL(5,2),
    power_interruptions INTEGER DEFAULT 0,

    -- Device status
    meter_status VARCHAR(20) DEFAULT 'operational',
    communication_status VARCHAR(20) DEFAULT 'online',
    last_calibration_date DATE,

    PRIMARY KEY (device_id, timestamp),
    INDEX ts_power_device_time_idx (device_id, timestamp DESC),
    INDEX ts_power_circuit_time_idx (circuit_id, timestamp DESC),
    INDEX ts_power_consumption_idx (energy_consumed_kwh, timestamp DESC)

) WITH (
    timeseries_timefield = 'timestamp',
    timeseries_metafield = 'meter_metadata',
    timeseries_granularity = 'seconds',
    compression_algorithm = 'zstd',
    enable_time_series_optimization = true
);

-- Advanced time-series data insertion with automatic optimization
INSERT INTO environmental_sensors (
    timestamp, device_id, sensor_type, location, device_model, firmware_version,
    temperature, humidity, pressure, light_level, air_quality_index,
    battery_level, signal_strength, reading_quality
)
SELECT 
    -- Generate realistic time-series data for demonstration
    CURRENT_TIMESTAMP - (generate_series * INTERVAL '1 minute'),
    'ENV_SENSOR_' || LPAD(device_num::text, 3, '0'),
    'environmental',

    -- Location data as JSONB
    JSON_BUILD_OBJECT(
        'building', 'Building ' || CHAR(64 + (device_num % 3) + 1),
        'floor', (device_num % 5) + 1,
        'room', 'Room ' || LPAD(((device_num % 20) + 1)::text, 3, '0'),
        'zone', 'Zone ' || CHAR(64 + (device_num % 4) + 1),
        'coordinates', JSON_BUILD_OBJECT(
            'latitude', 40.7128 + (RANDOM() - 0.5) * 0.01,
            'longitude', -74.0060 + (RANDOM() - 0.5) * 0.01
        )
    ),

    'SensorTech ST-ENV-2000',
    'v2.1.' || (3 + (device_num % 5))::text,

    -- Environmental measurements with realistic variations
    ROUND((20 + 5 * SIN(generate_series * 0.1) + (RANDOM() - 0.5) * 2)::numeric, 2) as temperature,
    ROUND((50 + 10 * COS(generate_series * 0.05) + (RANDOM() - 0.5) * 5)::numeric, 2) as humidity,
    ROUND((1013 + 2 * SIN(generate_series * 0.02) + (RANDOM() - 0.5) * 1)::numeric, 2) as pressure,
    (400 + (100 * SIN(generate_series * 0.3)) + (RANDOM() * 50))::integer as light_level,
    (50 + (20 * COS(generate_series * 0.1)) + (RANDOM() * 10))::integer as air_quality_index,

    -- Device health with degradation over time
    GREATEST(20, 100 - (generate_series * 0.001) + (RANDOM() - 0.5) * 5)::numeric(5,2) as battery_level,
    (70 + (RANDOM() * 30))::integer as signal_strength,

    CASE 
        WHEN RANDOM() < 0.85 THEN 'excellent'
        WHEN RANDOM() < 0.95 THEN 'good'  
        WHEN RANDOM() < 0.99 THEN 'fair'
        ELSE 'poor'
    END as reading_quality

FROM generate_series(1, 1440) as generate_series, -- 24 hours of minute-by-minute data
     generate_series(1, 50) as device_num          -- 50 devices
WHERE generate_series <= 1440; -- Limit to 24 hours

-- QueryLeaf automatically optimizes this bulk insert for MongoDB time-series collections

-- Advanced time-series analytics with SQL window functions
WITH hourly_environmental_metrics AS (
    SELECT 
        device_id,
        location->>'building' as building,
        location->>'floor' as floor,

        -- Time bucketing for hourly aggregation
        DATE_TRUNC('hour', timestamp) as hour_bucket,

        -- Basic statistical aggregations
        COUNT(*) as reading_count,
        AVG(temperature) as avg_temperature,
        MIN(temperature) as min_temperature,  
        MAX(temperature) as max_temperature,
        STDDEV(temperature) as temp_stddev,

        AVG(humidity) as avg_humidity,
        MIN(humidity) as min_humidity,
        MAX(humidity) as max_humidity,

        AVG(pressure) as avg_pressure,
        AVG(light_level) as avg_light_level,
        AVG(air_quality_index) as avg_air_quality,

        -- Device health metrics
        AVG(battery_level) as avg_battery_level,
        AVG(signal_strength) as avg_signal_strength,

        -- Data quality assessment
        COUNT(*) FILTER (WHERE reading_quality = 'excellent') as excellent_readings,
        COUNT(*) FILTER (WHERE reading_quality IN ('excellent', 'good')) as good_readings,
        (COUNT(*) FILTER (WHERE reading_quality IN ('excellent', 'good'))::DECIMAL / COUNT(*)) * 100 as quality_percentage

    FROM environmental_sensors
    WHERE timestamp >= CURRENT_TIMESTAMP - INTERVAL '7 days'
    GROUP BY device_id, location->>'building', location->>'floor', DATE_TRUNC('hour', timestamp)
),

environmental_trends AS (
    SELECT 
        hem.*,

        -- Time-series trend analysis using window functions
        LAG(avg_temperature, 1) OVER (
            PARTITION BY device_id 
            ORDER BY hour_bucket
        ) as prev_hour_temp,

        LAG(avg_temperature, 24) OVER (
            PARTITION BY device_id 
            ORDER BY hour_bucket
        ) as same_hour_yesterday_temp,

        -- Calculate temperature trends
        avg_temperature - LAG(avg_temperature, 1) OVER (
            PARTITION BY device_id 
            ORDER BY hour_bucket
        ) as temp_hourly_change,

        avg_temperature - LAG(avg_temperature, 24) OVER (
            PARTITION BY device_id 
            ORDER BY hour_bucket
        ) as temp_daily_change,

        -- Moving averages for smoothing
        AVG(avg_temperature) OVER (
            PARTITION BY device_id 
            ORDER BY hour_bucket 
            ROWS BETWEEN 5 PRECEDING AND 1 FOLLOWING
        ) as temp_7h_moving_avg,

        AVG(avg_humidity) OVER (
            PARTITION BY device_id 
            ORDER BY hour_bucket 
            ROWS BETWEEN 5 PRECEDING AND 1 FOLLOWING  
        ) as humidity_7h_moving_avg,

        -- Battery degradation analysis
        FIRST_VALUE(avg_battery_level) OVER (
            PARTITION BY device_id 
            ORDER BY hour_bucket 
            ROWS UNBOUNDED PRECEDING
        ) as initial_battery_level,

        -- Calculate linear regression slope for trends
        REGR_SLOPE(avg_temperature, EXTRACT(EPOCH FROM hour_bucket)) OVER (
            PARTITION BY device_id
            ORDER BY hour_bucket
            ROWS BETWEEN 23 PRECEDING AND CURRENT ROW -- 24-hour window
        ) as temp_trend_slope,

        REGR_R2(avg_temperature, EXTRACT(EPOCH FROM hour_bucket)) OVER (
            PARTITION BY device_id
            ORDER BY hour_bucket
            ROWS BETWEEN 23 PRECEDING AND CURRENT ROW
        ) as temp_trend_r2,

        -- Device performance ranking
        DENSE_RANK() OVER (
            PARTITION BY building, floor, hour_bucket
            ORDER BY quality_percentage DESC, avg_signal_strength DESC
        ) as device_performance_rank

    FROM hourly_environmental_metrics hem
),

anomaly_detection AS (
    SELECT 
        et.*,

        -- Statistical anomaly detection
        CASE 
            WHEN ABS(avg_temperature - temp_7h_moving_avg) > (2 * temp_stddev) THEN 'temperature_anomaly'
            WHEN temp_hourly_change > 5 OR temp_hourly_change < -5 THEN 'rapid_temperature_change'
            WHEN avg_battery_level < 20 THEN 'low_battery'
            WHEN avg_signal_strength < 30 THEN 'poor_connectivity'
            WHEN quality_percentage < 80 THEN 'poor_data_quality'
            ELSE 'normal'
        END as anomaly_type,

        -- Severity assessment
        CASE 
            WHEN avg_battery_level < 10 OR avg_signal_strength < 20 THEN 'critical'
            WHEN avg_battery_level < 20 OR avg_signal_strength < 30 OR quality_percentage < 70 THEN 'high'
            WHEN ABS(temp_hourly_change) > 3 OR quality_percentage < 90 THEN 'medium'
            ELSE 'low'
        END as anomaly_severity,

        -- Trend classification
        CASE 
            WHEN temp_trend_slope > 0.01 AND temp_trend_r2 > 0.7 THEN 'increasing_trend'
            WHEN temp_trend_slope < -0.01 AND temp_trend_r2 > 0.7 THEN 'decreasing_trend'
            WHEN ABS(temp_trend_slope) <= 0.01 THEN 'stable'
            ELSE 'irregular'
        END as temperature_trend_classification,

        -- Battery health assessment  
        CASE 
            WHEN (initial_battery_level - avg_battery_level) / GREATEST(EXTRACT(DAYS FROM (hour_bucket - (SELECT MIN(hour_bucket) FROM environmental_trends WHERE device_id = et.device_id))), 1) > 2 THEN 'fast_degradation'
            WHEN (initial_battery_level - avg_battery_level) / GREATEST(EXTRACT(DAYS FROM (hour_bucket - (SELECT MIN(hour_bucket) FROM environmental_trends WHERE device_id = et.device_id))), 1) > 0.5 THEN 'normal_degradation'
            ELSE 'stable_battery'
        END as battery_health_status

    FROM environmental_trends et
)

-- Generate comprehensive IoT device analytics report
SELECT 
    ad.device_id,
    ad.building,
    ad.floor,
    ad.hour_bucket,

    -- Current measurements
    ROUND(ad.avg_temperature, 2) as current_avg_temperature,
    ROUND(ad.avg_humidity, 2) as current_avg_humidity,
    ROUND(ad.avg_pressure, 2) as current_avg_pressure,
    ad.avg_light_level,
    ad.avg_air_quality,

    -- Trend analysis
    ROUND(ad.temp_7h_moving_avg, 2) as temperature_trend,
    ROUND(ad.temp_hourly_change, 2) as hourly_temp_change,
    ROUND(ad.temp_daily_change, 2) as daily_temp_change,
    ad.temperature_trend_classification,

    -- Device health
    ROUND(ad.avg_battery_level, 1) as battery_level,
    ad.avg_signal_strength,
    ROUND(ad.quality_percentage, 1) as data_quality_pct,
    ad.battery_health_status,

    -- Performance metrics
    ad.reading_count,
    ad.device_performance_rank,

    -- Anomaly and alert information
    ad.anomaly_type,
    ad.anomaly_severity,

    -- Operational insights
    CASE 
        WHEN ad.anomaly_severity = 'critical' THEN 'immediate_maintenance_required'
        WHEN ad.anomaly_severity = 'high' AND ad.battery_health_status = 'fast_degradation' THEN 'schedule_battery_replacement'
        WHEN ad.temperature_trend_classification = 'irregular' AND ad.quality_percentage < 85 THEN 'investigate_sensor_calibration'
        WHEN ad.avg_signal_strength < 50 THEN 'improve_network_connectivity'
        WHEN ad.device_performance_rank > 3 THEN 'investigate_environmental_factors'
        ELSE 'normal_operation'
    END as recommended_action,

    -- Predictive maintenance scoring
    ROUND(
        (
            CASE WHEN ad.avg_battery_level < 30 THEN 0.3 ELSE 0 END +
            CASE WHEN ad.avg_signal_strength < 50 THEN 0.2 ELSE 0 END +
            CASE WHEN ad.quality_percentage < 85 THEN 0.2 ELSE 0 END +
            CASE WHEN ad.temperature_trend_classification = 'irregular' THEN 0.15 ELSE 0 END +
            CASE WHEN ABS(ad.temp_hourly_change) > 3 THEN 0.15 ELSE 0 END
        ) * 100, 1
    ) as maintenance_priority_score,

    -- Time context
    ad.hour_bucket as analysis_time,
    CURRENT_TIMESTAMP as report_generated_at

FROM anomaly_detection ad
WHERE ad.hour_bucket >= CURRENT_TIMESTAMP - INTERVAL '24 hours'
  AND (ad.anomaly_type != 'normal' OR ad.maintenance_priority_score > 30)
ORDER BY ad.maintenance_priority_score DESC, ad.device_id, ad.hour_bucket DESC
LIMIT 100;

-- Advanced fleet-wide analytics with cross-device correlations
WITH fleet_performance_metrics AS (
    SELECT 
        DATE_TRUNC('hour', timestamp) as hour_bucket,
        location->>'building' as building,
        location->>'floor' as floor,

        -- Fleet-wide aggregations
        COUNT(DISTINCT device_id) as active_devices,
        COUNT(*) as total_readings,

        -- Environmental averages across fleet
        AVG(temperature) as fleet_avg_temperature,
        STDDEV(temperature) as fleet_temp_stddev,
        MIN(temperature) as fleet_min_temperature,
        MAX(temperature) as fleet_max_temperature,

        AVG(humidity) as fleet_avg_humidity,
        AVG(pressure) as fleet_avg_pressure,
        AVG(air_quality_index) as fleet_avg_air_quality,

        -- Fleet health metrics
        AVG(battery_level) as fleet_avg_battery,
        MIN(battery_level) as fleet_min_battery,
        AVG(signal_strength) as fleet_avg_signal,
        MIN(signal_strength) as fleet_min_signal,

        -- Data quality across fleet
        COUNT(*) FILTER (WHERE reading_quality IN ('excellent', 'good')) as quality_readings,
        (COUNT(*) FILTER (WHERE reading_quality IN ('excellent', 'good'))::DECIMAL / COUNT(*)) * 100 as fleet_quality_percentage,

        -- Device health distribution
        COUNT(DISTINCT device_id) FILTER (WHERE battery_level > 50 AND signal_strength > 70) as healthy_devices,
        COUNT(DISTINCT device_id) FILTER (WHERE battery_level < 20 OR signal_strength < 30) as critical_devices,

        -- Environmental comfort assessment
        COUNT(*) FILTER (WHERE temperature BETWEEN 20 AND 24 AND humidity BETWEEN 40 AND 60) as comfort_readings,
        (COUNT(*) FILTER (WHERE temperature BETWEEN 20 AND 24 AND humidity BETWEEN 40 AND 60)::DECIMAL / COUNT(*)) * 100 as comfort_percentage

    FROM environmental_sensors
    WHERE timestamp >= CURRENT_TIMESTAMP - INTERVAL '7 days'
    GROUP BY DATE_TRUNC('hour', timestamp), location->>'building', location->>'floor'
),

fleet_trends AS (
    SELECT 
        fpm.*,

        -- Fleet efficiency trends  
        LAG(fleet_avg_temperature, 1) OVER (
            PARTITION BY building, floor
            ORDER BY hour_bucket
        ) as prev_hour_fleet_temp,

        LAG(fleet_quality_percentage, 24) OVER (
            PARTITION BY building, floor
            ORDER BY hour_bucket  
        ) as same_hour_yesterday_quality,

        -- Calculate fleet-wide trends
        fleet_quality_percentage - LAG(fleet_quality_percentage, 24) OVER (
            PARTITION BY building, floor
            ORDER BY hour_bucket
        ) as daily_quality_change,

        -- Fleet health scoring
        ROUND(
            (
                (fleet_avg_battery / 100.0) * 0.3 +
                (fleet_avg_signal / 100.0) * 0.2 +
                (fleet_quality_percentage / 100.0) * 0.3 +
                (comfort_percentage / 100.0) * 0.2
            ) * 100, 1
        ) as fleet_health_score

    FROM fleet_performance_metrics fpm
)

-- Generate fleet optimization and management dashboard
SELECT 
    ft.building,
    ft.floor,
    ft.hour_bucket,

    -- Fleet operational metrics
    ft.active_devices,
    ft.total_readings,
    ROUND(ft.fleet_avg_temperature, 1) as avg_temperature,
    ROUND(ft.fleet_avg_humidity, 1) as avg_humidity,
    ft.fleet_avg_air_quality,

    -- Fleet health assessment
    ft.healthy_devices,
    ft.critical_devices,
    ROUND(ft.fleet_avg_battery, 1) as avg_battery_level,
    ROUND(ft.fleet_avg_signal, 1) as avg_signal_strength,
    ROUND(ft.fleet_quality_percentage, 1) as data_quality_pct,
    ft.fleet_health_score,

    -- Environmental management
    ROUND(ft.comfort_percentage, 1) as comfort_zone_percentage,

    -- Trend analysis
    ROUND(ft.daily_quality_change, 2) as quality_trend_24h,

    -- Fleet status classification
    CASE 
        WHEN ft.fleet_health_score >= 85 THEN 'optimal'
        WHEN ft.fleet_health_score >= 75 THEN 'good'
        WHEN ft.fleet_health_score >= 65 THEN 'fair'
        ELSE 'attention_required'
    END as fleet_status,

    -- Management recommendations
    CASE 
        WHEN ft.critical_devices > ft.active_devices * 0.2 THEN 'urgent_maintenance_needed'
        WHEN ft.fleet_avg_battery < 30 THEN 'battery_replacement_program'
        WHEN ft.fleet_quality_percentage < 80 THEN 'calibration_review_required'
        WHEN ft.comfort_percentage < 70 THEN 'environmental_adjustment_needed'
        WHEN ft.fleet_avg_signal < 60 THEN 'network_infrastructure_upgrade'
        ELSE 'continue_monitoring'
    END as fleet_recommendation,

    -- Efficiency metrics
    ROUND(ft.total_readings::DECIMAL / ft.active_devices, 1) as avg_readings_per_device,

    -- Time context
    ft.hour_bucket as analysis_time

FROM fleet_trends ft
WHERE ft.hour_bucket >= CURRENT_TIMESTAMP - INTERVAL '48 hours'
ORDER BY ft.building, ft.floor, ft.hour_bucket DESC;

-- Real-time alerting and monitoring setup
CREATE OR REPLACE VIEW real_time_device_alerts AS
WITH recent_readings AS (
    SELECT 
        device_id,
        timestamp,
        temperature,
        humidity,
        battery_level,
        signal_strength,
        reading_quality,
        location,

        -- Calculate recent trends
        LAG(temperature, 1) OVER (PARTITION BY device_id ORDER BY timestamp) as prev_temperature,
        LAG(battery_level, 1) OVER (PARTITION BY device_id ORDER BY timestamp) as prev_battery

    FROM environmental_sensors
    WHERE timestamp >= CURRENT_TIMESTAMP - INTERVAL '1 hour'
),

alert_conditions AS (
    SELECT 
        rr.*,

        -- Temperature alerts
        CASE 
            WHEN temperature > 30 THEN 'high_temperature_critical'
            WHEN temperature < 15 THEN 'low_temperature_critical'
            WHEN temperature > 26 THEN 'high_temperature_warning'
            WHEN temperature < 18 THEN 'low_temperature_warning'
            WHEN ABS(temperature - prev_temperature) > 5 THEN 'rapid_temperature_change'
            ELSE NULL
        END as temperature_alert,

        -- Device health alerts
        CASE 
            WHEN battery_level < 10 THEN 'battery_critical'
            WHEN battery_level < 20 THEN 'battery_low'
            WHEN signal_strength < 20 THEN 'connectivity_critical'
            WHEN signal_strength < 40 THEN 'connectivity_poor'
            WHEN reading_quality = 'poor' THEN 'data_quality_poor'
            ELSE NULL
        END as device_alert,

        -- Environmental comfort alerts
        CASE 
            WHEN temperature > 26 AND humidity > 70 THEN 'comfort_poor_hot_humid'
            WHEN temperature < 20 AND humidity < 30 THEN 'comfort_poor_cold_dry'
            WHEN humidity > 80 THEN 'humidity_excessive'
            WHEN humidity < 20 THEN 'humidity_insufficient'
            ELSE NULL
        END as comfort_alert

    FROM recent_readings rr
)

SELECT 
    device_id,
    timestamp,
    location->>'building' as building,
    location->>'floor' as floor,
    location->>'room' as room,

    -- Current measurements
    temperature,
    humidity,
    battery_level,
    signal_strength,
    reading_quality,

    -- Alert information
    COALESCE(temperature_alert, device_alert, comfort_alert) as alert_type,

    CASE 
        WHEN temperature_alert LIKE '%critical' OR device_alert LIKE '%critical' THEN 'critical'
        WHEN temperature_alert LIKE '%warning' OR device_alert LIKE '%poor' OR comfort_alert IS NOT NULL THEN 'warning'
        ELSE 'info'
    END as alert_severity,

    -- Alert context
    CASE 
        WHEN temperature_alert IS NOT NULL THEN 'Environmental conditions outside normal range'
        WHEN device_alert LIKE '%battery%' THEN 'Device requires battery maintenance'
        WHEN device_alert LIKE '%connectivity%' THEN 'Device has communication issues'
        WHEN comfort_alert IS NOT NULL THEN 'Environmental comfort affected'
        ELSE 'Device operating normally'
    END as alert_message,

    timestamp as alert_timestamp

FROM alert_conditions
WHERE temperature_alert IS NOT NULL 
   OR device_alert IS NOT NULL 
   OR comfort_alert IS NOT NULL
ORDER BY timestamp DESC, 
         CASE 
           WHEN temperature_alert LIKE '%critical' OR device_alert LIKE '%critical' THEN 1
           WHEN temperature_alert LIKE '%warning' OR device_alert LIKE '%poor' THEN 2
           ELSE 3
         END;

-- QueryLeaf provides comprehensive time-series capabilities:
-- 1. Native MongoDB time-series collection optimization with SQL syntax
-- 2. Advanced temporal analytics with window functions and trend analysis
-- 3. Real-time IoT data processing with automatic bucketing and compression
-- 4. Fleet-wide analytics and cross-device correlation analysis
-- 5. Predictive maintenance and anomaly detection capabilities
-- 6. Automated alerting and monitoring with configurable thresholds
-- 7. Efficient storage and query optimization for high-velocity sensor data
-- 8. Familiar SQL interface for complex time-series operations and analytics

Best Practices for MongoDB Time-Series IoT Architecture

Performance Optimization and Storage Efficiency

Essential principles for scalable IoT time-series database design:

Time-Series Collection Design: Configure appropriate granularity and metaField organization for optimal bucketing and compression
Indexing Strategy: Create time-based indexes optimized for temporal query patterns and device-specific access
Data Retention Policies: Implement intelligent data lifecycle management with hot, warm, and cold storage tiers
Batch Insert Optimization: Use bulk operations and batch processing for high-throughput sensor data ingestion
Query Optimization: Leverage MongoDB's time-series query optimization features for aggregation and analytical workloads
Compression Configuration: Enable appropriate compression algorithms for time-series data patterns and storage efficiency

Real-Time Analytics and Monitoring

Optimize time-series architectures for real-time IoT analytics:

Aggregation Pipelines: Design efficient aggregation pipelines for real-time metrics and trend calculation
Change Stream Processing: Implement change streams for real-time data processing and alert generation
Materialized Views: Create pre-computed aggregations for common analytical queries and dashboard requirements
Anomaly Detection: Implement statistical and machine learning-based anomaly detection for sensor data
Predictive Analytics: Design predictive models for maintenance scheduling and operational optimization
Alert Management: Configure intelligent alerting systems with severity classification and escalation policies

Conclusion

MongoDB time-series collections provide purpose-built capabilities for IoT data management that dramatically improve storage efficiency, query performance, and real-time analytics through native time-series optimization, intelligent data bucketing, and comprehensive aggregation capabilities. The specialized architecture ensures that IoT applications can efficiently handle massive sensor data volumes while maintaining fast query response times and cost-effective storage utilization.

Key MongoDB Time-Series IoT benefits include:

Optimized Storage Efficiency: Native time-series compression and bucketing reduce storage requirements by 60-90% compared to traditional approaches
High-Velocity Data Ingestion: Purpose-built for high-throughput sensor data with automatic optimization for temporal access patterns
Real-Time Analytics: Built-in aggregation optimization enables real-time analytics and monitoring across massive sensor fleets
Intelligent Data Lifecycle: Automated data retention and archival policies optimize costs while maintaining compliance requirements
Scalable Performance: Time-series collections scale seamlessly from thousands to millions of sensors with consistent performance
SQL Accessibility: Familiar SQL-style time-series operations through QueryLeaf for accessible IoT analytics and management

Whether you're building smart building management systems, industrial IoT monitoring platforms, or large-scale sensor networks, MongoDB time-series collections with QueryLeaf's familiar SQL interface provide the foundation for efficient, scalable IoT data management.

QueryLeaf Integration: QueryLeaf automatically optimizes MongoDB time-series operations while providing familiar SQL syntax for complex temporal analytics, fleet management, and predictive maintenance workflows. Advanced time-series aggregations, real-time monitoring, and anomaly detection are seamlessly accessible through familiar SQL constructs, making sophisticated IoT analytics approachable for SQL-oriented development teams.

The combination of MongoDB's robust time-series capabilities with SQL-style temporal operations makes it an ideal platform for IoT applications requiring both efficient sensor data management and familiar analytical database patterns, ensuring your IoT infrastructure can scale effectively while maintaining performance and operational simplicity.

December 17, 2025
21 min read

MongoDB Multi-Tenant Database Design and Schema Architecture: Advanced Patterns for Scalable SaaS Applications

Modern Software-as-a-Service (SaaS) applications serve multiple customers (tenants) through a single application instance, requiring sophisticated database design strategies that ensure data isolation, optimal performance, and cost-effective scalability. Traditional single-tenant database architectures become prohibitively expensive and operationally complex when supporting hundreds or thousands of customers, necessitating multi-tenant approaches that balance isolation, performance, and resource utilization.

MongoDB provides powerful multi-tenant database design capabilities that enable applications to efficiently serve multiple tenants through flexible schema architecture, advanced data partitioning strategies, and comprehensive isolation mechanisms. Unlike traditional relational databases that require complex sharding logic or expensive per-tenant database provisioning, MongoDB's document-oriented architecture naturally supports multi-tenant patterns with built-in scalability, flexible schemas, and sophisticated access control.

The Single-Tenant Architecture Challenge

Traditional single-tenant database approaches face significant scalability and cost challenges in SaaS environments:

-- Traditional single-tenant approach - separate database per tenant (expensive and complex)

-- Tenant 1 Database
CREATE DATABASE tenant_1_app_db;
USE tenant_1_app_db;

CREATE TABLE customers (
    customer_id SERIAL PRIMARY KEY,
    company_name VARCHAR(255) NOT NULL,
    contact_email VARCHAR(255) NOT NULL,
    subscription_plan VARCHAR(50) NOT NULL,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,

    -- Single-tenant specific fields
    tenant_id VARCHAR(100) NOT NULL DEFAULT 'tenant_1',
    tenant_config JSON,
    custom_fields JSON
);

CREATE TABLE orders (
    order_id SERIAL PRIMARY KEY,
    customer_id INTEGER REFERENCES customers(customer_id),
    order_total DECIMAL(10,2) NOT NULL,
    order_status VARCHAR(50) NOT NULL,
    order_date TIMESTAMP DEFAULT CURRENT_TIMESTAMP,

    -- Tenant-specific customizations
    tenant_id VARCHAR(100) NOT NULL DEFAULT 'tenant_1',
    tenant_workflow_stage VARCHAR(100),
    custom_order_fields JSON
);

CREATE TABLE products (
    product_id SERIAL PRIMARY KEY,
    product_name VARCHAR(255) NOT NULL,
    product_sku VARCHAR(100) UNIQUE NOT NULL,
    product_price DECIMAL(10,2) NOT NULL,
    inventory_count INTEGER DEFAULT 0,

    -- Tenant-specific product data
    tenant_id VARCHAR(100) NOT NULL DEFAULT 'tenant_1',
    tenant_product_categories TEXT[],
    custom_product_fields JSON
);

-- Tenant 2 Database (identical structure - resource duplication)
CREATE DATABASE tenant_2_app_db;
USE tenant_2_app_db;

-- Duplicate all table definitions with different tenant_id defaults
CREATE TABLE customers (
    customer_id SERIAL PRIMARY KEY,
    company_name VARCHAR(255) NOT NULL,
    contact_email VARCHAR(255) NOT NULL,
    subscription_plan VARCHAR(50) NOT NULL,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    tenant_id VARCHAR(100) NOT NULL DEFAULT 'tenant_2',
    tenant_config JSON,
    custom_fields JSON
);

-- ... repeat for orders, products, etc.

-- Problems with single-tenant database approach:
-- 1. Massive resource duplication - each tenant needs full database infrastructure
-- 2. Complex backup and maintenance operations across hundreds of databases
-- 3. Inefficient resource utilization - small tenants waste allocated resources
-- 4. Expensive database licensing and infrastructure costs
-- 5. Difficult cross-tenant analytics and reporting
-- 6. Complex application deployment and configuration management
-- 7. Challenging schema evolution across multiple databases
-- 8. Poor resource sharing and peak load distribution
-- 9. Complex monitoring and performance optimization
-- 10. Difficult disaster recovery and data migration scenarios

-- Example of complex connection management for single-tenant approach
-- Application code must maintain separate connections per tenant

-- PostgreSQL connection configuration for single-tenant (complex management)
CREATE ROLE tenant_1_user WITH LOGIN PASSWORD 'secure_password_1';
GRANT ALL PRIVILEGES ON DATABASE tenant_1_app_db TO tenant_1_user;

CREATE ROLE tenant_2_user WITH LOGIN PASSWORD 'secure_password_2';
GRANT ALL PRIVILEGES ON DATABASE tenant_2_app_db TO tenant_2_user;

-- Application must manage multiple connection pools
-- Connection Pool Configuration (simplified example):
-- tenant_1_pool: max_connections=20, database=tenant_1_app_db, user=tenant_1_user
-- tenant_2_pool: max_connections=20, database=tenant_2_app_db, user=tenant_2_user
-- tenant_n_pool: max_connections=20, database=tenant_n_app_db, user=tenant_n_user

-- Query execution requires tenant-specific connection routing
SELECT c.company_name, COUNT(o.order_id) as order_count
FROM customers c
LEFT JOIN orders o ON c.customer_id = o.customer_id
WHERE c.created_at >= CURRENT_DATE - INTERVAL '30 days'
GROUP BY c.company_name;

-- This query must be executed separately for each tenant database:
-- - Connect to tenant_1_app_db and execute query
-- - Connect to tenant_2_app_db and execute query  
-- - Connect to tenant_n_app_db and execute query
-- - Aggregate results at application level (complex)

-- Schema evolution challenges in single-tenant approach
-- Adding a new field requires coordinated deployment across all tenant databases

-- Add new field to customers table (must be done for EVERY tenant database)
ALTER TABLE customers ADD COLUMN subscription_renewal_date DATE;
ALTER TABLE customers ADD COLUMN billing_cycle VARCHAR(20) DEFAULT 'monthly';

-- Must be executed for:
-- tenant_1_app_db.customers
-- tenant_2_app_db.customers  
-- tenant_3_app_db.customers
-- ... tenant_n_app_db.customers

-- Backup and recovery complexity
-- Individual backup strategies per tenant database
pg_dump tenant_1_app_db > tenant_1_backup_2025_12_17.sql
pg_dump tenant_2_app_db > tenant_2_backup_2025_12_17.sql
pg_dump tenant_3_app_db > tenant_3_backup_2025_12_17.sql
-- ... repeat for all tenant databases

-- Cross-tenant reporting challenges (requires complex federation)
-- Attempting to get aggregate statistics across all tenants
WITH tenant_1_data AS (
    SELECT 'tenant_1' as tenant_id, COUNT(*) as customer_count, SUM(order_total) as revenue
    FROM tenant_1_app_db.customers c
    LEFT JOIN tenant_1_app_db.orders o ON c.customer_id = o.customer_id
),
tenant_2_data AS (
    SELECT 'tenant_2' as tenant_id, COUNT(*) as customer_count, SUM(order_total) as revenue
    FROM tenant_2_app_db.customers c
    LEFT JOIN tenant_2_app_db.orders o ON c.customer_id = o.customer_id
)
-- This approach becomes impossible with many tenants and requires complex application-level aggregation

-- Single-tenant limitations:
-- 1. Prohibitive infrastructure costs for large numbers of tenants
-- 2. Complex operational overhead for database maintenance
-- 3. Poor resource utilization and sharing inefficiencies
-- 4. Difficult cross-tenant feature development and analytics
-- 5. Complex disaster recovery and business continuity planning
-- 6. Challenging performance monitoring and optimization across databases
-- 7. Inefficient development and testing environment management
-- 8. Complex compliance and audit trail management
-- 9. Difficult tenant onboarding and offboarding processes
-- 10. Poor scalability characteristics for SaaS growth patterns

MongoDB provides sophisticated multi-tenant database design patterns:

// MongoDB Advanced Multi-Tenant Database Design and Schema Architecture

const { MongoClient, ObjectId } = require('mongodb');

// Comprehensive Multi-Tenant Database Architecture Manager
class MongoDBMultiTenantManager {
  constructor(connectionString, config = {}) {
    this.client = new MongoClient(connectionString);
    this.db = null;

    // Multi-tenant configuration strategies
    this.tenantStrategy = config.tenantStrategy || 'shared_database_shared_collection'; // Options: shared_database_shared_collection, shared_database_separate_collections, separate_databases
    this.tenantIsolationLevel = config.tenantIsolationLevel || 'logical'; // Options: logical, physical, hybrid
    this.enableTenantSharding = config.enableTenantSharding || false;
    this.enableTenantReplication = config.enableTenantReplication || false;

    // Advanced multi-tenant features
    this.config = {
      // Tenant identification and routing
      tenantIdentification: {
        strategy: config.tenantIdStrategy || 'header', // header, subdomain, path, database
        fieldName: config.tenantIdField || 'tenant_id',
        enableTenantCaching: config.enableTenantCaching !== false,
        cacheExpiration: config.cacheExpiration || 300000 // 5 minutes
      },

      // Data isolation and security
      dataIsolation: {
        enableRowLevelSecurity: config.enableRowLevelSecurity !== false,
        enableEncryptionAtRest: config.enableEncryptionAtRest || false,
        enableFieldLevelEncryption: config.enableFieldLevelEncryption || false,
        enableAuditLogging: config.enableAuditLogging !== false
      },

      // Performance and scalability
      performance: {
        enableTenantIndexing: config.enableTenantIndexing !== false,
        enableQueryOptimization: config.enableQueryOptimization !== false,
        enableCaching: config.enableCaching !== false,
        enableConnectionPooling: config.enableConnectionPooling !== false
      },

      // Schema management and evolution
      schemaManagement: {
        enableFlexibleSchemas: config.enableFlexibleSchemas !== false,
        enableSchemaVersioning: config.enableSchemaVersioning || false,
        enableCustomFields: config.enableCustomFields !== false,
        enableTenantSpecificCollections: config.enableTenantSpecificCollections || false
      },

      // Resource management
      resourceManagement: {
        enableResourceQuotas: config.enableResourceQuotas || false,
        enableTenantMetrics: config.enableTenantMetrics !== false,
        enableAutoScaling: config.enableAutoScaling || false,
        enableLoadBalancing: config.enableLoadBalancing || false
      }
    };

    this.tenantRegistry = new Map();
    this.tenantMetrics = new Map();
    this.initializeMultiTenantArchitecture();
  }

  async initializeMultiTenantArchitecture() {
    console.log('Initializing MongoDB multi-tenant architecture...');

    try {
      await this.client.connect();
      this.db = this.client.db('multi_tenant_saas_platform');

      // Setup tenant registry and metadata management
      await this.setupTenantRegistry();

      // Configure multi-tenant collections and indexes
      await this.setupMultiTenantCollections();

      // Initialize tenant-aware security and access control
      await this.setupTenantSecurity();

      // Setup performance monitoring and optimization
      await this.setupTenantPerformanceMonitoring();

      console.log('Multi-tenant architecture initialized successfully');

    } catch (error) {
      console.error('Error initializing multi-tenant architecture:', error);
      throw error;
    }
  }

  async setupTenantRegistry() {
    console.log('Setting up tenant registry and metadata management...');

    const tenantRegistry = this.db.collection('tenant_registry');

    // Create indexes for efficient tenant lookup and management
    await tenantRegistry.createIndex({ tenant_id: 1 }, { unique: true });
    await tenantRegistry.createIndex({ subdomain: 1 }, { unique: true, sparse: true });
    await tenantRegistry.createIndex({ status: 1, created_at: -1 });
    await tenantRegistry.createIndex({ subscription_plan: 1, tenant_tier: 1 });

    // Setup tenant metadata collection for configuration and customization
    const tenantMetadata = this.db.collection('tenant_metadata');
    await tenantMetadata.createIndex({ tenant_id: 1, metadata_type: 1 }, { unique: true });
    await tenantMetadata.createIndex({ tenant_id: 1, updated_at: -1 });
  }

  async setupMultiTenantCollections() {
    console.log('Setting up multi-tenant collections and schema architecture...');

    // Setup shared collections with tenant isolation
    const collections = ['customers', 'orders', 'products', 'invoices', 'users', 'analytics_events'];

    for (const collectionName of collections) {
      const collection = this.db.collection(collectionName);

      // Create tenant-aware indexes for optimal performance and isolation
      await collection.createIndex({ tenant_id: 1 });
      await collection.createIndex({ tenant_id: 1, created_at: -1 });
      await collection.createIndex({ tenant_id: 1, updated_at: -1 });

      // Collection-specific indexes
      if (collectionName === 'customers') {
        await collection.createIndex({ tenant_id: 1, email: 1 }, { unique: true });
        await collection.createIndex({ tenant_id: 1, company_name: 1 });
        await collection.createIndex({ tenant_id: 1, subscription_status: 1 });
      } else if (collectionName === 'orders') {
        await collection.createIndex({ tenant_id: 1, customer_id: 1, order_date: -1 });
        await collection.createIndex({ tenant_id: 1, order_status: 1, order_date: -1 });
        await collection.createIndex({ tenant_id: 1, order_total: -1 });
      } else if (collectionName === 'products') {
        await collection.createIndex({ tenant_id: 1, sku: 1 }, { unique: true });
        await collection.createIndex({ tenant_id: 1, category: 1, name: 1 });
        await collection.createIndex({ tenant_id: 1, price: 1, inventory_count: 1 });
      }
    }

    // Setup tenant-specific collections for high isolation requirements
    if (this.config.schemaManagement.enableTenantSpecificCollections) {
      await this.setupTenantSpecificCollections();
    }
  }

  async setupTenantSecurity() {
    console.log('Setting up tenant-aware security and access control...');

    // Create tenant-specific users and roles for enhanced security
    const tenantSecurity = this.db.collection('tenant_security');

    await tenantSecurity.createIndex({ tenant_id: 1, user_id: 1 }, { unique: true });
    await tenantSecurity.createIndex({ tenant_id: 1, role: 1 });
    await tenantSecurity.createIndex({ tenant_id: 1, permissions: 1 });

    // Setup audit logging for tenant data access
    if (this.config.dataIsolation.enableAuditLogging) {
      const auditLog = this.db.collection('tenant_audit_log');
      await auditLog.createIndex({ tenant_id: 1, timestamp: -1 });
      await auditLog.createIndex({ tenant_id: 1, action: 1, timestamp: -1 });
      await auditLog.createIndex({ tenant_id: 1, user_id: 1, timestamp: -1 });
    }
  }

  async setupTenantPerformanceMonitoring() {
    console.log('Setting up tenant performance monitoring and metrics...');

    const performanceMetrics = this.db.collection('tenant_performance_metrics');

    await performanceMetrics.createIndex({ tenant_id: 1, timestamp: -1 });
    await performanceMetrics.createIndex({ tenant_id: 1, metric_type: 1, timestamp: -1 });
    await performanceMetrics.createIndex({ tenant_id: 1, collection_name: 1, timestamp: -1 });

    // Setup resource usage tracking
    const resourceUsage = this.db.collection('tenant_resource_usage');
    await resourceUsage.createIndex({ tenant_id: 1, date: -1 });
    await resourceUsage.createIndex({ tenant_id: 1, resource_type: 1, date: -1 });
  }

  // Tenant registration and onboarding
  async registerNewTenant(tenantData) {
    console.log(`Registering new tenant: ${tenantData.tenant_id}`);

    try {
      const tenantRegistry = this.db.collection('tenant_registry');

      // Validate tenant data and check for conflicts
      const existingTenant = await tenantRegistry.findOne({ 
        $or: [
          { tenant_id: tenantData.tenant_id },
          { subdomain: tenantData.subdomain },
          { primary_domain: tenantData.primary_domain }
        ]
      });

      if (existingTenant) {
        throw new Error(`Tenant with identifier already exists: ${tenantData.tenant_id}`);
      }

      // Create comprehensive tenant registration
      const tenantRegistration = {
        tenant_id: tenantData.tenant_id,
        tenant_name: tenantData.tenant_name,
        subdomain: tenantData.subdomain,
        primary_domain: tenantData.primary_domain,

        // Tenant configuration
        configuration: {
          tenant_tier: tenantData.tenant_tier || 'standard',
          subscription_plan: tenantData.subscription_plan || 'basic',
          max_users: tenantData.max_users || 10,
          max_storage_gb: tenantData.max_storage_gb || 5,
          max_monthly_requests: tenantData.max_monthly_requests || 10000,

          // Feature flags
          features: {
            enable_advanced_analytics: tenantData.enable_advanced_analytics || false,
            enable_custom_branding: tenantData.enable_custom_branding || false,
            enable_api_access: tenantData.enable_api_access || true,
            enable_webhooks: tenantData.enable_webhooks || false,
            enable_sso: tenantData.enable_sso || false
          },

          // Data retention and compliance
          data_retention: {
            retention_period_days: tenantData.retention_period_days || 365,
            enable_gdpr_compliance: tenantData.enable_gdpr_compliance || false,
            enable_audit_trail: tenantData.enable_audit_trail || true
          }
        },

        // Tenant status and metadata
        status: 'active',
        created_at: new Date(),
        updated_at: new Date(),

        // Contact and billing information
        contact_info: {
          admin_email: tenantData.admin_email,
          billing_email: tenantData.billing_email || tenantData.admin_email,
          support_contact: tenantData.support_contact
        },

        // Technical configuration
        technical_config: {
          database_strategy: this.tenantStrategy,
          isolation_level: this.tenantIsolationLevel,
          enable_sharding: tenantData.enable_sharding || false,
          enable_replication: tenantData.enable_replication || true,
          preferred_read_region: tenantData.preferred_read_region || 'us-east-1'
        }
      };

      // Insert tenant registration
      const registrationResult = await tenantRegistry.insertOne(tenantRegistration);

      // Initialize tenant-specific collections and indexes
      await this.initializeTenantResources(tenantData.tenant_id);

      // Create initial tenant metadata
      await this.createTenantMetadata(tenantData.tenant_id, {
        schema_version: '1.0.0',
        custom_fields: tenantData.custom_fields || {},
        ui_configuration: tenantData.ui_configuration || {},
        integration_settings: tenantData.integration_settings || {}
      });

      // Update tenant registry cache
      this.tenantRegistry.set(tenantData.tenant_id, tenantRegistration);

      console.log(`Tenant ${tenantData.tenant_id} registered successfully`);
      return {
        success: true,
        tenant_id: tenantData.tenant_id,
        registration_id: registrationResult.insertedId,
        tenant_config: tenantRegistration.configuration
      };

    } catch (error) {
      console.error(`Error registering tenant ${tenantData.tenant_id}:`, error);
      throw error;
    }
  }

  async initializeTenantResources(tenantId) {
    console.log(`Initializing resources for tenant: ${tenantId}`);

    // Create initial tenant data and sample records
    const collections = ['customers', 'orders', 'products', 'users'];

    for (const collectionName of collections) {
      const collection = this.db.collection(collectionName);

      // Initialize with tenant-specific welcome data
      if (collectionName === 'users') {
        await collection.insertOne({
          tenant_id: tenantId,
          user_id: 'admin',
          username: 'admin',
          email: 'admin@' + tenantId + '.com',
          role: 'tenant_admin',
          permissions: ['read', 'write', 'admin'],
          created_at: new Date(),
          updated_at: new Date(),
          status: 'active',
          profile: {
            first_name: 'Tenant',
            last_name: 'Administrator',
            timezone: 'UTC',
            language: 'en',
            theme: 'light'
          }
        });
      }
    }

    // Initialize tenant metrics tracking
    this.tenantMetrics.set(tenantId, {
      total_documents: 0,
      storage_usage_bytes: 0,
      monthly_requests: 0,
      last_activity: new Date(),
      performance_metrics: {
        avg_query_time_ms: 0,
        avg_throughput_ops_per_sec: 0
      }
    });
  }

  async createTenantMetadata(tenantId, metadata) {
    const tenantMetadata = this.db.collection('tenant_metadata');

    const metadataDocument = {
      tenant_id: tenantId,
      metadata_type: 'configuration',
      metadata: metadata,
      created_at: new Date(),
      updated_at: new Date(),
      version: 1
    };

    return await tenantMetadata.insertOne(metadataDocument);
  }

  // Advanced tenant data operations with isolation
  async insertTenantDocument(tenantId, collectionName, document, options = {}) {
    console.log(`Inserting document for tenant ${tenantId} into ${collectionName}`);

    try {
      // Validate tenant access
      await this.validateTenantAccess(tenantId);

      const collection = this.db.collection(collectionName);

      // Ensure tenant isolation
      const tenantDocument = {
        ...document,
        tenant_id: tenantId,
        created_at: new Date(),
        updated_at: new Date(),

        // Add tenant-specific metadata
        tenant_metadata: {
          created_by_tenant: tenantId,
          tenant_schema_version: await this.getTenantSchemaVersion(tenantId),
          isolation_level: this.tenantIsolationLevel
        }
      };

      // Apply tenant-specific validation and business rules
      await this.validateTenantDocument(tenantId, collectionName, tenantDocument);

      const result = await collection.insertOne(tenantDocument, options);

      // Update tenant metrics
      await this.updateTenantMetrics(tenantId, 'document_inserted', { collection: collectionName });

      // Log tenant activity
      if (this.config.dataIsolation.enableAuditLogging) {
        await this.logTenantActivity(tenantId, 'insert', {
          collection: collectionName,
          document_id: result.insertedId,
          timestamp: new Date()
        });
      }

      return result;

    } catch (error) {
      console.error(`Error inserting document for tenant ${tenantId}:`, error);
      throw error;
    }
  }

  async queryTenantDocuments(tenantId, collectionName, query = {}, options = {}) {
    console.log(`Querying documents for tenant ${tenantId} from ${collectionName}`);

    try {
      // Validate tenant access
      await this.validateTenantAccess(tenantId);

      const collection = this.db.collection(collectionName);

      // Ensure tenant isolation in query
      const tenantQuery = {
        tenant_id: tenantId,
        ...query
      };

      // Apply tenant-specific query optimizations
      const optimizedOptions = await this.optimizeTenantQuery(tenantId, options);

      const startTime = Date.now();
      const results = await collection.find(tenantQuery, optimizedOptions).toArray();
      const queryTime = Date.now() - startTime;

      // Update tenant performance metrics
      await this.updateTenantMetrics(tenantId, 'query_executed', { 
        collection: collectionName,
        query_time_ms: queryTime,
        documents_returned: results.length
      });

      // Log tenant query activity
      if (this.config.dataIsolation.enableAuditLogging) {
        await this.logTenantActivity(tenantId, 'query', {
          collection: collectionName,
          query: tenantQuery,
          results_count: results.length,
          query_time_ms: queryTime,
          timestamp: new Date()
        });
      }

      return results;

    } catch (error) {
      console.error(`Error querying documents for tenant ${tenantId}:`, error);
      throw error;
    }
  }

  async updateTenantDocuments(tenantId, collectionName, filter, update, options = {}) {
    console.log(`Updating documents for tenant ${tenantId} in ${collectionName}`);

    try {
      await this.validateTenantAccess(tenantId);

      const collection = this.db.collection(collectionName);

      // Ensure tenant isolation in filter
      const tenantFilter = {
        tenant_id: tenantId,
        ...filter
      };

      // Add tenant-specific update metadata
      const tenantUpdate = {
        ...update,
        $set: {
          ...update.$set,
          updated_at: new Date(),
          updated_by_tenant: tenantId
        }
      };

      const result = await collection.updateMany(tenantFilter, tenantUpdate, options);

      // Update tenant metrics
      await this.updateTenantMetrics(tenantId, 'documents_updated', { 
        collection: collectionName,
        documents_modified: result.modifiedCount
      });

      // Log tenant update activity
      if (this.config.dataIsolation.enableAuditLogging) {
        await this.logTenantActivity(tenantId, 'update', {
          collection: collectionName,
          filter: tenantFilter,
          update: tenantUpdate,
          documents_modified: result.modifiedCount,
          timestamp: new Date()
        });
      }

      return result;

    } catch (error) {
      console.error(`Error updating documents for tenant ${tenantId}:`, error);
      throw error;
    }
  }

  async deleteTenantDocuments(tenantId, collectionName, filter, options = {}) {
    console.log(`Deleting documents for tenant ${tenantId} from ${collectionName}`);

    try {
      await this.validateTenantAccess(tenantId);

      const collection = this.db.collection(collectionName);

      // Ensure tenant isolation in filter
      const tenantFilter = {
        tenant_id: tenantId,
        ...filter
      };

      const result = await collection.deleteMany(tenantFilter, options);

      // Update tenant metrics
      await this.updateTenantMetrics(tenantId, 'documents_deleted', { 
        collection: collectionName,
        documents_deleted: result.deletedCount
      });

      // Log tenant delete activity
      if (this.config.dataIsolation.enableAuditLogging) {
        await this.logTenantActivity(tenantId, 'delete', {
          collection: collectionName,
          filter: tenantFilter,
          documents_deleted: result.deletedCount,
          timestamp: new Date()
        });
      }

      return result;

    } catch (error) {
      console.error(`Error deleting documents for tenant ${tenantId}:`, error);
      throw error;
    }
  }

  // Advanced tenant analytics and reporting
  async generateTenantAnalytics(tenantId, timeRange = '30d') {
    console.log(`Generating analytics for tenant ${tenantId} for ${timeRange}`);

    try {
      await this.validateTenantAccess(tenantId);

      const endDate = new Date();
      const startDate = new Date();

      switch (timeRange) {
        case '24h':
          startDate.setDate(endDate.getDate() - 1);
          break;
        case '7d':
          startDate.setDate(endDate.getDate() - 7);
          break;
        case '30d':
          startDate.setDate(endDate.getDate() - 30);
          break;
        case '90d':
          startDate.setDate(endDate.getDate() - 90);
          break;
      }

      const analytics = {
        tenant_id: tenantId,
        time_range: timeRange,
        generated_at: new Date(),

        // Document counts across collections
        document_counts: await this.getTenantDocumentCounts(tenantId),

        // Growth metrics
        growth_metrics: await this.getTenantGrowthMetrics(tenantId, startDate, endDate),

        // Activity metrics
        activity_metrics: await this.getTenantActivityMetrics(tenantId, startDate, endDate),

        // Performance metrics
        performance_metrics: await this.getTenantPerformanceMetrics(tenantId, startDate, endDate),

        // Resource utilization
        resource_usage: await this.getTenantResourceUsage(tenantId, startDate, endDate)
      };

      return analytics;

    } catch (error) {
      console.error(`Error generating analytics for tenant ${tenantId}:`, error);
      throw error;
    }
  }

  async getTenantDocumentCounts(tenantId) {
    const collections = ['customers', 'orders', 'products', 'users'];
    const counts = {};

    for (const collectionName of collections) {
      const collection = this.db.collection(collectionName);
      counts[collectionName] = await collection.countDocuments({ tenant_id: tenantId });
    }

    return counts;
  }

  async getTenantGrowthMetrics(tenantId, startDate, endDate) {
    const collections = ['customers', 'orders'];
    const growth = {};

    for (const collectionName of collections) {
      const collection = this.db.collection(collectionName);

      const totalCount = await collection.countDocuments({ tenant_id: tenantId });
      const periodCount = await collection.countDocuments({
        tenant_id: tenantId,
        created_at: { $gte: startDate, $lte: endDate }
      });

      growth[collectionName] = {
        total: totalCount,
        period_additions: periodCount,
        growth_rate: totalCount > 0 ? ((periodCount / totalCount) * 100).toFixed(2) : 0
      };
    }

    return growth;
  }

  // Tenant validation and security
  async validateTenantAccess(tenantId) {
    if (!this.tenantRegistry.has(tenantId)) {
      const tenantRegistry = this.db.collection('tenant_registry');
      const tenant = await tenantRegistry.findOne({ tenant_id: tenantId });

      if (!tenant) {
        throw new Error(`Tenant not found: ${tenantId}`);
      }

      if (tenant.status !== 'active') {
        throw new Error(`Tenant inactive: ${tenantId}, status: ${tenant.status}`);
      }

      this.tenantRegistry.set(tenantId, tenant);
    }

    return true;
  }

  async validateTenantDocument(tenantId, collectionName, document) {
    // Apply tenant-specific validation rules
    const tenantConfig = await this.getTenantConfiguration(tenantId);

    // Validate against tenant-specific schema rules
    if (tenantConfig.schema_validation && tenantConfig.schema_validation[collectionName]) {
      const validationRules = tenantConfig.schema_validation[collectionName];

      for (const [field, rules] of Object.entries(validationRules)) {
        if (rules.required && !document[field]) {
          throw new Error(`Required field missing for tenant ${tenantId}: ${field}`);
        }

        if (rules.type && document[field] && typeof document[field] !== rules.type) {
          throw new Error(`Invalid field type for tenant ${tenantId}: ${field}, expected ${rules.type}`);
        }
      }
    }

    return true;
  }

  async getTenantConfiguration(tenantId) {
    if (!this.tenantRegistry.has(tenantId)) {
      await this.validateTenantAccess(tenantId);
    }

    return this.tenantRegistry.get(tenantId).configuration;
  }

  async updateTenantMetrics(tenantId, metricType, metricData) {
    try {
      const performanceMetrics = this.db.collection('tenant_performance_metrics');

      const metric = {
        tenant_id: tenantId,
        metric_type: metricType,
        metric_data: metricData,
        timestamp: new Date()
      };

      await performanceMetrics.insertOne(metric);

      // Update in-memory metrics
      if (this.tenantMetrics.has(tenantId)) {
        const tenantMetric = this.tenantMetrics.get(tenantId);
        tenantMetric.last_activity = new Date();

        if (metricType === 'document_inserted') {
          tenantMetric.total_documents++;
        }
      }

    } catch (error) {
      console.error(`Error updating tenant metrics for ${tenantId}:`, error);
      // Don't throw - metrics shouldn't break operations
    }
  }

  async logTenantActivity(tenantId, action, details) {
    try {
      const auditLog = this.db.collection('tenant_audit_log');

      const logEntry = {
        tenant_id: tenantId,
        action: action,
        details: details,
        timestamp: new Date(),
        user_id: details.user_id || 'system',
        session_id: details.session_id || null
      };

      await auditLog.insertOne(logEntry);

    } catch (error) {
      console.error(`Error logging tenant activity for ${tenantId}:`, error);
      // Don't throw - audit logging shouldn't break operations
    }
  }
}

// Example usage: Multi-tenant SaaS platform implementation
async function demonstrateMultiTenantArchitecture() {
  const multiTenantManager = new MongoDBMultiTenantManager('mongodb://localhost:27017', {
    tenantStrategy: 'shared_database_shared_collection',
    tenantIsolationLevel: 'logical',
    enableTenantSharding: false,
    enableAuditLogging: true,
    enableTenantMetrics: true
  });

  // Register new tenants
  const tenant1 = await multiTenantManager.registerNewTenant({
    tenant_id: 'acme_corp',
    tenant_name: 'Acme Corporation',
    subdomain: 'acme',
    admin_email: '[email protected]',
    subscription_plan: 'enterprise',
    max_users: 100,
    max_storage_gb: 50,
    enable_advanced_analytics: true
  });

  const tenant2 = await multiTenantManager.registerNewTenant({
    tenant_id: 'startup_inc',
    tenant_name: 'Startup Inc',
    subdomain: 'startup',
    admin_email: '[email protected]',
    subscription_plan: 'basic',
    max_users: 10,
    max_storage_gb: 5
  });

  // Create tenant-specific data
  await multiTenantManager.insertTenantDocument('acme_corp', 'customers', {
    company_name: 'Big Client Company',
    contact_email: '[email protected]',
    subscription_status: 'active',
    annual_value: 50000
  });

  await multiTenantManager.insertTenantDocument('startup_inc', 'customers', {
    company_name: 'Small Client LLC',
    contact_email: '[email protected]',
    subscription_status: 'trial',
    annual_value: 5000
  });

  // Query tenant-specific data
  const acmeCustomers = await multiTenantManager.queryTenantDocuments('acme_corp', 'customers');
  const startupCustomers = await multiTenantManager.queryTenantDocuments('startup_inc', 'customers');

  console.log('Acme Corp customers:', acmeCustomers.length);
  console.log('Startup Inc customers:', startupCustomers.length);

  // Generate tenant analytics
  const acmeAnalytics = await multiTenantManager.generateTenantAnalytics('acme_corp', '30d');
  console.log('Acme Corp analytics:', acmeAnalytics);
}

module.exports = {
  MongoDBMultiTenantManager
};

Understanding MongoDB Multi-Tenant Architecture Patterns

Database-Level Multi-Tenancy Strategies and Implementation

MongoDB supports multiple multi-tenant architecture patterns to balance isolation, performance, and cost:

// Enterprise-grade multi-tenant architecture with advanced patterns
class EnterpriseMultiTenantArchitecture extends MongoDBMultiTenantManager {
  constructor(connectionString, enterpriseConfig) {
    super(connectionString, enterpriseConfig);

    this.enterpriseConfig = {
      ...enterpriseConfig,

      // Advanced tenant strategies
      enableTenantSharding: true,
      enableGlobalSecondaryIndexes: true,
      enableCrossRegionReplication: true,
      enableTenantDataEncryption: true,

      // Performance optimization
      enableQueryRouting: true,
      enableConnectionPooling: true,
      enableCaching: true,
      enableReadReplicas: true,

      // Compliance and governance
      enableDataLineage: true,
      enablePIIDetection: true,
      enableGDPRCompliance: true,
      enableSOCCompliance: true
    };
  }

  async implementShardedMultiTenancy(tenants) {
    console.log('Implementing sharded multi-tenancy architecture...');

    // Configure shard key strategy based on tenant distribution
    const shardKeyStrategy = await this.analyzeTenantDistribution(tenants);

    for (const collectionName of ['customers', 'orders', 'products']) {
      await this.enableSharding(collectionName, { tenant_id: 1 });
    }

    // Implement tenant-aware query routing
    await this.setupTenantQueryRouting();

    return shardKeyStrategy;
  }

  async setupTenantSpecificCollections(tenantId) {
    console.log(`Setting up tenant-specific collections for ${tenantId}...`);

    const tenantCollections = ['custom_fields', 'tenant_reports', 'integration_data'];

    for (const collectionName of tenantCollections) {
      const tenantSpecificCollection = `${tenantId}_${collectionName}`;
      const collection = this.db.collection(tenantSpecificCollection);

      // Create tenant-specific indexes and configuration
      await collection.createIndex({ created_at: -1 });
      await collection.createIndex({ updated_at: -1 });

      // Apply tenant-specific data retention policies
      if (collectionName === 'integration_data') {
        await collection.createIndex(
          { created_at: 1 }, 
          { expireAfterSeconds: 30 * 24 * 60 * 60 } // 30 days
        );
      }
    }
  }

  async implementHybridTenantStrategy(tenants) {
    console.log('Implementing hybrid tenant strategy...');

    // Group tenants by size and requirements
    const tenantGroups = this.categorizeTenants(tenants);

    // Large tenants get dedicated collections
    for (const largeTenant of tenantGroups.large) {
      await this.setupTenantSpecificCollections(largeTenant.tenant_id);
    }

    // Medium tenants share collections with optimized indexes
    await this.optimizeSharedCollectionsForMediumTenants(tenantGroups.medium);

    // Small tenants use fully shared collections
    await this.setupSharedCollectionsForSmallTenants(tenantGroups.small);

    return tenantGroups;
  }

  async setupCrossRegionReplication(tenants) {
    console.log('Setting up cross-region replication for tenant data...');

    const replicationStrategies = new Map();

    for (const tenant of tenants) {
      const tenantConfig = await this.getTenantConfiguration(tenant.tenant_id);

      if (tenantConfig.technical_config.enable_replication) {
        const strategy = {
          primary_region: tenantConfig.technical_config.primary_region || 'us-east-1',
          replica_regions: tenantConfig.technical_config.replica_regions || ['us-west-2'],
          read_preference: tenantConfig.technical_config.read_preference || 'primaryPreferred'
        };

        replicationStrategies.set(tenant.tenant_id, strategy);

        // Configure tenant-specific read preferences
        await this.configureTenantReadPreferences(tenant.tenant_id, strategy);
      }
    }

    return replicationStrategies;
  }

  async implementDataResidencyCompliance(tenants) {
    console.log('Implementing data residency and compliance controls...');

    const complianceStrategies = new Map();

    for (const tenant of tenants) {
      const tenantConfig = await this.getTenantConfiguration(tenant.tenant_id);
      const dataResidencyRequirements = tenantConfig.compliance?.data_residency;

      if (dataResidencyRequirements) {
        const strategy = {
          allowed_regions: dataResidencyRequirements.allowed_regions,
          prohibited_regions: dataResidencyRequirements.prohibited_regions,
          encryption_requirements: dataResidencyRequirements.encryption_requirements,
          audit_requirements: dataResidencyRequirements.audit_requirements
        };

        complianceStrategies.set(tenant.tenant_id, strategy);

        // Implement tenant-specific data encryption
        if (strategy.encryption_requirements) {
          await this.setupTenantDataEncryption(tenant.tenant_id, strategy);
        }

        // Setup compliance audit trails
        if (strategy.audit_requirements) {
          await this.setupComplianceAuditing(tenant.tenant_id, strategy);
        }
      }
    }

    return complianceStrategies;
  }
}

QueryLeaf Multi-Tenant Operations

QueryLeaf provides familiar SQL syntax for MongoDB multi-tenant operations and data management:

-- QueryLeaf multi-tenant database operations with SQL-familiar syntax

-- Create multi-tenant tables with automatic tenant isolation
CREATE TABLE customers (
    customer_id UUID PRIMARY KEY DEFAULT uuid_generate_v4(),
    tenant_id VARCHAR(100) NOT NULL,
    company_name VARCHAR(255) NOT NULL,
    contact_email VARCHAR(255) NOT NULL,
    subscription_status VARCHAR(50) DEFAULT 'trial',
    annual_value DECIMAL(12,2) DEFAULT 0,

    -- Multi-tenant metadata
    created_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
    tenant_schema_version VARCHAR(20) DEFAULT '1.0.0',

    -- Tenant-specific custom fields support
    custom_fields JSONB,

    -- Compliance and audit fields
    data_classification VARCHAR(50) DEFAULT 'internal',
    retention_policy VARCHAR(100),

    -- Multi-tenant constraints
    UNIQUE(tenant_id, contact_email),

    -- Tenant isolation index (automatically created by QueryLeaf)
    INDEX tenant_isolation_idx (tenant_id),
    INDEX tenant_customers_email_idx (tenant_id, contact_email),
    INDEX tenant_customers_status_idx (tenant_id, subscription_status, created_at DESC)
);

-- Multi-tenant orders table with relationship management
CREATE TABLE orders (
    order_id UUID PRIMARY KEY DEFAULT uuid_generate_v4(),
    tenant_id VARCHAR(100) NOT NULL,
    customer_id UUID NOT NULL,
    order_number VARCHAR(100) NOT NULL,
    order_total DECIMAL(12,2) NOT NULL,
    order_status VARCHAR(50) DEFAULT 'pending',
    order_date TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,

    -- Tenant-specific order workflow
    workflow_stage VARCHAR(100),
    approval_status VARCHAR(50),

    -- Multi-tenant metadata  
    created_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,

    -- Custom fields for tenant-specific data
    custom_order_fields JSONB,
    integration_data JSONB,

    -- Multi-tenant constraints and relationships
    FOREIGN KEY (tenant_id, customer_id) REFERENCES customers(tenant_id, customer_id),
    UNIQUE(tenant_id, order_number),

    -- Optimized tenant indexes
    INDEX tenant_orders_customer_idx (tenant_id, customer_id, order_date DESC),
    INDEX tenant_orders_status_idx (tenant_id, order_status, order_date DESC),
    INDEX tenant_orders_workflow_idx (tenant_id, workflow_stage, approval_status)
);

-- Multi-tenant products catalog with tenant-specific categorization
CREATE TABLE products (
    product_id UUID PRIMARY KEY DEFAULT uuid_generate_v4(),
    tenant_id VARCHAR(100) NOT NULL,
    product_name VARCHAR(255) NOT NULL,
    sku VARCHAR(100) NOT NULL,
    product_price DECIMAL(10,2) NOT NULL,
    inventory_count INTEGER DEFAULT 0,

    -- Tenant-specific categorization
    tenant_categories TEXT[],
    tenant_tags TEXT[],

    -- Product configuration per tenant
    pricing_model VARCHAR(50) DEFAULT 'fixed', -- fixed, tiered, usage_based
    pricing_tiers JSONB,

    -- Multi-tenant metadata
    created_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,

    -- Custom product fields
    custom_product_attributes JSONB,

    -- Tenant isolation and optimization
    UNIQUE(tenant_id, sku),
    INDEX tenant_products_category_idx (tenant_id, tenant_categories),
    INDEX tenant_products_price_idx (tenant_id, product_price, inventory_count),
    INDEX tenant_products_search_idx (tenant_id, product_name, sku)
);

-- Advanced multi-tenant data operations with automatic tenant context

-- Tenant-aware data insertion (QueryLeaf automatically adds tenant_id from context)
SET SESSION tenant_context = 'acme_corp';

INSERT INTO customers (company_name, contact_email, subscription_status, annual_value, custom_fields)
VALUES 
  ('Enterprise Client A', '[email protected]', 'active', 75000, '{"industry": "technology", "employees": 500}'),
  ('Enterprise Client B', '[email protected]', 'active', 120000, '{"industry": "finance", "employees": 1200}'),
  ('Mid Market Client', '[email protected]', 'active', 25000, '{"industry": "retail", "employees": 150}');

-- QueryLeaf automatically ensures tenant isolation:
-- MongoDB: db.customers.insertMany([{tenant_id: 'acme_corp', ...}])

-- Tenant-specific product catalog management
INSERT INTO products (product_name, sku, product_price, tenant_categories, custom_product_attributes)
SELECT 
  import_name,
  import_sku, 
  import_price::DECIMAL(10,2),
  ARRAY[import_primary_category, import_secondary_category],
  JSON_BUILD_OBJECT(
    'brand', import_brand,
    'model', import_model,
    'specifications', import_specifications::JSONB,
    'warranty_terms', import_warranty
  )
FROM product_import_staging
WHERE tenant_id = 'acme_corp'
  AND import_status = 'validated'
  AND import_price > 0;

-- Complex multi-tenant reporting with tenant isolation
WITH tenant_sales_summary AS (
  SELECT 
    c.tenant_id,
    c.customer_id,
    c.company_name,
    c.subscription_status,

    -- Order metrics
    COUNT(o.order_id) as total_orders,
    SUM(o.order_total) as total_revenue,
    AVG(o.order_total) as avg_order_value,
    MAX(o.order_date) as last_order_date,
    MIN(o.order_date) as first_order_date,

    -- Time-based analysis
    DATE_PART('days', MAX(o.order_date) - MIN(o.order_date)) as customer_lifetime_days,

    -- Revenue categorization
    CASE 
      WHEN SUM(o.order_total) >= 100000 THEN 'enterprise'
      WHEN SUM(o.order_total) >= 50000 THEN 'large'
      WHEN SUM(o.order_total) >= 10000 THEN 'medium'
      ELSE 'small'
    END as customer_segment,

    -- Engagement metrics
    ROUND(COUNT(o.order_id)::DECIMAL / GREATEST(DATE_PART('days', MAX(o.order_date) - MIN(o.order_date)), 1) * 30, 2) as monthly_order_frequency

  FROM customers c
  LEFT JOIN orders o ON c.tenant_id = o.tenant_id AND c.customer_id = o.customer_id
  WHERE c.tenant_id = 'acme_corp'  -- Automatic tenant isolation
    AND c.subscription_status = 'active'
    AND o.order_date >= CURRENT_DATE - INTERVAL '12 months'
  GROUP BY c.tenant_id, c.customer_id, c.company_name, c.subscription_status
),

revenue_analytics AS (
  SELECT 
    tenant_id,
    customer_segment,

    -- Segment metrics
    COUNT(*) as customers_in_segment,
    SUM(total_revenue) as segment_revenue,
    AVG(total_revenue) as avg_customer_revenue,
    AVG(avg_order_value) as segment_avg_order_value,
    AVG(monthly_order_frequency) as avg_monthly_frequency,

    -- Revenue distribution
    PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY total_revenue) as revenue_25th_percentile,
    PERCENTILE_CONT(0.50) WITHIN GROUP (ORDER BY total_revenue) as revenue_median,
    PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY total_revenue) as revenue_75th_percentile,
    PERCENTILE_CONT(0.90) WITHIN GROUP (ORDER BY total_revenue) as revenue_90th_percentile

  FROM tenant_sales_summary
  GROUP BY tenant_id, customer_segment
)

-- Generate comprehensive tenant analytics report
SELECT 
  ra.tenant_id,
  ra.customer_segment,
  ra.customers_in_segment,

  -- Revenue metrics
  ROUND(ra.segment_revenue, 2) as segment_revenue,
  ROUND(ra.avg_customer_revenue, 2) as avg_customer_revenue,
  ROUND(ra.segment_avg_order_value, 2) as avg_order_value,

  -- Customer behavior
  ROUND(ra.avg_monthly_frequency, 2) as avg_monthly_orders,

  -- Revenue distribution insights
  ROUND(ra.revenue_median, 2) as median_customer_revenue,
  ROUND(ra.revenue_90th_percentile, 2) as top_10_percent_revenue_threshold,

  -- Business insights
  ROUND((ra.segment_revenue / SUM(ra.segment_revenue) OVER (PARTITION BY ra.tenant_id)) * 100, 1) as segment_revenue_percentage,

  -- Growth potential assessment
  CASE 
    WHEN ra.customer_segment = 'small' AND ra.avg_monthly_frequency > 2 THEN 'high_growth_potential'
    WHEN ra.customer_segment = 'medium' AND ra.avg_customer_revenue > ra.revenue_75th_percentile THEN 'upsell_opportunity'
    WHEN ra.customer_segment = 'large' AND ra.avg_monthly_frequency < 1 THEN 'retention_risk'
    WHEN ra.customer_segment = 'enterprise' THEN 'strategic_account'
    ELSE 'stable'
  END as customer_strategy_recommendation,

  CURRENT_TIMESTAMP as report_generated_at

FROM revenue_analytics ra
ORDER BY ra.tenant_id, 
         CASE ra.customer_segment 
           WHEN 'enterprise' THEN 1 
           WHEN 'large' THEN 2 
           WHEN 'medium' THEN 3 
           WHEN 'small' THEN 4 
         END;

-- Advanced multi-tenant data management operations

-- Tenant-specific data archiving with retention policies  
WITH archival_candidates AS (
  SELECT 
    tenant_id,
    customer_id,
    company_name,
    subscription_status,
    created_at,

    -- Determine archival eligibility
    CASE 
      WHEN subscription_status = 'cancelled' AND created_at < CURRENT_DATE - INTERVAL '2 years' THEN true
      WHEN subscription_status = 'trial' AND created_at < CURRENT_DATE - INTERVAL '90 days' THEN true
      WHEN subscription_status = 'inactive' AND created_at < CURRENT_DATE - INTERVAL '1 year' THEN true
      ELSE false
    END as archive_eligible,

    -- Calculate data retention requirements
    CASE 
      WHEN custom_fields->>'industry' = 'finance' THEN INTERVAL '7 years'
      WHEN custom_fields->>'industry' = 'healthcare' THEN INTERVAL '10 years'
      WHEN custom_fields->>'data_classification' = 'confidential' THEN INTERVAL '5 years'
      ELSE INTERVAL '3 years'
    END as retention_period,

    created_at + 
    CASE 
      WHEN custom_fields->>'industry' = 'finance' THEN INTERVAL '7 years'
      WHEN custom_fields->>'industry' = 'healthcare' THEN INTERVAL '10 years'
      WHEN custom_fields->>'data_classification' = 'confidential' THEN INTERVAL '5 years'
      ELSE INTERVAL '3 years'
    END as archive_after_date

  FROM customers
  WHERE tenant_id = 'acme_corp'
),

archival_summary AS (
  SELECT 
    tenant_id,
    COUNT(*) as total_customers,
    COUNT(*) FILTER (WHERE archive_eligible = true) as archive_eligible_count,
    COUNT(*) FILTER (WHERE archive_after_date < CURRENT_DATE) as past_retention_count,

    -- Archive candidates by category
    COUNT(*) FILTER (WHERE subscription_status = 'cancelled' AND archive_eligible = true) as cancelled_archive_count,
    COUNT(*) FILTER (WHERE subscription_status = 'trial' AND archive_eligible = true) as trial_archive_count,
    COUNT(*) FILTER (WHERE subscription_status = 'inactive' AND archive_eligible = true) as inactive_archive_count

  FROM archival_candidates
  GROUP BY tenant_id
)

-- Archive eligible customer data
INSERT INTO archived_customers (
  SELECT 
    ac.*,
    'automated_retention_policy' as archive_reason,
    CURRENT_TIMESTAMP as archived_at,
    'data_retention_service' as archived_by
  FROM archival_candidates ac
  WHERE ac.archive_eligible = true
    AND ac.archive_after_date < CURRENT_DATE
);

-- Remove archived customers from active tables (with referential integrity)
DELETE FROM customers 
WHERE tenant_id = 'acme_corp'
  AND customer_id IN (
    SELECT customer_id 
    FROM archival_candidates 
    WHERE archive_eligible = true 
      AND archive_after_date < CURRENT_DATE
  );

-- Multi-tenant schema evolution and customization management
WITH tenant_schema_analysis AS (
  SELECT 
    tenant_id,

    -- Analyze custom field usage
    COUNT(*) as total_customers,
    COUNT(*) FILTER (WHERE custom_fields IS NOT NULL) as customers_with_custom_fields,

    -- Extract commonly used custom fields
    (
      SELECT ARRAY_AGG(DISTINCT field_name)
      FROM (
        SELECT jsonb_object_keys(custom_fields) as field_name
        FROM customers 
        WHERE tenant_id = c.tenant_id AND custom_fields IS NOT NULL
      ) field_keys
    ) as custom_field_names,

    -- Schema version distribution
    tenant_schema_version,
    COUNT(*) FILTER (WHERE tenant_schema_version = c.tenant_schema_version) as schema_version_count

  FROM customers c
  WHERE tenant_id = 'acme_corp'
  GROUP BY tenant_id, tenant_schema_version
),

schema_migration_plan AS (
  SELECT 
    tsa.tenant_id,
    tsa.tenant_schema_version as current_schema_version,
    '2.0.0' as target_schema_version,

    -- Migration requirements
    tsa.custom_field_names,
    ARRAY['customer_tier', 'lifecycle_stage', 'health_score'] as new_standard_fields,

    -- Migration complexity assessment
    CASE 
      WHEN array_length(tsa.custom_field_names, 1) > 10 THEN 'complex'
      WHEN array_length(tsa.custom_field_names, 1) > 5 THEN 'moderate'
      ELSE 'simple'
    END as migration_complexity,

    -- Estimated migration time
    CASE 
      WHEN array_length(tsa.custom_field_names, 1) > 10 THEN '4-6 hours'
      WHEN array_length(tsa.custom_field_names, 1) > 5 THEN '2-3 hours'
      ELSE '1 hour'
    END as estimated_migration_time

  FROM tenant_schema_analysis tsa
  WHERE tsa.tenant_schema_version < '2.0.0'
)

-- Execute tenant-specific schema migration
UPDATE customers 
SET 
  -- Migrate custom fields to standardized structure
  custom_fields = custom_fields || 
    JSON_BUILD_OBJECT(
      'customer_tier', 
      CASE 
        WHEN annual_value >= 100000 THEN 'enterprise'
        WHEN annual_value >= 50000 THEN 'professional'  
        WHEN annual_value >= 10000 THEN 'standard'
        ELSE 'basic'
      END,
      'lifecycle_stage',
      CASE 
        WHEN subscription_status = 'trial' THEN 'prospect'
        WHEN subscription_status = 'active' AND created_at > CURRENT_DATE - INTERVAL '90 days' THEN 'new'
        WHEN subscription_status = 'active' THEN 'established'
        WHEN subscription_status = 'cancelled' THEN 'churned'
        ELSE 'unknown'
      END,
      'health_score',
      CASE 
        WHEN subscription_status = 'active' AND annual_value > 50000 THEN 85 + RANDOM() * 15
        WHEN subscription_status = 'active' AND annual_value > 10000 THEN 70 + RANDOM() * 20
        WHEN subscription_status = 'active' THEN 60 + RANDOM() * 25
        WHEN subscription_status = 'trial' THEN 45 + RANDOM() * 20
        ELSE 20 + RANDOM() * 30
      END
    ),

  -- Update schema version
  tenant_schema_version = '2.0.0',
  updated_at = CURRENT_TIMESTAMP

WHERE tenant_id = 'acme_corp'
  AND tenant_schema_version < '2.0.0';

-- QueryLeaf provides comprehensive multi-tenant capabilities:
-- 1. Automatic tenant isolation and context management
-- 2. Tenant-aware indexes and query optimization  
-- 3. Flexible schema evolution and customization support
-- 4. Advanced multi-tenant reporting and analytics
-- 5. Automated data retention and archival policies
-- 6. Cross-tenant security and compliance controls
-- 7. Performance optimization for multi-tenant workloads
-- 8. Familiar SQL syntax for complex multi-tenant operations

Best Practices for MongoDB Multi-Tenant Architecture

Tenant Isolation and Data Security Strategies

Essential principles for secure and scalable multi-tenant database design:

Tenant Identification Strategy: Implement consistent tenant identification across all collections with proper validation and access control
Data Isolation Patterns: Choose appropriate isolation levels based on security requirements, compliance needs, and performance characteristics
Schema Design Flexibility: Design schemas that support tenant-specific customizations while maintaining performance and consistency
Access Control Implementation: Implement comprehensive role-based access control with tenant-aware permissions and audit trails
Performance Optimization: Optimize indexes and queries for multi-tenant access patterns and data distribution
Resource Management: Monitor and manage tenant resource usage with quotas, throttling, and fair resource allocation

Scalability and Performance Optimization

Optimize multi-tenant architectures for enterprise-scale requirements:

Sharding Strategy: Implement tenant-aware sharding strategies that ensure even data distribution and optimal query routing
Connection Pooling: Configure efficient connection pooling strategies that support multiple tenants without resource contention
Caching Architecture: Implement tenant-aware caching strategies that improve performance while maintaining data isolation
Query Optimization: Design queries and indexes specifically for multi-tenant access patterns and data locality
Resource Monitoring: Monitor tenant-specific performance metrics and resource utilization for proactive optimization
Capacity Planning: Plan capacity requirements based on tenant growth patterns, usage analytics, and performance requirements

Conclusion

MongoDB multi-tenant database design provides powerful capabilities for building scalable SaaS applications that efficiently serve multiple customers through sophisticated data isolation, flexible schema architecture, and comprehensive performance optimization. The document-oriented nature of MongoDB naturally supports multi-tenant patterns while providing the scalability and operational efficiency required for modern SaaS platforms.

Key MongoDB Multi-Tenant Architecture benefits include:

Flexible Isolation Models: Support for logical, physical, and hybrid tenant isolation strategies based on specific requirements
Cost-Effective Resource Sharing: Efficient resource utilization through shared infrastructure while maintaining proper tenant isolation
Schema Flexibility: Native support for tenant-specific customizations and schema evolution without complex migrations
Scalable Performance: Built-in sharding and replication capabilities optimized for multi-tenant access patterns
Comprehensive Security: Advanced access control and audit capabilities for enterprise compliance and data protection
SQL Accessibility: Familiar SQL-style multi-tenant operations through QueryLeaf for accessible enterprise database management

Whether you're building a new SaaS platform, migrating from single-tenant architectures, or scaling existing multi-tenant applications, MongoDB with QueryLeaf's familiar SQL interface provides the foundation for efficient, secure, and scalable multi-tenant database architecture.

QueryLeaf Integration: QueryLeaf automatically handles tenant context and isolation while providing familiar SQL syntax for complex multi-tenant operations. Advanced tenant management, data isolation, and cross-tenant analytics are seamlessly accessible through familiar SQL constructs, making sophisticated multi-tenant architecture approachable for SQL-oriented development teams.

The combination of MongoDB's robust multi-tenant capabilities with SQL-style database operations makes it an ideal platform for SaaS applications requiring both efficient multi-tenancy and familiar database management patterns, ensuring your application can scale cost-effectively while maintaining security and performance.

December 16, 2025
23 min read

MongoDB Data Compression and Storage Optimization: Advanced Techniques for Efficient Data Storage and Performance Tuning

Modern data-driven applications generate massive volumes of information that require intelligent storage strategies to maintain performance while controlling infrastructure costs. As data grows exponentially, organizations face mounting pressure to optimize storage efficiency, reduce backup times, minimize network bandwidth consumption, and improve overall database performance through strategic data compression and storage optimization techniques.

MongoDB provides sophisticated compression capabilities and storage optimization features that can dramatically reduce disk space usage, improve I/O performance, and enhance overall system efficiency through intelligent document compression, index optimization, and advanced storage engine configurations. Unlike traditional databases that treat compression as an afterthought, MongoDB integrates compression deeply into its storage engine architecture with options for WiredTiger compression, document-level optimization, and intelligent data lifecycle management.

The Traditional Storage Inefficiency Challenge

Conventional database storage approaches often waste significant disk space and suffer from poor performance characteristics:

-- Traditional PostgreSQL storage without compression - inefficient and wasteful

-- Typical relational table with poor storage efficiency
CREATE TABLE customer_interactions (
    interaction_id BIGSERIAL PRIMARY KEY,
    customer_id UUID NOT NULL,
    interaction_type VARCHAR(50) NOT NULL,
    channel_type VARCHAR(30) NOT NULL,
    interaction_timestamp TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,

    -- Repetitive and verbose data storage
    customer_full_name VARCHAR(200),
    customer_email VARCHAR(255),
    customer_phone VARCHAR(20),
    customer_address_line1 VARCHAR(200),
    customer_address_line2 VARCHAR(200),
    customer_city VARCHAR(100),
    customer_state VARCHAR(50),
    customer_postal_code VARCHAR(20),
    customer_country VARCHAR(50),

    -- Redundant categorical data (poor normalization choice for performance)
    interaction_category VARCHAR(100),
    interaction_subcategory VARCHAR(100),
    interaction_status VARCHAR(50),
    resolution_status VARCHAR(50),
    priority_level VARCHAR(20),
    department VARCHAR(100),
    agent_name VARCHAR(200),
    agent_email VARCHAR(255),
    agent_department VARCHAR(100),

    -- Large text fields without compression
    interaction_summary TEXT,
    customer_feedback TEXT,
    internal_notes TEXT,
    resolution_details TEXT,
    follow_up_actions TEXT,

    -- JSON data stored as text (no compression)
    interaction_metadata TEXT, -- JSON stored as plain text
    customer_preferences TEXT, -- JSON stored as plain text
    session_data TEXT,         -- JSON stored as plain text

    -- Additional fields causing bloat
    created_by UUID,
    updated_by UUID,
    created_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,

    -- Version tracking fields
    version_number INTEGER DEFAULT 1,
    last_modified_timestamp TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP
);

-- Create indexes that consume additional storage without optimization
CREATE INDEX customer_interactions_customer_id_idx ON customer_interactions (customer_id);
CREATE INDEX customer_interactions_timestamp_idx ON customer_interactions (interaction_timestamp);
CREATE INDEX customer_interactions_type_idx ON customer_interactions (interaction_type);
CREATE INDEX customer_interactions_status_idx ON customer_interactions (interaction_status);
CREATE INDEX customer_interactions_agent_idx ON customer_interactions (agent_name);
CREATE INDEX customer_interactions_department_idx ON customer_interactions (department);

-- Composite index without optimization
CREATE INDEX customer_interactions_composite_idx ON customer_interactions (
    customer_id, interaction_type, interaction_timestamp
);

-- Example of inefficient data insertion creating massive storage bloat
INSERT INTO customer_interactions (
    customer_id,
    interaction_type,
    channel_type,
    customer_full_name,
    customer_email,
    customer_phone,
    customer_address_line1,
    customer_address_line2,
    customer_city,
    customer_state,
    customer_postal_code,
    customer_country,
    interaction_category,
    interaction_subcategory,
    interaction_status,
    resolution_status,
    priority_level,
    department,
    agent_name,
    agent_email,
    agent_department,
    interaction_summary,
    customer_feedback,
    internal_notes,
    resolution_details,
    follow_up_actions,
    interaction_metadata,
    customer_preferences,
    session_data
) VALUES (
    '123e4567-e89b-12d3-a456-426614174000',
    'Customer Support Request',
    'Email',
    'John Michael Anderson Smith Jr.',
    '[email protected]',
    '+1-555-123-4567',
    '1234 Very Long Street Name That Takes Up Lots Of Space Avenue',
    'Suite 567 - Building Complex Name That Is Unnecessarily Verbose',
    'Some Very Long City Name That Could Be Abbreviated',
    'Some Very Long State Name That Could Be Abbreviated',
    '12345-6789',
    'United States of America',
    'Technical Support and Assistance Request',
    'Software Configuration and Setup Issues',
    'Open and Awaiting Assignment to Agent',
    'Unresolved and Pending Investigation',
    'High Priority Immediate Attention Required',
    'Technical Support and Customer Success Department',
    'Sarah Elizabeth Johnson-Williams III',
    'sarah.elizabeth.johnson.williams.iii@very-long-company-domain-name.com',
    'Technical Support and Customer Success Department Division',
    'Customer reports experiencing significant technical difficulties with the software configuration and setup process. Multiple attempts to resolve the issue through standard troubleshooting procedures have been unsuccessful. Customer expresses frustration with the current state of the system and requests immediate assistance from a senior technical specialist.',
    'I am extremely frustrated with this software. I have been trying to configure it for three days now and nothing is working properly. The documentation is confusing and the interface is not intuitive. I need someone to help me immediately or I will have to consider switching to a different solution.',
    'Customer called at 2:30 PM expressing significant frustration. Initial troubleshooting performed over phone for 45 minutes. Customer became increasingly agitated during the call. Escalated to senior technical specialist for immediate follow-up. Customer database shows history of similar issues.',
    'Scheduled follow-up call with senior technical specialist for tomorrow at 10:00 AM. Will prepare detailed configuration guide specific to customer environment. Engineering team notified of recurring configuration issues for product improvement consideration.',
    'Send follow-up email with temporary workaround solution. Schedule training session for customer team. Create internal documentation update request for engineering team to improve initial setup experience.',
    '{"browser": "Chrome", "version": "118.0.5993.88", "operating_system": "Windows 11 Professional", "screen_resolution": "1920x1080", "session_duration": "00:45:23", "pages_visited": ["dashboard", "configuration", "settings", "help", "contact"], "error_codes": ["CONFIG_001", "CONFIG_003", "SETUP_ERROR_15"], "user_agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/118.0.0.0 Safari/537.36"}',
    '{"communication_preference": "email", "timezone": "America/New_York", "language": "English", "notification_settings": {"email_notifications": true, "sms_notifications": false, "push_notifications": true}, "support_level": "premium", "account_type": "enterprise", "contract_tier": "gold"}',
    '{"session_id": "sess_987654321", "referrer": "https://www.google.com/search?q=software+setup+help", "campaign_source": "organic_search", "landing_page": "/help/setup", "conversion_tracking": {"goal_completed": false, "time_to_conversion": null}, "interaction_flow": ["landing", "help_search", "contact_form", "phone_call"]}'
);

-- Storage analysis revealing massive inefficiencies
SELECT 
    schemaname,
    tablename,

    -- Table size information (showing bloat)
    pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) as total_size,
    pg_size_pretty(pg_relation_size(schemaname||'.'||tablename)) as table_size,
    pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename) - pg_relation_size(schemaname||'.'||tablename)) as indexes_size,

    -- Row count and average row size
    n_tup_ins as total_rows,
    CASE 
        WHEN n_tup_ins > 0 THEN 
            pg_relation_size(schemaname||'.'||tablename) / n_tup_ins 
        ELSE 0 
    END as avg_row_size_bytes,

    -- Bloat estimation (simplified)
    CASE 
        WHEN n_tup_ins > 0 AND n_tup_del > 0 THEN
            ROUND((n_tup_del::float / n_tup_ins::float) * 100, 2)
        ELSE 0.0
    END as estimated_bloat_percentage

FROM pg_stat_user_tables 
WHERE tablename = 'customer_interactions'
ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC;

-- Index usage and efficiency analysis
SELECT 
    schemaname,
    tablename,
    indexname,
    pg_size_pretty(pg_relation_size(indexname::regclass)) as index_size,
    idx_tup_read as index_reads,
    idx_tup_fetch as index_fetches,

    -- Index efficiency metrics
    CASE 
        WHEN idx_tup_read > 0 THEN 
            ROUND((idx_tup_fetch::float / idx_tup_read::float) * 100, 2)
        ELSE 0.0
    END as index_selectivity_percentage,

    -- Usage frequency
    CASE 
        WHEN idx_scan > 0 THEN 'Used'
        ELSE 'Unused - Consider Dropping'
    END as usage_status

FROM pg_stat_user_indexes 
WHERE tablename = 'customer_interactions'
ORDER BY pg_relation_size(indexname::regclass) DESC;

-- Problems with traditional storage approaches:
-- 1. Massive storage bloat due to redundant data and poor schema design
-- 2. No compression at the column or row level leading to wasted space
-- 3. Inefficient text storage for JSON and large text fields
-- 4. Index bloat consuming excessive disk space
-- 5. Poor data normalization choices leading to repetitive storage
-- 6. No intelligent data lifecycle management or archiving strategies
-- 7. Inefficient backup and restore times due to storage bloat
-- 8. High network bandwidth consumption for data replication
-- 9. Poor cache efficiency due to large row sizes
-- 10. Limited compression options and no real-time compression capabilities

-- Attempt at PostgreSQL compression (limited options)
-- PostgreSQL doesn't have built-in transparent compression for most data types

-- TOAST compression for large values (automatically applied but limited)
ALTER TABLE customer_interactions ALTER COLUMN interaction_summary SET STORAGE EXTENDED;
ALTER TABLE customer_interactions ALTER COLUMN customer_feedback SET STORAGE EXTENDED;
ALTER TABLE customer_interactions ALTER COLUMN internal_notes SET STORAGE EXTENDED;

-- Limited compression through column storage optimizations
-- PostgreSQL primarily relies on TOAST for large values but lacks comprehensive compression

-- Manual compression approach (application-level - complex and inefficient)
CREATE OR REPLACE FUNCTION compress_json_data(input_json TEXT) 
RETURNS BYTEA AS $$
BEGIN
    -- Limited compression capabilities requiring custom implementation
    RETURN pg_compress(input_json::bytea);
END;
$$ LANGUAGE plpgsql;

-- Problems with PostgreSQL compression approaches:
-- 1. Limited built-in compression options compared to modern databases
-- 2. TOAST compression only applies to large values (typically > 2KB)
-- 3. No transparent compression for regular columns and data types
-- 4. Application-level compression adds complexity and overhead
-- 5. Poor integration with indexing and query performance
-- 6. No automatic compression algorithm selection based on data patterns
-- 7. Limited control over compression levels and trade-offs
-- 8. No transparent decompression during query execution
-- 9. Complex backup and restore processes with compressed data
-- 10. Poor monitoring and optimization tools for compression effectiveness

MongoDB provides comprehensive data compression and storage optimization capabilities:

// MongoDB Advanced Data Compression and Storage Optimization
const { MongoClient, ObjectId } = require('mongodb');

const client = new MongoClient('mongodb://localhost:27017', {
  // Connection-level compression for network optimization
  compressors: ['snappy', 'zlib', 'zstd'],
  zlibCompressionLevel: 6
});

const db = client.db('storage_optimization_demo');

// Comprehensive MongoDB Storage Optimization Manager
class MongoStorageOptimizationManager {
  constructor(db, config = {}) {
    this.db = db;
    this.collections = {
      customerInteractions: db.collection('customer_interactions'),
      products: db.collection('products'),
      orderHistory: db.collection('order_history'),
      userSessions: db.collection('user_sessions'),
      analyticsData: db.collection('analytics_data'),
      archivedData: db.collection('archived_data'),
      storageMetrics: db.collection('storage_metrics')
    };

    // Advanced storage optimization configuration
    this.optimizationConfig = {
      // Compression settings
      compression: {
        algorithm: config.compressionAlgorithm || 'snappy', // snappy, zlib, zstd
        level: config.compressionLevel || 6,
        enableDocumentCompression: config.enableDocumentCompression !== false,
        enableIndexCompression: config.enableIndexCompression !== false,
        enableBlockCompression: config.enableBlockCompression !== false
      },

      // Storage efficiency settings
      storageEfficiency: {
        enableDocumentOptimization: config.enableDocumentOptimization !== false,
        enableSchemaOptimization: config.enableSchemaOptimization !== false,
        enableIndexOptimization: config.enableIndexOptimization !== false,
        enableDataLifecycleManagement: config.enableDataLifecycleManagement !== false
      },

      // Performance optimization settings
      performanceOptimization: {
        enableCacheOptimization: config.enableCacheOptimization !== false,
        enableQueryOptimization: config.enableQueryOptimization !== false,
        enableBackgroundOptimization: config.enableBackgroundOptimization !== false,
        optimizationSchedule: config.optimizationSchedule || 'daily'
      },

      // Monitoring and analytics
      monitoring: {
        enableStorageMetrics: config.enableStorageMetrics !== false,
        enableCompressionAnalytics: config.enableCompressionAnalytics !== false,
        enablePerformanceTracking: config.enablePerformanceTracking !== false,
        metricsRetention: config.metricsRetention || '90 days'
      }
    };

    this.initializeStorageOptimization();
  }

  async initializeStorageOptimization() {
    console.log('Initializing MongoDB storage optimization system...');

    try {
      // Configure WiredTiger storage engine compression
      await this.configureWiredTigerCompression();

      // Setup optimized collection configurations
      await this.setupOptimizedCollections();

      // Initialize compression monitoring
      await this.setupCompressionMonitoring();

      // Start background optimization processes
      await this.startBackgroundOptimization();

      console.log('Storage optimization system initialized successfully');

    } catch (error) {
      console.error('Error initializing storage optimization:', error);
      throw error;
    }
  }

  async configureWiredTigerCompression() {
    console.log('Configuring WiredTiger compression settings...');

    try {
      // Configure collection compression for maximum efficiency
      const compressionConfig = {
        // Document compression using advanced algorithms
        storageEngine: {
          wiredTiger: {
            configString: 
              `block_compressor=${this.optimizationConfig.compression.algorithm},` +
              `memory_page_max=10m,` +
              `split_pct=90,` +
              `leaf_value_max=64MB,` +
              `checksum=on,` +
              `compression=${this.optimizationConfig.compression.algorithm}`
          }
        }
      };

      // Apply compression configuration to new collections
      await this.collections.customerInteractions.createIndex(
        { customerId: 1, interactionTimestamp: -1 },
        { 
          background: true,
          storageEngine: {
            wiredTiger: {
              configString: `prefix_compression=true,block_compressor=${this.optimizationConfig.compression.algorithm}`
            }
          }
        }
      );

      console.log('WiredTiger compression configured successfully');
      return compressionConfig;

    } catch (error) {
      console.error('Error configuring WiredTiger compression:', error);
      throw error;
    }
  }

  async setupOptimizedCollections() {
    console.log('Setting up collections with optimal storage configurations...');

    try {
      // Customer interactions collection with optimized schema and compression
      await this.collections.customerInteractions.createIndex(
        { "customerId": 1, "timestamp": -1 },
        { 
          background: true,
          partialFilterExpression: { "timestamp": { $gte: new Date(Date.now() - 365 * 24 * 60 * 60 * 1000) } }
        }
      );

      // Compound index with compression optimization
      await this.collections.customerInteractions.createIndex(
        { 
          "interactionType": 1, 
          "status": 1, 
          "timestamp": -1 
        },
        { 
          background: true,
          sparse: true,
          storageEngine: {
            wiredTiger: {
              configString: "prefix_compression=true"
            }
          }
        }
      );

      // Text index with compression for search optimization
      await this.collections.customerInteractions.createIndex(
        {
          "summary": "text",
          "notes": "text",
          "feedback": "text"
        },
        {
          background: true,
          weights: {
            "summary": 10,
            "notes": 5,
            "feedback": 3
          },
          name: "interaction_text_search_compressed"
        }
      );

      // Products collection with category-specific optimization
      await this.collections.products.createIndex(
        { "category": 1, "subcategory": 1, "sku": 1 },
        { 
          background: true,
          storageEngine: {
            wiredTiger: {
              configString: "prefix_compression=true,dictionary=1000"
            }
          }
        }
      );

      console.log('Optimized collection configurations completed');

    } catch (error) {
      console.error('Error setting up optimized collections:', error);
      throw error;
    }
  }

  async insertOptimizedCustomerInteraction(interactionData) {
    console.log('Inserting customer interaction with storage optimization...');
    const startTime = Date.now();

    try {
      // Optimize document structure for storage efficiency
      const optimizedDocument = await this.optimizeDocumentStructure(interactionData);

      // Apply compression-friendly data organization
      const compressedDocument = await this.applyCompressionOptimizations(optimizedDocument);

      // Insert with optimized settings
      const insertResult = await this.collections.customerInteractions.insertOne(compressedDocument, {
        writeConcern: { w: 'majority', j: true }
      });

      // Track storage metrics
      await this.trackStorageMetrics('insert', {
        documentId: insertResult.insertedId,
        originalSize: JSON.stringify(interactionData).length,
        optimizedSize: JSON.stringify(compressedDocument).length,
        processingTime: Date.now() - startTime
      });

      return {
        insertedId: insertResult.insertedId,
        originalSize: JSON.stringify(interactionData).length,
        optimizedSize: JSON.stringify(compressedDocument).length,
        compressionRatio: (JSON.stringify(interactionData).length / JSON.stringify(compressedDocument).length).toFixed(2),
        processingTime: Date.now() - startTime
      };

    } catch (error) {
      console.error('Error inserting optimized customer interaction:', error);
      throw error;
    }
  }

  async optimizeDocumentStructure(document) {
    console.log('Optimizing document structure for storage efficiency...');

    // Create normalized and optimized document structure
    const optimized = {
      _id: document._id || new ObjectId(),

      // Optimize customer data with references instead of repetition
      customerId: document.customerId,

      // Use efficient data types and structures
      interaction: {
        type: this.normalizeCategory(document.interactionType),
        channel: this.normalizeCategory(document.channelType),
        timestamp: document.interactionTimestamp || new Date(),
        category: this.optimizeCategoryHierarchy(document.category, document.subcategory),
        status: this.normalizeStatus(document.status),
        priority: this.normalizePriority(document.priority)
      },

      // Optimize agent information
      agent: {
        id: document.agentId,
        department: this.normalizeCategory(document.department)
      },

      // Compress and optimize text content
      content: {
        summary: this.optimizeTextContent(document.summary),
        feedback: this.optimizeTextContent(document.feedback),
        notes: this.optimizeTextContent(document.notes),
        resolution: this.optimizeTextContent(document.resolution)
      },

      // Optimize metadata with compression-friendly structure
      metadata: this.optimizeMetadata(document.metadata),

      // Add optimization tracking
      optimization: {
        version: 1,
        optimizedAt: new Date(),
        compressionAlgorithm: this.optimizationConfig.compression.algorithm
      }
    };

    return optimized;
  }

  async applyCompressionOptimizations(document) {
    console.log('Applying compression optimizations to document...');

    // Apply field-level optimizations for better compression
    const optimized = { ...document };

    // Optimize repetitive categorical data
    if (optimized.interaction) {
      optimized.interaction = this.compressCategorialData(optimized.interaction);
    }

    // Optimize text content for compression
    if (optimized.content) {
      optimized.content = await this.compressTextContent(optimized.content);
    }

    // Optimize metadata structure
    if (optimized.metadata) {
      optimized.metadata = this.compressMetadata(optimized.metadata);
    }

    return optimized;
  }

  normalizeCategory(category) {
    if (!category) return null;

    // Use lookup table for common categories to reduce storage
    const categoryMapping = {
      'Customer Support Request': 'CSR',
      'Technical Support and Assistance Request': 'TSR',
      'Billing and Account Inquiry': 'BAI',
      'Product Information Request': 'PIR',
      'Complaint and Issue Resolution': 'CIR',
      'Feature Request and Suggestion': 'FRS'
    };

    return categoryMapping[category] || category;
  }

  optimizeCategoryHierarchy(category, subcategory) {
    if (!category) return null;

    // Create efficient hierarchical structure
    const hierarchy = {
      primary: this.normalizeCategory(category)
    };

    if (subcategory) {
      hierarchy.secondary = this.normalizeCategory(subcategory);
    }

    return hierarchy;
  }

  normalizeStatus(status) {
    if (!status) return null;

    const statusMapping = {
      'Open and Awaiting Assignment to Agent': 'OPEN',
      'In Progress with Agent': 'IN_PROGRESS',
      'Waiting for Customer Response': 'WAITING',
      'Resolved and Closed': 'RESOLVED',
      'Escalated to Senior Specialist': 'ESCALATED'
    };

    return statusMapping[status] || status;
  }

  normalizePriority(priority) {
    if (!priority) return null;

    const priorityMapping = {
      'High Priority Immediate Attention Required': 'HIGH',
      'Medium Priority Standard Response': 'MEDIUM',
      'Low Priority When Available': 'LOW'
    };

    return priorityMapping[priority] || priority;
  }

  optimizeTextContent(text) {
    if (!text || typeof text !== 'string') return text;

    // Apply text optimization techniques
    return text
      .trim()
      .replace(/\s+/g, ' ')  // Normalize whitespace
      .replace(/\n\s*\n/g, '\n')  // Remove extra line breaks
      .substring(0, 10000);  // Limit extremely long text
  }

  async compressTextContent(content) {
    console.log('Compressing text content for storage efficiency...');

    const compressed = {};

    for (const [key, text] of Object.entries(content)) {
      if (text && typeof text === 'string') {
        // Apply compression-friendly text processing
        compressed[key] = this.prepareTextForCompression(text);
      } else {
        compressed[key] = text;
      }
    }

    return compressed;
  }

  prepareTextForCompression(text) {
    if (!text || text.length < 100) return text;

    // Prepare text for optimal compression
    return text
      .trim()
      .replace(/\s+/g, ' ')  // Normalize spaces
      .replace(/([.!?])\s+/g, '$1\n')  // Normalize sentence breaks
      .replace(/\n\s*\n+/g, '\n')  // Remove extra linebreaks
      .toLowerCase()  // Lowercase for better compression patterns
      .substring(0, 5000);  // Reasonable text limit
  }

  compressCategorialData(interaction) {
    // Use efficient categorical data storage
    return {
      t: interaction.type,      // type
      c: interaction.channel,   // channel
      ts: interaction.timestamp, // timestamp
      cat: interaction.category, // category
      s: interaction.status,    // status
      p: interaction.priority   // priority
    };
  }

  compressMetadata(metadata) {
    if (!metadata || typeof metadata !== 'object') return metadata;

    // Compress metadata keys and optimize structure
    const compressed = {};

    Object.entries(metadata).forEach(([key, value]) => {
      // Use abbreviated keys for common metadata
      const keyMap = {
        'browser': 'br',
        'operating_system': 'os',
        'screen_resolution': 'res',
        'session_duration': 'dur',
        'user_agent': 'ua',
        'referrer': 'ref',
        'campaign_source': 'src'
      };

      const compressedKey = keyMap[key] || key;
      compressed[compressedKey] = this.compressMetadataValue(value);
    });

    return compressed;
  }

  compressMetadataValue(value) {
    if (typeof value === 'string') {
      // Compress common string values
      const valueMap = {
        'Chrome': 'C',
        'Firefox': 'F',
        'Safari': 'S',
        'Edge': 'E',
        'Windows': 'W',
        'MacOS': 'M',
        'Linux': 'L'
      };

      return valueMap[value] || value;
    }

    return value;
  }

  optimizeMetadata(metadata) {
    if (!metadata) return null;

    // Create optimized metadata structure
    const optimized = {
      session: {
        id: metadata.session_id,
        duration: metadata.session_duration,
        pages: metadata.pages_visited?.length || 0
      }
    };

    // Add browser information efficiently
    if (metadata.browser || metadata.operating_system) {
      optimized.env = {
        browser: this.normalizeCategory(metadata.browser),
        os: this.normalizeCategory(metadata.operating_system),
        resolution: metadata.screen_resolution
      };
    }

    // Add tracking information efficiently
    if (metadata.campaign_source || metadata.referrer) {
      optimized.tracking = {
        source: metadata.campaign_source,
        referrer: metadata.referrer ? new URL(metadata.referrer).hostname : null
      };
    }

    return optimized;
  }

  async performStorageAnalysis(collectionName) {
    console.log(`Performing comprehensive storage analysis for ${collectionName}...`);
    const startTime = Date.now();

    try {
      const collection = this.collections[collectionName] || this.db.collection(collectionName);

      // Get collection statistics
      const collStats = await this.db.runCommand({ collStats: collectionName, verbose: true });

      // Analyze document structure and compression effectiveness
      const sampleDocuments = await collection.aggregate([
        { $sample: { size: 100 } }
      ]).toArray();

      // Calculate compression metrics
      const compressionAnalysis = this.analyzeCompressionEffectiveness(sampleDocuments);

      // Get index statistics
      const indexStats = await collection.aggregate([
        { $indexStats: {} }
      ]).toArray();

      // Analyze storage efficiency
      const storageEfficiency = this.calculateStorageEfficiency(collStats, compressionAnalysis);

      const analysis = {
        collectionName: collectionName,
        timestamp: new Date(),
        processingTime: Date.now() - startTime,

        // Collection-level statistics
        collectionStats: {
          documentCount: collStats.count,
          averageDocumentSize: collStats.avgObjSize,
          totalSize: collStats.size,
          storageSize: collStats.storageSize,
          totalIndexSize: collStats.totalIndexSize,

          // Compression metrics
          compressionRatio: collStats.size / collStats.storageSize,
          compressionSavings: collStats.size - collStats.storageSize,
          compressionEfficiency: ((collStats.size - collStats.storageSize) / collStats.size * 100).toFixed(2)
        },

        // Document analysis
        documentAnalysis: compressionAnalysis,

        // Index analysis
        indexAnalysis: {
          totalIndexes: indexStats.length,
          indexEfficiency: this.analyzeIndexEfficiency(indexStats),
          indexSizeBreakdown: indexStats.map(idx => ({
            name: idx.name,
            size: idx.host ? idx.host.size : 'N/A',
            usageCount: idx.accesses?.ops || 0
          }))
        },

        // Storage efficiency metrics
        storageEfficiency: storageEfficiency,

        // Optimization recommendations
        recommendations: this.generateOptimizationRecommendations(collStats, compressionAnalysis, indexStats)
      };

      // Store analysis results for tracking
      await this.collections.storageMetrics.insertOne({
        ...analysis,
        analysisId: new ObjectId(),
        createdAt: new Date()
      });

      return analysis;

    } catch (error) {
      console.error(`Error performing storage analysis for ${collectionName}:`, error);
      throw error;
    }
  }

  analyzeCompressionEffectiveness(sampleDocuments) {
    console.log('Analyzing compression effectiveness across document samples...');

    const analysis = {
      sampleSize: sampleDocuments.length,
      averageOriginalSize: 0,
      averageCompressedSize: 0,
      compressionRatio: 0,
      fieldAnalysis: {},
      recommendedOptimizations: []
    };

    if (sampleDocuments.length === 0) return analysis;

    let totalOriginalSize = 0;
    let totalCompressedSize = 0;
    const fieldSizes = {};

    sampleDocuments.forEach(doc => {
      const docString = JSON.stringify(doc);
      const originalSize = docString.length;
      totalOriginalSize += originalSize;

      // Simulate compression effectiveness
      const compressedSize = this.estimateCompressedSize(docString);
      totalCompressedSize += compressedSize;

      // Analyze individual fields
      Object.entries(doc).forEach(([field, value]) => {
        if (!fieldSizes[field]) {
          fieldSizes[field] = { totalSize: 0, count: 0 };
        }
        fieldSizes[field].totalSize += JSON.stringify(value).length;
        fieldSizes[field].count++;
      });
    });

    analysis.averageOriginalSize = totalOriginalSize / sampleDocuments.length;
    analysis.averageCompressedSize = totalCompressedSize / sampleDocuments.length;
    analysis.compressionRatio = totalOriginalSize / totalCompressedSize;

    // Analyze field-level compression opportunities
    Object.entries(fieldSizes).forEach(([field, stats]) => {
      analysis.fieldAnalysis[field] = {
        averageSize: stats.totalSize / stats.count,
        frequency: stats.count / sampleDocuments.length,
        compressionOpportunity: this.calculateFieldCompressionOpportunity(field, stats)
      };
    });

    return analysis;
  }

  estimateCompressedSize(content) {
    // Simplified compression estimation based on repetition patterns
    const patterns = content.match(/(.{3,}?)(?=.*\1)/g) || [];
    const repetitionFactor = patterns.length / content.length;
    const estimatedCompressionRatio = Math.max(0.3, 1 - repetitionFactor * 0.7);
    return Math.round(content.length * estimatedCompressionRatio);
  }

  calculateFieldCompressionOpportunity(fieldName, stats) {
    // Determine compression opportunity for specific fields
    if (stats.averageSize > 1000) return 'high';
    if (stats.averageSize > 100) return 'medium';
    return 'low';
  }

  analyzeIndexEfficiency(indexStats) {
    console.log('Analyzing index efficiency and optimization opportunities...');

    const efficiency = {
      totalIndexes: indexStats.length,
      usedIndexes: 0,
      unusedIndexes: 0,
      highUsageIndexes: [],
      lowUsageIndexes: [],
      compressionOpportunities: []
    };

    indexStats.forEach(idx => {
      const usageCount = idx.accesses?.ops || 0;

      if (usageCount > 0) {
        efficiency.usedIndexes++;

        if (usageCount > 1000) {
          efficiency.highUsageIndexes.push({
            name: idx.name,
            usage: usageCount,
            recommendation: 'Consider prefix compression optimization'
          });
        } else if (usageCount < 10) {
          efficiency.lowUsageIndexes.push({
            name: idx.name,
            usage: usageCount,
            recommendation: 'Consider removal if consistently low usage'
          });
        }
      } else {
        efficiency.unusedIndexes++;
        efficiency.lowUsageIndexes.push({
          name: idx.name,
          usage: 0,
          recommendation: 'Consider removal - unused index'
        });
      }
    });

    return efficiency;
  }

  calculateStorageEfficiency(collStats, compressionAnalysis) {
    console.log('Calculating overall storage efficiency metrics...');

    const efficiency = {
      // Current efficiency metrics
      currentCompressionRatio: collStats.size / collStats.storageSize,
      indexToDataRatio: collStats.totalIndexSize / collStats.size,
      averageDocumentEfficiency: compressionAnalysis.compressionRatio,

      // Potential improvements
      potentialSavings: this.calculatePotentialSavings(collStats, compressionAnalysis),

      // Efficiency classification
      efficiencyGrade: this.classifyStorageEfficiency(collStats, compressionAnalysis)
    };

    return efficiency;
  }

  calculatePotentialSavings(collStats, compressionAnalysis) {
    // Calculate potential storage savings through optimization
    const currentSize = collStats.storageSize;
    const potentialCompression = compressionAnalysis.compressionRatio * 1.2; // 20% improvement estimate
    const potentialSavings = currentSize * (1 - 1/potentialCompression);

    return {
      potentialSavingsBytes: potentialSavings,
      potentialSavingsPercent: (potentialSavings / currentSize * 100).toFixed(2),
      estimatedCompression: potentialCompression.toFixed(2)
    };
  }

  classifyStorageEfficiency(collStats, compressionAnalysis) {
    const compressionRatio = collStats.size / collStats.storageSize;
    const indexRatio = collStats.totalIndexSize / collStats.size;

    if (compressionRatio > 3 && indexRatio < 0.5) return 'Excellent';
    if (compressionRatio > 2 && indexRatio < 1) return 'Good';
    if (compressionRatio > 1.5) return 'Fair';
    return 'Poor';
  }

  generateOptimizationRecommendations(collStats, compressionAnalysis, indexStats) {
    console.log('Generating storage optimization recommendations...');

    const recommendations = [];

    // Compression recommendations
    const compressionRatio = collStats.size / collStats.storageSize;
    if (compressionRatio < 2) {
      recommendations.push({
        type: 'compression',
        priority: 'high',
        recommendation: 'Enable or upgrade compression algorithm',
        expectedImprovement: '30-50% storage reduction'
      });
    }

    // Index recommendations
    const indexRatio = collStats.totalIndexSize / collStats.size;
    if (indexRatio > 1) {
      recommendations.push({
        type: 'indexing',
        priority: 'medium',
        recommendation: 'Review and optimize index usage',
        expectedImprovement: '20-30% index size reduction'
      });
    }

    // Document structure recommendations
    if (compressionAnalysis.averageOriginalSize > 10000) {
      recommendations.push({
        type: 'schema',
        priority: 'medium',
        recommendation: 'Optimize document structure and field organization',
        expectedImprovement: '15-25% size reduction'
      });
    }

    return recommendations;
  }

  async setupCompressionMonitoring() {
    console.log('Setting up compression monitoring and analytics...');

    try {
      // Create monitoring collection with optimized storage
      await this.collections.storageMetrics.createIndex(
        { timestamp: -1 },
        { 
          background: true,
          expireAfterSeconds: 90 * 24 * 60 * 60 // 90 days retention
        }
      );

      await this.collections.storageMetrics.createIndex(
        { collectionName: 1, timestamp: -1 },
        { background: true }
      );

      console.log('Compression monitoring setup completed');

    } catch (error) {
      console.error('Error setting up compression monitoring:', error);
      throw error;
    }
  }

  async trackStorageMetrics(operation, metrics) {
    try {
      const metricRecord = {
        operation: operation,
        timestamp: new Date(),
        ...metrics,

        // Add system context
        systemInfo: {
          compressionAlgorithm: this.optimizationConfig.compression.algorithm,
          compressionLevel: this.optimizationConfig.compression.level,
          nodeVersion: process.version
        }
      };

      await this.collections.storageMetrics.insertOne(metricRecord);

    } catch (error) {
      console.error('Error tracking storage metrics:', error);
      // Don't throw - metrics tracking shouldn't break operations
    }
  }

  async startBackgroundOptimization() {
    if (!this.optimizationConfig.performanceOptimization.enableBackgroundOptimization) {
      return;
    }

    console.log('Starting background storage optimization processes...');

    // Schedule regular storage analysis
    setInterval(async () => {
      try {
        await this.performAutomatedOptimization();
      } catch (error) {
        console.error('Background optimization error:', error);
      }
    }, 24 * 60 * 60 * 1000); // Daily optimization
  }

  async performAutomatedOptimization() {
    console.log('Performing automated storage optimization...');

    const collections = Object.keys(this.collections);

    for (const collectionName of collections) {
      if (collectionName === 'storageMetrics') continue; // Skip metrics collection

      try {
        // Analyze storage efficiency
        const analysis = await this.performStorageAnalysis(collectionName);

        // Apply automatic optimizations based on analysis
        if (analysis.storageEfficiency.efficiencyGrade === 'Poor') {
          await this.applyAutomaticOptimizations(collectionName, analysis);
        }

      } catch (error) {
        console.error(`Error optimizing collection ${collectionName}:`, error);
      }
    }
  }

  async applyAutomaticOptimizations(collectionName, analysis) {
    console.log(`Applying automatic optimizations for ${collectionName}...`);

    const recommendations = analysis.recommendations;

    for (const rec of recommendations) {
      if (rec.priority === 'high' && rec.type === 'indexing') {
        // Remove unused indexes automatically
        await this.optimizeUnusedIndexes(collectionName, analysis.indexAnalysis);
      }
    }
  }

  async optimizeUnusedIndexes(collectionName, indexAnalysis) {
    console.log(`Optimizing unused indexes for ${collectionName}...`);

    const collection = this.collections[collectionName] || this.db.collection(collectionName);

    for (const unusedIndex of indexAnalysis.indexEfficiency.lowUsageIndexes) {
      if (unusedIndex.usage === 0 && !unusedIndex.name.startsWith('_id')) {
        try {
          await collection.dropIndex(unusedIndex.name);
          console.log(`Dropped unused index: ${unusedIndex.name}`);
        } catch (error) {
          console.error(`Error dropping index ${unusedIndex.name}:`, error);
        }
      }
    }
  }
}

// Benefits of MongoDB Advanced Data Compression and Storage Optimization:
// - Comprehensive WiredTiger compression with multiple algorithm options (snappy, zlib, zstd)
// - Document-level optimization reducing storage requirements by 50-80%
// - Advanced index compression and optimization for improved query performance
// - Intelligent document structure optimization for better compression ratios
// - Real-time compression effectiveness monitoring and analytics
// - Automatic background optimization processes for continuous improvement
// - Field-level compression strategies based on data patterns and usage
// - Integration with MongoDB's native compression capabilities
// - Comprehensive storage analysis and optimization recommendations
// - SQL-compatible compression operations through QueryLeaf integration

module.exports = {
  MongoStorageOptimizationManager
};

Understanding MongoDB Data Compression Architecture

Advanced Compression Strategies and Storage Engine Optimization

Implement sophisticated compression and storage optimization patterns for production MongoDB deployments:

// Production-ready MongoDB compression with enterprise-level optimization and monitoring
class EnterpriseCompressionManager extends MongoStorageOptimizationManager {
  constructor(db, enterpriseConfig) {
    super(db, enterpriseConfig);

    this.enterpriseConfig = {
      ...enterpriseConfig,

      // Enterprise compression features
      enableTieredStorage: true,
      enableAutomaticArchiving: true,
      enableCompressionTuning: true,
      enableCapacityPlanning: true,
      enableComplianceTracking: true,

      // Advanced optimization strategies
      compressionStrategies: {
        algorithm: 'zstd',  // Best compression ratio for enterprise
        level: 9,           // Maximum compression
        enableDictionaryCompression: true,
        enableStreamCompression: true,
        enableDifferentialCompression: true
      }
    };

    this.setupEnterpriseOptimizations();
  }

  async implementTieredStorageStrategy(collections, tieringConfig) {
    console.log('Implementing tiered storage strategy for optimal cost and performance...');

    const tieredStrategy = {
      // Hot tier - frequently accessed data
      hotTier: {
        compressionAlgorithm: 'snappy',  // Fast compression/decompression
        compressionLevel: 1,
        storageType: 'high-performance-ssd',
        retentionPeriod: '30 days',
        accessPattern: 'high-frequency'
      },

      // Warm tier - occasionally accessed data
      warmTier: {
        compressionAlgorithm: 'zlib',    // Balanced compression
        compressionLevel: 6,
        storageType: 'standard-ssd',
        retentionPeriod: '180 days',
        accessPattern: 'medium-frequency'
      },

      // Cold tier - rarely accessed data
      coldTier: {
        compressionAlgorithm: 'zstd',    // Maximum compression
        compressionLevel: 19,
        storageType: 'high-capacity-hdd',
        retentionPeriod: '2 years',
        accessPattern: 'low-frequency'
      },

      // Archive tier - long-term retention
      archiveTier: {
        compressionAlgorithm: 'zstd',
        compressionLevel: 22,            // Ultra compression
        storageType: 'archive-storage',
        retentionPeriod: '7 years',
        accessPattern: 'archive-only'
      }
    };

    return await this.deployTieredStorageStrategy(collections, tieredStrategy);
  }

  async setupAdvancedCompressionMonitoring() {
    console.log('Setting up enterprise-grade compression monitoring...');

    const monitoringConfig = {
      // Real-time compression metrics
      realTimeMetrics: {
        compressionRatio: true,
        compressionSpeed: true,
        decompressionSpeed: true,
        storageEfficiency: true,
        costOptimization: true
      },

      // Performance impact analysis
      performanceAnalysis: {
        queryLatencyImpact: true,
        throughputAnalysis: true,
        resourceUtilization: true,
        compressionOverhead: true
      },

      // Capacity planning and forecasting
      capacityPlanning: {
        growthProjections: true,
        compressionTrends: true,
        costForecasting: true,
        optimizationOpportunities: true
      },

      // Alerting and automation
      alerting: {
        compressionEfficiencyThresholds: true,
        storageCapacityWarnings: true,
        performanceDegradationAlerts: true,
        automaticOptimizationTriggers: true
      }
    };

    return await this.initializeEnterpriseMonitoring(monitoringConfig);
  }

  async performCompressionBenchmarking(testDatasets) {
    console.log('Performing comprehensive compression algorithm benchmarking...');

    const algorithms = ['snappy', 'zlib', 'zstd'];
    const levels = [1, 6, 9, 19, 22];
    const benchmarkResults = [];

    for (const algorithm of algorithms) {
      for (const level of levels) {
        if (algorithm === 'snappy' && level > 1) continue; // Snappy only has one level

        const result = await this.benchmarkCompressionAlgorithm(
          algorithm, 
          level, 
          testDatasets
        );

        benchmarkResults.push({
          algorithm,
          level,
          ...result
        });
      }
    }

    return this.analyzeCompressionBenchmarks(benchmarkResults);
  }
}

SQL-Style Data Compression with QueryLeaf

QueryLeaf provides familiar SQL syntax for MongoDB compression and storage optimization operations:

-- QueryLeaf advanced data compression and storage optimization with SQL-familiar syntax

-- Configure compression settings for optimal storage efficiency
SET compression_algorithm = 'zstd';
SET compression_level = 9;
SET enable_index_compression = true;
SET enable_document_optimization = true;
SET enable_background_compression = true;

-- Create compressed collection with optimized storage configuration
CREATE TABLE customer_interactions (
    interaction_id UUID PRIMARY KEY,
    customer_id UUID NOT NULL,

    -- Optimized categorical data with compression-friendly normalization
    interaction_type VARCHAR(20) COMPRESSED,  -- Normalized categories
    channel_type VARCHAR(10) COMPRESSED,
    status VARCHAR(10) COMPRESSED,
    priority VARCHAR(5) COMPRESSED,

    -- Timestamp with efficient storage
    interaction_timestamp TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,

    -- Optimized text content with intelligent compression
    summary TEXT COMPRESSED WITH (
        algorithm = 'zstd',
        level = 15,
        enable_dictionary = true,
        max_length = 5000
    ),

    feedback TEXT COMPRESSED WITH (
        algorithm = 'zstd', 
        level = 15,
        enable_dictionary = true,
        max_length = 10000
    ),

    notes TEXT COMPRESSED WITH (
        algorithm = 'zstd',
        level = 15,
        enable_dictionary = true,
        max_length = 10000
    ),

    -- Optimized metadata storage
    metadata JSONB COMPRESSED WITH (
        algorithm = 'zstd',
        level = 12,
        enable_field_compression = true,
        normalize_keys = true
    ),

    -- Agent information with normalization
    agent_id UUID,
    department_code VARCHAR(10) COMPRESSED,

    -- Audit fields with efficient storage
    created_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP

) WITH (
    -- Storage engine configuration
    storage_engine = 'WiredTiger',
    compression_algorithm = 'zstd',
    compression_level = 9,

    -- Collection-level optimization
    enable_document_compression = true,
    enable_index_compression = true,
    compression_dictionary_size = '1MB',

    -- Performance optimization
    cache_size = '1GB',
    memory_page_max = '10MB',
    split_percentage = 90,

    -- Storage optimization
    block_compressor = 'zstd',
    prefix_compression = true,
    huffman_encoding = true
);

-- Create compressed indexes with optimal storage efficiency  
CREATE INDEX customer_interactions_customer_time_idx 
ON customer_interactions (customer_id, interaction_timestamp DESC)
WITH (
    compression_algorithm = 'zstd',
    prefix_compression = true,
    compression_level = 6,
    storage_optimized = true
);

-- Compound index with intelligent compression
CREATE INDEX customer_interactions_search_idx 
ON customer_interactions (
    interaction_type, 
    status, 
    priority, 
    interaction_timestamp DESC
)
WITH (
    compression_algorithm = 'snappy',  -- Faster for frequent lookups
    prefix_compression = true,
    partial_index_filter = (interaction_timestamp >= CURRENT_DATE - INTERVAL '1 year')
);

-- Text search index with advanced compression
CREATE TEXT INDEX customer_interactions_text_idx
ON customer_interactions (summary, feedback, notes)
WITH (
    compression_algorithm = 'zstd',
    compression_level = 12,
    language = 'english',
    enable_stemming = true,
    enable_stop_words = true,
    weights = {
        'summary': 10,
        'feedback': 5, 
        'notes': 3
    }
);

-- Advanced bulk insert with compression optimization
WITH optimized_data_preparation AS (
    SELECT 
        -- Normalize and optimize data for compression
        interaction_id,
        customer_id,

        -- Categorical data normalization for better compression
        CASE interaction_type
            WHEN 'Customer Support Request' THEN 'CSR'
            WHEN 'Technical Support Request' THEN 'TSR'  
            WHEN 'Billing Inquiry' THEN 'BIL'
            WHEN 'Product Information' THEN 'PRD'
            ELSE LEFT(interaction_type, 20)
        END as normalized_interaction_type,

        CASE channel_type
            WHEN 'Email' THEN 'EM'
            WHEN 'Phone' THEN 'PH'
            WHEN 'Chat' THEN 'CH'
            WHEN 'Social Media' THEN 'SM'
            ELSE LEFT(channel_type, 10) 
        END as normalized_channel_type,

        -- Status normalization
        CASE status
            WHEN 'Open and Awaiting Assignment' THEN 'OPEN'
            WHEN 'In Progress with Agent' THEN 'PROG'
            WHEN 'Waiting for Customer Response' THEN 'WAIT'
            WHEN 'Resolved and Closed' THEN 'DONE'
            ELSE LEFT(status, 10)
        END as normalized_status,

        -- Priority normalization
        CASE priority
            WHEN 'High Priority Immediate' THEN 'HIGH'
            WHEN 'Medium Priority Standard' THEN 'MED'
            WHEN 'Low Priority When Available' THEN 'LOW'
            ELSE LEFT(priority, 5)
        END as normalized_priority,

        interaction_timestamp,

        -- Text content optimization for compression
        CASE 
            WHEN summary IS NOT NULL THEN
                LEFT(
                    TRIM(REGEXP_REPLACE(LOWER(summary), '\s+', ' ', 'g')), 
                    5000
                )
            ELSE NULL
        END as optimized_summary,

        CASE 
            WHEN feedback IS NOT NULL THEN
                LEFT(
                    TRIM(REGEXP_REPLACE(LOWER(feedback), '\s+', ' ', 'g')), 
                    10000
                )
            ELSE NULL
        END as optimized_feedback,

        -- Metadata compression optimization
        CASE 
            WHEN metadata IS NOT NULL THEN
                JSON_OBJECT(
                    -- Use abbreviated keys for better compression
                    'br', JSON_EXTRACT(metadata, '$.browser'),
                    'os', JSON_EXTRACT(metadata, '$.operating_system'),
                    'res', JSON_EXTRACT(metadata, '$.screen_resolution'),
                    'dur', JSON_EXTRACT(metadata, '$.session_duration'),
                    'ref', JSON_EXTRACT(metadata, '$.referrer')
                )
            ELSE NULL
        END as optimized_metadata,

        agent_id,
        department_code,
        created_at,
        updated_at

    FROM staging_customer_interactions
    WHERE 
        -- Data quality filters
        interaction_timestamp >= CURRENT_DATE - INTERVAL '2 years'
        AND customer_id IS NOT NULL
        AND interaction_type IS NOT NULL
),

compression_analysis AS (
    SELECT 
        COUNT(*) as total_records,

        -- Estimate compression savings
        SUM(
            LENGTH(COALESCE(original_summary, '')) +
            LENGTH(COALESCE(original_feedback, '')) +
            LENGTH(COALESCE(original_metadata::text, ''))
        ) as total_original_size,

        SUM(
            LENGTH(COALESCE(optimized_summary, '')) +
            LENGTH(COALESCE(optimized_feedback, '')) +
            LENGTH(COALESCE(optimized_metadata::text, ''))
        ) as total_optimized_size,

        -- Calculate optimization metrics
        ROUND(
            (1 - SUM(LENGTH(COALESCE(optimized_summary, '')) + LENGTH(COALESCE(optimized_feedback, '')) + LENGTH(COALESCE(optimized_metadata::text, '')))::float / 
             NULLIF(SUM(LENGTH(COALESCE(original_summary, '')) + LENGTH(COALESCE(original_feedback, '')) + LENGTH(COALESCE(original_metadata::text, ''))), 0)) * 100,
            2
        ) as optimization_percentage,

        -- Estimate final compression ratio with zstd
        ROUND(
            SUM(LENGTH(COALESCE(optimized_summary, '')) + LENGTH(COALESCE(optimized_feedback, '')) + LENGTH(COALESCE(optimized_metadata::text, '')))::float * 0.3,  -- Estimate 70% compression
            0
        ) as estimated_final_size

    FROM optimized_data_preparation odp
    JOIN staging_customer_interactions sci ON odp.interaction_id = sci.interaction_id
)

-- Execute optimized bulk insert with compression
INSERT INTO customer_interactions (
    interaction_id,
    customer_id,
    interaction_type,
    channel_type,
    status,
    priority,
    interaction_timestamp,
    summary,
    feedback,
    metadata,
    agent_id,
    department_code,
    created_at,
    updated_at
)
SELECT 
    odp.interaction_id,
    odp.customer_id,
    odp.normalized_interaction_type,
    odp.normalized_channel_type,
    odp.normalized_status,
    odp.normalized_priority,
    odp.interaction_timestamp,
    odp.optimized_summary,
    odp.optimized_feedback,
    odp.optimized_metadata,
    odp.agent_id,
    odp.department_code,
    odp.created_at,
    odp.updated_at
FROM optimized_data_preparation odp
CROSS JOIN compression_analysis ca

-- Bulk insert optimization configuration
WITH (
    compression_algorithm = 'zstd',
    compression_level = 9,
    batch_size = 1000,
    parallel_batches = 4,

    -- Enable optimization features
    enable_document_optimization = true,
    enable_field_compression = true,
    enable_dictionary_compression = true,

    -- Performance settings
    write_concern = 'majority',
    write_timeout = 30000,
    bypass_document_validation = false
);

-- Advanced storage analysis and optimization monitoring
WITH storage_analysis AS (
    SELECT 
        'customer_interactions' as collection_name,

        -- Document count and size metrics
        COUNT(*) as document_count,

        -- Estimate storage metrics (MongoDB-specific calculations)
        SUM(
            LENGTH(interaction_id::text) +
            LENGTH(customer_id::text) +
            LENGTH(COALESCE(interaction_type, '')) +
            LENGTH(COALESCE(summary, '')) +
            LENGTH(COALESCE(feedback, '')) +
            LENGTH(COALESCE(metadata::text, ''))
        ) as estimated_uncompressed_size,

        -- Compression effectiveness analysis
        AVG(LENGTH(COALESCE(summary, ''))) as avg_summary_size,
        AVG(LENGTH(COALESCE(feedback, ''))) as avg_feedback_size,
        AVG(LENGTH(COALESCE(metadata::text, ''))) as avg_metadata_size,

        -- Field distribution analysis for compression optimization
        COUNT(*) FILTER (WHERE summary IS NOT NULL AND LENGTH(summary) > 1000) as large_summary_count,
        COUNT(*) FILTER (WHERE feedback IS NOT NULL AND LENGTH(feedback) > 2000) as large_feedback_count,
        COUNT(*) FILTER (WHERE metadata IS NOT NULL AND JSON_DEPTH(metadata) > 3) as complex_metadata_count,

        -- Categorical data analysis for normalization effectiveness
        COUNT(DISTINCT interaction_type) as unique_interaction_types,
        COUNT(DISTINCT channel_type) as unique_channel_types,
        COUNT(DISTINCT status) as unique_statuses,
        COUNT(DISTINCT priority) as unique_priorities,

        -- Time-based analysis for tiered storage opportunities
        COUNT(*) FILTER (WHERE interaction_timestamp >= CURRENT_DATE - INTERVAL '30 days') as recent_documents,
        COUNT(*) FILTER (WHERE interaction_timestamp BETWEEN CURRENT_DATE - INTERVAL '180 days' AND CURRENT_DATE - INTERVAL '30 days') as warm_documents,
        COUNT(*) FILTER (WHERE interaction_timestamp < CURRENT_DATE - INTERVAL '180 days') as cold_documents

    FROM customer_interactions
),

index_analysis AS (
    SELECT 
        -- Index usage and efficiency metrics
        indexname,
        schemaname,
        tablename,

        -- Size analysis
        pg_size_pretty(pg_relation_size(indexname::regclass)) as index_size,

        -- Usage statistics
        idx_scan as scans,
        idx_tup_read as tuples_read,
        idx_tup_fetch as tuples_fetched,

        -- Efficiency calculation
        CASE 
            WHEN idx_tup_read > 0 THEN 
                ROUND((idx_tup_fetch::float / idx_tup_read::float) * 100, 2)
            ELSE 0.0
        END as selectivity_percentage,

        -- Compression opportunity assessment
        CASE 
            WHEN pg_relation_size(indexname::regclass) > 100 * 1024 * 1024 THEN 'high_compression_opportunity'
            WHEN pg_relation_size(indexname::regclass) > 10 * 1024 * 1024 THEN 'medium_compression_opportunity'
            ELSE 'low_compression_opportunity'
        END as compression_opportunity

    FROM pg_stat_user_indexes 
    WHERE tablename = 'customer_interactions'
),

compression_effectiveness AS (
    SELECT 
        sa.collection_name,
        sa.document_count,

        -- Size projections
        ROUND(sa.estimated_uncompressed_size / 1024.0 / 1024.0, 2) as estimated_uncompressed_mb,
        ROUND((sa.estimated_uncompressed_size * 0.3) / 1024.0 / 1024.0, 2) as estimated_compressed_mb,  -- 70% compression
        ROUND((1 - 0.3) * 100, 1) as estimated_compression_percentage,

        -- Field-specific optimization potential
        ROUND(sa.avg_summary_size, 0) as avg_summary_size_bytes,
        ROUND(sa.avg_feedback_size, 0) as avg_feedback_size_bytes,
        ROUND(sa.avg_metadata_size, 0) as avg_metadata_size_bytes,

        -- Data distribution insights
        sa.unique_interaction_types,
        sa.unique_channel_types,
        sa.unique_statuses,
        sa.unique_priorities,

        -- Tiered storage recommendations
        ROUND((sa.recent_documents::float / sa.document_count) * 100, 1) as hot_tier_percentage,
        ROUND((sa.warm_documents::float / sa.document_count) * 100, 1) as warm_tier_percentage,
        ROUND((sa.cold_documents::float / sa.document_count) * 100, 1) as cold_tier_percentage,

        -- Optimization recommendations
        CASE 
            WHEN sa.large_summary_count > sa.document_count * 0.5 THEN 'optimize_text_compression'
            WHEN sa.complex_metadata_count > sa.document_count * 0.3 THEN 'optimize_metadata_structure'
            WHEN sa.unique_interaction_types < 20 THEN 'improve_categorical_normalization'
            ELSE 'continue_current_optimization'
        END as primary_recommendation

    FROM storage_analysis sa
)

-- Generate comprehensive storage optimization report
SELECT 
    ce.collection_name,
    ce.document_count,

    -- Storage projections
    ce.estimated_uncompressed_mb || ' MB' as projected_uncompressed_size,
    ce.estimated_compressed_mb || ' MB' as projected_compressed_size,
    ce.estimated_compression_percentage || '%' as compression_savings,

    -- Field optimization insights
    'Summary: ' || ce.avg_summary_size_bytes || ' bytes avg' as summary_analysis,
    'Feedback: ' || ce.avg_feedback_size_bytes || ' bytes avg' as feedback_analysis,  
    'Metadata: ' || ce.avg_metadata_size_bytes || ' bytes avg' as metadata_analysis,

    -- Data structure insights
    'Types: ' || ce.unique_interaction_types || ', Channels: ' || ce.unique_channel_types as categorical_analysis,

    -- Tiered storage distribution
    'Hot: ' || ce.hot_tier_percentage || '%, Warm: ' || ce.warm_tier_percentage || '%, Cold: ' || ce.cold_tier_percentage || '%' as tier_distribution,

    -- Index compression opportunities
    (
        SELECT COUNT(*) 
        FROM index_analysis ia 
        WHERE ia.compression_opportunity = 'high_compression_opportunity'
    ) as high_compression_indexes,

    (
        SELECT STRING_AGG(ia.indexname, ', ')
        FROM index_analysis ia 
        WHERE ia.selectivity_percentage < 50 AND ia.scans < 100
    ) as underutilized_indexes,

    -- Primary optimization recommendation
    ce.primary_recommendation,

    -- Detailed optimization steps
    CASE ce.primary_recommendation
        WHEN 'optimize_text_compression' THEN 
            'Implement advanced text preprocessing and zstd compression with dictionaries'
        WHEN 'optimize_metadata_structure' THEN 
            'Restructure metadata fields and apply field-level compression'
        WHEN 'improve_categorical_normalization' THEN 
            'Implement lookup tables and categorical data compression'
        ELSE 
            'Continue monitoring and apply incremental optimizations'
    END as optimization_steps,

    -- Expected improvements
    CASE ce.primary_recommendation
        WHEN 'optimize_text_compression' THEN '40-60% additional compression'
        WHEN 'optimize_metadata_structure' THEN '25-40% metadata size reduction'  
        WHEN 'improve_categorical_normalization' THEN '15-25% categorical data compression'
        ELSE '5-15% incremental improvements'
    END as expected_improvement,

    -- Implementation priority
    CASE 
        WHEN ce.estimated_uncompressed_mb > 1000 AND ce.estimated_compression_percentage < 60 THEN 'HIGH'
        WHEN ce.estimated_uncompressed_mb > 500 OR ce.estimated_compression_percentage < 70 THEN 'MEDIUM'
        ELSE 'LOW'
    END as implementation_priority,

    CURRENT_TIMESTAMP as analysis_timestamp

FROM compression_effectiveness ce;

-- Set up automated compression monitoring and optimization
CREATE EVENT compressed_storage_monitor
ON SCHEDULE EVERY 1 DAY
STARTS CURRENT_TIMESTAMP + INTERVAL 1 HOUR
DO
BEGIN
    -- Daily storage analysis and optimization recommendations
    INSERT INTO storage_optimization_reports (
        collection_name,
        analysis_date,
        storage_metrics,
        compression_effectiveness,
        optimization_recommendations,
        expected_savings
    )

    SELECT 
        'customer_interactions' as collection_name,
        CURRENT_DATE as analysis_date,

        JSON_OBJECT(
            'document_count', COUNT(*),
            'estimated_size_mb', ROUND(SUM(LENGTH(COALESCE(summary, '')) + LENGTH(COALESCE(feedback, ''))) / 1024.0 / 1024.0, 2),
            'compression_ratio', 3.2,  -- Estimated based on zstd performance
            'index_compression_ratio', 2.5
        ) as storage_metrics,

        JSON_OBJECT(
            'text_compression_effectiveness', 'high',
            'metadata_optimization_potential', 'medium', 
            'categorical_compression_ratio', 'excellent',
            'overall_effectiveness', 'very_good'
        ) as compression_effectiveness,

        JSON_ARRAY(
            'Continue zstd compression optimization',
            'Monitor text field growth patterns',
            'Review index usage and compression opportunities',
            'Implement tiered storage for older data'
        ) as optimization_recommendations,

        JSON_OBJECT(
            'storage_cost_savings', '60-75%',
            'backup_time_improvement', '70%',
            'network_bandwidth_reduction', '65%',
            'query_performance_impact', 'minimal'
        ) as expected_savings

    FROM customer_interactions
    WHERE interaction_timestamp >= CURRENT_DATE - INTERVAL 1 DAY;

END;

-- QueryLeaf provides comprehensive data compression capabilities:
-- 1. Advanced compression algorithm selection and configuration (snappy, zlib, zstd)
-- 2. Document-level optimization with intelligent field compression
-- 3. Index compression with prefix compression and dictionary encoding
-- 4. Automatic compression effectiveness monitoring and analytics
-- 5. Tiered storage strategies for optimal cost and performance balance
-- 6. Real-time compression ratio tracking and optimization recommendations
-- 7. Background compression optimization with automated tuning
-- 8. SQL-familiar syntax for MongoDB compression operations
-- 9. Comprehensive storage analysis and capacity planning tools
-- 10. Enterprise-grade compression management with compliance tracking

Best Practices for Production Data Compression

Storage Optimization and Compression Strategy

Essential principles for effective MongoDB compression deployment:

Algorithm Selection: Choose compression algorithms based on workload characteristics - snappy for performance, zstd for maximum compression
Tiered Compression: Implement different compression levels for hot, warm, and cold data tiers
Document Optimization: Structure documents to maximize compression effectiveness through field organization and normalization
Index Compression: Enable index compression with prefix compression for optimal query performance
Monitoring Integration: Implement comprehensive compression monitoring to track effectiveness and optimize configurations
Background Optimization: Configure automated compression optimization processes for continuous improvement

Performance and Cost Optimization

Optimize compression strategies for production-scale requirements:

Compression Tuning: Balance compression ratio against CPU overhead based on specific application requirements
Storage Lifecycle: Implement intelligent data lifecycle management with compression tier progression
Capacity Planning: Monitor compression effectiveness trends for accurate capacity forecasting
Network Optimization: Leverage network-level compression for replication and backup optimization
Resource Management: Monitor CPU impact of compression and decompression operations
Cost Analysis: Track storage cost savings against computational overhead for ROI optimization

Conclusion

MongoDB data compression provides comprehensive storage optimization capabilities that can dramatically reduce storage requirements, improve backup and replication performance, and lower infrastructure costs through intelligent compression algorithms, document optimization, and automated compression management. The WiredTiger storage engine's integrated compression features ensure that compression benefits don't compromise query performance or data integrity.

Key MongoDB Data Compression benefits include:

Significant Storage Savings: Achieve 60-80% storage reduction through advanced compression algorithms and document optimization
Performance Integration: Native compression integration ensures minimal impact on query performance and throughput
Automated Optimization: Background compression optimization processes continuously improve storage efficiency
Comprehensive Monitoring: Real-time compression analytics provide insights for optimization and capacity planning
Flexible Configuration: Multiple compression algorithms and levels support diverse workload requirements
SQL Accessibility: Familiar SQL-style compression operations through QueryLeaf for accessible storage optimization

Whether you're managing large-scale data warehouses, high-volume transaction systems, or cost-sensitive cloud deployments, MongoDB compression with QueryLeaf's familiar SQL interface provides the foundation for efficient, cost-effective data storage.

QueryLeaf Integration: QueryLeaf automatically optimizes MongoDB compression operations while providing SQL-familiar syntax for compression configuration, monitoring, and optimization. Advanced compression strategies, tiered storage management, and automated optimization techniques are seamlessly accessible through familiar SQL constructs, making sophisticated storage optimization approachable for SQL-oriented development teams.

The combination of MongoDB's robust compression capabilities with SQL-style storage optimization operations makes it an ideal platform for applications requiring both efficient data storage and familiar database management patterns, ensuring your storage infrastructure can scale cost-effectively while maintaining optimal performance characteristics.

December 15, 2025
20 min read

MongoDB Search Indexing and Full-Text Search Optimization: Advanced Text Search Strategies for Modern Applications

Modern applications demand sophisticated search capabilities that go beyond simple pattern matching to provide intelligent, contextual, and fast text search experiences. Users expect search functionality that understands intent, ranks results by relevance, handles typos gracefully, and delivers results in milliseconds across large datasets containing millions of documents.

MongoDB's text search capabilities provide powerful full-text indexing with advanced features including relevance scoring, language-specific stemming, stop word filtering, phrase matching, and wildcard search patterns. Unlike traditional databases that require external search engines or complex text processing pipelines, MongoDB integrates text search directly into the database with optimized indexing and query execution.

The Traditional Text Search Challenge

Conventional database approaches to text search face significant limitations in performance, functionality, and scalability:

-- Traditional PostgreSQL text search with performance and functionality limitations

-- Simple LIKE pattern matching (extremely slow on large datasets)
SELECT 
  product_id,
  product_name,
  description,
  category,
  price
FROM products 
WHERE product_name LIKE '%laptop%' 
   OR description LIKE '%laptop%'
   OR category LIKE '%laptop%';

-- Problems with LIKE pattern matching:
-- 1. No relevance scoring or ranking capabilities
-- 2. Extremely poor performance on large datasets (full table scans)
-- 3. Case sensitivity issues requiring additional LOWER() calls
-- 4. No stemming support (won't match "laptops", "laptop's")
-- 5. No stop word handling or language-specific features
-- 6. Limited wildcard and partial matching capabilities
-- 7. No phrase matching or proximity search
-- 8. Sequential scanning makes it unusable for real-time search

-- Improved PostgreSQL approach with full-text search (better but limited)
-- Create full-text search indexes
CREATE INDEX products_fts_idx ON products 
USING gin(to_tsvector('english', product_name || ' ' || description));

-- Full-text search query with ranking
WITH search_results AS (
  SELECT 
    product_id,
    product_name,
    description,
    category,
    price,

    -- Text search vector matching
    to_tsvector('english', product_name || ' ' || description) as document,
    plainto_tsquery('english', 'gaming laptop high performance') as query,

    -- Basic relevance ranking
    ts_rank(
      to_tsvector('english', product_name || ' ' || description),
      plainto_tsquery('english', 'gaming laptop high performance')
    ) as relevance_score,

    -- Headline generation for snippets
    ts_headline(
      'english',
      description,
      plainto_tsquery('english', 'gaming laptop high performance'),
      'StartSel=<b>, StopSel=</b>, MaxWords=35, MinWords=15, ShortWord=3'
    ) as search_snippet

  FROM products
  WHERE to_tsvector('english', product_name || ' ' || description) 
        @@ plainto_tsquery('english', 'gaming laptop high performance')
),

ranked_results AS (
  SELECT 
    sr.*,

    -- Additional ranking factors
    CASE 
      WHEN LOWER(product_name) LIKE '%gaming%' AND LOWER(product_name) LIKE '%laptop%' THEN 2.0
      WHEN LOWER(product_name) LIKE '%gaming%' OR LOWER(product_name) LIKE '%laptop%' THEN 1.5  
      ELSE 1.0
    END as title_boost,

    -- Category relevance boost
    CASE 
      WHEN LOWER(category) IN ('computers', 'laptops', 'gaming') THEN 1.3
      WHEN LOWER(category) IN ('electronics', 'hardware') THEN 1.1
      ELSE 1.0  
    END as category_boost,

    -- Price range consideration for commercial applications
    CASE 
      WHEN price BETWEEN 800 AND 2000 THEN 1.2  -- Sweet spot for gaming laptops
      WHEN price > 2000 THEN 1.1
      ELSE 1.0
    END as price_boost

  FROM search_results sr
)

SELECT 
  product_id,
  product_name,
  description,
  category,
  price,
  search_snippet,

  -- Combined relevance score
  ROUND(
    (relevance_score * title_boost * category_boost * price_boost), 
    4
  ) as final_relevance_score,

  -- Search result quality indicators
  CASE 
    WHEN relevance_score > 0.3 THEN 'high'
    WHEN relevance_score > 0.1 THEN 'medium'  
    ELSE 'low'
  END as result_quality,

  -- Match type classification
  CASE 
    WHEN LOWER(product_name) LIKE '%gaming laptop%' THEN 'exact_phrase_title'
    WHEN LOWER(product_name) LIKE '%gaming%' AND LOWER(product_name) LIKE '%laptop%' THEN 'all_terms_title'
    WHEN LOWER(description) LIKE '%gaming laptop%' THEN 'exact_phrase_description'
    ELSE 'partial_match'
  END as match_type

FROM ranked_results
WHERE relevance_score > 0.05  -- Filter out very low relevance results
ORDER BY final_relevance_score DESC, price ASC
LIMIT 50;

-- Advanced PostgreSQL search with multiple criteria and faceting
WITH search_base AS (
  SELECT 
    product_id,
    product_name,
    description,
    category,
    price,
    brand,
    specifications,  -- JSONB field
    created_at,

    -- Multiple search vector approaches for different field weights
    setweight(to_tsvector('english', product_name), 'A') ||
    setweight(to_tsvector('english', COALESCE(description, '')), 'B') ||
    setweight(to_tsvector('english', category), 'C') ||
    setweight(to_tsvector('english', brand), 'D') as search_vector,

    plainto_tsquery('english', 'gaming laptop RTX nvidia') as search_query

  FROM products
  WHERE 
    -- Pre-filter for performance
    category IN ('Computers', 'Laptops', 'Gaming', 'Electronics') AND
    price BETWEEN 300 AND 5000 AND
    created_at >= CURRENT_DATE - INTERVAL '2 years'
),

search_matches AS (
  SELECT 
    *,

    -- Text search ranking with field weighting
    ts_rank_cd(search_vector, search_query, 32) as base_relevance,

    -- Individual field matches for boost calculation
    to_tsvector('english', product_name) @@ search_query as title_match,
    to_tsvector('english', description) @@ search_query as description_match,
    to_tsvector('english', category) @@ search_query as category_match,
    to_tsvector('english', brand) @@ search_query as brand_match,

    -- JSON field search for specifications
    specifications::text ILIKE '%RTX%' OR specifications::text ILIKE '%nvidia%' as spec_match,

    -- Generate search snippets
    ts_headline(
      'english',
      description,
      search_query,
      'StartSel=<mark>, StopSel=</mark>, MaxWords=50, MinWords=20'
    ) as snippet

  FROM search_base
  WHERE search_vector @@ search_query
),

enhanced_results AS (
  SELECT 
    *,

    -- Calculate comprehensive relevance score
    base_relevance * 
    (CASE WHEN title_match THEN 3.0 ELSE 1.0 END) *
    (CASE WHEN brand_match THEN 1.5 ELSE 1.0 END) *
    (CASE WHEN spec_match THEN 1.8 ELSE 1.0 END) *
    (CASE WHEN category_match THEN 1.3 ELSE 1.0 END) as enhanced_relevance,

    -- Quality scoring
    CASE 
      WHEN title_match AND spec_match THEN 'excellent'
      WHEN title_match OR (description_match AND spec_match) THEN 'good'
      WHEN description_match OR category_match THEN 'fair'
      ELSE 'poor'
    END as result_quality_tier,

    -- Commercial relevance for e-commerce
    CASE 
      WHEN price BETWEEN 1000 AND 2500 AND spec_match THEN 'high_commercial_value'
      WHEN price BETWEEN 500 AND 1000 THEN 'moderate_commercial_value'
      ELSE 'standard_commercial_value'
    END as commercial_tier

  FROM search_matches
)

-- Final results with faceting information
SELECT 
  product_id,
  product_name,
  description,
  category,
  brand,
  price,
  snippet,
  ROUND(enhanced_relevance, 4) as relevance_score,
  result_quality_tier,
  commercial_tier,

  -- Match indicators for UI highlighting
  title_match,
  description_match,
  brand_match,
  spec_match,

  created_at

FROM enhanced_results
WHERE enhanced_relevance > 0.1
ORDER BY enhanced_relevance DESC, price ASC
LIMIT 100;

-- Search faceting query for filter options
WITH facet_data AS (
  SELECT 
    category,
    brand,

    -- Price range buckets
    CASE 
      WHEN price < 500 THEN 'Under $500'
      WHEN price BETWEEN 500 AND 999 THEN '$500-$999'
      WHEN price BETWEEN 1000 AND 1499 THEN '$1000-$1499'  
      WHEN price BETWEEN 1500 AND 2499 THEN '$1500-$2499'
      ELSE '$2500+'
    END as price_range,

    -- Extract key specs from JSON
    CASE 
      WHEN specifications::text ILIKE '%RTX 40%' THEN 'RTX 40 Series'
      WHEN specifications::text ILIKE '%RTX 30%' THEN 'RTX 30 Series'  
      WHEN specifications::text ILIKE '%RTX%' THEN 'RTX Graphics'
      WHEN specifications::text ILIKE '%GTX%' THEN 'GTX Graphics'
      ELSE 'Other Graphics'
    END as graphics_type

  FROM products
  WHERE to_tsvector('english', product_name || ' ' || description) 
        @@ plainto_tsquery('english', 'gaming laptop RTX nvidia')
)

SELECT 
  'category' as facet_type,
  category as facet_value,
  COUNT(*) as result_count
FROM facet_data
WHERE category IS NOT NULL
GROUP BY category

UNION ALL

SELECT 
  'brand' as facet_type,
  brand as facet_value,
  COUNT(*) as result_count  
FROM facet_data
WHERE brand IS NOT NULL
GROUP BY brand

UNION ALL

SELECT 
  'price_range' as facet_type,
  price_range as facet_value,
  COUNT(*) as result_count
FROM facet_data  
GROUP BY price_range

UNION ALL

SELECT 
  'graphics' as facet_type,
  graphics_type as facet_value,
  COUNT(*) as result_count
FROM facet_data
GROUP BY graphics_type

ORDER BY facet_type, result_count DESC;

-- PostgreSQL limitations for text search:
-- 1. Complex configuration and maintenance of text search indexes
-- 2. Limited language support compared to specialized search engines
-- 3. Basic relevance scoring with limited customization options
-- 4. No built-in support for auto-complete or suggestion features
-- 5. Difficulty handling real-time index updates with high write volumes
-- 6. Limited support for complex query syntax (phrase matching, proximity)
-- 7. Basic stemming and stop word handling
-- 8. No native support for search analytics or performance monitoring
-- 9. Difficult to implement features like "did you mean" or fuzzy matching
-- 10. Scalability limitations for very large text corpora

-- MySQL approach (even more limited)
SELECT 
  product_id,
  product_name,
  description,
  MATCH(product_name, description) AGAINST('gaming laptop' IN NATURAL LANGUAGE MODE) as relevance
FROM products 
WHERE MATCH(product_name, description) AGAINST('gaming laptop' IN NATURAL LANGUAGE MODE)
ORDER BY relevance DESC
LIMIT 20;

-- MySQL full-text search limitations:
-- - Very basic relevance scoring
-- - Limited query syntax support  
-- - Poor performance with large datasets
-- - Minimal customization options
-- - No advanced features like phrase matching or proximity search
-- - Basic language support
-- - Difficult to integrate with application logic

MongoDB provides comprehensive, high-performance text search with advanced features:

// MongoDB Advanced Text Search - comprehensive full-text search with optimization
const { MongoClient } = require('mongodb');

// Advanced MongoDB Text Search Manager
class MongoDBTextSearchManager {
  constructor(db) {
    this.db = db;
    this.searchMetrics = {
      queries: 0,
      averageResponseTime: 0,
      totalResponseTime: 0,
      cacheHits: 0,
      cacheMisses: 0
    };
    this.searchCache = new Map();
    this.indexOptimization = {
      lastOptimization: null,
      optimizationThreshold: 10000, // Optimize after 10k queries
      performanceTargetMs: 100
    };
  }

  async initializeTextSearchIndexes() {
    console.log('Initializing comprehensive text search indexes...');

    const products = this.db.collection('products');
    const articles = this.db.collection('articles');
    const users = this.db.collection('users');

    try {
      // Create compound text index for products with weighted fields
      await products.createIndex(
        {
          name: 'text',
          description: 'text',
          category: 'text',
          brand: 'text',
          'specifications.processor': 'text',
          'specifications.graphics': 'text',
          'specifications.memory': 'text',
          tags: 'text'
        },
        {
          weights: {
            name: 10,           // Highest weight for product names
            brand: 8,           // High weight for brand names
            category: 5,        // Medium-high weight for categories
            description: 3,     // Medium weight for descriptions
            'specifications.processor': 4,
            'specifications.graphics': 4,
            'specifications.memory': 2,
            tags: 6
          },
          name: 'products_comprehensive_text_search',
          default_language: 'english',
          language_override: 'searchLanguage',
          textIndexVersion: 3,
          background: true
        }
      );

      // Create text index for articles/blog content
      await articles.createIndex(
        {
          title: 'text',
          content: 'text',
          excerpt: 'text',
          tags: 'text',
          'author.name': 'text'
        },
        {
          weights: {
            title: 10,
            excerpt: 6,
            tags: 8,
            content: 2,
            'author.name': 4
          },
          name: 'articles_text_search',
          default_language: 'english',
          background: true
        }
      );

      // Create user search index for user discovery
      await users.createIndex(
        {
          'profile.first_name': 'text',
          'profile.last_name': 'text',
          'profile.bio': 'text',
          'profile.skills': 'text',
          'profile.company': 'text',
          'profile.location': 'text'
        },
        {
          weights: {
            'profile.first_name': 10,
            'profile.last_name': 10,
            'profile.company': 6,
            'profile.skills': 8,
            'profile.bio': 3,
            'profile.location': 4
          },
          name: 'users_profile_text_search',
          default_language: 'english',
          background: true
        }
      );

      // Create additional supporting indexes for search optimization
      await products.createIndex({ category: 1, price: 1 });
      await products.createIndex({ brand: 1, 'specifications.type': 1 });
      await products.createIndex({ price: 1 });
      await products.createIndex({ createdAt: -1 });
      await products.createIndex({ 'ratings.average': -1, 'ratings.count': -1 });

      // Support indexes for articles
      await articles.createIndex({ publishedAt: -1 });
      await articles.createIndex({ 'author.id': 1, publishedAt: -1 });
      await articles.createIndex({ status: 1, publishedAt: -1 });
      await articles.createIndex({ category: 1, publishedAt: -1 });

      console.log('Text search indexes created successfully');
      return {
        success: true,
        indexes: [
          'products_comprehensive_text_search',
          'articles_text_search', 
          'users_profile_text_search'
        ]
      };

    } catch (error) {
      console.error('Error creating text search indexes:', error);
      throw error;
    }
  }

  async performAdvancedTextSearch(collection, searchTerms, options = {}) {
    const startTime = Date.now();
    console.log(`Performing advanced text search on ${collection} for: "${searchTerms}"`);

    const {
      limit = 20,
      skip = 0,
      sortBy = 'relevance',
      includeScore = true,
      language = 'english',
      caseSensitive = false,
      diacriticSensitive = false,
      filters = {},
      facets = [],
      highlighting = true,
      minScore = 0.5
    } = options;

    try {
      const coll = this.db.collection(collection);

      // Build text search query
      const textQuery = this.buildTextSearchQuery(searchTerms, {
        language,
        caseSensitive, 
        diacriticSensitive
      });

      // Combine text search with additional filters
      const searchQuery = {
        $text: textQuery,
        ...filters
      };

      console.log('Search query:', JSON.stringify(searchQuery, null, 2));

      // Build aggregation pipeline for advanced search features
      const pipeline = [
        // Match stage with text search and filters
        { $match: searchQuery },

        // Add text score for relevance ranking
        { $addFields: { 
          searchScore: { $meta: 'textScore' },
          searchTimestamp: new Date()
        }},

        // Filter by minimum score threshold
        { $match: { searchScore: { $gte: minScore } } },

        // Add computed relevance factors
        { $addFields: this.buildRelevanceFactors(collection) },

        // Calculate final relevance score
        { $addFields: {
          finalRelevance: {
            $multiply: [
              '$searchScore',
              { $ifNull: ['$categoryBoost', 1] },
              { $ifNull: ['$priceBoost', 1] },
              { $ifNull: ['$brandBoost', 1] },
              { $ifNull: ['$qualityBoost', 1] }
            ]
          }
        }},

        // Sort by relevance or other criteria
        { $sort: this.buildSortCriteria(sortBy) },

        // Pagination
        { $skip: skip },
        { $limit: limit },

        // Project final results
        { $project: this.buildProjection(collection, includeScore, highlighting) }
      ];

      // Add faceting stages if requested
      if (facets.length > 0) {
        pipeline.push({
          $facet: {
            results: [{ $skip: 0 }], // Continue main pipeline
            facets: this.buildFacetPipeline(facets)
          }
        });
      }

      // Execute search query
      const results = await coll.aggregate(pipeline).toArray();

      const endTime = Date.now();
      const responseTime = endTime - startTime;

      // Update search metrics
      this.updateSearchMetrics(responseTime);

      // Process and format results
      const formattedResults = this.formatSearchResults(results, {
        collection,
        searchTerms,
        responseTime,
        facets: facets.length > 0
      });

      console.log(`Search completed in ${responseTime}ms, found ${formattedResults.results.length} results`);

      return formattedResults;

    } catch (error) {
      console.error('Text search failed:', error);
      throw error;
    }
  }

  buildTextSearchQuery(searchTerms, options) {
    const { language, caseSensitive, diacriticSensitive } = options;

    // Handle different search term formats
    let searchString;
    if (typeof searchTerms === 'string') {
      // Parse search string for advanced syntax
      searchString = this.parseSearchSyntax(searchTerms);
    } else if (Array.isArray(searchTerms)) {
      // Join array terms with AND logic
      searchString = searchTerms.map(term => `"${term}"`).join(' ');
    } else {
      searchString = searchTerms.toString();
    }

    const textQuery = {
      $search: searchString
    };

    if (language !== 'english') {
      textQuery.$language = language;
    }

    if (caseSensitive) {
      textQuery.$caseSensitive = true;
    }

    if (diacriticSensitive) {
      textQuery.$diacriticSensitive = true;
    }

    return textQuery;
  }

  parseSearchSyntax(searchString) {
    // Handle advanced search syntax
    let parsed = searchString.trim();

    // Convert common search patterns
    parsed = parsed
      .replace(/\bAND\b/gi, ' ')
      .replace(/\bOR\b/gi, ' | ')
      .replace(/\bNOT\b/gi, ' -')
      .replace(/\+([^\s]+)/g, '"$1"'); // Quoted terms for exact match

    return parsed;
  }

  buildRelevanceFactors(collection) {
    // Collection-specific relevance boosting
    const baseFactors = {
      categoryBoost: 1,
      priceBoost: 1,
      brandBoost: 1,
      qualityBoost: 1
    };

    if (collection === 'products') {
      return {
        ...baseFactors,
        categoryBoost: {
          $switch: {
            branches: [
              { case: { $in: ['$category', ['Gaming', 'Computers', 'Laptops']] }, then: 1.3 },
              { case: { $in: ['$category', ['Electronics', 'Hardware']] }, then: 1.1 }
            ],
            default: 1.0
          }
        },
        priceBoost: {
          $switch: {
            branches: [
              { case: { $and: [{ $gte: ['$price', 500] }, { $lte: ['$price', 2000] }] }, then: 1.2 },
              { case: { $gt: ['$price', 2000] }, then: 1.1 }
            ],
            default: 1.0
          }
        },
        qualityBoost: {
          $cond: {
            if: { $and: [
              { $gte: [{ $ifNull: ['$ratings.average', 0] }, 4.0] },
              { $gte: [{ $ifNull: ['$ratings.count', 0] }, 10] }
            ]},
            then: 1.4,
            else: 1.0
          }
        }
      };
    } else if (collection === 'articles') {
      return {
        ...baseFactors,
        qualityBoost: {
          $cond: {
            if: { $gte: [{ $ifNull: ['$views', 0] }, 1000] },
            then: 1.3,
            else: 1.0
          }
        },
        categoryBoost: {
          $switch: {
            branches: [
              { case: { $eq: ['$status', 'featured'] }, then: 1.5 },
              { case: { $eq: ['$status', 'published'] }, then: 1.0 }
            ],
            default: 0.8
          }
        }
      };
    }

    return baseFactors;
  }

  buildSortCriteria(sortBy) {
    switch (sortBy) {
      case 'relevance':
        return { finalRelevance: -1, searchScore: -1 };
      case 'price_low':
        return { price: 1, finalRelevance: -1 };
      case 'price_high':
        return { price: -1, finalRelevance: -1 };
      case 'newest':
        return { createdAt: -1, finalRelevance: -1 };
      case 'rating':
        return { 'ratings.average': -1, 'ratings.count': -1, finalRelevance: -1 };
      case 'popularity':
        return { views: -1, finalRelevance: -1 };
      default:
        return { finalRelevance: -1, searchScore: -1 };
    }
  }

  buildProjection(collection, includeScore, highlighting) {
    const baseProjection = {
      searchScore: includeScore ? 1 : 0,
      finalRelevance: includeScore ? 1 : 0,
      searchTimestamp: 0 // Remove from output
    };

    if (collection === 'products') {
      return {
        ...baseProjection,
        _id: 1,
        name: 1,
        description: 1,
        category: 1,
        brand: 1,
        price: 1,
        images: 1,
        specifications: 1,
        ratings: 1,
        availability: 1,
        createdAt: 1,
        // Add search highlighting if requested
        searchHighlight: highlighting ? {
          $literal: 'Highlighting would be implemented here'
        } : 0
      };
    } else if (collection === 'articles') {
      return {
        ...baseProjection,
        _id: 1,
        title: 1,
        excerpt: 1,
        content: highlighting ? 1 : 0,
        author: 1,
        publishedAt: 1,
        category: 1,
        tags: 1,
        views: 1,
        status: 1
      };
    }

    return baseProjection;
  }

  buildFacetPipeline(facets) {
    const facetStages = [];

    for (const facet of facets) {
      switch (facet.type) {
        case 'category':
          facetStages.push({
            $group: {
              _id: '$category',
              count: { $sum: 1 },
              avgPrice: { $avg: '$price' },
              avgRating: { $avg: '$ratings.average' }
            }
          });
          break;

        case 'price_range':
          facetStages.push({
            $bucket: {
              groupBy: '$price',
              boundaries: [0, 500, 1000, 1500, 2000, 5000],
              default: 'Other',
              output: {
                count: { $sum: 1 },
                avgRating: { $avg: '$ratings.average' },
                products: { $push: '$name' }
              }
            }
          });
          break;

        case 'brand':
          facetStages.push({
            $group: {
              _id: '$brand',
              count: { $sum: 1 },
              avgPrice: { $avg: '$price' },
              avgRating: { $avg: '$ratings.average' }
            }
          });
          break;
      }
    }

    return facetStages;
  }

  formatSearchResults(results, metadata) {
    const { collection, searchTerms, responseTime, facets } = metadata;

    if (facets && results[0] && results[0].facets) {
      // Handle faceted results
      return {
        results: results[0].results || [],
        facets: results[0].facets || [],
        metadata: {
          collection,
          searchTerms,
          responseTime,
          totalResults: results[0].results?.length || 0,
          searchType: 'faceted'
        }
      };
    } else {
      // Handle regular results
      return {
        results: results || [],
        metadata: {
          collection,
          searchTerms,
          responseTime,
          totalResults: results.length,
          searchType: 'standard'
        }
      };
    }
  }

  async performAutoComplete(collection, partialTerm, options = {}) {
    console.log(`Performing autocomplete for "${partialTerm}" on ${collection}`);

    const {
      limit = 10,
      fields = ['name'],
      minLength = 2
    } = options;

    if (partialTerm.length < minLength) {
      return { suggestions: [], metadata: { reason: 'Term too short' } };
    }

    const coll = this.db.collection(collection);

    try {
      // Use regex for prefix matching combined with text search
      const regexPattern = new RegExp(`^${partialTerm}`, 'i');

      const suggestions = await coll.aggregate([
        {
          $match: {
            $or: fields.map(field => ({ [field]: regexPattern }))
          }
        },
        {
          $project: {
            suggestion: { $arrayElemAt: [fields.map(field => `$${field}`), 0] },
            score: { $literal: 1 }
          }
        },
        { $group: { _id: '$suggestion', score: { $first: '$score' } } },
        { $sort: { _id: 1 } },
        { $limit: limit },
        { $project: { _id: 0, suggestion: '$_id', score: 1 } }
      ]).toArray();

      return {
        suggestions: suggestions.map(s => s.suggestion),
        metadata: {
          term: partialTerm,
          collection,
          count: suggestions.length
        }
      };

    } catch (error) {
      console.error('Autocomplete failed:', error);
      return { suggestions: [], error: error.message };
    }
  }

  async performFuzzySearch(collection, searchTerm, options = {}) {
    console.log(`Performing fuzzy search for "${searchTerm}" on ${collection}`);

    const {
      maxDistance = 2,
      limit = 20,
      fields = ['name', 'description']
    } = options;

    const coll = this.db.collection(collection);

    try {
      // Use MongoDB's fuzzy text search capabilities
      const pipeline = [
        {
          $match: {
            $text: {
              $search: searchTerm,
              $caseSensitive: false,
              $diacriticSensitive: false
            }
          }
        },
        {
          $addFields: {
            score: { $meta: 'textScore' },
            fuzzyMatch: true
          }
        },
        { $sort: { score: -1 } },
        { $limit: limit }
      ];

      const results = await coll.aggregate(pipeline).toArray();

      return {
        results,
        metadata: {
          searchTerm,
          collection,
          fuzzy: true,
          maxDistance,
          count: results.length
        }
      };

    } catch (error) {
      console.error('Fuzzy search failed:', error);
      throw error;
    }
  }

  updateSearchMetrics(responseTime) {
    this.searchMetrics.queries++;
    this.searchMetrics.totalResponseTime += responseTime;
    this.searchMetrics.averageResponseTime = 
      this.searchMetrics.totalResponseTime / this.searchMetrics.queries;
  }

  async getSearchAnalytics(timeRange = '24h') {
    console.log(`Generating search analytics for ${timeRange}`);

    // In a real implementation, this would query a search analytics collection
    return {
      period: timeRange,
      totalQueries: this.searchMetrics.queries,
      averageResponseTime: Math.round(this.searchMetrics.averageResponseTime),
      cacheHitRate: this.searchMetrics.queries > 0 ? 
        (this.searchMetrics.cacheHits / this.searchMetrics.queries * 100) : 0,
      topSearchTerms: [
        'gaming laptop',
        'mongodb tutorial', 
        'javascript framework',
        'web development',
        'data visualization'
      ],
      performanceMetrics: {
        fastQueries: 85, // < 100ms
        mediumQueries: 12, // 100-500ms  
        slowQueries: 3 // > 500ms
      },
      recommendations: [
        'Consider adding more text indexes for frequently searched fields',
        'Monitor query patterns for optimization opportunities',
        'Implement search result caching for common queries'
      ]
    };
  }
}

// Example usage: Comprehensive text search implementation
async function demonstrateAdvancedTextSearch() {
  const client = new MongoClient('mongodb://localhost:27017');
  await client.connect();
  const db = client.db('ecommerce_platform');

  const searchManager = new MongoDBTextSearchManager(db);

  // Initialize text search indexes
  await searchManager.initializeTextSearchIndexes();

  // Advanced product search
  const productResults = await searchManager.performAdvancedTextSearch('products', 
    'gaming laptop RTX nvidia', {
      limit: 20,
      sortBy: 'relevance',
      filters: {
        price: { $gte: 800, $lte: 3000 },
        availability: { $ne: 'out_of_stock' }
      },
      facets: [
        { type: 'category' },
        { type: 'brand' }, 
        { type: 'price_range' }
      ],
      minScore: 0.3
    }
  );

  console.log('Product Search Results:', JSON.stringify(productResults, null, 2));

  // Autocomplete demonstration
  const autoCompleteResults = await searchManager.performAutoComplete('products', 'gam', {
    limit: 10,
    fields: ['name', 'category']
  });

  console.log('Autocomplete Results:', autoCompleteResults);

  // Fuzzy search for typos
  const fuzzyResults = await searchManager.performFuzzySearch('articles', 'javasript tutorial', {
    maxDistance: 2,
    limit: 15
  });

  console.log('Fuzzy Search Results:', fuzzyResults);

  // Search analytics
  const analytics = await searchManager.getSearchAnalytics('24h');
  console.log('Search Analytics:', analytics);

  await client.close();
}

Understanding MongoDB Text Search Architecture

Advanced Text Search Patterns and Optimization

Implement sophisticated text search strategies for production applications:

// Production-ready text search with advanced optimization and monitoring
class EnterpriseTextSearchManager extends MongoDBTextSearchManager {
  constructor(db, enterpriseConfig = {}) {
    super(db);

    this.enterpriseConfig = {
      enableSearchCaching: enterpriseConfig.enableSearchCaching || true,
      enableSearchAnalytics: enterpriseConfig.enableSearchAnalytics || true,
      enableAutoOptimization: enterpriseConfig.enableAutoOptimization || true,
      cacheSize: enterpriseConfig.cacheSize || 10000,
      cacheTTL: enterpriseConfig.cacheTTL || 300000, // 5 minutes
      optimizationInterval: enterpriseConfig.optimizationInterval || 3600000 // 1 hour
    };

    this.setupEnterpriseFeatures();
  }

  async performSemanticSearch(collection, searchQuery, options = {}) {
    console.log(`Performing semantic search on ${collection}`);

    const {
      useEmbeddings = true,
      semanticWeight = 0.3,
      textWeight = 0.7,
      limit = 20
    } = options;

    // Combine traditional text search with semantic similarity
    const pipeline = [
      // Traditional text search
      {
        $match: {
          $text: {
            $search: searchQuery,
            $caseSensitive: false
          }
        }
      },

      // Add text search score
      {
        $addFields: {
          textScore: { $meta: 'textScore' }
        }
      }
    ];

    if (useEmbeddings) {
      // Add semantic similarity scoring (would integrate with vector search)
      pipeline.push({
        $addFields: {
          semanticScore: {
            // Placeholder for vector similarity calculation
            $literal: 0.8
          }
        }
      });

      // Combine scores
      pipeline.push({
        $addFields: {
          hybridScore: {
            $add: [
              { $multiply: ['$textScore', textWeight] },
              { $multiply: ['$semanticScore', semanticWeight] }
            ]
          }
        }
      });
    }

    // Sort and limit results
    pipeline.push(
      { $sort: { hybridScore: -1 } },
      { $limit: limit }
    );

    const coll = this.db.collection(collection);
    const results = await coll.aggregate(pipeline).toArray();

    return {
      results,
      searchType: 'semantic',
      metadata: {
        query: searchQuery,
        useEmbeddings,
        textWeight,
        semanticWeight
      }
    };
  }

  async performMultiCollectionSearch(searchQuery, collections, options = {}) {
    console.log(`Performing multi-collection search across: ${collections.join(', ')}`);

    const {
      limit = 50,
      rankAcrossCollections = true
    } = options;

    const searchPromises = collections.map(async (collectionName) => {
      const results = await this.performAdvancedTextSearch(collectionName, searchQuery, {
        limit: Math.ceil(limit / collections.length),
        includeScore: true
      });

      return {
        collection: collectionName,
        results: results.results.map(result => ({
          ...result,
          sourceCollection: collectionName,
          collectionType: this.getCollectionType(collectionName)
        }))
      };
    });

    const collectionResults = await Promise.all(searchPromises);

    if (rankAcrossCollections) {
      // Merge and rank results across all collections
      const allResults = collectionResults.flatMap(cr => cr.results);
      allResults.sort((a, b) => (b.finalRelevance || b.searchScore) - (a.finalRelevance || a.searchScore));

      return {
        results: allResults.slice(0, limit),
        metadata: {
          searchQuery,
          collections,
          totalCollections: collections.length,
          ranked: true
        }
      };
    } else {
      // Return results grouped by collection
      return {
        resultsByCollection: collectionResults,
        metadata: {
          searchQuery,
          collections,
          totalCollections: collections.length,
          ranked: false
        }
      };
    }
  }

  async implementSearchSuggestions(searchHistory, userBehavior, options = {}) {
    console.log('Generating intelligent search suggestions');

    const {
      maxSuggestions = 10,
      includePopular = true,
      includePersonalized = true
    } = options;

    const suggestions = [];

    // Popular search suggestions
    if (includePopular) {
      const popularQueries = await this.getPopularSearchQueries(5);
      suggestions.push(...popularQueries.map(q => ({
        query: q.query,
        type: 'popular',
        count: q.count,
        score: q.popularity_score
      })));
    }

    // Personalized suggestions based on user behavior
    if (includePersonalized && userBehavior) {
      const personalizedQueries = await this.getPersonalizedSuggestions(userBehavior, 5);
      suggestions.push(...personalizedQueries.map(q => ({
        query: q.query,
        type: 'personalized', 
        reason: q.reason,
        score: q.relevance_score
      })));
    }

    // Sort by score and return top suggestions
    suggestions.sort((a, b) => b.score - a.score);

    return {
      suggestions: suggestions.slice(0, maxSuggestions),
      metadata: {
        includePopular,
        includePersonalized,
        totalGenerated: suggestions.length
      }
    };
  }

  async performSearchWithCorrection(searchQuery, collection, options = {}) {
    console.log(`Search with spell correction for: "${searchQuery}"`);

    // First, try the original query
    let results = await this.performAdvancedTextSearch(collection, searchQuery, {
      ...options,
      minScore: 0.3
    });

    if (results.results.length < 3) {
      // Try to find spelling corrections
      const corrections = await this.generateSpellCorrections(searchQuery);

      if (corrections.length > 0) {
        // Try search with corrected query
        const correctedResults = await this.performAdvancedTextSearch(
          collection, 
          corrections[0].suggestion, 
          options
        );

        if (correctedResults.results.length > results.results.length) {
          return {
            ...correctedResults,
            searchCorrection: {
              originalQuery: searchQuery,
              correctedQuery: corrections[0].suggestion,
              confidence: corrections[0].confidence
            }
          };
        }
      }
    }

    return results;
  }

  async generateSpellCorrections(query) {
    // Simplified spell correction logic
    // In production, would use sophisticated algorithms like Levenshtein distance
    const commonCorrections = new Map([
      ['javasript', 'javascript'],
      ['phython', 'python'],
      ['reactjs', 'react'],
      ['mongdb', 'mongodb'],
      ['databse', 'database']
    ]);

    const words = query.toLowerCase().split(' ');
    const corrections = [];

    for (const word of words) {
      if (commonCorrections.has(word)) {
        corrections.push({
          original: word,
          suggestion: commonCorrections.get(word),
          confidence: 0.9
        });
      }
    }

    if (corrections.length > 0) {
      let correctedQuery = query;
      corrections.forEach(correction => {
        correctedQuery = correctedQuery.replace(
          new RegExp(correction.original, 'gi'), 
          correction.suggestion
        );
      });

      return [{
        suggestion: correctedQuery,
        confidence: corrections.reduce((sum, c) => sum + c.confidence, 0) / corrections.length,
        corrections: corrections
      }];
    }

    return [];
  }

  getCollectionType(collectionName) {
    const typeMap = {
      'products': 'Product',
      'articles': 'Article',
      'users': 'User Profile',
      'categories': 'Category',
      'brands': 'Brand'
    };

    return typeMap[collectionName] || 'Document';
  }

  async getPopularSearchQueries(limit) {
    // In production, would query search analytics collection
    return [
      { query: 'gaming laptop', count: 1250, popularity_score: 0.95 },
      { query: 'wireless headphones', count: 890, popularity_score: 0.87 },
      { query: 'smartphone deals', count: 756, popularity_score: 0.82 },
      { query: 'home office setup', count: 623, popularity_score: 0.78 },
      { query: 'fitness tracker', count: 567, popularity_score: 0.75 }
    ].slice(0, limit);
  }

  async getPersonalizedSuggestions(userBehavior, limit) {
    // Analyze user behavior for personalized suggestions
    const { searchHistory, purchases, browsingCategories } = userBehavior;

    const suggestions = [];

    // Based on search history
    if (searchHistory?.length > 0) {
      suggestions.push({
        query: `${searchHistory[0]} accessories`,
        reason: 'Based on recent searches',
        relevance_score: 0.8
      });
    }

    // Based on purchase history
    if (browsingCategories?.length > 0) {
      suggestions.push({
        query: `new ${browsingCategories[0]} products`,
        reason: 'Based on browsing history',
        relevance_score: 0.75
      });
    }

    return suggestions.slice(0, limit);
  }
}

QueryLeaf Text Search Integration

QueryLeaf provides familiar SQL syntax for MongoDB text search operations:

-- QueryLeaf advanced text search with SQL-familiar syntax

-- Create full-text search indexes using SQL DDL
CREATE FULLTEXT INDEX products_text_search ON products (
  name WEIGHT 10,
  description WEIGHT 3,
  category WEIGHT 5,
  brand WEIGHT 8,
  specifications.processor WEIGHT 4
) WITH (
  language = 'english',
  case_sensitive = false
);

-- Basic full-text search with relevance scoring
SELECT 
  _id,
  name,
  description,
  category,
  brand,
  price,
  MATCH_SCORE() as relevance_score,

  -- Search result highlighting
  HIGHLIGHT(description, 'gaming laptop', '<mark>', '</mark>') as highlighted_description

FROM products
WHERE MATCH(name, description, category) AGAINST ('gaming laptop RTX')
ORDER BY relevance_score DESC
LIMIT 20;

-- Advanced search with filters and ranking
WITH search_results AS (
  SELECT 
    *,
    MATCH_SCORE() as base_score,

    -- Category relevance boost
    CASE 
      WHEN category IN ('Gaming', 'Computers', 'Laptops') THEN 1.3
      WHEN category IN ('Electronics', 'Hardware') THEN 1.1
      ELSE 1.0
    END as category_boost,

    -- Price range boost for commercial relevance
    CASE 
      WHEN price BETWEEN 800 AND 2000 THEN 1.2
      WHEN price > 2000 THEN 1.1
      ELSE 1.0
    END as price_boost,

    -- Quality and rating boost
    CASE 
      WHEN ratings.average >= 4.0 AND ratings.count >= 10 THEN 1.4
      WHEN ratings.average >= 3.5 THEN 1.2
      ELSE 1.0
    END as quality_boost

  FROM products
  WHERE 
    -- Text search condition
    MATCH(name, description, category, brand) AGAINST ('gaming laptop RTX nvidia')

    -- Additional filters
    AND price BETWEEN 500 AND 5000
    AND availability != 'out_of_stock'
    AND ratings.count > 0
),

enhanced_results AS (
  SELECT 
    *,

    -- Calculate final relevance score
    (base_score * category_boost * price_boost * quality_boost) as final_relevance,

    -- Result quality classification
    CASE 
      WHEN base_score > 5.0 THEN 'excellent'
      WHEN base_score > 3.0 THEN 'good'
      WHEN base_score > 1.0 THEN 'fair'
      ELSE 'poor'
    END as result_quality,

    -- Match type analysis
    CASE 
      WHEN name LIKE '%gaming laptop%' THEN 'exact_phrase_title'
      WHEN name LIKE '%gaming%' AND name LIKE '%laptop%' THEN 'all_terms_title'
      WHEN description LIKE '%gaming laptop%' THEN 'exact_phrase_description'
      ELSE 'partial_match'
    END as match_type,

    -- Commercial value assessment
    CASE 
      WHEN price BETWEEN 1000 AND 2500 AND quality_boost > 1.2 THEN 'high_value'
      WHEN price BETWEEN 500 AND 1500 THEN 'medium_value'
      ELSE 'standard_value'
    END as commercial_value

  FROM search_results
  WHERE base_score > 0.5
)

SELECT 
  _id,
  name,
  SUBSTRING(description, 1, 200) as short_description,
  category,
  brand,
  price,
  ratings,
  availability,

  -- Relevance and quality metrics
  ROUND(final_relevance, 3) as relevance_score,
  result_quality,
  match_type,
  commercial_value,

  -- Search highlighting
  HIGHLIGHT(name, 'gaming laptop RTX nvidia', '<b>', '</b>') as highlighted_name,
  HIGHLIGHT(description, 'gaming laptop RTX nvidia', '<mark>', '</mark>', 50, 20) as search_snippet

FROM enhanced_results
WHERE final_relevance > 0.8
ORDER BY final_relevance DESC, price ASC
LIMIT 50;

-- Search faceting for filter options
SELECT 
  'category' as facet_type,
  category as facet_value,
  COUNT(*) as result_count,
  AVG(price) as avg_price,
  AVG(ratings.average) as avg_rating
FROM products
WHERE MATCH(name, description) AGAINST ('gaming laptop')
  AND price > 0
GROUP BY category

UNION ALL

SELECT 
  'brand' as facet_type,
  brand as facet_value,
  COUNT(*) as result_count,
  AVG(price) as avg_price,
  AVG(ratings.average) as avg_rating
FROM products  
WHERE MATCH(name, description) AGAINST ('gaming laptop')
  AND price > 0
GROUP BY brand

UNION ALL

SELECT 
  'price_range' as facet_type,
  CASE 
    WHEN price < 500 THEN 'Under $500'
    WHEN price BETWEEN 500 AND 999 THEN '$500-$999'
    WHEN price BETWEEN 1000 AND 1499 THEN '$1000-$1499'
    WHEN price BETWEEN 1500 AND 2499 THEN '$1500-$2499'
    ELSE '$2500+'
  END as facet_value,
  COUNT(*) as result_count,
  AVG(price) as avg_price,
  AVG(ratings.average) as avg_rating
FROM products
WHERE MATCH(name, description) AGAINST ('gaming laptop')
  AND price > 0
GROUP BY facet_value

ORDER BY facet_type, result_count DESC;

-- Autocomplete search functionality  
SELECT DISTINCT
  name as suggestion,
  category,
  brand,
  COUNT(*) OVER (PARTITION BY name) as suggestion_frequency
FROM products
WHERE name LIKE 'gam%'
   OR name REGEXP '^gam.*'
   OR SOUNDEX(name) = SOUNDEX('gaming')
ORDER BY 
  CASE 
    WHEN name LIKE 'gam%' THEN 1
    WHEN name REGEXP '^gam' THEN 2
    ELSE 3
  END,
  suggestion_frequency DESC,
  name
LIMIT 10;

-- Search with spell correction and suggestions
WITH original_search AS (
  SELECT COUNT(*) as result_count
  FROM products  
  WHERE MATCH(name, description) AGAINST ('javasript tutorial')
),

corrected_search AS (
  SELECT 
    'javascript tutorial' as suggested_query,
    COUNT(*) as corrected_count
  FROM products
  WHERE MATCH(name, description) AGAINST ('javascript tutorial')
)

SELECT 
  CASE 
    WHEN os.result_count < 3 AND cs.corrected_count > os.result_count THEN
      JSON_OBJECT(
        'original_query', 'javasript tutorial',
        'suggested_query', cs.suggested_query,
        'original_results', os.result_count,
        'suggested_results', cs.corrected_count,
        'use_suggestion', true
      )
    ELSE
      JSON_OBJECT(
        'original_query', 'javasript tutorial',
        'original_results', os.result_count,
        'use_suggestion', false
      )
  END as search_correction

FROM original_search os
CROSS JOIN corrected_search cs;

-- Multi-collection search across different content types  
WITH product_search AS (
  SELECT 
    _id,
    name as title,
    description as content,
    'product' as content_type,
    category,
    price,
    MATCH_SCORE() as relevance_score
  FROM products
  WHERE MATCH(name, description, category) AGAINST ('web development tools')
),

article_search AS (
  SELECT 
    _id,
    title,
    SUBSTRING(content, 1, 500) as content,
    'article' as content_type,
    category,
    0 as price,
    MATCH_SCORE() as relevance_score
  FROM articles
  WHERE MATCH(title, content, tags) AGAINST ('web development tools')
    AND status = 'published'
),

unified_search AS (
  SELECT * FROM product_search
  WHERE relevance_score > 0.5

  UNION ALL

  SELECT * FROM article_search  
  WHERE relevance_score > 0.5
),

ranked_results AS (
  SELECT 
    *,

    -- Cross-collection relevance adjustment
    relevance_score * 
    CASE content_type
      WHEN 'product' THEN 1.0
      WHEN 'article' THEN 0.8
      ELSE 0.6
    END as adjusted_relevance,

    -- Content type specific ranking
    ROW_NUMBER() OVER (
      PARTITION BY content_type 
      ORDER BY relevance_score DESC
    ) as type_rank

  FROM unified_search
)

SELECT 
  _id,
  title,
  SUBSTRING(content, 1, 200) as snippet,
  content_type,
  category,
  price,
  ROUND(adjusted_relevance, 3) as relevance_score,
  type_rank,

  -- Content type indicator for UI
  CASE content_type
    WHEN 'product' THEN '🛍️ Product'
    WHEN 'article' THEN '📄 Article'  
    ELSE '📄 Content'
  END as type_display

FROM ranked_results
WHERE adjusted_relevance > 0.4
ORDER BY adjusted_relevance DESC, type_rank ASC
LIMIT 30;

-- Search analytics and performance monitoring
WITH search_performance AS (
  SELECT 
    DATE_TRUNC('hour', search_timestamp) as hour_bucket,
    search_query,
    collection_name,
    response_time_ms,
    result_count,
    user_id,

    -- Response time categorization
    CASE 
      WHEN response_time_ms < 100 THEN 'fast'
      WHEN response_time_ms < 500 THEN 'medium'
      ELSE 'slow'
    END as response_category,

    -- Result quality assessment
    CASE 
      WHEN result_count = 0 THEN 'no_results'
      WHEN result_count < 5 THEN 'few_results'
      WHEN result_count < 20 THEN 'good_results'
      ELSE 'many_results'
    END as result_category

  FROM search_analytics_log
  WHERE search_timestamp >= CURRENT_TIMESTAMP - INTERVAL '24 hours'
),

hourly_metrics AS (
  SELECT 
    hour_bucket,

    -- Volume metrics
    COUNT(*) as total_searches,
    COUNT(DISTINCT user_id) as unique_users,
    COUNT(DISTINCT search_query) as unique_queries,

    -- Performance metrics
    AVG(response_time_ms) as avg_response_time,
    MAX(response_time_ms) as max_response_time,
    PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY response_time_ms) as p95_response_time,

    -- Response distribution
    COUNT(*) FILTER (WHERE response_category = 'fast') as fast_responses,
    COUNT(*) FILTER (WHERE response_category = 'medium') as medium_responses,
    COUNT(*) FILTER (WHERE response_category = 'slow') as slow_responses,

    -- Result quality distribution
    COUNT(*) FILTER (WHERE result_category = 'no_results') as no_result_searches,
    COUNT(*) FILTER (WHERE result_category = 'few_results') as few_result_searches,
    COUNT(*) FILTER (WHERE result_category = 'good_results') as good_result_searches,
    COUNT(*) FILTER (WHERE result_category = 'many_results') as many_result_searches,

    -- Quality scores
    AVG(result_count) as avg_results_per_search,
    COUNT(*) FILTER (WHERE result_count > 0) / COUNT(*)::float * 100 as success_rate_pct

  FROM search_performance
  GROUP BY hour_bucket
)

SELECT 
  hour_bucket,
  total_searches,
  unique_users,
  unique_queries,

  -- Performance summary
  ROUND(avg_response_time, 1) as avg_response_time_ms,
  ROUND(p95_response_time, 1) as p95_response_time_ms,
  ROUND((fast_responses / total_searches::float * 100), 1) as fast_response_pct,

  -- Quality summary
  ROUND(success_rate_pct, 1) as search_success_rate,
  ROUND(avg_results_per_search, 1) as avg_results_count,

  -- Performance assessment
  CASE 
    WHEN avg_response_time < 150 AND success_rate_pct > 90 THEN 'excellent'
    WHEN avg_response_time < 300 AND success_rate_pct > 80 THEN 'good'
    WHEN avg_response_time < 500 AND success_rate_pct > 70 THEN 'acceptable'
    ELSE 'needs_improvement'
  END as performance_rating,

  -- Quality indicators
  no_result_searches,
  few_result_searches,
  good_result_searches,
  many_result_searches

FROM hourly_metrics
ORDER BY hour_bucket DESC;

-- QueryLeaf provides comprehensive text search capabilities:
-- 1. SQL-familiar MATCH...AGAINST syntax for full-text search
-- 2. Advanced relevance scoring with custom boost factors
-- 3. Search result highlighting and snippet generation
-- 4. Autocomplete and suggestion functionality with SQL patterns
-- 5. Multi-collection search with unified result ranking
-- 6. Search analytics and performance monitoring queries
-- 7. Faceting and filtering integration with text search
-- 8. Spell correction and search suggestion capabilities
-- 9. Integration with MongoDB's native text indexing optimizations
-- 10. Familiar SQL patterns for complex search implementations

Best Practices for MongoDB Text Search Implementation

Search Index Design and Optimization

Essential principles for effective text search indexing:

Field Weighting Strategy: Assign appropriate weights to different fields based on search relevance
Language Configuration: Configure appropriate language settings for stemming and stop words
Index Maintenance: Monitor and optimize text indexes for performance and storage efficiency
Compound Indexes: Combine text search with other query patterns using compound indexes
Performance Testing: Regularly test search performance with realistic data volumes and query patterns
Relevance Tuning: Continuously refine relevance scoring based on user behavior and feedback

Production Search Implementation

Optimize text search for enterprise production environments:

Caching Strategy: Implement intelligent caching for frequently searched terms and results
Analytics Integration: Track search patterns, performance, and user behavior for optimization
Scalability Planning: Design search architecture to handle growing data volumes and query loads
Error Handling: Implement graceful degradation and fallback strategies for search failures
Monitoring: Establish comprehensive monitoring for search performance, relevance, and user satisfaction
A/B Testing: Use controlled testing to validate search improvements and relevance changes

Conclusion

MongoDB's text search capabilities provide powerful, flexible, and performant full-text search that eliminates the need for external search engines in many applications. The integrated approach offers sophisticated features including relevance scoring, language-specific processing, and advanced query syntax while maintaining the simplicity and performance of native database operations.

Key MongoDB Text Search benefits include:

Native Integration: Full-text search capabilities built directly into the database
Advanced Relevance: Sophisticated scoring algorithms with customizable weighting and boosting
Language Intelligence: Built-in language processing with stemming, stop words, and phrase matching
Performance Optimization: Optimized indexing and query execution for fast search responses
Flexible Architecture: Support for complex search patterns including faceting, autocomplete, and multi-collection search
SQL Accessibility: Familiar SQL-style search operations through QueryLeaf for rapid development

Whether you're building e-commerce product search, content management systems, knowledge bases, or any application requiring powerful text search capabilities, MongoDB's text search with QueryLeaf's familiar SQL interface provides the foundation for sophisticated, high-performance search experiences.

QueryLeaf Integration: QueryLeaf automatically optimizes SQL text search operations into MongoDB's native text search capabilities while providing familiar MATCH...AGAINST syntax, relevance functions, and search analytics. Advanced search features like faceting, highlighting, and autocomplete are seamlessly handled through familiar SQL constructs, making powerful text search both accessible and efficient for SQL-oriented development teams.

The combination of MongoDB's robust text search engine with SQL-style search operations makes it an ideal platform for applications requiring both sophisticated search capabilities and familiar database operation patterns, ensuring your search functionality can deliver exceptional user experiences while maintaining development productivity and operational simplicity.

December 14, 2025
19 min read

MongoDB Bulk Operations and High-Throughput Data Processing: Advanced Batch Processing Patterns for Enterprise Applications

Modern enterprise applications must process massive volumes of data efficiently, handling everything from data migrations and ETL pipelines to real-time analytics updates and batch synchronization tasks. Traditional row-by-row database operations create significant performance bottlenecks when dealing with large datasets, leading to exponential processing times, excessive network overhead, and resource contention that can impact overall system performance and scalability.

MongoDB bulk operations provide sophisticated high-throughput data processing capabilities that eliminate the performance limitations of individual operations while maintaining data consistency and integrity. Unlike traditional approaches that require complex batching logic and careful transaction management, MongoDB's bulk operation APIs offer ordered and unordered execution modes with comprehensive error handling and performance optimization features built for enterprise-scale data processing scenarios.

The Traditional High-Volume Processing Challenge

Conventional database approaches struggle with large-scale data processing requirements:

-- Traditional PostgreSQL batch processing - inefficient and resource-intensive

-- Individual insert operations with poor performance characteristics
CREATE TABLE user_analytics (
    user_id BIGINT NOT NULL,
    event_type VARCHAR(100) NOT NULL,
    event_data JSONB,
    session_id VARCHAR(100),
    timestamp TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    device_info JSONB,
    location_data JSONB,

    -- Performance indexes that become expensive with individual operations
    INDEX idx_user_timestamp (user_id, timestamp),
    INDEX idx_event_type (event_type),
    INDEX idx_session_id (session_id)
);

-- Slow row-by-row processing approach
DO $$
DECLARE
    event_record RECORD;
    total_records INTEGER := 0;
    batch_size INTEGER := 1000;
    start_time TIMESTAMP;
    processing_time INTERVAL;
BEGIN
    start_time := CURRENT_TIMESTAMP;

    -- Inefficient cursor-based processing
    FOR event_record IN 
        SELECT * FROM staging_events WHERE processed = false
        ORDER BY created_at
        LIMIT 100000
    LOOP
        BEGIN
            -- Individual INSERT operations are extremely slow
            INSERT INTO user_analytics (
                user_id, event_type, event_data, session_id, 
                device_info, location_data
            ) VALUES (
                event_record.user_id,
                event_record.event_type,
                event_record.event_data::JSONB,
                event_record.session_id,
                event_record.device_info::JSONB,
                event_record.location_data::JSONB
            );

            -- Update staging table (another slow operation)
            UPDATE staging_events 
            SET processed = true, processed_at = CURRENT_TIMESTAMP
            WHERE id = event_record.id;

            total_records := total_records + 1;

            -- Manual batching for commit control
            IF total_records % batch_size = 0 THEN
                COMMIT;
                RAISE NOTICE 'Processed % records', total_records;
            END IF;

        EXCEPTION WHEN OTHERS THEN
            -- Individual error handling is complex
            INSERT INTO error_log (
                operation, error_message, failed_data, timestamp
            ) VALUES (
                'user_analytics_insert',
                SQLERRM,
                row_to_json(event_record),
                CURRENT_TIMESTAMP
            );
            ROLLBACK;
        END;
    END LOOP;

    processing_time := CURRENT_TIMESTAMP - start_time;
    RAISE NOTICE 'Completed processing % records in %', total_records, processing_time;
END $$;

-- Traditional batch UPDATE operations with poor performance
UPDATE user_analytics ua
SET 
    location_data = COALESCE(loc.location_info, ua.location_data),
    device_info = device_info || COALESCE(dev.device_updates, '{}'::jsonb),
    last_updated = CURRENT_TIMESTAMP
FROM (
    -- Expensive JOIN operations on large datasets
    SELECT DISTINCT ON (user_id, session_id)
        user_id,
        session_id, 
        location_info,
        ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY timestamp DESC) as rn
    FROM location_updates 
    WHERE processed = false
) loc,
(
    SELECT DISTINCT ON (user_id, session_id)
        user_id,
        session_id,
        device_updates
    FROM device_updates
    WHERE processed = false
) dev
WHERE ua.user_id = loc.user_id 
  AND ua.session_id = loc.session_id
  AND ua.user_id = dev.user_id
  AND ua.session_id = dev.session_id
  AND ua.timestamp >= CURRENT_DATE - INTERVAL '7 days';

-- Complex aggregation processing with performance issues
WITH hourly_metrics AS (
    SELECT 
        user_id,
        DATE_TRUNC('hour', timestamp) as hour_bucket,
        event_type,
        COUNT(*) as event_count,
        COUNT(DISTINCT session_id) as unique_sessions,

        -- Expensive JSON operations on large datasets
        AVG(CAST(event_data->>'duration' AS NUMERIC)) as avg_duration,
        SUM(CAST(event_data->>'value' AS NUMERIC)) as total_value,

        -- Complex device and location aggregations
        COUNT(DISTINCT device_info->>'device_id') as unique_devices,
        ARRAY_AGG(DISTINCT location_data->>'country') FILTER (WHERE location_data->>'country' IS NOT NULL) as countries

    FROM user_analytics
    WHERE timestamp >= CURRENT_DATE - INTERVAL '1 day'
    GROUP BY user_id, DATE_TRUNC('hour', timestamp), event_type
),
user_daily_summary AS (
    SELECT 
        user_id,
        DATE_TRUNC('day', hour_bucket) as date_bucket,

        -- Multiple aggregation levels cause performance problems
        SUM(event_count) as total_events,
        SUM(unique_sessions) as total_sessions,
        AVG(avg_duration) as overall_avg_duration,
        SUM(total_value) as daily_value,

        -- Array operations are expensive
        ARRAY_AGG(DISTINCT unnest(countries)) as all_countries,
        COUNT(DISTINCT event_type) as event_type_diversity

    FROM hourly_metrics
    GROUP BY user_id, DATE_TRUNC('day', hour_bucket)
)
-- Expensive UPSERT operations
INSERT INTO user_daily_summaries (
    user_id, summary_date, total_events, total_sessions,
    avg_duration, daily_value, countries_visited, event_diversity,
    computed_at
)
SELECT 
    user_id, date_bucket, total_events, total_sessions,
    overall_avg_duration, daily_value, all_countries, event_type_diversity,
    CURRENT_TIMESTAMP
FROM user_daily_summary
ON CONFLICT (user_id, summary_date) 
DO UPDATE SET
    total_events = EXCLUDED.total_events,
    total_sessions = EXCLUDED.total_sessions,
    avg_duration = EXCLUDED.avg_duration,
    daily_value = EXCLUDED.daily_value,
    countries_visited = EXCLUDED.countries_visited,
    event_diversity = EXCLUDED.event_diversity,
    computed_at = EXCLUDED.computed_at,
    updated_at = CURRENT_TIMESTAMP;

-- MySQL bulk processing limitations
-- MySQL has even more restrictive bulk operation capabilities
LOAD DATA INFILE '/tmp/user_events.csv'
INTO TABLE user_analytics
FIELDS TERMINATED BY ','
ENCLOSED BY '"'
LINES TERMINATED BY '\n'
IGNORE 1 ROWS
(user_id, event_type, @event_data, session_id, @device_info, @location_data)
SET 
    event_data = CAST(@event_data AS JSON),
    device_info = CAST(@device_info AS JSON),
    location_data = CAST(@location_data AS JSON),
    timestamp = CURRENT_TIMESTAMP;

-- Problems with traditional bulk processing approaches:
-- 1. Limited batch size handling leads to memory issues or poor performance
-- 2. Complex error handling with partial failure scenarios
-- 3. Manual transaction management and commit strategies
-- 4. Expensive index maintenance during large operations
-- 5. Limited parallelization and concurrency control
-- 6. Poor performance with large JSON/JSONB operations
-- 7. Complex rollback scenarios with partial batch failures
-- 8. Inefficient network utilization with individual operations
-- 9. Limited support for conditional updates and upserts
-- 10. Resource contention issues during high-volume processing

MongoDB bulk operations provide sophisticated high-performance data processing:

// MongoDB Bulk Operations - comprehensive high-throughput data processing system
const { MongoClient, ObjectId } = require('mongodb');
const EventEmitter = require('events');

const client = new MongoClient('mongodb://localhost:27017');
const db = client.db('analytics_platform');

// Advanced MongoDB bulk operations manager for enterprise applications
class MongoDBBulkOperationsManager extends EventEmitter {
  constructor(db) {
    super();
    this.db = db;
    this.collections = {
      userAnalytics: db.collection('user_analytics'),
      userSummaries: db.collection('user_daily_summaries'),
      deviceProfiles: db.collection('device_profiles'),
      locationData: db.collection('location_data'),
      sessionMetrics: db.collection('session_metrics'),
      errorLog: db.collection('bulk_operation_errors')
    };

    this.bulkOperationStats = new Map();
    this.processingQueue = [];
    this.maxBatchSize = 10000;
    this.concurrentOperations = 5;
    this.retryConfig = {
      maxRetries: 3,
      baseDelay: 1000,
      maxDelay: 10000
    };

    // Performance optimization settings
    this.bulkWriteOptions = {
      ordered: false, // Parallel processing for better performance
      writeConcern: { w: 1, j: true }, // Balance performance and durability
      bypassDocumentValidation: false
    };

    this.setupOperationHandlers();
  }

  async processBulkUserEvents(eventsData, options = {}) {
    console.log(`Processing bulk user events: ${eventsData.length} records`);

    const {
      batchSize = this.maxBatchSize,
      ordered = false,
      enableValidation = true,
      upsertMode = false
    } = options;

    const startTime = Date.now();
    const operationId = this.generateOperationId('bulk_user_events');

    try {
      // Initialize operation tracking
      this.initializeOperationStats(operationId, eventsData.length);

      // Process in optimized batches
      const batches = this.createOptimizedBatches(eventsData, batchSize);
      const results = [];

      // Execute batches with controlled concurrency
      for (let i = 0; i < batches.length; i += this.concurrentOperations) {
        const batchGroup = batches.slice(i, i + this.concurrentOperations);

        const batchPromises = batchGroup.map(batch => 
          this.executeBulkInsertBatch(batch, operationId, {
            ordered,
            enableValidation,
            upsertMode
          })
        );

        const batchResults = await Promise.allSettled(batchPromises);
        results.push(...batchResults);

        // Update progress
        const processedCount = Math.min((i + this.concurrentOperations) * batchSize, eventsData.length);
        this.updateOperationProgress(operationId, processedCount);

        console.log(`Processed ${processedCount}/${eventsData.length} events`);
      }

      // Aggregate results and handle errors
      const finalResult = this.aggregateBatchResults(results, operationId);
      const processingTime = Date.now() - startTime;

      console.log(`Bulk user events processing completed in ${processingTime}ms`);
      console.log(`Success: ${finalResult.totalInserted}, Errors: ${finalResult.totalErrors}`);

      // Store operation metrics
      await this.recordOperationMetrics(operationId, finalResult, processingTime);

      return finalResult;

    } catch (error) {
      console.error('Bulk user events processing failed:', error);
      await this.recordOperationError(operationId, error, eventsData.length);
      throw error;
    }
  }

  async executeBulkInsertBatch(batch, operationId, options) {
    const { ordered = false, enableValidation = true, upsertMode = false } = options;

    try {
      // Prepare bulk write operations
      const bulkOps = batch.map(event => {
        const document = {
          ...event,
          _id: event._id || new ObjectId(),
          processedAt: new Date(),
          batchId: operationId,

          // Enhanced metadata for analytics
          processingMetadata: {
            ingestionTime: new Date(),
            sourceSystem: event.sourceSystem || 'bulk_import',
            dataVersion: event.dataVersion || '1.0',
            validationPassed: enableValidation ? this.validateEventData(event) : true
          }
        };

        if (upsertMode) {
          return {
            updateOne: {
              filter: { 
                userId: event.userId, 
                sessionId: event.sessionId, 
                timestamp: event.timestamp 
              },
              update: { 
                $set: document,
                $setOnInsert: { createdAt: new Date() }
              },
              upsert: true
            }
          };
        } else {
          return {
            insertOne: {
              document: document
            }
          };
        }
      });

      // Execute bulk operation with optimized settings
      const bulkWriteResult = await this.collections.userAnalytics.bulkWrite(
        bulkOps, 
        {
          ...this.bulkWriteOptions,
          ordered: ordered
        }
      );

      return {
        success: true,
        result: bulkWriteResult,
        batchSize: batch.length,
        operationId: operationId
      };

    } catch (error) {
      console.error(`Batch operation failed for operation ${operationId}:`, error);

      // Log detailed error information
      await this.logBatchError(operationId, batch, error);

      return {
        success: false,
        error: error,
        batchSize: batch.length,
        operationId: operationId
      };
    }
  }

  async processBulkUserUpdates(updatesData, options = {}) {
    console.log(`Processing bulk user updates: ${updatesData.length} updates`);

    const {
      batchSize = this.maxBatchSize,
      arrayFilters = [],
      multi = false
    } = options;

    const operationId = this.generateOperationId('bulk_user_updates');
    const startTime = Date.now();

    try {
      this.initializeOperationStats(operationId, updatesData.length);

      // Group updates by operation type for optimization
      const updateGroups = this.groupUpdatesByType(updatesData);
      const results = [];

      for (const [updateType, updates] of Object.entries(updateGroups)) {
        console.log(`Processing ${updates.length} ${updateType} operations`);

        const batches = this.createOptimizedBatches(updates, batchSize);

        for (const batch of batches) {
          const bulkOps = batch.map(update => this.createUpdateOperation(update, {
            arrayFilters,
            multi,
            updateType
          }));

          try {
            const result = await this.collections.userAnalytics.bulkWrite(
              bulkOps,
              this.bulkWriteOptions
            );

            results.push({ success: true, result, updateType });

          } catch (error) {
            console.error(`Bulk update failed for type ${updateType}:`, error);
            results.push({ success: false, error, updateType, batchSize: batch.length });

            await this.logBatchError(operationId, batch, error);
          }
        }
      }

      const finalResult = this.aggregateUpdateResults(results, operationId);
      const processingTime = Date.now() - startTime;

      console.log(`Bulk updates completed in ${processingTime}ms`);
      await this.recordOperationMetrics(operationId, finalResult, processingTime);

      return finalResult;

    } catch (error) {
      console.error('Bulk updates processing failed:', error);
      throw error;
    }
  }

  createUpdateOperation(update, options) {
    const { arrayFilters = [], multi = false, updateType } = options;

    const baseOperation = {
      filter: update.filter || { _id: update._id },
      update: {},
      ...(arrayFilters.length > 0 && { arrayFilters }),
      ...(multi && { multi: true })
    };

    switch (updateType) {
      case 'field_updates':
        return {
          updateMany: {
            ...baseOperation,
            update: {
              $set: {
                ...update.setFields,
                lastUpdated: new Date()
              },
              ...(update.unsetFields && { $unset: update.unsetFields }),
              ...(update.incFields && { $inc: update.incFields })
            }
          }
        };

      case 'array_operations':
        return {
          updateMany: {
            ...baseOperation,
            update: {
              ...(update.pushToArrays && { $push: update.pushToArrays }),
              ...(update.pullFromArrays && { $pull: update.pullFromArrays }),
              ...(update.addToSets && { $addToSet: update.addToSets }),
              $set: { lastUpdated: new Date() }
            }
          }
        };

      case 'nested_updates':
        return {
          updateMany: {
            ...baseOperation,
            update: {
              $set: {
                ...Object.entries(update.nestedFields || {}).reduce((acc, [path, value]) => {
                  acc[path] = value;
                  return acc;
                }, {}),
                lastUpdated: new Date()
              }
            }
          }
        };

      case 'conditional_updates':
        return {
          updateMany: {
            ...baseOperation,
            update: [
              {
                $set: {
                  ...update.conditionalFields,
                  lastUpdated: new Date(),
                  // Add conditional logic using aggregation pipeline
                  ...(update.computedFields && Object.entries(update.computedFields).reduce((acc, [field, expr]) => {
                    acc[field] = expr;
                    return acc;
                  }, {}))
                }
              }
            ]
          }
        };

      default:
        return {
          updateOne: {
            ...baseOperation,
            update: { $set: { ...update.fields, lastUpdated: new Date() } }
          }
        };
    }
  }

  async processBulkAggregationUpdates(aggregationConfig) {
    console.log('Processing bulk aggregation updates...');

    const operationId = this.generateOperationId('bulk_aggregation');
    const startTime = Date.now();

    try {
      // Execute aggregation pipeline to compute updates
      const aggregationPipeline = [
        // Stage 1: Match and filter data
        {
          $match: {
            timestamp: { 
              $gte: aggregationConfig.dateRange?.start || new Date(Date.now() - 24*60*60*1000),
              $lte: aggregationConfig.dateRange?.end || new Date()
            },
            ...(aggregationConfig.additionalFilters || {})
          }
        },

        // Stage 2: Group and aggregate metrics
        {
          $group: {
            _id: aggregationConfig.groupBy || { 
              userId: '$userId', 
              date: { $dateToString: { format: '%Y-%m-%d', date: '$timestamp' } }
            },

            // Event metrics
            totalEvents: { $sum: 1 },
            uniqueSessions: { $addToSet: '$sessionId' },
            eventTypes: { $addToSet: '$eventType' },

            // Duration and value metrics
            totalDuration: { 
              $sum: { 
                $toDouble: { $ifNull: ['$eventData.duration', 0] } 
              } 
            },
            totalValue: { 
              $sum: { 
                $toDouble: { $ifNull: ['$eventData.value', 0] } 
              } 
            },

            // Device and location aggregations
            uniqueDevices: { $addToSet: '$deviceInfo.deviceId' },
            countries: { $addToSet: '$locationData.country' },

            // Time-based metrics
            firstEvent: { $min: '$timestamp' },
            lastEvent: { $max: '$timestamp' },

            // Advanced metrics
            bounceRate: {
              $avg: {
                $cond: [
                  { $eq: [{ $size: '$uniqueSessions' }, 1] },
                  1, 0
                ]
              }
            }
          }
        },

        // Stage 3: Compute derived metrics
        {
          $addFields: {
            sessionCount: { $size: '$uniqueSessions' },
            eventTypeCount: { $size: '$eventTypes' },
            deviceCount: { $size: '$uniqueDevices' },
            countryCount: { $size: '$countries' },

            // Calculate averages
            avgDuration: { 
              $divide: ['$totalDuration', '$totalEvents'] 
            },
            avgValue: { 
              $divide: ['$totalValue', '$totalEvents'] 
            },
            avgEventsPerSession: { 
              $divide: ['$totalEvents', { $size: '$uniqueSessions' }] 
            },

            // Session duration
            sessionDuration: { 
              $subtract: ['$lastEvent', '$firstEvent'] 
            },

            // User engagement score
            engagementScore: {
              $add: [
                { $multiply: ['$totalEvents', 1] },
                { $multiply: [{ $size: '$uniqueSessions' }, 2] },
                { $multiply: ['$totalValue', 0.1] }
              ]
            },

            computedAt: new Date()
          }
        }
      ];

      // Execute aggregation
      const aggregationResults = await this.collections.userAnalytics
        .aggregate(aggregationPipeline)
        .toArray();

      console.log(`Aggregation computed ${aggregationResults.length} summary records`);

      // Convert aggregation results to bulk upsert operations
      const upsertOps = aggregationResults.map(result => ({
        updateOne: {
          filter: {
            userId: result._id.userId,
            summaryDate: result._id.date
          },
          update: {
            $set: {
              ...result,
              _id: undefined, // Remove the grouped _id
              userId: result._id.userId,
              summaryDate: result._id.date,
              lastUpdated: new Date()
            }
          },
          upsert: true
        }
      }));

      // Execute bulk upsert operations
      const bulkResult = await this.collections.userSummaries.bulkWrite(
        upsertOps,
        this.bulkWriteOptions
      );

      const processingTime = Date.now() - startTime;

      console.log(`Bulk aggregation completed in ${processingTime}ms`);
      console.log(`Upserted: ${bulkResult.upsertedCount}, Modified: ${bulkResult.modifiedCount}`);

      await this.recordOperationMetrics(operationId, {
        aggregationResults: aggregationResults.length,
        upsertedCount: bulkResult.upsertedCount,
        modifiedCount: bulkResult.modifiedCount
      }, processingTime);

      return {
        success: true,
        aggregationResults: aggregationResults.length,
        bulkResult: bulkResult,
        processingTime: processingTime
      };

    } catch (error) {
      console.error('Bulk aggregation processing failed:', error);
      await this.recordOperationError(operationId, error, 0);
      throw error;
    }
  }

  async processBulkDeleteOperations(deleteConfig) {
    console.log('Processing bulk delete operations...');

    const operationId = this.generateOperationId('bulk_delete');
    const startTime = Date.now();

    try {
      const {
        conditions = [],
        archiveBeforeDelete = true,
        batchSize = 5000,
        dryRun = false
      } = deleteConfig;

      let totalDeleted = 0;
      let totalArchived = 0;

      for (const condition of conditions) {
        console.log(`Processing delete condition: ${JSON.stringify(condition.filter)}`);

        if (dryRun) {
          // Count documents that would be deleted
          const count = await this.collections.userAnalytics.countDocuments(condition.filter);
          console.log(`[DRY RUN] Would delete ${count} documents`);
          continue;
        }

        // Archive before delete if requested
        if (archiveBeforeDelete) {
          console.log('Archiving documents before deletion...');

          const documentsToArchive = await this.collections.userAnalytics
            .find(condition.filter)
            .limit(condition.limit || 100000)
            .toArray();

          if (documentsToArchive.length > 0) {
            // Add archive metadata
            const archiveDocuments = documentsToArchive.map(doc => ({
              ...doc,
              archivedAt: new Date(),
              archiveReason: condition.reason || 'bulk_delete_operation',
              originalCollection: 'user_analytics'
            }));

            await this.db.collection('archived_user_analytics')
              .insertMany(archiveDocuments);

            totalArchived += documentsToArchive.length;
            console.log(`Archived ${documentsToArchive.length} documents`);
          }
        }

        // Execute bulk delete
        const deleteResult = await this.collections.userAnalytics.deleteMany(
          condition.filter
        );

        totalDeleted += deleteResult.deletedCount;
        console.log(`Deleted ${deleteResult.deletedCount} documents`);

        // Add delay between operations to reduce system load
        if (condition.delayMs) {
          await new Promise(resolve => setTimeout(resolve, condition.delayMs));
        }
      }

      const processingTime = Date.now() - startTime;

      console.log(`Bulk delete completed in ${processingTime}ms`);
      console.log(`Total deleted: ${totalDeleted}, Total archived: ${totalArchived}`);

      await this.recordOperationMetrics(operationId, {
        totalDeleted,
        totalArchived,
        dryRun
      }, processingTime);

      return {
        success: true,
        totalDeleted,
        totalArchived,
        processingTime
      };

    } catch (error) {
      console.error('Bulk delete processing failed:', error);
      await this.recordOperationError(operationId, error, 0);
      throw error;
    }
  }

  // Utility methods
  createOptimizedBatches(data, batchSize) {
    const batches = [];

    // Sort data for better insertion performance (optional)
    const sortedData = data.sort((a, b) => {
      if (a.userId !== b.userId) return a.userId.localeCompare(b.userId);
      if (a.timestamp !== b.timestamp) return new Date(a.timestamp) - new Date(b.timestamp);
      return 0;
    });

    for (let i = 0; i < sortedData.length; i += batchSize) {
      batches.push(sortedData.slice(i, i + batchSize));
    }

    return batches;
  }

  groupUpdatesByType(updates) {
    const groups = {
      field_updates: [],
      array_operations: [],
      nested_updates: [],
      conditional_updates: []
    };

    for (const update of updates) {
      if (update.setFields || update.unsetFields || update.incFields) {
        groups.field_updates.push(update);
      } else if (update.pushToArrays || update.pullFromArrays || update.addToSets) {
        groups.array_operations.push(update);
      } else if (update.nestedFields) {
        groups.nested_updates.push(update);
      } else if (update.conditionalFields || update.computedFields) {
        groups.conditional_updates.push(update);
      } else {
        groups.field_updates.push(update); // Default category
      }
    }

    return groups;
  }

  validateEventData(event) {
    // Implement validation logic
    const required = ['userId', 'eventType', 'timestamp'];
    return required.every(field => event[field] != null);
  }

  generateOperationId(operationType) {
    return `${operationType}_${Date.now()}_${Math.random().toString(36).substr(2, 9)}`;
  }

  initializeOperationStats(operationId, totalRecords) {
    this.bulkOperationStats.set(operationId, {
      startTime: Date.now(),
      totalRecords: totalRecords,
      processedRecords: 0,
      errors: 0,
      status: 'running'
    });
  }

  updateOperationProgress(operationId, processedCount) {
    const stats = this.bulkOperationStats.get(operationId);
    if (stats) {
      stats.processedRecords = processedCount;
      stats.progress = (processedCount / stats.totalRecords) * 100;
    }
  }

  aggregateBatchResults(results, operationId) {
    let totalInserted = 0;
    let totalModified = 0;
    let totalUpserted = 0;
    let totalErrors = 0;

    for (const result of results) {
      if (result.status === 'fulfilled' && result.value.success) {
        const bulkResult = result.value.result;
        totalInserted += bulkResult.insertedCount || 0;
        totalModified += bulkResult.modifiedCount || 0;
        totalUpserted += bulkResult.upsertedCount || 0;
      } else {
        totalErrors += result.value?.batchSize || 1;
      }
    }

    return {
      operationId,
      totalInserted,
      totalModified,
      totalUpserted,
      totalErrors,
      successRate: ((totalInserted + totalModified + totalUpserted) / 
                   (totalInserted + totalModified + totalUpserted + totalErrors)) * 100
    };
  }

  aggregateUpdateResults(results, operationId) {
    let totalMatched = 0;
    let totalModified = 0;
    let totalUpserted = 0;
    let totalErrors = 0;

    for (const result of results) {
      if (result.success) {
        const bulkResult = result.result;
        totalMatched += bulkResult.matchedCount || 0;
        totalModified += bulkResult.modifiedCount || 0;
        totalUpserted += bulkResult.upsertedCount || 0;
      } else {
        totalErrors += result.batchSize || 1;
      }
    }

    return {
      operationId,
      totalMatched,
      totalModified,
      totalUpserted,
      totalErrors,
      successRate: ((totalMatched + totalUpserted) / 
                   (totalMatched + totalUpserted + totalErrors)) * 100
    };
  }

  async logBatchError(operationId, batch, error) {
    const errorDoc = {
      operationId: operationId,
      timestamp: new Date(),
      errorMessage: error.message,
      errorCode: error.code,
      batchSize: batch.length,
      sampleDocument: batch[0], // First document for debugging
      stackTrace: error.stack
    };

    try {
      await this.collections.errorLog.insertOne(errorDoc);
    } catch (logError) {
      console.error('Failed to log batch error:', logError);
    }
  }

  async recordOperationMetrics(operationId, result, processingTime) {
    const metricsDoc = {
      operationId: operationId,
      timestamp: new Date(),
      processingTime: processingTime,
      result: result,

      // Performance metrics
      throughput: result.totalInserted ? (result.totalInserted / processingTime) * 1000 : 0, // docs per second
      errorRate: result.totalErrors ? (result.totalErrors / (result.totalInserted + result.totalErrors)) : 0
    };

    try {
      await this.collections.operationMetrics.insertOne(metricsDoc);
    } catch (error) {
      console.error('Failed to record operation metrics:', error);
    }
  }

  async recordOperationError(operationId, error, recordCount) {
    const errorDoc = {
      operationId: operationId,
      timestamp: new Date(),
      errorMessage: error.message,
      errorCode: error.code,
      recordCount: recordCount,
      stackTrace: error.stack
    };

    try {
      await this.collections.errorLog.insertOne(errorDoc);
    } catch (logError) {
      console.error('Failed to record operation error:', logError);
    }
  }

  setupOperationHandlers() {
    // Handle operation completion events
    this.on('operation_complete', (result) => {
      console.log(`Operation ${result.operationId} completed with ${result.successRate}% success rate`);
    });

    this.on('operation_error', (error) => {
      console.error(`Operation failed:`, error);
    });
  }

  getOperationStats(operationId) {
    return this.bulkOperationStats.get(operationId);
  }

  getAllOperationStats() {
    return Object.fromEntries(this.bulkOperationStats);
  }
}

// Example usage: Complete enterprise bulk processing system
async function setupEnterpriseDataProcessing() {
  console.log('Setting up enterprise bulk data processing system...');

  const bulkManager = new MongoDBBulkOperationsManager(db);

  // Process large user event dataset
  const userEventsData = [
    {
      userId: 'user_001',
      eventType: 'page_view',
      timestamp: new Date(),
      sessionId: 'session_123',
      eventData: { page: '/dashboard', duration: 1500 },
      deviceInfo: { deviceId: 'device_456', type: 'desktop' },
      locationData: { country: 'US', city: 'San Francisco' }
    },
    // ... thousands more events
  ];

  // Execute bulk insert with performance optimization
  const insertResult = await bulkManager.processBulkUserEvents(userEventsData, {
    batchSize: 5000,
    ordered: false,
    enableValidation: true,
    upsertMode: false
  });

  console.log('Bulk insert result:', insertResult);

  // Process bulk updates for user enrichment
  const updateOperations = [
    {
      filter: { userId: 'user_001' },
      setFields: {
        'profile.lastActivity': new Date(),
        'profile.totalSessions': 25
      },
      incFields: {
        'metrics.totalEvents': 1,
        'metrics.totalValue': 10.50
      }
    }
    // ... more update operations
  ];

  const updateResult = await bulkManager.processBulkUserUpdates(updateOperations);
  console.log('Bulk update result:', updateResult);

  // Execute aggregation-based summary updates
  const aggregationResult = await bulkManager.processBulkAggregationUpdates({
    dateRange: {
      start: new Date(Date.now() - 24*60*60*1000), // Last 24 hours
      end: new Date()
    },
    groupBy: {
      userId: '$userId',
      date: { $dateToString: { format: '%Y-%m-%d', date: '$timestamp' } }
    }
  });

  console.log('Aggregation result:', aggregationResult);

  return bulkManager;
}

// Benefits of MongoDB Bulk Operations:
// - Exceptional performance with batch processing eliminating network round-trips
// - Flexible ordered and unordered execution modes for different consistency requirements  
// - Comprehensive error handling with detailed failure reporting and partial success tracking
// - Advanced filtering and transformation capabilities during bulk operations
// - Built-in support for upserts, array operations, and complex document updates
// - Optimized resource utilization with configurable batch sizes and concurrency control
// - Integrated monitoring and metrics collection for operation performance analysis
// - Native support for complex aggregation-based bulk updates and data transformations
// - Sophisticated retry mechanisms and dead letter queue patterns for reliability
// - SQL-compatible bulk operation patterns through QueryLeaf integration

module.exports = {
  MongoDBBulkOperationsManager,
  setupEnterpriseDataProcessing
};

Understanding MongoDB Bulk Operations Architecture

Advanced Performance Patterns and Enterprise Integration

Implement sophisticated bulk operation strategies for production-scale applications:

// Production-grade bulk operations with advanced performance patterns
class EnterpriseBulkProcessor extends MongoDBBulkOperationsManager {
  constructor(db, enterpriseConfig) {
    super(db);

    this.enterpriseConfig = {
      distributedProcessing: enterpriseConfig.distributedProcessing || false,
      shardingAware: enterpriseConfig.shardingAware || false,
      compressionEnabled: enterpriseConfig.compressionEnabled || true,
      memoryOptimization: enterpriseConfig.memoryOptimization || true,
      performanceMonitoring: enterpriseConfig.performanceMonitoring || true
    };

    this.setupEnterpriseFeatures();
  }

  async processDistributedBulkOperations(operationConfig) {
    console.log('Processing distributed bulk operations across cluster...');

    const {
      collections,
      shardKey,
      parallelism = 10,
      replicationFactor = 1
    } = operationConfig;

    // Distribute operations based on shard key for optimal performance
    const shardDistribution = await this.analyzeShardDistribution(shardKey);

    const distributedTasks = collections.map(async (collectionConfig) => {
      const optimizedBatches = this.createShardAwareBatches(
        collectionConfig.data,
        shardKey,
        shardDistribution
      );

      return this.processShardOptimizedBatches(optimizedBatches, collectionConfig);
    });

    const results = await Promise.allSettled(distributedTasks);
    return this.aggregateDistributedResults(results);
  }

  async implementStreamingBulkOperations(streamConfig) {
    console.log('Setting up streaming bulk operations for continuous processing...');

    const {
      sourceStream,
      batchSize = 1000,
      flushInterval = 5000,
      backpressureThreshold = 10000
    } = streamConfig;

    let batchBuffer = [];
    let lastFlush = Date.now();

    return new Promise((resolve, reject) => {
      sourceStream.on('data', async (data) => {
        batchBuffer.push(data);

        // Check for batch completion or time-based flush
        const shouldFlush = batchBuffer.length >= batchSize || 
                           (Date.now() - lastFlush) >= flushInterval;

        if (shouldFlush) {
          try {
            await this.flushBatchBuffer(batchBuffer);
            batchBuffer = [];
            lastFlush = Date.now();
          } catch (error) {
            reject(error);
          }
        }

        // Implement backpressure control
        if (batchBuffer.length >= backpressureThreshold) {
          sourceStream.pause();
          await this.flushBatchBuffer(batchBuffer);
          batchBuffer = [];
          sourceStream.resume();
        }
      });

      sourceStream.on('end', async () => {
        if (batchBuffer.length > 0) {
          await this.flushBatchBuffer(batchBuffer);
        }
        resolve();
      });

      sourceStream.on('error', reject);
    });
  }

  async optimizeForTimeSeriesData(timeSeriesConfig) {
    console.log('Optimizing bulk operations for time series data...');

    const {
      timeField = 'timestamp',
      metaField = 'metadata',
      granularity = 'hours',
      compressionEnabled = true
    } = timeSeriesConfig;

    // Create time-based bucketing for optimal insertion performance
    const bucketedOperations = this.createTimeBuckets(
      timeSeriesConfig.data,
      timeField,
      granularity
    );

    const bulkOps = bucketedOperations.map(bucket => ({
      insertOne: {
        document: {
          _id: new ObjectId(),
          [timeField]: bucket.bucketTime,
          [metaField]: bucket.metadata,
          measurements: bucket.measurements,

          // Time series optimization metadata
          bucketInfo: {
            count: bucket.measurements.length,
            min: bucket.min,
            max: bucket.max,
            granularity: granularity
          }
        }
      }
    }));

    return await this.collections.timeSeries.bulkWrite(bulkOps, {
      ...this.bulkWriteOptions,
      ordered: true // Maintain time order for time series
    });
  }
}

SQL-Style Bulk Operations with QueryLeaf

QueryLeaf provides familiar SQL syntax for MongoDB bulk operations and batch processing:

-- QueryLeaf bulk operations with SQL-familiar patterns

-- Bulk INSERT operations with comprehensive data processing
BULK INSERT INTO user_analytics (
    user_id, event_type, event_data, session_id, timestamp, 
    device_info, location_data, processing_metadata
)
VALUES 
-- Process multiple rows efficiently
('user_001', 'page_view', JSON_OBJECT('page', '/dashboard', 'duration', 1500), 'session_123', CURRENT_TIMESTAMP,
 JSON_OBJECT('device_id', 'device_456', 'type', 'desktop'), 
 JSON_OBJECT('country', 'US', 'city', 'San Francisco'),
 JSON_OBJECT('batch_id', 'batch_001', 'ingestion_time', CURRENT_TIMESTAMP)),

('user_002', 'button_click', JSON_OBJECT('button', 'subscribe', 'value', 25.00), 'session_124', CURRENT_TIMESTAMP,
 JSON_OBJECT('device_id', 'device_457', 'type', 'mobile'), 
 JSON_OBJECT('country', 'CA', 'city', 'Toronto'),
 JSON_OBJECT('batch_id', 'batch_001', 'ingestion_time', CURRENT_TIMESTAMP)),

('user_003', 'purchase', JSON_OBJECT('product_id', 'prod_789', 'amount', 99.99), 'session_125', CURRENT_TIMESTAMP,
 JSON_OBJECT('device_id', 'device_458', 'type', 'tablet'), 
 JSON_OBJECT('country', 'UK', 'city', 'London'),
 JSON_OBJECT('batch_id', 'batch_001', 'ingestion_time', CURRENT_TIMESTAMP))

-- Advanced bulk options
WITH BULK_OPTIONS (
    batch_size = 5000,
    ordered = false,
    write_concern = 'majority',
    bypass_validation = false,
    continue_on_error = true,
    upsert_mode = false
)
ON DUPLICATE KEY UPDATE 
    event_data = VALUES(event_data),
    last_updated = CURRENT_TIMESTAMP,
    update_count = update_count + 1;

-- Bulk UPDATE operations with complex conditions and transformations  
BULK UPDATE user_analytics
SET 
    -- Field updates with calculations
    metrics.total_events = metrics.total_events + 1,
    metrics.total_value = metrics.total_value + JSON_EXTRACT(event_data, '$.amount'),
    metrics.last_activity = CURRENT_TIMESTAMP,

    -- Conditional field updates
    engagement_level = CASE 
        WHEN metrics.total_events + 1 >= 100 THEN 'high'
        WHEN metrics.total_events + 1 >= 50 THEN 'medium'
        ELSE 'low'
    END,

    -- Array operations
    event_history = JSON_ARRAY_APPEND(
        COALESCE(event_history, JSON_ARRAY()), 
        '$',
        JSON_OBJECT(
            'event_type', event_type,
            'timestamp', timestamp,
            'value', JSON_EXTRACT(event_data, '$.amount')
        )
    ),

    -- Nested document updates
    device_stats = JSON_SET(
        COALESCE(device_stats, JSON_OBJECT()),
        CONCAT('$.', JSON_EXTRACT(device_info, '$.type'), '_events'),
        COALESCE(JSON_EXTRACT(device_stats, CONCAT('$.', JSON_EXTRACT(device_info, '$.type'), '_events')), 0) + 1
    )

WHERE 
    timestamp >= CURRENT_DATE - INTERVAL '7 days'
    AND event_type IN ('page_view', 'button_click', 'purchase')
    AND JSON_EXTRACT(device_info, '$.type') IS NOT NULL

-- Batch processing configuration
WITH BATCH_CONFIG (
    batch_size = 2000,
    max_execution_time = 300, -- 5 minutes
    retry_attempts = 3,
    parallel_workers = 5,
    memory_limit = '1GB'
);

-- Bulk UPSERT operations with complex merge logic
BULK UPSERT INTO user_daily_summaries (
    user_id, summary_date, metrics, computed_fields, last_updated
)
WITH AGGREGATION_SOURCE AS (
    -- Compute daily metrics from raw events
    SELECT 
        user_id,
        DATE(timestamp) as summary_date,

        -- Event count metrics
        COUNT(*) as total_events,
        COUNT(DISTINCT session_id) as unique_sessions,
        COUNT(DISTINCT JSON_EXTRACT(device_info, '$.device_id')) as unique_devices,

        -- Value and duration metrics
        SUM(COALESCE(JSON_EXTRACT(event_data, '$.amount'), 0)) as total_value,
        SUM(COALESCE(JSON_EXTRACT(event_data, '$.duration'), 0)) as total_duration,
        AVG(COALESCE(JSON_EXTRACT(event_data, '$.amount'), 0)) as avg_value,

        -- Engagement metrics
        COUNT(*) FILTER (WHERE event_type = 'purchase') as purchase_events,
        COUNT(*) FILTER (WHERE event_type = 'page_view') as pageview_events,
        COUNT(*) FILTER (WHERE event_type = 'button_click') as interaction_events,

        -- Geographic and device analytics
        JSON_ARRAYAGG(
            DISTINCT JSON_EXTRACT(location_data, '$.country')
        ) FILTER (WHERE JSON_EXTRACT(location_data, '$.country') IS NOT NULL) as countries_visited,

        JSON_OBJECT(
            'desktop', COUNT(*) FILTER (WHERE JSON_EXTRACT(device_info, '$.type') = 'desktop'),
            'mobile', COUNT(*) FILTER (WHERE JSON_EXTRACT(device_info, '$.type') = 'mobile'),
            'tablet', COUNT(*) FILTER (WHERE JSON_EXTRACT(device_info, '$.type') = 'tablet')
        ) as device_breakdown,

        -- Time-based patterns
        JSON_OBJECT(
            'morning', COUNT(*) FILTER (WHERE EXTRACT(HOUR FROM timestamp) BETWEEN 6 AND 11),
            'afternoon', COUNT(*) FILTER (WHERE EXTRACT(HOUR FROM timestamp) BETWEEN 12 AND 17),
            'evening', COUNT(*) FILTER (WHERE EXTRACT(HOUR FROM timestamp) BETWEEN 18 AND 23),
            'night', COUNT(*) FILTER (WHERE EXTRACT(HOUR FROM timestamp) BETWEEN 0 AND 5)
        ) as time_distribution,

        -- Calculated engagement scores
        (
            COUNT(*) * 1.0 +
            COUNT(DISTINCT session_id) * 2.0 +
            COUNT(*) FILTER (WHERE event_type = 'purchase') * 5.0 +
            SUM(COALESCE(JSON_EXTRACT(event_data, '$.amount'), 0)) * 0.1
        ) as engagement_score,

        -- Session quality metrics
        AVG(session_duration.duration) as avg_session_duration,
        MAX(session_events.event_count) as max_events_per_session

    FROM user_analytics ua
    LEFT JOIN LATERAL (
        SELECT 
            (MAX(timestamp) - MIN(timestamp)) / 1000.0 as duration
        FROM user_analytics ua2 
        WHERE ua2.user_id = ua.user_id 
          AND ua2.session_id = ua.session_id
    ) session_duration ON TRUE
    LEFT JOIN LATERAL (
        SELECT COUNT(*) as event_count
        FROM user_analytics ua3
        WHERE ua3.user_id = ua.user_id 
          AND ua3.session_id = ua.session_id
    ) session_events ON TRUE

    WHERE timestamp >= CURRENT_DATE - INTERVAL '1 day'
    GROUP BY user_id, DATE(timestamp)
)
SELECT 
    user_id,
    summary_date,

    -- Comprehensive metrics object
    JSON_OBJECT(
        'events', JSON_OBJECT(
            'total', total_events,
            'purchases', purchase_events,
            'pageviews', pageview_events,
            'interactions', interaction_events
        ),
        'sessions', JSON_OBJECT(
            'count', unique_sessions,
            'avg_duration', avg_session_duration,
            'max_events', max_events_per_session
        ),
        'financial', JSON_OBJECT(
            'total_value', total_value,
            'avg_value', avg_value,
            'purchase_rate', purchase_events / total_events::float
        ),
        'engagement', JSON_OBJECT(
            'score', engagement_score,
            'events_per_session', total_events / unique_sessions::float,
            'value_per_session', total_value / unique_sessions::float
        ),
        'devices', device_breakdown,
        'geography', JSON_OBJECT(
            'countries', countries_visited,
            'country_count', JSON_LENGTH(countries_visited)
        )
    ) as metrics,

    -- Computed analytical fields
    JSON_OBJECT(
        'user_segment', CASE 
            WHEN engagement_score >= 100 AND total_value >= 100 THEN 'vip'
            WHEN engagement_score >= 50 OR total_value >= 50 THEN 'engaged'
            WHEN total_events >= 10 THEN 'active'
            ELSE 'casual'
        END,

        'activity_pattern', CASE
            WHEN JSON_EXTRACT(time_distribution, '$.morning') / total_events::float > 0.5 THEN 'morning_user'
            WHEN JSON_EXTRACT(time_distribution, '$.evening') / total_events::float > 0.5 THEN 'evening_user'
            WHEN JSON_EXTRACT(time_distribution, '$.night') / total_events::float > 0.3 THEN 'night_owl'
            ELSE 'balanced'
        END,

        'device_preference', CASE
            WHEN JSON_EXTRACT(device_breakdown, '$.mobile') / total_events::float > 0.7 THEN 'mobile_first'
            WHEN JSON_EXTRACT(device_breakdown, '$.desktop') / total_events::float > 0.7 THEN 'desktop_user'
            ELSE 'multi_device'
        END,

        'purchase_probability', LEAST(1.0, 
            purchase_events / total_events::float * 2 +
            total_value / 1000.0 +
            engagement_score / 200.0
        ),

        'churn_risk', CASE
            WHEN avg_session_duration < 30 AND total_events < 5 THEN 'high'
            WHEN unique_sessions = 1 AND total_events < 3 THEN 'medium'  
            ELSE 'low'
        END

    ) as computed_fields,

    CURRENT_TIMESTAMP as last_updated

FROM AGGREGATION_SOURCE

-- Upsert behavior for existing records
ON CONFLICT (user_id, summary_date)
DO UPDATE SET
    metrics = JSON_MERGE_PATCH(EXCLUDED.metrics, VALUES(metrics)),
    computed_fields = VALUES(computed_fields),
    last_updated = VALUES(last_updated),
    update_count = COALESCE(update_count, 0) + 1;

-- Advanced bulk DELETE operations with archival
BULK DELETE FROM user_analytics 
WHERE 
    timestamp < CURRENT_DATE - INTERVAL '90 days'
    AND JSON_EXTRACT(event_data, '$.amount') IS NULL
    AND event_type NOT IN ('purchase', 'subscription', 'payment')

-- Archive before deletion
WITH ARCHIVAL_STRATEGY (
    archive_collection = 'archived_user_analytics',
    compression = 'zstd',
    partition_by = 'DATE_TRUNC(month, timestamp)',
    retention_period = '2 years'
)

-- Conditional deletion with business rules
AND NOT EXISTS (
    SELECT 1 FROM user_profiles up 
    WHERE up.user_id = user_analytics.user_id 
      AND up.account_type IN ('premium', 'enterprise')
)

-- Performance optimization
WITH DELETE_OPTIONS (
    batch_size = 10000,
    max_execution_time = 1800, -- 30 minutes
    checkpoint_interval = 5000,
    parallel_deletion = true,
    verify_foreign_keys = false
);

-- Time series bulk operations for high-frequency data
BULK INSERT INTO sensor_readings (
    sensor_id, measurement_time, readings, metadata
)
WITH TIME_SERIES_OPTIMIZATION (
    time_field = 'measurement_time',
    meta_field = 'metadata', 
    granularity = 'hour',
    bucket_span = 3600, -- 1 hour buckets
    compression = 'delta'
)
SELECT 
    sensor_config.sensor_id,
    bucket_time,

    -- Aggregate measurements into time buckets
    JSON_OBJECT(
        'temperature', JSON_OBJECT(
            'min', MIN(temperature),
            'max', MAX(temperature), 
            'avg', AVG(temperature),
            'count', COUNT(temperature)
        ),
        'humidity', JSON_OBJECT(
            'min', MIN(humidity),
            'max', MAX(humidity),
            'avg', AVG(humidity),
            'count', COUNT(humidity)
        ),
        'pressure', JSON_OBJECT(
            'min', MIN(pressure),
            'max', MAX(pressure),
            'avg', AVG(pressure),
            'count', COUNT(pressure)
        )
    ) as readings,

    -- Metadata for the bucket
    JSON_OBJECT(
        'data_points', COUNT(*),
        'data_quality', 
            CASE 
                WHEN COUNT(*) >= 3600 THEN 'excellent'  -- One reading per second
                WHEN COUNT(*) >= 360 THEN 'good'        -- One reading per 10 seconds
                WHEN COUNT(*) >= 36 THEN 'fair'         -- One reading per minute
                ELSE 'poor'
            END,
        'sensor_health', 
            CASE
                WHEN COUNT(CASE WHEN temperature IS NULL THEN 1 END) / COUNT(*)::float > 0.1 THEN 'degraded'
                WHEN MAX(timestamp) - MIN(timestamp) < INTERVAL '55 minutes' THEN 'intermittent'
                ELSE 'healthy'
            END,
        'bucket_info', JSON_OBJECT(
            'start_time', MIN(timestamp),
            'end_time', MAX(timestamp),
            'span_seconds', EXTRACT(EPOCH FROM (MAX(timestamp) - MIN(timestamp)))
        )
    ) as metadata

FROM raw_sensor_data rsd
JOIN sensor_config sc ON sc.sensor_id = rsd.sensor_id
WHERE rsd.timestamp >= CURRENT_TIMESTAMP - INTERVAL '24 hours'
  AND rsd.processed = false
GROUP BY 
    rsd.sensor_id, 
    DATE_TRUNC('hour', rsd.timestamp)
ORDER BY sensor_id, bucket_time;

-- Performance monitoring for bulk operations
CREATE BULK_MONITOR bulk_operation_monitor
WITH METRICS (
    -- Throughput metrics
    documents_per_second,
    bytes_per_second,
    batch_completion_rate,

    -- Latency metrics  
    avg_batch_latency,
    p95_batch_latency,
    max_batch_latency,

    -- Error and reliability metrics
    error_rate,
    retry_rate,
    success_rate,

    -- Resource utilization
    memory_usage,
    cpu_utilization,
    network_throughput,
    storage_iops
)
WITH ALERTS (
    slow_performance = {
        condition: documents_per_second < 1000 FOR 5 MINUTES,
        severity: 'medium',
        action: 'increase_batch_size'
    },

    high_error_rate = {
        condition: error_rate > 0.05 FOR 2 MINUTES,
        severity: 'high', 
        action: 'pause_and_investigate'
    },

    resource_exhaustion = {
        condition: memory_usage > 0.9 OR cpu_utilization > 0.9 FOR 1 MINUTE,
        severity: 'critical',
        action: 'throttle_operations'
    }
);

-- QueryLeaf provides comprehensive bulk operation capabilities:
-- 1. SQL-familiar syntax for MongoDB bulk insert, update, upsert, and delete operations
-- 2. Advanced aggregation-based bulk processing with complex transformations
-- 3. Time series optimization patterns for high-frequency data processing
-- 4. Comprehensive performance monitoring and alerting for production environments
-- 5. Flexible batching strategies with memory and performance optimization
-- 6. Error handling and retry mechanisms for reliable bulk processing
-- 7. Integration with archival and data lifecycle management strategies
-- 8. Real-time metrics computation and materialized view maintenance
-- 9. Conditional processing logic and business rule enforcement
-- 10. Resource management and throttling for sustainable high-throughput operations

Best Practices for Bulk Operations Implementation

Performance Optimization Strategies

Essential principles for maximizing bulk operation performance:

Batch Size Optimization: Test different batch sizes to find the optimal balance between memory usage and throughput
Ordering Strategy: Use unordered operations when possible for maximum parallelism and performance
Index Management: Consider disabling non-essential indexes during large bulk loads and rebuilding afterward
Write Concern Tuning: Balance durability requirements with performance by adjusting write concern settings
Resource Monitoring: Monitor memory, CPU, and I/O usage during bulk operations to prevent resource exhaustion
Network Optimization: Use compression and connection pooling to optimize network utilization

Production Deployment Guidelines

Optimize bulk operations for production-scale environments:

Capacity Planning: Estimate resource requirements based on data volume and processing complexity
Scheduling Strategy: Schedule bulk operations during low-traffic periods to minimize impact on application performance
Error Recovery: Implement comprehensive error handling with retry logic and dead letter queues
Progress Monitoring: Provide real-time progress tracking and estimated completion times
Rollback Planning: Develop rollback strategies for failed bulk operations with data consistency guarantees
Testing Framework: Thoroughly test bulk operations with representative data volumes and error scenarios

Conclusion

MongoDB bulk operations provide exceptional performance and scalability for enterprise-scale data processing, eliminating the bottlenecks and complexity of traditional row-by-row operations while maintaining comprehensive error handling and data integrity features. The sophisticated bulk APIs support complex business logic, conditional processing, and real-time monitoring that enable applications to handle massive data volumes efficiently.

Key MongoDB bulk operations benefits include:

Exceptional Performance: Batch processing eliminates network round-trips and optimizes resource utilization
Flexible Execution Modes: Ordered and unordered operations provide control over consistency and performance trade-offs
Comprehensive Error Handling: Detailed error reporting and partial success tracking for reliable data processing
Advanced Transformation: Built-in support for complex aggregations, upserts, and conditional operations
Resource Optimization: Configurable batch sizes, concurrency control, and memory management
Production Monitoring: Integrated metrics collection and performance analysis capabilities

Whether you're implementing data migration pipelines, ETL processes, real-time analytics updates, or batch synchronization systems, MongoDB bulk operations with QueryLeaf's familiar SQL interface provide the foundation for high-performance data processing.

QueryLeaf Integration: QueryLeaf automatically optimizes MongoDB bulk operations while providing SQL-familiar syntax for batch processing, aggregations, and complex data transformations. Advanced bulk patterns, performance monitoring, and error handling are seamlessly accessible through familiar SQL constructs, making sophisticated high-throughput data processing both powerful and approachable for SQL-oriented development teams.

The combination of MongoDB's robust bulk operation capabilities with SQL-style processing makes it an ideal platform for applications requiring exceptional data processing performance, ensuring your enterprise applications can scale efficiently while maintaining data integrity and operational reliability across massive datasets.

December 13, 2025
24 min read

MongoDB Document Validation and Schema Evolution Strategies: Flexible Data Governance for Dynamic Applications

Modern enterprise applications must balance data integrity requirements with the flexibility to adapt quickly to changing business needs, evolving user requirements, and regulatory compliance updates. Traditional relational databases enforce rigid schema constraints that ensure data consistency but often require complex migration procedures and application downtime when requirements change, creating significant barriers to agile development and rapid feature deployment.

MongoDB's flexible document structure provides powerful capabilities for maintaining data quality through configurable validation rules while preserving the schema flexibility essential for dynamic applications. Understanding how to implement effective document validation strategies and manage schema evolution enables development teams to maintain data integrity standards while preserving the agility that makes MongoDB ideal for modern application architectures.

The Traditional Rigid Schema Challenge

Conventional relational database schema management creates significant constraints on application evolution and data governance flexibility:

-- Traditional PostgreSQL rigid schema with complex constraint management overhead

-- User profile management with strict constraints requiring schema migrations
CREATE TABLE user_profiles (
    user_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    email VARCHAR(255) UNIQUE NOT NULL,
    username VARCHAR(50) UNIQUE NOT NULL,
    first_name VARCHAR(100) NOT NULL,
    last_name VARCHAR(100) NOT NULL,
    phone_number VARCHAR(20),
    date_of_birth DATE,

    -- Strict validation constraints
    profile_status VARCHAR(20) NOT NULL DEFAULT 'active',
    account_type VARCHAR(20) NOT NULL DEFAULT 'standard',
    subscription_tier VARCHAR(20) NOT NULL DEFAULT 'free',

    -- Business rule constraints
    created_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
    last_login_at TIMESTAMP WITH TIME ZONE,

    -- Rigid constraints that resist business requirement changes
    CONSTRAINT valid_profile_status CHECK (profile_status IN ('active', 'inactive', 'suspended', 'deleted')),
    CONSTRAINT valid_account_type CHECK (account_type IN ('standard', 'premium', 'enterprise')),
    CONSTRAINT valid_subscription_tier CHECK (subscription_tier IN ('free', 'basic', 'pro', 'enterprise')),
    CONSTRAINT valid_email_format CHECK (email ~* '^[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}$'),
    CONSTRAINT valid_phone_format CHECK (phone_number ~* '^\+?[1-9]\d{1,14}$'),
    CONSTRAINT valid_birth_date CHECK (date_of_birth >= '1900-01-01' AND date_of_birth <= CURRENT_DATE),
    CONSTRAINT username_length CHECK (length(username) >= 3 AND length(username) <= 50),
    CONSTRAINT name_validation CHECK (length(first_name) >= 1 AND length(last_name) >= 1)
);

-- User preferences with complex relationship constraints
CREATE TABLE user_preferences (
    preference_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    user_id UUID NOT NULL REFERENCES user_profiles(user_id) ON DELETE CASCADE,
    preference_category VARCHAR(50) NOT NULL,
    preference_key VARCHAR(100) NOT NULL,
    preference_value JSONB NOT NULL,
    preference_type VARCHAR(20) NOT NULL DEFAULT 'string',
    is_public BOOLEAN DEFAULT FALSE,

    created_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,

    -- Complex validation requiring multiple checks
    CONSTRAINT valid_preference_category CHECK (preference_category IN (
        'notifications', 'privacy', 'display', 'communication', 'security', 'accessibility'
    )),
    CONSTRAINT valid_preference_type CHECK (preference_type IN (
        'string', 'number', 'boolean', 'array', 'object'
    )),
    CONSTRAINT preference_value_validation CHECK (
        CASE preference_type
            WHEN 'boolean' THEN preference_value::text IN ('true', 'false')
            WHEN 'number' THEN preference_value::text ~ '^-?[0-9]+\.?[0-9]*$'
            ELSE TRUE -- String and complex types need application validation
        END
    ),

    -- Unique constraint prevents duplicate preferences
    UNIQUE (user_id, preference_category, preference_key)
);

-- Complex schema migration procedure for adding new business requirements
-- Example: Adding new subscription tiers and validation rules

-- Step 1: Add new constraint values (requires exclusive table lock)
ALTER TABLE user_profiles 
DROP CONSTRAINT valid_subscription_tier;

ALTER TABLE user_profiles 
ADD CONSTRAINT valid_subscription_tier CHECK (subscription_tier IN (
    'free', 'basic', 'pro', 'premium', 'enterprise', 'custom'
));

-- Step 2: Add new preference categories (requires constraint recreation)
ALTER TABLE user_preferences 
DROP CONSTRAINT valid_preference_category;

ALTER TABLE user_preferences 
ADD CONSTRAINT valid_preference_category CHECK (preference_category IN (
    'notifications', 'privacy', 'display', 'communication', 'security', 
    'accessibility', 'billing', 'integration', 'analytics'
));

-- Step 3: Add new validation rules for complex business logic
ALTER TABLE user_profiles 
ADD COLUMN profile_metadata JSONB DEFAULT '{}';

-- Step 4: Create complex validation function for new business rules
CREATE OR REPLACE FUNCTION validate_profile_metadata(metadata JSONB)
RETURNS BOOLEAN AS $$
DECLARE
    subscription_tier TEXT;
    required_fields TEXT[];
    field TEXT;
BEGIN
    -- Extract subscription tier for validation
    subscription_tier := metadata->>'subscriptionTier';

    -- Define required fields based on tier
    IF subscription_tier IN ('premium', 'enterprise', 'custom') THEN
        required_fields := ARRAY['company', 'industry', 'employeeCount'];
    ELSE
        required_fields := ARRAY['userType'];
    END IF;

    -- Validate required fields exist
    FOREACH field IN ARRAY required_fields LOOP
        IF NOT metadata ? field OR metadata->>field IS NULL OR metadata->>field = '' THEN
            RETURN FALSE;
        END IF;
    END LOOP;

    -- Validate specific field formats
    IF metadata ? 'employeeCount' THEN
        IF NOT (metadata->>'employeeCount')::TEXT ~ '^[0-9]+$' THEN
            RETURN FALSE;
        END IF;
    END IF;

    IF metadata ? 'industry' THEN
        IF NOT metadata->>'industry' IN ('technology', 'healthcare', 'finance', 'education', 'retail', 'other') THEN
            RETURN FALSE;
        END IF;
    END IF;

    RETURN TRUE;
END;
$$ LANGUAGE plpgsql IMMUTABLE;

-- Step 5: Add constraint using validation function
ALTER TABLE user_profiles 
ADD CONSTRAINT valid_profile_metadata CHECK (validate_profile_metadata(profile_metadata));

-- Complex view for business logic validation across related tables
CREATE OR REPLACE VIEW validated_user_profiles AS
SELECT 
    up.*,

    -- Preference validation aggregation
    CASE 
        WHEN COUNT(pref.preference_id) = 0 THEN 'incomplete_profile'
        WHEN COUNT(pref.preference_id) FILTER (WHERE pref.preference_category = 'privacy') = 0 THEN 'missing_privacy_settings'
        WHEN COUNT(pref.preference_id) >= 5 THEN 'complete_profile'
        ELSE 'partial_profile'
    END as profile_completion_status,

    -- Business rule validation
    CASE 
        WHEN up.subscription_tier IN ('premium', 'enterprise') AND up.profile_metadata->>'company' IS NULL THEN 'invalid_business_profile'
        WHEN up.subscription_tier = 'free' AND COUNT(pref.preference_id) > 10 THEN 'exceeds_free_limit'
        ELSE 'valid_profile'
    END as business_rule_status,

    -- Compliance validation
    CASE 
        WHEN up.created_at < CURRENT_TIMESTAMP - INTERVAL '2 years' 
             AND up.last_login_at IS NULL THEN 'dormant_account'
        WHEN up.profile_metadata->>'gdprConsent' IS NULL 
             AND up.created_at > '2018-05-25' THEN 'missing_gdpr_consent'
        ELSE 'compliant'
    END as compliance_status

FROM user_profiles up
LEFT JOIN user_preferences pref ON up.user_id = pref.user_id
GROUP BY up.user_id, up.email, up.username, up.first_name, up.last_name, 
         up.phone_number, up.date_of_birth, up.profile_status, up.account_type, 
         up.subscription_tier, up.created_at, up.updated_at, up.last_login_at, 
         up.profile_metadata;

-- Complex migration script for schema evolution
DO $$
DECLARE
    migration_start TIMESTAMP := CURRENT_TIMESTAMP;
    records_updated INTEGER := 0;
    validation_errors INTEGER := 0;
    user_record RECORD;
BEGIN
    RAISE NOTICE 'Starting schema migration at %', migration_start;

    -- Update existing records to comply with new validation rules
    FOR user_record IN 
        SELECT user_id, subscription_tier, profile_metadata
        FROM user_profiles 
        WHERE profile_metadata = '{}'::JSONB
    LOOP
        BEGIN
            -- Set default metadata based on subscription tier
            UPDATE user_profiles 
            SET profile_metadata = CASE subscription_tier
                WHEN 'enterprise' THEN '{"userType": "business", "company": "Unknown", "industry": "other"}'::JSONB
                WHEN 'premium' THEN '{"userType": "professional", "company": "Individual"}'::JSONB
                ELSE '{"userType": "personal"}'::JSONB
            END
            WHERE user_id = user_record.user_id;

            records_updated := records_updated + 1;

        EXCEPTION
            WHEN check_violation THEN
                validation_errors := validation_errors + 1;
                RAISE WARNING 'Validation error for user %: %', user_record.user_id, SQLERRM;

                -- Handle validation errors with fallback values
                UPDATE user_profiles 
                SET profile_metadata = '{"userType": "personal"}'::JSONB
                WHERE user_id = user_record.user_id;
        END;
    END LOOP;

    RAISE NOTICE 'Schema migration completed: % records updated, % validation errors', 
        records_updated, validation_errors;
    RAISE NOTICE 'Migration duration: %', CURRENT_TIMESTAMP - migration_start;
END $$;

-- Problems with traditional rigid schema management:
-- 1. Complex schema migrations requiring downtime and exclusive table locks affecting availability
-- 2. Rigid constraint validation preventing rapid business requirement adaptation
-- 3. Application-database coupling requiring coordinated deployments for schema changes
-- 4. Complex validation function maintenance and versioning overhead
-- 5. Difficult rollback procedures for schema changes affecting production stability
-- 6. Performance impact from complex constraint checking on high-volume operations
-- 7. Limited flexibility for optional or conditional field validation
-- 8. Version management complexity across multiple database environments
-- 9. Testing challenges for schema changes across different application versions
-- 10. Operational overhead for managing constraint dependencies and relationships

MongoDB provides flexible document validation that maintains data integrity while preserving schema evolution capabilities:

// MongoDB Document Validation - Flexible schema governance with evolution support
const { MongoClient, ObjectId } = require('mongodb');

// Advanced MongoDB Document Validation and Schema Evolution Manager
class MongoDBValidationManager {
  constructor(client, config = {}) {
    this.client = client;
    this.db = client.db(config.database || 'flexible_validation_db');

    this.config = {
      // Validation configuration
      enableFlexibleValidation: config.enableFlexibleValidation !== false,
      enableSchemaVersioning: config.enableSchemaVersioning !== false,
      enableGradualValidation: config.enableGradualValidation !== false,

      // Evolution management
      enableSchemaEvolution: config.enableSchemaEvolution !== false,
      enableBackwardCompatibility: config.enableBackwardCompatibility !== false,
      enableValidationMigration: config.enableValidationMigration !== false,

      // Governance and compliance
      enableComplianceValidation: config.enableComplianceValidation !== false,
      enableBusinessRuleValidation: config.enableBusinessRuleValidation !== false,
      enableDataQualityMonitoring: config.enableDataQualityMonitoring !== false,

      // Performance optimization
      enableValidationOptimization: config.enableValidationOptimization !== false,
      enableValidationCaching: config.enableValidationCaching !== false,
      enablePartialValidation: config.enablePartialValidation !== false
    };

    // Schema management
    this.schemaVersions = new Map();
    this.validationRules = new Map();
    this.evolutionHistory = new Map();

    this.initializeValidationManager();
  }

  async initializeValidationManager() {
    console.log('Initializing MongoDB Document Validation Manager...');

    try {
      // Setup validation collections
      await this.setupValidationCollections();

      // Initialize schema versioning
      if (this.config.enableSchemaVersioning) {
        await this.initializeSchemaVersioning();
      }

      // Setup validation rules
      await this.setupFlexibleValidationRules();

      // Initialize compliance validation
      if (this.config.enableComplianceValidation) {
        await this.initializeComplianceValidation();
      }

      console.log('MongoDB Validation Manager initialized successfully');

    } catch (error) {
      console.error('Error initializing validation manager:', error);
      throw error;
    }
  }

  async setupValidationCollections() {
    console.log('Setting up document validation collections...');

    try {
      // Schema version tracking collection
      const schemaVersionsCollection = this.db.collection('schema_versions');
      await schemaVersionsCollection.createIndexes([
        { key: { collection: 1, version: 1 }, unique: true, background: true },
        { key: { effectiveDate: 1 }, background: true },
        { key: { isActive: 1, collection: 1 }, background: true }
      ]);

      // Validation error tracking
      const validationErrorsCollection = this.db.collection('validation_errors');
      await validationErrorsCollection.createIndexes([
        { key: { timestamp: -1 }, background: true },
        { key: { collection: 1, errorType: 1 }, background: true },
        { key: { severity: 1, timestamp: -1 }, background: true }
      ]);

      console.log('Validation collections configured successfully');

    } catch (error) {
      console.error('Error setting up validation collections:', error);
      throw error;
    }
  }

  async createFlexibleValidationSchema(collectionName, validationConfig) {
    console.log(`Creating flexible validation schema for: ${collectionName}`);

    try {
      const collection = this.db.collection(collectionName);

      // Build flexible validation schema
      const validationSchema = this.buildValidationSchema(validationConfig);

      // Create collection with validation
      await this.db.createCollection(collectionName, {
        validator: validationSchema,
        validationLevel: validationConfig.level || 'moderate', // strict, moderate, off
        validationAction: validationConfig.action || 'error' // error, warn
      });

      // Store validation configuration for evolution tracking
      await this.storeValidationConfig(collectionName, validationConfig, validationSchema);

      console.log(`Flexible validation schema created for: ${collectionName}`);

      return {
        collection: collectionName,
        validationLevel: validationConfig.level || 'moderate',
        validationAction: validationConfig.action || 'error',
        schema: validationSchema,
        flexibility: 'high',
        evolutionSupport: true
      };

    } catch (error) {
      console.error(`Error creating validation schema for ${collectionName}:`, error);
      throw error;
    }
  }

  buildValidationSchema(config) {
    const schema = {
      $jsonSchema: {
        bsonType: 'object',
        title: config.title || 'Document Validation Schema',
        description: config.description || 'Flexible document validation',
        properties: {},
        additionalProperties: config.allowAdditionalProperties !== false // Default: allow flexibility
      }
    };

    // Add required fields
    if (config.required && config.required.length > 0) {
      schema.$jsonSchema.required = config.required;
    }

    // Build field properties with flexible validation
    if (config.fields) {
      Object.keys(config.fields).forEach(fieldName => {
        const fieldConfig = config.fields[fieldName];
        schema.$jsonSchema.properties[fieldName] = this.buildFieldValidation(fieldConfig);
      });
    }

    // Add custom validation expressions
    if (config.customValidation) {
      schema = { ...schema, ...config.customValidation };
    }

    return schema;
  }

  buildFieldValidation(fieldConfig) {
    const validation = {
      bsonType: fieldConfig.type || 'string',
      description: fieldConfig.description || `Validation for ${fieldConfig.type} field`
    };

    // Type-specific validations
    switch (fieldConfig.type) {
      case 'string':
        if (fieldConfig.minLength) validation.minLength = fieldConfig.minLength;
        if (fieldConfig.maxLength) validation.maxLength = fieldConfig.maxLength;
        if (fieldConfig.pattern) validation.pattern = fieldConfig.pattern;
        if (fieldConfig.enum) validation.enum = fieldConfig.enum;
        break;

      case 'number':
      case 'int':
        if (fieldConfig.minimum !== undefined) validation.minimum = fieldConfig.minimum;
        if (fieldConfig.maximum !== undefined) validation.maximum = fieldConfig.maximum;
        if (fieldConfig.exclusiveMinimum !== undefined) validation.exclusiveMinimum = fieldConfig.exclusiveMinimum;
        if (fieldConfig.exclusiveMaximum !== undefined) validation.exclusiveMaximum = fieldConfig.exclusiveMaximum;
        break;

      case 'array':
        if (fieldConfig.minItems) validation.minItems = fieldConfig.minItems;
        if (fieldConfig.maxItems) validation.maxItems = fieldConfig.maxItems;
        if (fieldConfig.uniqueItems) validation.uniqueItems = fieldConfig.uniqueItems;
        if (fieldConfig.items) validation.items = this.buildFieldValidation(fieldConfig.items);
        break;

      case 'object':
        if (fieldConfig.properties) {
          validation.properties = {};
          Object.keys(fieldConfig.properties).forEach(prop => {
            validation.properties[prop] = this.buildFieldValidation(fieldConfig.properties[prop]);
          });
        }
        if (fieldConfig.required) validation.required = fieldConfig.required;
        if (fieldConfig.additionalProperties !== undefined) {
          validation.additionalProperties = fieldConfig.additionalProperties;
        }
        break;
    }

    // Conditional validation
    if (fieldConfig.conditional) {
      validation.if = fieldConfig.conditional.if;
      validation.then = fieldConfig.conditional.then;
      validation.else = fieldConfig.conditional.else;
    }

    return validation;
  }

  async setupUserProfileValidation() {
    console.log('Setting up flexible user profile validation...');

    const userProfileValidation = {
      title: 'User Profile Validation Schema',
      description: 'Flexible validation for user profiles with business rule support',
      level: 'moderate', // Allow some flexibility
      action: 'error',
      allowAdditionalProperties: true, // Support schema evolution

      required: ['email', 'username', 'profileStatus'],

      fields: {
        email: {
          type: 'string',
          description: 'User email address',
          pattern: '^[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\\.[A-Za-z]{2,}$',
          maxLength: 255
        },
        username: {
          type: 'string',
          description: 'Unique username',
          minLength: 3,
          maxLength: 50,
          pattern: '^[a-zA-Z0-9_-]+$'
        },
        firstName: {
          type: 'string',
          description: 'User first name',
          minLength: 1,
          maxLength: 100
        },
        lastName: {
          type: 'string',
          description: 'User last name',
          minLength: 1,
          maxLength: 100
        },
        phoneNumber: {
          type: 'string',
          description: 'Phone number in international format',
          pattern: '^\\+?[1-9]\\d{1,14}$'
        },
        dateOfBirth: {
          type: 'date',
          description: 'User date of birth'
        },
        profileStatus: {
          type: 'string',
          description: 'Current profile status',
          enum: ['active', 'inactive', 'suspended', 'pending', 'archived']
        },
        accountType: {
          type: 'string',
          description: 'Account type classification',
          enum: ['standard', 'premium', 'enterprise', 'trial']
        },
        subscriptionTier: {
          type: 'string',
          description: 'Subscription tier level',
          enum: ['free', 'basic', 'pro', 'premium', 'enterprise', 'custom']
        },
        profileMetadata: {
          type: 'object',
          description: 'Flexible profile metadata',
          additionalProperties: true,
          properties: {
            theme: {
              type: 'string',
              enum: ['light', 'dark', 'auto']
            },
            language: {
              type: 'string',
              pattern: '^[a-z]{2}(-[A-Z]{2})?$'
            },
            timezone: {
              type: 'string',
              description: 'IANA timezone identifier'
            }
          }
        },
        preferences: {
          type: 'array',
          description: 'User preference settings',
          items: {
            type: 'object',
            required: ['category', 'key', 'value'],
            properties: {
              category: {
                type: 'string',
                enum: ['notifications', 'privacy', 'display', 'communication', 'security', 'accessibility', 'billing', 'integration']
              },
              key: {
                type: 'string',
                minLength: 1,
                maxLength: 100
              },
              value: {
                description: 'Preference value (flexible type)'
              },
              dataType: {
                type: 'string',
                enum: ['string', 'number', 'boolean', 'array', 'object']
              }
            }
          }
        },
        addresses: {
          type: 'array',
          description: 'User addresses',
          maxItems: 10,
          items: {
            type: 'object',
            required: ['type', 'country'],
            properties: {
              type: {
                type: 'string',
                enum: ['home', 'work', 'billing', 'shipping', 'other']
              },
              streetAddress: { type: 'string', maxLength: 500 },
              city: { type: 'string', maxLength: 100 },
              stateProvince: { type: 'string', maxLength: 100 },
              postalCode: { type: 'string', maxLength: 20 },
              country: { type: 'string', minLength: 2, maxLength: 3 },
              isPrimary: { type: 'boolean' }
            }
          }
        },
        complianceData: {
          type: 'object',
          description: 'Compliance and regulatory data',
          properties: {
            gdprConsent: {
              type: 'object',
              properties: {
                granted: { type: 'boolean' },
                timestamp: { type: 'date' },
                version: { type: 'string' }
              }
            },
            dataRetentionPolicy: { type: 'string' },
            rightToBeForgotenRequests: { type: 'array' }
          }
        }
      },

      // Custom validation expressions for business rules
      customValidation: {
        // Business rule: Premium accounts require company information
        $or: [
          { subscriptionTier: { $nin: ['premium', 'enterprise'] } },
          { 
            $and: [
              { subscriptionTier: { $in: ['premium', 'enterprise'] } },
              { 'profileMetadata.company': { $exists: true, $ne: '' } }
            ]
          }
        ]
      }
    };

    await this.createFlexibleValidationSchema('user_profiles', userProfileValidation);

    console.log('User profile validation configured with business rules');

    return userProfileValidation;
  }

  async setupOrderValidation() {
    console.log('Setting up flexible order validation...');

    const orderValidation = {
      title: 'Order Validation Schema',
      description: 'Flexible order validation supporting multiple business models',
      level: 'strict', // Orders need stricter validation
      action: 'error',
      allowAdditionalProperties: true,

      required: ['customerId', 'orderDate', 'status', 'items', 'totalAmount'],

      fields: {
        customerId: {
          type: 'objectId',
          description: 'Reference to customer'
        },
        orderNumber: {
          type: 'string',
          description: 'Unique order identifier',
          pattern: '^ORD-[0-9]{8,12}$'
        },
        orderDate: {
          type: 'date',
          description: 'Order creation date'
        },
        status: {
          type: 'string',
          description: 'Order processing status',
          enum: ['pending', 'confirmed', 'processing', 'shipped', 'delivered', 'cancelled', 'refunded']
        },
        orderType: {
          type: 'string',
          description: 'Order type classification',
          enum: ['standard', 'express', 'subscription', 'bulk', 'dropship', 'preorder']
        },
        items: {
          type: 'array',
          description: 'Order line items',
          minItems: 1,
          maxItems: 100,
          items: {
            type: 'object',
            required: ['productId', 'quantity', 'unitPrice', 'totalPrice'],
            properties: {
              productId: { type: 'objectId' },
              productName: { type: 'string', maxLength: 200 },
              quantity: { 
                type: 'number', 
                minimum: 1, 
                maximum: 1000,
                multipleOf: 1
              },
              unitPrice: { 
                type: 'number', 
                minimum: 0,
                multipleOf: 0.01
              },
              totalPrice: { 
                type: 'number', 
                minimum: 0,
                multipleOf: 0.01
              },
              category: { type: 'string' },
              sku: { type: 'string', maxLength: 50 },
              customizations: {
                type: 'object',
                additionalProperties: true
              }
            }
          }
        },
        totalAmount: {
          type: 'number',
          description: 'Order total amount',
          minimum: 0,
          multipleOf: 0.01
        },
        currency: {
          type: 'string',
          description: 'Currency code',
          pattern: '^[A-Z]{3}$',
          enum: ['USD', 'EUR', 'GBP', 'CAD', 'AUD', 'JPY']
        },
        shippingAddress: {
          type: 'object',
          required: ['country'],
          properties: {
            fullName: { type: 'string', maxLength: 200 },
            streetAddress: { type: 'string', maxLength: 500 },
            city: { type: 'string', maxLength: 100 },
            stateProvince: { type: 'string', maxLength: 100 },
            postalCode: { type: 'string', maxLength: 20 },
            country: { type: 'string', minLength: 2, maxLength: 3 },
            phoneNumber: { type: 'string' }
          }
        },
        billingAddress: {
          type: 'object',
          description: 'Billing address (can reference shipping if same)'
        },
        paymentInformation: {
          type: 'object',
          properties: {
            paymentMethod: {
              type: 'string',
              enum: ['credit_card', 'debit_card', 'paypal', 'bank_transfer', 'cash_on_delivery', 'store_credit']
            },
            paymentStatus: {
              type: 'string', 
              enum: ['pending', 'authorized', 'captured', 'failed', 'refunded', 'partially_refunded']
            },
            transactionId: { type: 'string' },
            authorizationCode: { type: 'string' }
          }
        },
        fulfillmentData: {
          type: 'object',
          properties: {
            warehouseId: { type: 'string' },
            shippingMethod: { type: 'string' },
            estimatedDelivery: { type: 'date' },
            trackingNumber: { type: 'string' },
            shippingCost: { type: 'number', minimum: 0 }
          }
        },
        orderMetadata: {
          type: 'object',
          description: 'Flexible order metadata',
          additionalProperties: true
        }
      },

      // Complex business rule validations
      customValidation: {
        $expr: {
          $and: [
            // Total amount should equal sum of item totals plus shipping
            {
              $lte: [
                { $abs: { $subtract: [
                  '$totalAmount',
                  { $add: [
                    { $sum: '$items.totalPrice' },
                    { $ifNull: ['$fulfillmentData.shippingCost', 0] }
                  ]}
                ]}},
                0.02 // Allow 2 cent rounding difference
              ]
            },
            // Each item total should equal quantity * unit price
            {
              $allElementsTrue: {
                $map: {
                  input: '$items',
                  as: 'item',
                  in: {
                    $lte: [
                      { $abs: { $subtract: [
                        '$$item.totalPrice',
                        { $multiply: ['$$item.quantity', '$$item.unitPrice'] }
                      ]}},
                      0.01
                    ]
                  }
                }
              }
            }
          ]
        }
      }
    };

    await this.createFlexibleValidationSchema('orders', orderValidation);

    console.log('Order validation configured with business logic');

    return orderValidation;
  }

  async evolveValidationSchema(collectionName, newValidationConfig, migrationOptions = {}) {
    console.log(`Evolving validation schema for: ${collectionName}`);

    try {
      // Get current validation configuration
      const currentConfig = await this.getCurrentValidationConfig(collectionName);

      // Create new schema version
      const newVersion = await this.createSchemaVersion(collectionName, newValidationConfig, currentConfig);

      // Apply gradual migration if specified
      if (migrationOptions.gradual) {
        await this.applyGradualMigration(collectionName, newValidationConfig, migrationOptions);
      } else {
        // Direct schema update
        await this.applySchemaUpdate(collectionName, newValidationConfig);
      }

      // Track evolution history
      await this.trackSchemaEvolution(collectionName, currentConfig, newValidationConfig);

      console.log(`Schema evolution completed for: ${collectionName}, version: ${newVersion}`);

      return {
        collection: collectionName,
        previousVersion: currentConfig?.version,
        newVersion: newVersion,
        migrationApplied: true,
        gradualMigration: migrationOptions.gradual || false,
        backwardCompatible: this.isBackwardCompatible(currentConfig, newValidationConfig)
      };

    } catch (error) {
      console.error(`Error evolving schema for ${collectionName}:`, error);
      throw error;
    }
  }

  async applyGradualMigration(collectionName, newValidationConfig, options) {
    console.log(`Applying gradual migration for: ${collectionName}`);

    try {
      // Phase 1: Set validation to 'warn' to allow transition
      await this.db.runCommand({
        collMod: collectionName,
        validator: this.buildValidationSchema(newValidationConfig),
        validationLevel: 'moderate',
        validationAction: 'warn' // Don't block operations during transition
      });

      // Phase 2: Identify and migrate non-compliant documents
      const migrationResults = await this.migrateNonCompliantDocuments(
        collectionName, 
        newValidationConfig, 
        options
      );

      // Phase 3: Restore strict validation after migration
      if (migrationResults.success) {
        await this.db.runCommand({
          collMod: collectionName,
          validationLevel: newValidationConfig.level || 'strict',
          validationAction: newValidationConfig.action || 'error'
        });
      }

      console.log(`Gradual migration completed: ${JSON.stringify(migrationResults)}`);

      return migrationResults;

    } catch (error) {
      console.error(`Error in gradual migration for ${collectionName}:`, error);
      throw error;
    }
  }

  async migrateNonCompliantDocuments(collectionName, validationConfig, options) {
    console.log(`Migrating non-compliant documents in: ${collectionName}`);

    try {
      const collection = this.db.collection(collectionName);
      let migratedCount = 0;
      let errorCount = 0;
      const batchSize = options.batchSize || 100;

      // Find documents that would fail new validation
      const cursor = collection.find({}, { batchSize });

      while (await cursor.hasNext()) {
        const batch = [];
        for (let i = 0; i < batchSize && await cursor.hasNext(); i++) {
          batch.push(await cursor.next());
        }

        for (const doc of batch) {
          try {
            // Attempt to migrate document to new schema
            const migratedDoc = await this.migrateDocument(doc, validationConfig);

            if (migratedDoc !== doc) {
              await collection.replaceOne({ _id: doc._id }, migratedDoc);
              migratedCount++;
            }

          } catch (migrationError) {
            errorCount++;

            // Log migration error for analysis
            await this.logValidationError(collectionName, doc._id, 'migration_error', {
              error: migrationError.message,
              document: doc,
              validationConfig: validationConfig
            });

            // Handle based on migration strategy
            if (options.skipErrors) {
              console.warn(`Skipping migration for document ${doc._id}: ${migrationError.message}`);
            } else {
              throw migrationError;
            }
          }
        }
      }

      return {
        success: errorCount === 0 || options.skipErrors,
        migratedCount: migratedCount,
        errorCount: errorCount,
        strategy: 'gradual_migration'
      };

    } catch (error) {
      console.error(`Error migrating documents in ${collectionName}:`, error);
      throw error;
    }
  }

  async migrateDocument(document, validationConfig) {
    const migrated = { ...document };

    // Apply field transformations based on new schema
    if (validationConfig.migrations) {
      for (const migration of validationConfig.migrations) {
        migrated = await this.applyFieldMigration(migrated, migration);
      }
    }

    // Set default values for newly required fields
    if (validationConfig.required) {
      for (const requiredField of validationConfig.required) {
        if (!(requiredField in migrated)) {
          migrated[requiredField] = this.getDefaultValueForField(requiredField, validationConfig.fields?.[requiredField]);
        }
      }
    }

    // Apply data transformations for type compatibility
    if (validationConfig.fields) {
      Object.keys(validationConfig.fields).forEach(fieldName => {
        if (fieldName in migrated) {
          migrated[fieldName] = this.transformFieldValue(
            migrated[fieldName], 
            validationConfig.fields[fieldName]
          );
        }
      });
    }

    return migrated;
  }

  async validateDocumentCompliance(collectionName, document) {
    console.log(`Validating document compliance for: ${collectionName}`);

    try {
      const validationConfig = await this.getCurrentValidationConfig(collectionName);

      if (!validationConfig) {
        return { isValid: true, reason: 'no_validation_configured' };
      }

      // Basic schema validation
      const schemaValidation = await this.validateAgainstSchema(document, validationConfig);

      // Business rule validation
      const businessRuleValidation = await this.validateBusinessRules(document, validationConfig);

      // Compliance validation
      const complianceValidation = await this.validateCompliance(document, collectionName);

      const overallValidation = {
        isValid: schemaValidation.isValid && businessRuleValidation.isValid && complianceValidation.isValid,
        schemaValidation: schemaValidation,
        businessRuleValidation: businessRuleValidation,
        complianceValidation: complianceValidation,
        validationTimestamp: new Date()
      };

      // Log validation result if there are issues
      if (!overallValidation.isValid) {
        await this.logValidationError(collectionName, document._id, 'validation_failure', overallValidation);
      }

      return overallValidation;

    } catch (error) {
      console.error(`Error validating document compliance:`, error);
      throw error;
    }
  }

  async validateAgainstSchema(document, validationConfig) {
    // Implement JSON schema validation logic
    // This would typically use a JSON schema validation library

    const validationResult = {
      isValid: true,
      errors: [],
      warnings: []
    };

    try {
      // Validate required fields
      if (validationConfig.required) {
        for (const requiredField of validationConfig.required) {
          if (!(requiredField in document) || document[requiredField] == null) {
            validationResult.isValid = false;
            validationResult.errors.push({
              field: requiredField,
              error: 'required_field_missing',
              message: `Required field '${requiredField}' is missing`
            });
          }
        }
      }

      // Validate field types and constraints
      if (validationConfig.fields) {
        Object.keys(validationConfig.fields).forEach(fieldName => {
          if (fieldName in document) {
            const fieldValidation = this.validateField(
              document[fieldName], 
              validationConfig.fields[fieldName], 
              fieldName
            );

            if (!fieldValidation.isValid) {
              validationResult.isValid = false;
              validationResult.errors.push(...fieldValidation.errors);
            }

            validationResult.warnings.push(...fieldValidation.warnings);
          }
        });
      }

      return validationResult;

    } catch (error) {
      validationResult.isValid = false;
      validationResult.errors.push({
        error: 'schema_validation_error',
        message: error.message
      });
      return validationResult;
    }
  }

  validateField(value, fieldConfig, fieldName) {
    const result = {
      isValid: true,
      errors: [],
      warnings: []
    };

    // Type validation
    if (!this.isValidType(value, fieldConfig.type)) {
      result.isValid = false;
      result.errors.push({
        field: fieldName,
        error: 'invalid_type',
        message: `Field '${fieldName}' should be of type '${fieldConfig.type}'`
      });
      return result;
    }

    // String validations
    if (fieldConfig.type === 'string' && typeof value === 'string') {
      if (fieldConfig.minLength && value.length < fieldConfig.minLength) {
        result.isValid = false;
        result.errors.push({
          field: fieldName,
          error: 'min_length_violation',
          message: `Field '${fieldName}' is too short`
        });
      }

      if (fieldConfig.maxLength && value.length > fieldConfig.maxLength) {
        result.isValid = false;
        result.errors.push({
          field: fieldName,
          error: 'max_length_violation', 
          message: `Field '${fieldName}' is too long`
        });
      }

      if (fieldConfig.pattern && !new RegExp(fieldConfig.pattern).test(value)) {
        result.isValid = false;
        result.errors.push({
          field: fieldName,
          error: 'pattern_violation',
          message: `Field '${fieldName}' does not match required pattern`
        });
      }

      if (fieldConfig.enum && !fieldConfig.enum.includes(value)) {
        result.isValid = false;
        result.errors.push({
          field: fieldName,
          error: 'enum_violation',
          message: `Field '${fieldName}' is not one of allowed values`
        });
      }
    }

    // Number validations
    if ((fieldConfig.type === 'number' || fieldConfig.type === 'int') && typeof value === 'number') {
      if (fieldConfig.minimum !== undefined && value < fieldConfig.minimum) {
        result.isValid = false;
        result.errors.push({
          field: fieldName,
          error: 'minimum_violation',
          message: `Field '${fieldName}' is below minimum value`
        });
      }

      if (fieldConfig.maximum !== undefined && value > fieldConfig.maximum) {
        result.isValid = false;
        result.errors.push({
          field: fieldName,
          error: 'maximum_violation',
          message: `Field '${fieldName}' exceeds maximum value`
        });
      }
    }

    // Array validations
    if (fieldConfig.type === 'array' && Array.isArray(value)) {
      if (fieldConfig.minItems && value.length < fieldConfig.minItems) {
        result.isValid = false;
        result.errors.push({
          field: fieldName,
          error: 'min_items_violation',
          message: `Field '${fieldName}' has too few items`
        });
      }

      if (fieldConfig.maxItems && value.length > fieldConfig.maxItems) {
        result.isValid = false;
        result.errors.push({
          field: fieldName,
          error: 'max_items_violation',
          message: `Field '${fieldName}' has too many items`
        });
      }
    }

    return result;
  }

  // Utility methods for validation management

  isValidType(value, expectedType) {
    switch (expectedType) {
      case 'string': return typeof value === 'string';
      case 'number': return typeof value === 'number' && !isNaN(value);
      case 'int': return Number.isInteger(value);
      case 'boolean': return typeof value === 'boolean';
      case 'array': return Array.isArray(value);
      case 'object': return typeof value === 'object' && value !== null && !Array.isArray(value);
      case 'date': return value instanceof Date || (typeof value === 'string' && !isNaN(Date.parse(value)));
      case 'objectId': return ObjectId.isValid(value);
      default: return true;
    }
  }

  getDefaultValueForField(fieldName, fieldConfig) {
    if (fieldConfig?.default !== undefined) {
      return fieldConfig.default;
    }

    switch (fieldConfig?.type) {
      case 'string': return '';
      case 'number':
      case 'int': return 0;
      case 'boolean': return false;
      case 'array': return [];
      case 'object': return {};
      case 'date': return new Date();
      default: return null;
    }
  }

  transformFieldValue(value, fieldConfig) {
    // Apply type transformations for backward compatibility
    switch (fieldConfig.type) {
      case 'string':
        return typeof value === 'string' ? value : String(value);
      case 'number':
        return typeof value === 'number' ? value : Number(value);
      case 'int':
        return Number.isInteger(value) ? value : parseInt(value);
      case 'boolean':
        return typeof value === 'boolean' ? value : Boolean(value);
      case 'date':
        return value instanceof Date ? value : new Date(value);
      default:
        return value;
    }
  }

  async getCurrentValidationConfig(collectionName) {
    const schemaVersionsCollection = this.db.collection('schema_versions');

    const currentVersion = await schemaVersionsCollection.findOne(
      { 
        collection: collectionName, 
        isActive: true 
      },
      { sort: { version: -1 } }
    );

    return currentVersion;
  }

  async createSchemaVersion(collectionName, newConfig, currentConfig) {
    const schemaVersionsCollection = this.db.collection('schema_versions');

    const newVersion = (currentConfig?.version || 0) + 1;

    // Deactivate current version
    if (currentConfig) {
      await schemaVersionsCollection.updateOne(
        { collection: collectionName, version: currentConfig.version },
        { $set: { isActive: false, deactivatedAt: new Date() } }
      );
    }

    // Create new version record
    await schemaVersionsCollection.insertOne({
      collection: collectionName,
      version: newVersion,
      validationConfig: newConfig,
      isActive: true,
      effectiveDate: new Date(),
      createdBy: 'system',
      migrationCompleted: false
    });

    return newVersion;
  }

  async logValidationError(collectionName, documentId, errorType, errorDetails) {
    const validationErrorsCollection = this.db.collection('validation_errors');

    await validationErrorsCollection.insertOne({
      collection: collectionName,
      documentId: documentId,
      errorType: errorType,
      errorDetails: errorDetails,
      timestamp: new Date(),
      severity: this.getErrorSeverity(errorType),
      resolved: false
    });
  }

  getErrorSeverity(errorType) {
    switch (errorType) {
      case 'validation_failure': return 'high';
      case 'migration_error': return 'medium';
      case 'compliance_violation': return 'high';
      case 'business_rule_violation': return 'medium';
      default: return 'low';
    }
  }

  isBackwardCompatible(oldConfig, newConfig) {
    // Simplified backward compatibility check
    if (!oldConfig) return true;

    // Check if new required fields were added
    const oldRequired = oldConfig.required || [];
    const newRequired = newConfig.required || [];

    const addedRequired = newRequired.filter(field => !oldRequired.includes(field));
    if (addedRequired.length > 0) return false;

    // Check for removed allowed enum values
    // This would need more sophisticated logic in a real implementation

    return true;
  }

  async getValidationStatus() {
    console.log('Retrieving validation system status...');

    try {
      const schemaVersionsCollection = this.db.collection('schema_versions');
      const validationErrorsCollection = this.db.collection('validation_errors');

      const activeSchemas = await schemaVersionsCollection
        .find({ isActive: true })
        .toArray();

      const recentErrors = await validationErrorsCollection
        .find({ timestamp: { $gte: new Date(Date.now() - 24 * 60 * 60 * 1000) } })
        .sort({ timestamp: -1 })
        .limit(100)
        .toArray();

      const errorSummary = recentErrors.reduce((acc, error) => {
        acc[error.errorType] = (acc[error.errorType] || 0) + 1;
        return acc;
      }, {});

      return {
        activeSchemas: activeSchemas.length,
        schemaDetails: activeSchemas.map(schema => ({
          collection: schema.collection,
          version: schema.version,
          effectiveDate: schema.effectiveDate
        })),
        recentErrors: recentErrors.length,
        errorBreakdown: errorSummary,
        systemHealth: recentErrors.length < 10 ? 'healthy' : 'needs_attention'
      };

    } catch (error) {
      console.error('Error retrieving validation status:', error);
      throw error;
    }
  }

  async cleanup() {
    console.log('Cleaning up Validation Manager...');

    this.schemaVersions.clear();
    this.validationRules.clear();
    this.evolutionHistory.clear();

    console.log('Validation Manager cleanup completed');
  }
}

// Example usage demonstrating flexible validation and schema evolution
async function demonstrateFlexibleValidation() {
  const client = new MongoClient('mongodb://localhost:27017');
  await client.connect();

  const validationManager = new MongoDBValidationManager(client, {
    database: 'flexible_validation_demo',
    enableFlexibleValidation: true,
    enableSchemaVersioning: true,
    enableComplianceValidation: true
  });

  try {
    // Setup initial validation schemas
    console.log('Setting up flexible validation schemas...');

    const userProfileValidation = await validationManager.setupUserProfileValidation();
    console.log('User Profile Validation:', userProfileValidation.collection);

    const orderValidation = await validationManager.setupOrderValidation();
    console.log('Order Validation:', orderValidation.collection);

    // Test document validation
    const sampleUser = {
      email: '[email protected]',
      username: 'johndoe123',
      firstName: 'John',
      lastName: 'Doe',
      profileStatus: 'active',
      accountType: 'premium',
      subscriptionTier: 'premium',
      profileMetadata: {
        theme: 'dark',
        company: 'Tech Corp',
        industry: 'technology'
      },
      preferences: [
        {
          category: 'notifications',
          key: 'email_frequency',
          value: 'daily',
          dataType: 'string'
        }
      ]
    };

    const validationResult = await validationManager.validateDocumentCompliance('user_profiles', sampleUser);
    console.log('Validation Result:', validationResult.isValid);

    // Demonstrate schema evolution
    console.log('Demonstrating schema evolution...');

    const evolvedValidation = {
      ...userProfileValidation,
      required: [...(userProfileValidation.required || []), 'phoneNumber'],
      fields: {
        ...userProfileValidation.fields,
        phoneNumber: {
          type: 'string',
          description: 'Required phone number',
          pattern: '^\\+?[1-9]\\d{1,14}$'
        },
        emergencyContact: {
          type: 'object',
          description: 'Emergency contact information',
          properties: {
            name: { type: 'string', minLength: 1 },
            phoneNumber: { type: 'string', pattern: '^\\+?[1-9]\\d{1,14}$' },
            relationship: { type: 'string', enum: ['spouse', 'parent', 'sibling', 'friend', 'other'] }
          }
        }
      },
      migrations: [
        {
          field: 'phoneNumber',
          action: 'set_default',
          defaultValue: '+1-000-000-0000'
        }
      ]
    };

    const evolutionResult = await validationManager.evolveValidationSchema(
      'user_profiles', 
      evolvedValidation,
      { 
        gradual: true, 
        skipErrors: false,
        batchSize: 100
      }
    );

    console.log('Schema Evolution Result:', evolutionResult);

    // Get validation system status
    const systemStatus = await validationManager.getValidationStatus();
    console.log('Validation System Status:', systemStatus);

    return {
      userProfileValidation,
      orderValidation,
      validationResult,
      evolutionResult,
      systemStatus
    };

  } catch (error) {
    console.error('Error demonstrating flexible validation:', error);
    throw error;
  } finally {
    await validationManager.cleanup();
    await client.close();
  }
}

// Benefits of MongoDB Flexible Document Validation:
// - Configurable validation levels supporting gradual schema enforcement and evolution
// - Schema versioning and migration capabilities enabling controlled validation updates
// - Business rule integration allowing complex validation logic beyond basic type checking
// - Compliance validation supporting regulatory and governance requirements
// - Flexible validation actions (error, warn) enabling different enforcement strategies
// - Performance optimization through selective and conditional validation rules
// - Backward compatibility management preserving application functionality during updates
// - Real-time validation monitoring and error tracking for operational excellence

module.exports = {
  MongoDBValidationManager,
  demonstrateFlexibleValidation
};

SQL-Style Document Validation with QueryLeaf

QueryLeaf provides familiar SQL syntax for MongoDB document validation and schema evolution management:

-- QueryLeaf document validation with SQL-familiar schema governance and evolution syntax

-- Configure validation system settings
SET enable_flexible_validation = true;
SET enable_schema_versioning = true;
SET enable_gradual_validation = true;
SET enable_compliance_validation = true;
SET validation_error_tracking = true;
SET schema_evolution_monitoring = true;

-- Create flexible validation schema with business rules
CREATE VALIDATION SCHEMA user_profile_validation
FOR COLLECTION user_profiles
WITH (
  validation_level = 'moderate',  -- strict, moderate, off
  validation_action = 'error',    -- error, warn
  allow_additional_properties = true,  -- Support schema evolution
  schema_version = 1,
  effective_date = CURRENT_TIMESTAMP
)
AS JSON_SCHEMA({
  "bsonType": "object",
  "title": "User Profile Validation Schema",
  "description": "Flexible user profile validation with business rule support",
  "required": ["email", "username", "profile_status"],
  "properties": {
    "email": {
      "bsonType": "string",
      "pattern": "^[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\\.[A-Za-z]{2,}$",
      "maxLength": 255,
      "description": "Valid email address"
    },
    "username": {
      "bsonType": "string", 
      "minLength": 3,
      "maxLength": 50,
      "pattern": "^[a-zA-Z0-9_-]+$",
      "description": "Unique alphanumeric username"
    },
    "first_name": {
      "bsonType": "string",
      "minLength": 1, 
      "maxLength": 100
    },
    "last_name": {
      "bsonType": "string",
      "minLength": 1,
      "maxLength": 100  
    },
    "phone_number": {
      "bsonType": "string",
      "pattern": "^\\+?[1-9]\\d{1,14}$",
      "description": "International phone number format"
    },
    "date_of_birth": {
      "bsonType": "date"
    },
    "profile_status": {
      "bsonType": "string",
      "enum": ["active", "inactive", "suspended", "pending", "archived"],
      "description": "Current profile status"
    },
    "account_type": {
      "bsonType": "string",
      "enum": ["standard", "premium", "enterprise", "trial"]
    },
    "subscription_tier": {
      "bsonType": "string", 
      "enum": ["free", "basic", "pro", "premium", "enterprise", "custom"]
    },
    "profile_metadata": {
      "bsonType": "object",
      "additionalProperties": true,
      "properties": {
        "theme": {
          "bsonType": "string",
          "enum": ["light", "dark", "auto"]
        },
        "language": {
          "bsonType": "string",
          "pattern": "^[a-z]{2}(-[A-Z]{2})?$"
        },
        "company": {
          "bsonType": "string",
          "minLength": 1
        }
      }
    },
    "preferences": {
      "bsonType": "array",
      "items": {
        "bsonType": "object",
        "required": ["category", "key", "value"],
        "properties": {
          "category": {
            "bsonType": "string",
            "enum": [
              "notifications", "privacy", "display", "communication", 
              "security", "accessibility", "billing", "integration"
            ]
          },
          "key": {
            "bsonType": "string",
            "minLength": 1,
            "maxLength": 100
          },
          "value": {
            "description": "Flexible preference value"
          },
          "data_type": {
            "bsonType": "string", 
            "enum": ["string", "number", "boolean", "array", "object"]
          }
        }
      }
    },
    "compliance_data": {
      "bsonType": "object",
      "properties": {
        "gdpr_consent": {
          "bsonType": "object",
          "properties": {
            "granted": { "bsonType": "bool" },
            "timestamp": { "bsonType": "date" },
            "version": { "bsonType": "string" }
          }
        }
      }
    }
  },

  -- Custom business rule validation expressions
  "$expr": {
    "$or": [
      -- Non-premium accounts don't need company info
      { "subscription_tier": { "$nin": ["premium", "enterprise"] } },
      -- Premium accounts must have company information
      { 
        "$and": [
          { "subscription_tier": { "$in": ["premium", "enterprise"] } },
          { "profile_metadata.company": { "$exists": true, "$ne": "" } }
        ]
      }
    ]
  }
});

-- Create order validation schema with complex business logic
CREATE VALIDATION SCHEMA order_validation  
FOR COLLECTION orders
WITH (
  validation_level = 'strict',    -- Orders need strict validation
  validation_action = 'error',
  allow_additional_properties = true
)
AS JSON_SCHEMA({
  "bsonType": "object",
  "title": "Order Validation Schema",
  "required": ["customer_id", "order_date", "status", "items", "total_amount"],
  "properties": {
    "customer_id": {
      "bsonType": "objectId"
    },
    "order_number": {
      "bsonType": "string",
      "pattern": "^ORD-[0-9]{8,12}$"
    },
    "order_date": {
      "bsonType": "date"
    },
    "status": {
      "bsonType": "string",
      "enum": ["pending", "confirmed", "processing", "shipped", "delivered", "cancelled", "refunded"]
    },
    "items": {
      "bsonType": "array",
      "minItems": 1,
      "maxItems": 100,
      "items": {
        "bsonType": "object",
        "required": ["product_id", "quantity", "unit_price", "total_price"],
        "properties": {
          "product_id": { "bsonType": "objectId" },
          "quantity": { 
            "bsonType": "number", 
            "minimum": 1, 
            "maximum": 1000
          },
          "unit_price": { 
            "bsonType": "number", 
            "minimum": 0
          },
          "total_price": { 
            "bsonType": "number", 
            "minimum": 0
          }
        }
      }
    },
    "total_amount": {
      "bsonType": "number",
      "minimum": 0
    },
    "currency": {
      "bsonType": "string",
      "pattern": "^[A-Z]{3}$",
      "enum": ["USD", "EUR", "GBP", "CAD", "AUD"]
    }
  },

  -- Complex validation expressions for business logic
  "$expr": {
    "$and": [
      -- Total amount equals sum of item totals (with tolerance for rounding)
      {
        "$lte": [
          { "$abs": { "$subtract": [
            "$total_amount",
            { "$sum": "$items.total_price" }
          ]}},
          0.02
        ]
      },
      -- Each item total equals quantity * unit price
      {
        "$allElementsTrue": {
          "$map": {
            "input": "$items",
            "as": "item",
            "in": {
              "$lte": [
                { "$abs": { "$subtract": [
                  "$$item.total_price",
                  { "$multiply": ["$$item.quantity", "$$item.unit_price"] }
                ]}},
                0.01
              ]
            }
          }
        }
      }
    ]
  }
});

-- Demonstrate schema evolution with backward compatibility
BEGIN TRANSACTION;

-- Step 1: Evolve user profile schema to add new requirements
CREATE VALIDATION SCHEMA user_profile_validation_v2
FOR COLLECTION user_profiles  
WITH (
  validation_level = 'moderate',
  validation_action = 'warn',  -- Warn initially during migration
  schema_version = 2,
  replaces_version = 1,
  migration_strategy = 'gradual'
)
AS JSON_SCHEMA({
  -- Inherit from previous schema with additions
  "$ref": "user_profile_validation_v1",

  -- Add new required fields
  "required": ["email", "username", "profile_status", "phone_number"],

  -- Add new field definitions
  "properties": {
    -- All existing properties inherited
    "emergency_contact": {
      "bsonType": "object", 
      "description": "Emergency contact information",
      "properties": {
        "name": { 
          "bsonType": "string", 
          "minLength": 1 
        },
        "phone_number": { 
          "bsonType": "string",
          "pattern": "^\\+?[1-9]\\d{1,14}$"
        },
        "relationship": { 
          "bsonType": "string",
          "enum": ["spouse", "parent", "sibling", "friend", "other"]
        }
      }
    },
    "security_settings": {
      "bsonType": "object",
      "properties": {
        "two_factor_enabled": { "bsonType": "bool" },
        "backup_codes_generated": { "bsonType": "bool" },
        "last_password_change": { "bsonType": "date" }
      }
    }
  }
});

-- Step 2: Migrate existing documents to new schema
WITH migration_analysis AS (
  -- Identify documents that need migration
  SELECT 
    _id,
    email,
    username,
    profile_status,
    phone_number,

    -- Check what's missing for new schema
    CASE 
      WHEN phone_number IS NULL THEN 'missing_phone'
      WHEN emergency_contact IS NULL THEN 'missing_emergency_contact'
      WHEN security_settings IS NULL THEN 'missing_security_settings'
      ELSE 'compliant'
    END as migration_requirement,

    -- Determine migration strategy
    CASE 
      WHEN phone_number IS NULL THEN 'set_default_phone'
      WHEN emergency_contact IS NULL THEN 'skip_optional_field'
      ELSE 'no_action_needed'
    END as migration_action

  FROM user_profiles
  WHERE validation_version < 2 OR validation_version IS NULL
),

-- Apply gradual migrations
migration_execution AS (
  UPDATE user_profiles 
  SET 
    -- Set default phone number for existing users
    phone_number = COALESCE(phone_number, '+1-000-000-0000'),

    -- Initialize security settings with defaults
    security_settings = COALESCE(security_settings, JSON_BUILD_OBJECT(
      'two_factor_enabled', false,
      'backup_codes_generated', false,
      'last_password_change', created_at
    )),

    -- Track validation version
    validation_version = 2,
    validation_migrated_at = CURRENT_TIMESTAMP,

    -- Preserve existing data
    updated_at = CURRENT_TIMESTAMP

  FROM migration_analysis ma
  WHERE user_profiles._id = ma._id
  AND ma.migration_requirement != 'compliant'

  RETURNING 
    _id,
    migration_requirement,
    migration_action,
    'migration_applied' as result
)

-- Track migration results
INSERT INTO schema_migration_log
SELECT 
  'user_profiles' as collection_name,
  2 as target_version,
  1 as source_version,
  COUNT(*) as documents_migrated,
  COUNT(*) FILTER (WHERE result = 'migration_applied') as successful_migrations,
  COUNT(*) FILTER (WHERE result != 'migration_applied') as failed_migrations,
  CURRENT_TIMESTAMP as migration_timestamp,
  'gradual_migration' as migration_strategy,

  JSON_BUILD_OBJECT(
    'migration_summary', JSON_AGG(DISTINCT migration_requirement),
    'actions_applied', JSON_AGG(DISTINCT migration_action),
    'success_rate', ROUND(
      COUNT(*) FILTER (WHERE result = 'migration_applied')::decimal / COUNT(*) * 100, 
      2
    )
  ) as migration_metadata

FROM migration_execution
GROUP BY collection_name, target_version, source_version;

-- Step 3: Update validation to strict mode after migration
UPDATE VALIDATION SCHEMA user_profile_validation_v2
SET 
  validation_level = 'strict',
  validation_action = 'error',
  migration_completed = true,
  activated_at = CURRENT_TIMESTAMP;

COMMIT TRANSACTION;

-- Validate document compliance with business rules
WITH document_compliance_analysis AS (
  SELECT 
    _id,
    email,
    username,
    profile_status,
    subscription_tier,

    -- Schema compliance checks
    CASE 
      WHEN email !~ '^[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\\.[A-Za-z]{2,}$' THEN 'invalid_email_format'
      WHEN LENGTH(username) < 3 OR LENGTH(username) > 50 THEN 'invalid_username_length'
      WHEN phone_number !~ '^\\+?[1-9]\\d{1,14}$' THEN 'invalid_phone_format'
      ELSE 'schema_compliant'
    END as schema_compliance_status,

    -- Business rule compliance
    CASE 
      WHEN subscription_tier IN ('premium', 'enterprise') AND profile_metadata->>'company' IS NULL THEN 'business_rule_violation'
      WHEN profile_status = 'active' AND last_login_at < CURRENT_TIMESTAMP - INTERVAL '90 days' THEN 'dormant_account_warning'
      WHEN JSON_ARRAY_LENGTH(COALESCE(preferences, '[]'::JSON)) = 0 THEN 'incomplete_profile'
      ELSE 'business_rules_compliant'
    END as business_rule_compliance,

    -- Regulatory compliance (GDPR example)
    CASE 
      WHEN created_at > '2018-05-25' AND compliance_data->>'gdpr_consent' IS NULL THEN 'gdpr_consent_required'
      WHEN profile_status = 'inactive' AND updated_at < CURRENT_TIMESTAMP - INTERVAL '2 years' THEN 'data_retention_review'
      ELSE 'regulatory_compliant'
    END as regulatory_compliance,

    -- Overall compliance scoring
    CASE 
      WHEN schema_compliance_status = 'schema_compliant' 
           AND business_rule_compliance = 'business_rules_compliant'
           AND regulatory_compliance = 'regulatory_compliant' THEN 100
      WHEN schema_compliance_status = 'schema_compliant' 
           AND business_rule_compliance = 'business_rules_compliant' THEN 85
      WHEN schema_compliance_status = 'schema_compliant' THEN 70
      ELSE 50
    END as compliance_score,

    -- Validation metadata
    validation_version,
    CURRENT_TIMESTAMP as validation_timestamp

  FROM user_profiles
),

-- Generate compliance report with recommendations
compliance_recommendations AS (
  SELECT 
    dca.*,

    -- Remediation recommendations
    ARRAY_REMOVE(ARRAY[
      CASE schema_compliance_status WHEN 'schema_compliant' THEN NULL ELSE 'Fix schema validation errors' END,
      CASE business_rule_compliance WHEN 'business_rules_compliant' THEN NULL ELSE 'Address business rule violations' END,
      CASE regulatory_compliance WHEN 'regulatory_compliant' THEN NULL ELSE 'Ensure regulatory compliance' END
    ], NULL) as remediation_actions,

    -- Priority classification
    CASE 
      WHEN compliance_score < 60 THEN 'high_priority'
      WHEN compliance_score < 80 THEN 'medium_priority' 
      ELSE 'low_priority'
    END as remediation_priority,

    -- Specific action items
    CASE schema_compliance_status
      WHEN 'invalid_email_format' THEN 'Update email to valid format'
      WHEN 'invalid_username_length' THEN 'Adjust username length (3-50 characters)'
      WHEN 'invalid_phone_format' THEN 'Correct phone number format'
      ELSE 'No schema remediation needed'
    END as schema_remediation,

    CASE business_rule_compliance
      WHEN 'business_rule_violation' THEN 'Add required company information for premium accounts'
      WHEN 'dormant_account_warning' THEN 'Review dormant account status and policies'
      WHEN 'incomplete_profile' THEN 'Complete user preference settings'
      ELSE 'Business rules compliant'
    END as business_remediation

  FROM document_compliance_analysis dca
)

-- Comprehensive validation and compliance dashboard
SELECT 
  -- Summary statistics
  COUNT(*) as total_documents,
  COUNT(*) FILTER (WHERE compliance_score = 100) as fully_compliant,
  COUNT(*) FILTER (WHERE compliance_score >= 80) as substantially_compliant,
  COUNT(*) FILTER (WHERE compliance_score < 60) as non_compliant,

  -- Compliance breakdown
  ROUND(AVG(compliance_score), 2) as avg_compliance_score,

  -- Issue distribution
  COUNT(*) FILTER (WHERE schema_compliance_status != 'schema_compliant') as schema_issues,
  COUNT(*) FILTER (WHERE business_rule_compliance != 'business_rules_compliant') as business_rule_issues,
  COUNT(*) FILTER (WHERE regulatory_compliance != 'regulatory_compliant') as regulatory_issues,

  -- Priority distribution
  COUNT(*) FILTER (WHERE remediation_priority = 'high_priority') as high_priority_issues,
  COUNT(*) FILTER (WHERE remediation_priority = 'medium_priority') as medium_priority_issues,

  -- Validation system health
  CASE 
    WHEN AVG(compliance_score) >= 95 THEN 'excellent'
    WHEN AVG(compliance_score) >= 85 THEN 'good'
    WHEN AVG(compliance_score) >= 70 THEN 'acceptable' 
    ELSE 'needs_attention'
  END as validation_system_health,

  -- Most common issues
  MODE() WITHIN GROUP (ORDER BY schema_compliance_status) as most_common_schema_issue,
  MODE() WITHIN GROUP (ORDER BY business_rule_compliance) as most_common_business_issue,

  -- Recommendations summary
  JSON_BUILD_OBJECT(
    'immediate_actions', ARRAY[
      CASE WHEN COUNT(*) FILTER (WHERE remediation_priority = 'high_priority') > 0 
           THEN 'Address high priority compliance issues immediately'
           ELSE 'Continue monitoring compliance metrics' END,
      CASE WHEN AVG(compliance_score) < 80 
           THEN 'Review and strengthen validation rules'
           ELSE 'Maintain current validation standards' END
    ],
    'system_improvements', ARRAY[
      'Implement automated compliance monitoring',
      'Setup real-time validation alerts',
      'Schedule regular compliance audits'
    ]
  ) as system_recommendations

FROM compliance_recommendations;

-- QueryLeaf provides comprehensive MongoDB document validation capabilities:
-- 1. Flexible validation schema creation with SQL-familiar syntax and business rule integration
-- 2. Schema versioning and evolution management with backward compatibility support
-- 3. Gradual migration strategies for schema updates with minimal application disruption
-- 4. Business rule validation enabling complex logic beyond basic type checking
-- 5. Compliance validation supporting regulatory requirements and governance standards
-- 6. Real-time validation monitoring and error tracking for operational excellence
-- 7. Performance-optimized validation with selective enforcement and caching strategies
-- 8. Migration automation with rollback capabilities and impact analysis
-- 9. Comprehensive compliance reporting with remediation recommendations
-- 10. Integration with MongoDB's native validation features through familiar SQL patterns

Best Practices for MongoDB Document Validation Implementation

Strategic Validation Design

Essential practices for implementing effective document validation strategies:

Validation Level Strategy: Choose appropriate validation levels (strict, moderate, off) based on data criticality and application requirements
Gradual Implementation: Implement validation rules gradually to avoid disrupting existing data and application functionality
Business Rule Integration: Design validation rules that reflect actual business requirements rather than just technical constraints
Performance Optimization: Balance validation thoroughness with query performance impact and system responsiveness
Error Handling Strategy: Implement comprehensive error handling for validation failures with meaningful user feedback
Compliance Alignment: Ensure validation rules support regulatory compliance requirements and data governance policies

Schema Evolution and Management

Optimize document validation for long-term schema evolution:

Version Control: Implement schema versioning to track validation changes and enable rollback capabilities
Migration Planning: Design migration strategies that minimize downtime and preserve data integrity during schema evolution
Backward Compatibility: Maintain backward compatibility when possible to support gradual application updates
Testing Strategies: Implement comprehensive testing for validation rules and schema migrations before production deployment
Monitoring and Alerting: Establish monitoring for validation failures, schema compliance, and migration success rates
Documentation Standards: Maintain clear documentation of validation rules, business logic, and evolution history

Conclusion

MongoDB's flexible document validation provides powerful capabilities for maintaining data integrity while preserving the schema flexibility essential for modern applications. The combination of configurable validation levels, business rule integration, and schema evolution support enables development teams to implement robust data governance without sacrificing application agility.

Key MongoDB Document Validation benefits include:

Flexible Enforcement: Configurable validation levels supporting different data integrity requirements and application needs
Schema Evolution: Versioned validation schemas with migration capabilities enabling controlled schema updates
Business Logic Integration: Custom validation expressions supporting complex business rules and compliance requirements
Performance Optimization: Selective validation strategies balancing data integrity with query performance
Compliance Support: Regulatory validation capabilities supporting governance requirements and audit needs
SQL Accessibility: Familiar SQL-style syntax for validation rule management and schema evolution

Whether you're building user management systems, e-commerce platforms, content management applications, or compliance-driven systems, MongoDB's document validation with QueryLeaf's familiar SQL interface provides the foundation for maintainable, compliant, and scalable data governance.

QueryLeaf Integration: QueryLeaf automatically manages MongoDB document validation while providing SQL-familiar syntax for validation rule creation, schema evolution, and compliance monitoring. Advanced validation strategies, migration automation, and business rule integration are seamlessly accessible through familiar SQL constructs, making sophisticated data governance both powerful and approachable for SQL-oriented development teams.

The combination of MongoDB's flexible validation capabilities with SQL-style governance management makes it an ideal platform for applications requiring both data integrity assurance and schema flexibility, ensuring your data governance strategies can evolve with changing requirements while maintaining operational excellence and compliance standards.

December 12, 2025
25 min read

MongoDB Transactions and Multi-Document ACID Operations: Enterprise Data Consistency and Integrity for Mission-Critical Applications

Modern enterprise applications require strong data consistency guarantees across complex business operations that span multiple documents, collections, and even databases. Traditional relational databases provide ACID properties through table-level transactions, but often struggle with distributed architectures and horizontal scaling requirements needed for high-volume enterprise workloads.

MongoDB's multi-document transactions provide full ACID guarantees across multiple documents and collections within replica sets and sharded clusters, enabling complex business operations to maintain data integrity while supporting distributed system architectures. Unlike traditional databases limited to single-server ACID properties, MongoDB transactions scale horizontally while maintaining consistency guarantees essential for financial, healthcare, and other mission-critical applications.

The Traditional Distributed Transaction Challenge

Conventional database transaction management faces significant limitations in distributed environments:

-- Traditional PostgreSQL distributed transactions - complex coordination with performance overhead

-- Financial transfer operation requiring distributed consistency across accounts
CREATE TABLE accounts (
    account_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    user_id INTEGER NOT NULL,
    account_number VARCHAR(20) UNIQUE NOT NULL,
    account_type VARCHAR(20) NOT NULL DEFAULT 'checking',
    balance DECIMAL(15,2) NOT NULL DEFAULT 0.00,
    currency_code VARCHAR(3) NOT NULL DEFAULT 'USD',

    -- Account metadata
    account_status VARCHAR(20) DEFAULT 'active',
    opened_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    last_activity TIMESTAMP DEFAULT CURRENT_TIMESTAMP,

    -- Balance constraints and limits
    minimum_balance DECIMAL(15,2) DEFAULT 0.00,
    overdraft_limit DECIMAL(15,2) DEFAULT 0.00,
    daily_transfer_limit DECIMAL(15,2) DEFAULT 10000.00,

    -- Compliance and tracking
    kyc_verified BOOLEAN DEFAULT FALSE,
    risk_level VARCHAR(20) DEFAULT 'low',
    regulatory_flags TEXT[],

    -- Audit trail
    created_by INTEGER,
    updated_by INTEGER,
    version_number INTEGER DEFAULT 1,

    CONSTRAINT valid_balance CHECK (balance + overdraft_limit >= 0),
    CONSTRAINT valid_status CHECK (account_status IN ('active', 'suspended', 'closed', 'frozen'))
);

-- Transaction log for audit and reconciliation requirements
CREATE TABLE transaction_log (
    transaction_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    transaction_timestamp TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
    transaction_type VARCHAR(50) NOT NULL,

    -- Source and destination account information
    source_account_id UUID REFERENCES accounts(account_id),
    destination_account_id UUID REFERENCES accounts(account_id),

    -- Transaction amounts and currency
    transaction_amount DECIMAL(15,2) NOT NULL,
    currency_code VARCHAR(3) NOT NULL DEFAULT 'USD',
    exchange_rate DECIMAL(10,6),

    -- Transaction details
    description TEXT,
    reference_number VARCHAR(100) UNIQUE,
    external_reference VARCHAR(100),
    merchant_info JSONB,

    -- Processing status and workflow
    transaction_status VARCHAR(20) DEFAULT 'pending',
    processing_stage VARCHAR(30) DEFAULT 'initiated',
    approval_required BOOLEAN DEFAULT FALSE,
    approved_by INTEGER,
    approved_at TIMESTAMP,

    -- Error handling and retry logic
    error_code VARCHAR(20),
    error_message TEXT,
    retry_count INTEGER DEFAULT 0,
    max_retries INTEGER DEFAULT 3,

    -- Compliance and regulatory
    compliance_checked BOOLEAN DEFAULT FALSE,
    aml_flagged BOOLEAN DEFAULT FALSE,
    regulatory_reporting JSONB,

    -- Audit and tracking
    created_by INTEGER,
    session_id VARCHAR(100),
    ip_address INET,
    user_agent TEXT,

    CONSTRAINT valid_transaction_type CHECK (transaction_type IN (
        'transfer', 'deposit', 'withdrawal', 'payment', 'refund', 'fee', 'interest'
    )),
    CONSTRAINT valid_amount CHECK (transaction_amount > 0),
    CONSTRAINT valid_status CHECK (transaction_status IN (
        'pending', 'processing', 'completed', 'failed', 'cancelled', 'reversed'
    ))
);

-- Account balance history for audit and reconciliation
CREATE TABLE balance_history (
    history_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    account_id UUID NOT NULL REFERENCES accounts(account_id),
    transaction_id UUID REFERENCES transaction_log(transaction_id),

    -- Balance tracking
    previous_balance DECIMAL(15,2) NOT NULL,
    transaction_amount DECIMAL(15,2) NOT NULL,
    new_balance DECIMAL(15,2) NOT NULL,
    running_balance DECIMAL(15,2) NOT NULL,

    -- Timestamp and sequencing
    recorded_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
    sequence_number BIGSERIAL,

    -- Balance verification
    balance_verified BOOLEAN DEFAULT FALSE,
    verification_timestamp TIMESTAMP,
    discrepancy_amount DECIMAL(15,2) DEFAULT 0.00,

    -- Reconciliation metadata
    reconciliation_batch_id UUID,
    reconciliation_status VARCHAR(20) DEFAULT 'pending',

    CONSTRAINT balance_consistency CHECK (previous_balance + transaction_amount = new_balance)
);

-- Complex distributed transaction procedure with error handling and rollback complexity
CREATE OR REPLACE FUNCTION transfer_funds_with_distributed_consistency(
    p_source_account_id UUID,
    p_destination_account_id UUID,
    p_amount DECIMAL(15,2),
    p_description TEXT DEFAULT 'Fund Transfer',
    p_reference_number VARCHAR(100) DEFAULT NULL,
    p_user_id INTEGER DEFAULT NULL
)
RETURNS TABLE (
    transaction_id UUID,
    transaction_status TEXT,
    source_new_balance DECIMAL(15,2),
    destination_new_balance DECIMAL(15,2),
    processing_result TEXT
) AS $$
DECLARE
    v_transaction_id UUID := gen_random_uuid();
    v_source_account RECORD;
    v_destination_account RECORD;
    v_source_new_balance DECIMAL(15,2);
    v_destination_new_balance DECIMAL(15,2);
    v_daily_transfer_total DECIMAL(15,2);
    v_reference_number VARCHAR(100);
    v_compliance_result JSONB;
    v_error_message TEXT;

BEGIN
    -- Generate reference number if not provided
    v_reference_number := COALESCE(p_reference_number, 'TXN' || EXTRACT(EPOCH FROM CURRENT_TIMESTAMP)::bigint || random()::text);

    -- Start distributed transaction with SERIALIZABLE isolation for consistency
    BEGIN
        -- Lock source account and validate
        SELECT * INTO v_source_account
        FROM accounts 
        WHERE account_id = p_source_account_id 
        AND account_status = 'active'
        FOR UPDATE;

        IF NOT FOUND THEN
            RAISE EXCEPTION 'Source account not found or inactive: %', p_source_account_id;
        END IF;

        -- Lock destination account and validate
        SELECT * INTO v_destination_account
        FROM accounts 
        WHERE account_id = p_destination_account_id 
        AND account_status IN ('active', 'suspended') -- Allow deposits to suspended accounts
        FOR UPDATE;

        IF NOT FOUND THEN
            RAISE EXCEPTION 'Destination account not found or closed: %', p_destination_account_id;
        END IF;

        -- Validate transfer amount constraints
        IF p_amount <= 0 THEN
            RAISE EXCEPTION 'Transfer amount must be positive: %', p_amount;
        END IF;

        -- Check sufficient balance including overdraft
        v_source_new_balance := v_source_account.balance - p_amount;
        IF v_source_new_balance < -v_source_account.overdraft_limit THEN
            RAISE EXCEPTION 'Insufficient funds: balance=%, overdraft=%, requested=%', 
                v_source_account.balance, v_source_account.overdraft_limit, p_amount;
        END IF;

        -- Check daily transfer limits
        SELECT COALESCE(SUM(transaction_amount), 0) INTO v_daily_transfer_total
        FROM transaction_log
        WHERE source_account_id = p_source_account_id
        AND transaction_timestamp >= CURRENT_DATE
        AND transaction_status IN ('completed', 'processing')
        AND transaction_type = 'transfer';

        IF v_daily_transfer_total + p_amount > v_source_account.daily_transfer_limit THEN
            RAISE EXCEPTION 'Daily transfer limit exceeded: current=%, limit=%, requested=%',
                v_daily_transfer_total, v_source_account.daily_transfer_limit, p_amount;
        END IF;

        -- Compliance and AML checks (simplified)
        v_compliance_result := jsonb_build_object(
            'amount_threshold_check', p_amount > 10000,
            'cross_border_check', v_source_account.currency_code != v_destination_account.currency_code,
            'high_risk_account', v_source_account.risk_level = 'high' OR v_destination_account.risk_level = 'high',
            'kyc_verified', v_source_account.kyc_verified AND v_destination_account.kyc_verified
        );

        IF (v_compliance_result->>'amount_threshold_check')::boolean AND NOT (v_compliance_result->>'kyc_verified')::boolean THEN
            RAISE EXCEPTION 'Large transfer requires full KYC verification for both accounts';
        END IF;

        -- Create transaction log entry
        INSERT INTO transaction_log (
            transaction_id, source_account_id, destination_account_id,
            transaction_amount, currency_code, description, reference_number,
            transaction_type, transaction_status, processing_stage,
            compliance_checked, regulatory_reporting, created_by, session_id
        ) VALUES (
            v_transaction_id, p_source_account_id, p_destination_account_id,
            p_amount, v_source_account.currency_code, p_description, v_reference_number,
            'transfer', 'processing', 'balance_update',
            true, v_compliance_result, p_user_id, current_setting('application.session_id', true)
        );

        -- Update source account balance
        UPDATE accounts 
        SET balance = balance - p_amount,
            last_activity = CURRENT_TIMESTAMP,
            updated_by = p_user_id,
            version_number = version_number + 1
        WHERE account_id = p_source_account_id;

        -- Record source balance history
        INSERT INTO balance_history (
            account_id, transaction_id, previous_balance, 
            transaction_amount, new_balance, running_balance
        ) VALUES (
            p_source_account_id, v_transaction_id, v_source_account.balance,
            -p_amount, v_source_new_balance, v_source_new_balance
        );

        -- Calculate destination new balance
        v_destination_new_balance := v_destination_account.balance + p_amount;

        -- Update destination account balance
        UPDATE accounts 
        SET balance = balance + p_amount,
            last_activity = CURRENT_TIMESTAMP,
            updated_by = p_user_id,
            version_number = version_number + 1
        WHERE account_id = p_destination_account_id;

        -- Record destination balance history
        INSERT INTO balance_history (
            account_id, transaction_id, previous_balance, 
            transaction_amount, new_balance, running_balance
        ) VALUES (
            p_destination_account_id, v_transaction_id, v_destination_account.balance,
            p_amount, v_destination_new_balance, v_destination_new_balance
        );

        -- Update transaction status to completed
        UPDATE transaction_log 
        SET transaction_status = 'completed',
            processing_stage = 'completed',
            approved_at = CURRENT_TIMESTAMP
        WHERE transaction_id = v_transaction_id;

        -- Commit the distributed transaction
        COMMIT;

        -- Return success results
        RETURN QUERY SELECT 
            v_transaction_id,
            'completed'::text,
            v_source_new_balance,
            v_destination_new_balance,
            'Transfer completed successfully'::text;

    EXCEPTION
        WHEN OTHERS THEN
            -- Rollback transaction and log error
            ROLLBACK;

            v_error_message := SQLERRM;

            -- Insert failed transaction record for audit
            INSERT INTO transaction_log (
                transaction_id, source_account_id, destination_account_id,
                transaction_amount, currency_code, description, reference_number,
                transaction_type, transaction_status, processing_stage,
                error_code, error_message, created_by
            ) VALUES (
                v_transaction_id, p_source_account_id, p_destination_account_id,
                p_amount, COALESCE(v_source_account.currency_code, 'USD'), p_description, v_reference_number,
                'transfer', 'failed', 'error_handling',
                'TRANSFER_FAILED', v_error_message, p_user_id
            );

            -- Return error results
            RETURN QUERY SELECT 
                v_transaction_id,
                'failed'::text,
                COALESCE(v_source_account.balance, 0::decimal),
                COALESCE(v_destination_account.balance, 0::decimal),
                ('Transfer failed: ' || v_error_message)::text;
    END;
END;
$$ LANGUAGE plpgsql;

-- Complex batch transaction processing with distributed coordination
CREATE OR REPLACE FUNCTION process_batch_transfers(
    p_transfers JSONB, -- Array of transfer specifications
    p_batch_id UUID DEFAULT gen_random_uuid(),
    p_processing_user INTEGER DEFAULT NULL
)
RETURNS TABLE (
    batch_id UUID,
    total_transfers INTEGER,
    successful_transfers INTEGER,
    failed_transfers INTEGER,
    total_amount DECIMAL(15,2),
    processing_duration INTERVAL,
    batch_status TEXT
) AS $$
DECLARE
    v_transfer JSONB;
    v_transfer_result RECORD;
    v_batch_start TIMESTAMP := CURRENT_TIMESTAMP;
    v_total_transfers INTEGER := 0;
    v_successful_transfers INTEGER := 0;
    v_failed_transfers INTEGER := 0;
    v_total_amount DECIMAL(15,2) := 0;
    v_individual_amount DECIMAL(15,2);

BEGIN
    -- Create batch processing record
    INSERT INTO batch_processing_log (
        batch_id, batch_type, initiated_by, initiated_at, 
        total_operations, batch_status
    ) VALUES (
        p_batch_id, 'fund_transfers', p_processing_user, v_batch_start,
        jsonb_array_length(p_transfers), 'processing'
    );

    -- Process each transfer in the batch
    FOR v_transfer IN SELECT * FROM jsonb_array_elements(p_transfers) LOOP
        BEGIN
            v_individual_amount := (v_transfer->>'amount')::DECIMAL(15,2);
            v_total_amount := v_total_amount + v_individual_amount;
            v_total_transfers := v_total_transfers + 1;

            -- Execute individual transfer
            SELECT * INTO v_transfer_result
            FROM transfer_funds_with_distributed_consistency(
                (v_transfer->>'source_account_id')::UUID,
                (v_transfer->>'destination_account_id')::UUID,
                v_individual_amount,
                COALESCE(v_transfer->>'description', 'Batch Transfer'),
                v_transfer->>'reference_number',
                p_processing_user
            );

            IF v_transfer_result.transaction_status = 'completed' THEN
                v_successful_transfers := v_successful_transfers + 1;
            ELSE
                v_failed_transfers := v_failed_transfers + 1;

                -- Log batch transfer failure
                INSERT INTO batch_operation_details (
                    batch_id, operation_sequence, operation_status,
                    operation_data, error_message
                ) VALUES (
                    p_batch_id, v_total_transfers, 'failed',
                    v_transfer, v_transfer_result.processing_result
                );
            END IF;

        EXCEPTION
            WHEN OTHERS THEN
                v_failed_transfers := v_failed_transfers + 1;

                -- Log exception details
                INSERT INTO batch_operation_details (
                    batch_id, operation_sequence, operation_status,
                    operation_data, error_message
                ) VALUES (
                    p_batch_id, v_total_transfers, 'error',
                    v_transfer, SQLERRM
                );
        END;
    END LOOP;

    -- Update batch completion status
    UPDATE batch_processing_log 
    SET batch_status = CASE 
            WHEN v_failed_transfers = 0 THEN 'completed_success'
            WHEN v_successful_transfers = 0 THEN 'completed_failure'
            ELSE 'completed_partial'
        END,
        completed_at = CURRENT_TIMESTAMP,
        successful_operations = v_successful_transfers,
        failed_operations = v_failed_transfers,
        total_amount_processed = v_total_amount
    WHERE batch_id = p_batch_id;

    -- Return batch processing summary
    RETURN QUERY SELECT 
        p_batch_id,
        v_total_transfers,
        v_successful_transfers,
        v_failed_transfers,
        v_total_amount,
        CURRENT_TIMESTAMP - v_batch_start,
        CASE 
            WHEN v_failed_transfers = 0 THEN 'success'
            WHEN v_successful_transfers = 0 THEN 'failed'
            ELSE 'partial_success'
        END::text;
END;
$$ LANGUAGE plpgsql;

-- Problems with traditional distributed transaction management:
-- 1. Complex distributed coordination requiring extensive procedural code and error handling
-- 2. Performance bottlenecks from table-level locking and serializable isolation levels
-- 3. Limited scalability across multiple database instances and distributed architectures
-- 4. Manual rollback logic and compensation procedures for failed distributed operations
-- 5. Difficulty maintaining ACID properties across microservice boundaries and network partitions
-- 6. Complex deadlock detection and resolution in multi-table transaction scenarios
-- 7. Resource-intensive locking mechanisms impacting concurrent transaction performance
-- 8. Manual consistency management across related tables and complex foreign key relationships
-- 9. Limited support for horizontal scaling while maintaining transactional consistency
-- 10. Operational complexity of monitoring and debugging distributed transaction failures

MongoDB provides native multi-document transactions with distributed ACID guarantees:

// MongoDB Multi-Document Transactions - Native distributed ACID operations with scalable consistency
const { MongoClient, ObjectId } = require('mongodb');

// Advanced MongoDB Transaction Manager for Enterprise ACID Operations
class MongoDBTransactionManager {
  constructor(client, config = {}) {
    this.client = client;
    this.db = client.db(config.database || 'enterprise_transactions');

    this.config = {
      // Transaction configuration
      defaultTransactionOptions: {
        readConcern: { level: config.readConcern || 'snapshot' },
        writeConcern: { w: config.writeConcern || 'majority' },
        readPreference: config.readPreference || 'primary'
      },
      maxTransactionTimeMS: config.maxTransactionTimeMS || 60000, // 1 minute
      maxRetryAttempts: config.maxRetryAttempts || 3,

      // Error handling and retry configuration
      enableRetryLogic: config.enableRetryLogic !== false,
      retryableErrors: config.retryableErrors || [
        'WriteConflict', 'TransientTransactionError', 'UnknownTransactionCommitResult'
      ],

      // Performance optimization
      enableTransactionMetrics: config.enableTransactionMetrics !== false,
      enableDeadlockDetection: config.enableDeadlockDetection !== false,
      enablePerformanceMonitoring: config.enablePerformanceMonitoring !== false,

      // Business logic configuration
      enableComplianceChecks: config.enableComplianceChecks !== false,
      enableAuditLogging: config.enableAuditLogging !== false,
      enableBusinessValidation: config.enableBusinessValidation !== false
    };

    // Transaction management state
    this.activeTransactions = new Map();
    this.transactionMetrics = new Map();
    this.deadlockStats = new Map();

    this.initializeTransactionManager();
  }

  async initializeTransactionManager() {
    console.log('Initializing MongoDB Transaction Manager...');

    try {
      // Setup transaction-aware collections with appropriate indexes
      await this.setupTransactionCollections();

      // Initialize performance monitoring
      if (this.config.enablePerformanceMonitoring) {
        await this.initializePerformanceMonitoring();
      }

      console.log('MongoDB Transaction Manager initialized successfully');

    } catch (error) {
      console.error('Error initializing transaction manager:', error);
      throw error;
    }
  }

  async setupTransactionCollections() {
    console.log('Setting up transaction-aware collections...');

    try {
      // Accounts collection with transaction-optimized indexes
      const accountsCollection = this.db.collection('accounts');
      await accountsCollection.createIndexes([
        { key: { accountNumber: 1 }, unique: true, background: true },
        { key: { userId: 1, accountType: 1 }, background: true },
        { key: { accountStatus: 1, balance: -1 }, background: true },
        { key: { lastActivity: -1 }, background: true }
      ]);

      // Transaction log collection for ACID audit trail
      const transactionLogCollection = this.db.collection('transaction_log');
      await transactionLogCollection.createIndexes([
        { key: { transactionTimestamp: -1 }, background: true },
        { key: { sourceAccountId: 1, transactionTimestamp: -1 }, background: true },
        { key: { destinationAccountId: 1, transactionTimestamp: -1 }, background: true },
        { key: { referenceNumber: 1 }, unique: true, sparse: true, background: true },
        { key: { transactionStatus: 1, transactionTimestamp: -1 }, background: true }
      ]);

      // Balance history for audit and reconciliation
      const balanceHistoryCollection = this.db.collection('balance_history');
      await balanceHistoryCollection.createIndexes([
        { key: { accountId: 1, recordedAt: -1 }, background: true },
        { key: { transactionId: 1 }, background: true },
        { key: { sequenceNumber: -1 }, background: true }
      ]);

      console.log('Transaction collections configured successfully');

    } catch (error) {
      console.error('Error setting up transaction collections:', error);
      throw error;
    }
  }

  async executeTransactionWithRetry(transactionLogic, options = {}) {
    console.log('Executing transaction with automatic retry logic...');

    const session = this.client.startSession();
    const transactionOptions = { 
      ...this.config.defaultTransactionOptions, 
      ...options 
    };

    let retryCount = 0;
    const maxRetries = this.config.maxRetryAttempts;
    const transactionId = new ObjectId();
    const startTime = Date.now();

    while (retryCount <= maxRetries) {
      try {
        await session.withTransaction(
          async () => {
            const transactionContext = {
              session: session,
              transactionId: transactionId,
              attempt: retryCount + 1,
              startTime: startTime,
              db: this.db
            };

            return await transactionLogic(transactionContext);
          },
          transactionOptions
        );

        // Transaction succeeded
        const duration = Date.now() - startTime;
        await this.updateTransactionMetrics(transactionId, 'completed', duration, retryCount);

        console.log(`Transaction completed successfully: ${transactionId} (${retryCount + 1} attempts)`);
        return { success: true, transactionId, attempts: retryCount + 1, duration };

      } catch (error) {
        retryCount++;

        const isRetryableError = this.isRetryableTransactionError(error);
        const shouldRetry = isRetryableError && retryCount <= maxRetries && this.config.enableRetryLogic;

        if (!shouldRetry) {
          // Transaction failed permanently
          const duration = Date.now() - startTime;
          await this.updateTransactionMetrics(transactionId, 'failed', duration, retryCount - 1);

          console.error(`Transaction failed permanently: ${transactionId}`, error);
          throw error;
        }

        // Wait before retry with exponential backoff
        const backoffMs = Math.min(1000 * Math.pow(2, retryCount - 1), 5000);
        await new Promise(resolve => setTimeout(resolve, backoffMs));

        console.warn(`Transaction retry ${retryCount}/${maxRetries} for ${transactionId}: ${error.message}`);
      }
    }

    await session.endSession();
  }

  async transferFundsWithACID(transferRequest) {
    console.log('Processing fund transfer with ACID guarantees...');

    return await this.executeTransactionWithRetry(async (context) => {
      const { session, transactionId, db } = context;

      // Collections with session for transaction scope
      const accountsCollection = db.collection('accounts', { session });
      const transactionLogCollection = db.collection('transaction_log', { session });
      const balanceHistoryCollection = db.collection('balance_history', { session });

      // Step 1: Validate and lock source account
      const sourceAccount = await accountsCollection.findOneAndUpdate(
        { 
          _id: new ObjectId(transferRequest.sourceAccountId),
          accountStatus: 'active' 
        },
        { 
          $set: { 
            lastActivity: new Date(),
            lockTimestamp: new Date(),
            lockedBy: transactionId.toString()
          }
        },
        { 
          returnDocument: 'after',
          session: session 
        }
      );

      if (!sourceAccount.value) {
        throw new Error(`Source account not found or inactive: ${transferRequest.sourceAccountId}`);
      }

      // Step 2: Validate and lock destination account
      const destinationAccount = await accountsCollection.findOneAndUpdate(
        { 
          _id: new ObjectId(transferRequest.destinationAccountId),
          accountStatus: { $in: ['active', 'suspended'] } // Allow deposits to suspended accounts
        },
        { 
          $set: { 
            lastActivity: new Date(),
            lockTimestamp: new Date(),
            lockedBy: transactionId.toString()
          }
        },
        { 
          returnDocument: 'after',
          session: session 
        }
      );

      if (!destinationAccount.value) {
        throw new Error(`Destination account not found: ${transferRequest.destinationAccountId}`);
      }

      // Step 3: Business logic validation
      await this.validateTransferRequirements(
        sourceAccount.value, 
        destinationAccount.value, 
        transferRequest, 
        context
      );

      // Step 4: Create transaction log entry
      const transactionLog = {
        _id: new ObjectId(),
        transactionId: transactionId,
        transactionTimestamp: new Date(),
        transactionType: 'transfer',

        // Account information
        sourceAccountId: sourceAccount.value._id,
        destinationAccountId: destinationAccount.value._id,

        // Transaction details
        transactionAmount: transferRequest.amount,
        currencyCode: transferRequest.currencyCode || 'USD',
        description: transferRequest.description || 'Fund Transfer',
        referenceNumber: transferRequest.referenceNumber || this.generateReferenceNumber(),

        // Processing metadata
        transactionStatus: 'processing',
        processingStage: 'balance_update',
        initiatedBy: transferRequest.userId,
        sessionId: transferRequest.sessionId,

        // Compliance and audit
        complianceChecked: true,
        complianceResult: await this.performComplianceChecks(transferRequest, context),

        // Transaction context
        clientMetadata: transferRequest.metadata || {}
      };

      await transactionLogCollection.insertOne(transactionLog, { session });

      // Step 5: Update source account balance
      const sourceBalanceUpdate = await accountsCollection.findOneAndUpdate(
        { _id: sourceAccount.value._id },
        { 
          $inc: { balance: -transferRequest.amount },
          $set: { 
            lastActivity: new Date(),
            updatedBy: transferRequest.userId,
            lockTimestamp: null,
            lockedBy: null
          },
          $inc: { versionNumber: 1 }
        },
        { 
          returnDocument: 'after',
          session: session 
        }
      );

      // Step 6: Record source balance history
      await balanceHistoryCollection.insertOne({
        _id: new ObjectId(),
        accountId: sourceAccount.value._id,
        transactionId: transactionId,
        previousBalance: sourceAccount.value.balance,
        transactionAmount: -transferRequest.amount,
        newBalance: sourceBalanceUpdate.value.balance,
        recordedAt: new Date(),
        sequenceNumber: await this.getNextSequenceNumber(sourceAccount.value._id, context)
      }, { session });

      // Step 7: Update destination account balance
      const destinationBalanceUpdate = await accountsCollection.findOneAndUpdate(
        { _id: destinationAccount.value._id },
        { 
          $inc: { balance: transferRequest.amount },
          $set: { 
            lastActivity: new Date(),
            updatedBy: transferRequest.userId,
            lockTimestamp: null,
            lockedBy: null
          },
          $inc: { versionNumber: 1 }
        },
        { 
          returnDocument: 'after',
          session: session 
        }
      );

      // Step 8: Record destination balance history
      await balanceHistoryCollection.insertOne({
        _id: new ObjectId(),
        accountId: destinationAccount.value._id,
        transactionId: transactionId,
        previousBalance: destinationAccount.value.balance,
        transactionAmount: transferRequest.amount,
        newBalance: destinationBalanceUpdate.value.balance,
        recordedAt: new Date(),
        sequenceNumber: await this.getNextSequenceNumber(destinationAccount.value._id, context)
      }, { session });

      // Step 9: Update transaction log to completed
      await transactionLogCollection.updateOne(
        { transactionId: transactionId },
        { 
          $set: { 
            transactionStatus: 'completed',
            processingStage: 'completed',
            completedAt: new Date()
          }
        },
        { session }
      );

      // Step 10: Return transaction result
      return {
        transactionId: transactionId,
        referenceNumber: transactionLog.referenceNumber,
        sourceAccountBalance: sourceBalanceUpdate.value.balance,
        destinationAccountBalance: destinationBalanceUpdate.value.balance,
        transactionAmount: transferRequest.amount,
        completedAt: new Date()
      };
    });
  }

  async processBatchTransfers(batchRequest) {
    console.log('Processing batch transfers with distributed ACID guarantees...');

    return await this.executeTransactionWithRetry(async (context) => {
      const { session, transactionId, db } = context;
      const batchResults = [];
      const batchSummary = {
        batchId: transactionId,
        totalTransfers: batchRequest.transfers.length,
        successfulTransfers: 0,
        failedTransfers: 0,
        totalAmount: 0,
        processedAt: new Date()
      };

      // Create batch processing record
      await db.collection('batch_processing_log', { session }).insertOne({
        _id: transactionId,
        batchType: 'fund_transfers',
        initiatedBy: batchRequest.userId,
        initiatedAt: new Date(),
        totalOperations: batchRequest.transfers.length,
        batchStatus: 'processing',
        transfers: batchRequest.transfers
      });

      // Process each transfer within the same transaction
      for (const [index, transfer] of batchRequest.transfers.entries()) {
        try {
          // Execute individual transfer as part of the batch transaction
          const transferResult = await this.processIndividualTransferInBatch(
            transfer, 
            context, 
            index + 1
          );

          batchResults.push({
            sequence: index + 1,
            status: 'completed',
            transferId: transferResult.transactionId,
            amount: transfer.amount,
            result: transferResult
          });

          batchSummary.successfulTransfers++;
          batchSummary.totalAmount += transfer.amount;

        } catch (transferError) {
          batchResults.push({
            sequence: index + 1,
            status: 'failed',
            amount: transfer.amount,
            error: transferError.message
          });

          batchSummary.failedTransfers++;

          // In strict batch mode, fail entire batch if any transfer fails
          if (batchRequest.strictMode) {
            throw new Error(`Batch transfer failed at sequence ${index + 1}: ${transferError.message}`);
          }
        }
      }

      // Update batch completion status
      await db.collection('batch_processing_log', { session }).updateOne(
        { _id: transactionId },
        { 
          $set: { 
            batchStatus: batchSummary.failedTransfers === 0 ? 'completed_success' : 'completed_partial',
            completedAt: new Date(),
            successfulOperations: batchSummary.successfulTransfers,
            failedOperations: batchSummary.failedTransfers,
            totalAmountProcessed: batchSummary.totalAmount,
            results: batchResults
          }
        }
      );

      return {
        batchSummary,
        batchResults,
        transactionId: transactionId
      };
    });
  }

  async processIndividualTransferInBatch(transfer, context, sequenceNumber) {
    const { session, db } = context;
    const transferId = new ObjectId();

    // Validate accounts exist and are available
    const accounts = await db.collection('accounts', { session })
      .find({ 
        _id: { $in: [
          new ObjectId(transfer.sourceAccountId), 
          new ObjectId(transfer.destinationAccountId)
        ]},
        accountStatus: { $in: ['active', 'suspended'] }
      })
      .toArray();

    if (accounts.length !== 2) {
      throw new Error(`Invalid account(s) for transfer sequence ${sequenceNumber}`);
    }

    const sourceAccount = accounts.find(acc => acc._id.toString() === transfer.sourceAccountId);
    const destinationAccount = accounts.find(acc => acc._id.toString() === transfer.destinationAccountId);

    // Validate sufficient balance
    if (sourceAccount.balance < transfer.amount) {
      throw new Error(`Insufficient funds for transfer sequence ${sequenceNumber}`);
    }

    // Update balances atomically
    await Promise.all([
      db.collection('accounts', { session }).updateOne(
        { _id: sourceAccount._id },
        { 
          $inc: { balance: -transfer.amount },
          $set: { lastActivity: new Date() }
        }
      ),
      db.collection('accounts', { session }).updateOne(
        { _id: destinationAccount._id },
        { 
          $inc: { balance: transfer.amount },
          $set: { lastActivity: new Date() }
        }
      )
    ]);

    // Create transaction record
    await db.collection('transaction_log', { session }).insertOne({
      _id: transferId,
      batchSequence: sequenceNumber,
      parentBatchId: context.transactionId,
      transactionTimestamp: new Date(),
      transactionType: 'transfer',
      sourceAccountId: sourceAccount._id,
      destinationAccountId: destinationAccount._id,
      transactionAmount: transfer.amount,
      description: transfer.description || `Batch transfer ${sequenceNumber}`,
      transactionStatus: 'completed'
    });

    return {
      transactionId: transferId,
      sequenceNumber: sequenceNumber,
      amount: transfer.amount
    };
  }

  async validateTransferRequirements(sourceAccount, destinationAccount, transferRequest, context) {
    const validations = [];

    // Amount validation
    if (transferRequest.amount <= 0) {
      validations.push('Transfer amount must be positive');
    }

    // Balance validation
    const availableBalance = sourceAccount.balance + (sourceAccount.overdraftLimit || 0);
    if (availableBalance < transferRequest.amount) {
      validations.push(`Insufficient funds: available=${availableBalance}, requested=${transferRequest.amount}`);
    }

    // Daily limit validation
    const dailyTotal = await this.calculateDailyTransferTotal(sourceAccount._id, context);
    const dailyLimit = sourceAccount.dailyTransferLimit || 10000;
    if (dailyTotal + transferRequest.amount > dailyLimit) {
      validations.push(`Daily transfer limit exceeded: current=${dailyTotal}, limit=${dailyLimit}`);
    }

    // Account status validation
    if (sourceAccount.accountStatus !== 'active') {
      validations.push('Source account is not active');
    }

    if (destinationAccount.accountStatus === 'closed') {
      validations.push('Destination account is closed');
    }

    if (validations.length > 0) {
      throw new Error(`Transfer validation failed: ${validations.join(', ')}`);
    }
  }

  async performComplianceChecks(transferRequest, context) {
    if (!this.config.enableComplianceChecks) {
      return { checked: false, message: 'Compliance checks disabled' };
    }

    const complianceResult = {
      amlCheck: transferRequest.amount > 10000,
      crossBorderCheck: false, // Would be determined by account locations
      highRiskCheck: false,     // Would be determined by account risk scores
      kycVerified: true,        // Would be validated from account KYC status
      complianceScore: 'low',
      checkedAt: new Date()
    };

    // Simulate compliance processing
    if (complianceResult.amlCheck && !complianceResult.kycVerified) {
      throw new Error('Large transfers require full KYC verification');
    }

    return complianceResult;
  }

  async calculateDailyTransferTotal(accountId, context) {
    const { session, db } = context;
    const startOfDay = new Date();
    startOfDay.setHours(0, 0, 0, 0);

    const dailyTransfers = await db.collection('transaction_log', { session })
      .aggregate([
        {
          $match: {
            sourceAccountId: accountId,
            transactionTimestamp: { $gte: startOfDay },
            transactionStatus: { $in: ['completed', 'processing'] },
            transactionType: 'transfer'
          }
        },
        {
          $group: {
            _id: null,
            totalAmount: { $sum: '$transactionAmount' }
          }
        }
      ])
      .toArray();

    return dailyTransfers[0]?.totalAmount || 0;
  }

  async getNextSequenceNumber(accountId, context) {
    const { session, db } = context;

    const lastHistory = await db.collection('balance_history', { session })
      .findOne(
        { accountId: accountId },
        { sort: { sequenceNumber: -1 } }
      );

    return (lastHistory?.sequenceNumber || 0) + 1;
  }

  generateReferenceNumber() {
    const timestamp = Date.now().toString(36);
    const random = Math.random().toString(36).substring(2, 8);
    return `TXN${timestamp}${random}`.toUpperCase();
  }

  isRetryableTransactionError(error) {
    const errorMessage = error.message || '';
    return this.config.retryableErrors.some(retryableError => 
      errorMessage.includes(retryableError) || 
      error.hasErrorLabel?.(retryableError)
    );
  }

  async updateTransactionMetrics(transactionId, status, duration, retryCount) {
    if (!this.config.enableTransactionMetrics) return;

    const metrics = {
      transactionId: transactionId,
      status: status,
      duration: duration,
      retryCount: retryCount,
      timestamp: new Date()
    };

    this.transactionMetrics.set(transactionId.toString(), metrics);

    // Optionally persist metrics to database
    await this.db.collection('transaction_metrics').insertOne(metrics);
  }

  async getTransactionStatus() {
    console.log('Retrieving transaction manager status...');

    const status = {
      activeTransactions: this.activeTransactions.size,
      totalTransactions: this.transactionMetrics.size,
      configuration: {
        maxRetryAttempts: this.config.maxRetryAttempts,
        maxTransactionTimeMS: this.config.maxTransactionTimeMS,
        retryLogicEnabled: this.config.enableRetryLogic
      },
      performance: {
        averageTransactionTime: 0,
        successRate: 0,
        retryRate: 0
      }
    };

    // Calculate performance metrics
    if (this.transactionMetrics.size > 0) {
      const metrics = Array.from(this.transactionMetrics.values());

      status.performance.averageTransactionTime = 
        metrics.reduce((sum, m) => sum + m.duration, 0) / metrics.length;

      const successfulTransactions = metrics.filter(m => m.status === 'completed').length;
      status.performance.successRate = (successfulTransactions / metrics.length) * 100;

      const retriedTransactions = metrics.filter(m => m.retryCount > 0).length;
      status.performance.retryRate = (retriedTransactions / metrics.length) * 100;
    }

    return status;
  }

  async cleanup() {
    console.log('Cleaning up Transaction Manager resources...');

    this.activeTransactions.clear();
    this.transactionMetrics.clear();
    this.deadlockStats.clear();

    console.log('Transaction Manager cleanup completed');
  }
}

// Example usage for enterprise financial operations
async function demonstrateEnterpriseTransactions() {
  const client = new MongoClient('mongodb://localhost:27017', {
    replicaSet: 'rs0' // Transactions require replica set or sharded cluster
  });
  await client.connect();

  const transactionManager = new MongoDBTransactionManager(client, {
    database: 'enterprise_banking',
    readConcern: 'snapshot',
    writeConcern: 'majority',
    enableRetryLogic: true,
    enableComplianceChecks: true,
    enablePerformanceMonitoring: true
  });

  try {
    // Create sample accounts for demonstration
    const accountsCollection = client.db('enterprise_banking').collection('accounts');

    const sampleAccounts = [
      {
        _id: new ObjectId(),
        accountNumber: 'ACC001',
        userId: 'user_alice',
        accountType: 'checking',
        balance: 5000.00,
        accountStatus: 'active',
        dailyTransferLimit: 10000.00,
        overdraftLimit: 500.00
      },
      {
        _id: new ObjectId(),
        accountNumber: 'ACC002',
        userId: 'user_bob',
        accountType: 'savings',
        balance: 2500.00,
        accountStatus: 'active',
        dailyTransferLimit: 5000.00,
        overdraftLimit: 0.00
      }
    ];

    await accountsCollection.insertMany(sampleAccounts);

    // Demonstrate single fund transfer with ACID guarantees
    console.log('Executing single fund transfer...');
    const transferResult = await transactionManager.transferFundsWithACID({
      sourceAccountId: sampleAccounts[0]._id.toString(),
      destinationAccountId: sampleAccounts[1]._id.toString(),
      amount: 1000.00,
      description: 'Monthly transfer',
      currencyCode: 'USD',
      userId: 'user_alice',
      sessionId: 'session_123'
    });

    console.log('Transfer Result:', transferResult);

    // Demonstrate batch transfers with distributed ACID
    console.log('Executing batch transfers...');
    const batchResult = await transactionManager.processBatchTransfers({
      transfers: [
        {
          sourceAccountId: sampleAccounts[0]._id.toString(),
          destinationAccountId: sampleAccounts[1]._id.toString(),
          amount: 250.00,
          description: 'Batch transfer 1'
        },
        {
          sourceAccountId: sampleAccounts[1]._id.toString(),
          destinationAccountId: sampleAccounts[0]._id.toString(),
          amount: 150.00,
          description: 'Batch transfer 2'
        }
      ],
      userId: 'user_system',
      strictMode: false // Allow partial success
    });

    console.log('Batch Result:', JSON.stringify(batchResult, null, 2));

    // Get transaction manager status
    const status = await transactionManager.getTransactionStatus();
    console.log('Transaction Manager Status:', status);

    return {
      transferResult,
      batchResult,
      status
    };

  } catch (error) {
    console.error('Error demonstrating enterprise transactions:', error);
    throw error;
  } finally {
    await transactionManager.cleanup();
    await client.close();
  }
}

// Benefits of MongoDB Multi-Document Transactions:
// - Native ACID guarantees across multiple documents and collections without external coordination
// - Distributed transaction support across sharded clusters and replica sets with automatic failover
// - Automatic retry logic for transient failures with exponential backoff and deadlock detection
// - Performance optimization through snapshot isolation and optimistic concurrency control
// - Seamless integration with MongoDB's document model and aggregation framework capabilities
// - Enterprise-grade consistency management with configurable read and write concerns
// - Horizontal scaling with maintained ACID properties across distributed database architectures
// - Simplified application logic without manual compensation procedures or rollback handling

module.exports = {
  MongoDBTransactionManager,
  demonstrateEnterpriseTransactions
};

SQL-Style Transaction Management with QueryLeaf

QueryLeaf provides familiar SQL syntax for MongoDB transaction operations and ACID management:

-- QueryLeaf transactions with SQL-familiar multi-document ACID operations

-- Configure transaction management settings
SET transaction_isolation_level = 'snapshot';
SET transaction_write_concern = 'majority';
SET transaction_read_concern = 'snapshot';
SET enable_transaction_retry = true;
SET max_transaction_time_ms = 60000;
SET enable_deadlock_detection = true;

-- Begin distributed transaction with ACID guarantees
BEGIN TRANSACTION ISOLATION LEVEL SNAPSHOT
WITH (
  write_concern = 'majority',
  read_concern = 'snapshot',
  max_time_ms = 60000,
  retry_on_conflict = true
);

-- Create accounts with transaction-aware constraints
WITH account_setup AS (
  INSERT INTO accounts_transactional
  SELECT 
    GENERATE_UUID() as account_id,
    'ACC' || LPAD(generate_series(1, 100)::text, 6, '0') as account_number,
    'user_' || generate_series(1, 100) as user_id,
    (ARRAY['checking', 'savings', 'business'])[1 + floor(random() * 3)] as account_type,

    -- Balance and limits with business constraints
    ROUND((random() * 50000 + 1000)::numeric, 2) as balance,
    'USD' as currency_code,
    'active' as account_status,
    CURRENT_TIMESTAMP as opened_at,
    CURRENT_TIMESTAMP as last_activity,

    -- Transaction limits for compliance
    CASE account_type
      WHEN 'checking' THEN 10000.00
      WHEN 'savings' THEN 5000.00 
      WHEN 'business' THEN 50000.00
    END as daily_transfer_limit,

    CASE account_type
      WHEN 'checking' THEN 1000.00
      WHEN 'savings' THEN 0.00
      WHEN 'business' THEN 5000.00
    END as overdraft_limit,

    -- Compliance and verification
    random() > 0.1 as kyc_verified,
    (ARRAY['low', 'medium', 'high'])[1 + floor(random() * 3)] as risk_level,
    ARRAY[]::text[] as regulatory_flags,

    -- Audit tracking
    1 as version_number,
    CURRENT_TIMESTAMP as created_at,
    CURRENT_TIMESTAMP as updated_at
  RETURNING account_id, account_number, user_id, balance, account_type
),

-- Multi-document fund transfer with full ACID guarantees
fund_transfer_transaction AS (
  -- Step 1: Validate source account with pessimistic locking
  WITH source_account_lock AS (
    SELECT 
      account_id,
      account_number,
      user_id,
      balance,
      daily_transfer_limit,
      overdraft_limit,
      account_status,

      -- Calculate available balance
      balance + overdraft_limit as available_balance,

      -- Lock account for transaction duration
      CURRENT_TIMESTAMP as locked_at,
      GENERATE_UUID() as lock_token
    FROM accounts_transactional
    WHERE account_number = 'ACC000001'  -- Source account
    AND account_status = 'active'
    FOR UPDATE -- Pessimistic lock within transaction
  ),

  -- Step 2: Validate destination account
  destination_account_validation AS (
    SELECT 
      account_id,
      account_number,
      user_id,
      balance,
      account_status
    FROM accounts_transactional
    WHERE account_number = 'ACC000002'  -- Destination account
    AND account_status IN ('active', 'suspended') -- Allow deposits to suspended accounts
    FOR UPDATE -- Lock destination account
  ),

  -- Step 3: Business logic validation
  transfer_validation AS (
    SELECT 
      sal.*,
      dav.account_id as dest_account_id,
      dav.account_number as dest_account_number,
      dav.balance as dest_balance,

      -- Transfer parameters
      1500.00 as transfer_amount,
      'Monthly rent payment' as transfer_description,
      'user_alice' as initiated_by,
      'session_abc123' as session_id,

      -- Validation results
      CASE 
        WHEN 1500.00 <= 0 THEN 'INVALID_AMOUNT'
        WHEN sal.available_balance < 1500.00 THEN 'INSUFFICIENT_FUNDS'
        WHEN sal.account_status != 'active' THEN 'INVALID_SOURCE_ACCOUNT'
        WHEN dav.account_status = 'closed' THEN 'INVALID_DESTINATION_ACCOUNT'
        ELSE 'VALID'
      END as validation_result,

      -- Generate transaction identifiers
      GENERATE_UUID() as transaction_id,
      'TXN' || EXTRACT(EPOCH FROM CURRENT_TIMESTAMP)::bigint || FLOOR(random() * 1000) as reference_number

    FROM source_account_lock sal
    CROSS JOIN destination_account_validation dav
  ),

  -- Step 4: Daily limit validation
  daily_limit_check AS (
    SELECT 
      tv.*,
      COALESCE(daily_totals.daily_amount, 0) as current_daily_total,

      CASE 
        WHEN tv.validation_result != 'VALID' THEN tv.validation_result
        WHEN COALESCE(daily_totals.daily_amount, 0) + tv.transfer_amount > tv.daily_transfer_limit THEN 'DAILY_LIMIT_EXCEEDED'
        ELSE 'VALID'
      END as final_validation_result

    FROM transfer_validation tv
    LEFT JOIN (
      -- Calculate daily transfer total for source account
      SELECT 
        source_account_id,
        SUM(transaction_amount) as daily_amount
      FROM transaction_log_acid
      WHERE source_account_id = tv.account_id
      AND transaction_timestamp >= CURRENT_DATE
      AND transaction_status IN ('completed', 'processing')
      AND transaction_type = 'transfer'
      GROUP BY source_account_id
    ) daily_totals ON daily_totals.source_account_id = tv.account_id
  ),

  -- Step 5: Create transaction log entry (within transaction scope)
  transaction_log_entry AS (
    INSERT INTO transaction_log_acid
    SELECT 
      dlc.transaction_id,
      CURRENT_TIMESTAMP as transaction_timestamp,
      'transfer' as transaction_type,
      dlc.account_id as source_account_id,
      dlc.dest_account_id as destination_account_id,

      -- Transaction details
      dlc.transfer_amount as transaction_amount,
      'USD' as currency_code,
      dlc.transfer_description as description,
      dlc.reference_number,

      -- Processing status
      'processing' as transaction_status,
      'balance_update' as processing_stage,
      dlc.initiated_by,
      dlc.session_id,

      -- Compliance and audit
      true as compliance_checked,
      JSON_BUILD_OBJECT(
        'amount_threshold', dlc.transfer_amount > 10000,
        'daily_limit_check', dlc.current_daily_total + dlc.transfer_amount <= dlc.daily_transfer_limit,
        'kyc_verified', true, -- Would be validated from account data
        'risk_assessment', 'low'
      ) as compliance_result,

      -- Error handling
      CASE dlc.final_validation_result
        WHEN 'VALID' THEN NULL
        ELSE dlc.final_validation_result
      END as error_code,

      CURRENT_TIMESTAMP as created_at
    FROM daily_limit_check dlc
    WHERE dlc.final_validation_result = 'VALID' -- Only insert if validation passes
    RETURNING transaction_id, reference_number, transaction_amount
  ),

  -- Step 6: Update source account balance (atomic operation)
  source_balance_update AS (
    UPDATE accounts_transactional 
    SET 
      balance = balance - tle.transaction_amount,
      last_activity = CURRENT_TIMESTAMP,
      version_number = version_number + 1,
      updated_at = CURRENT_TIMESTAMP
    FROM transaction_log_entry tle
    WHERE accounts_transactional.account_id = (
      SELECT account_id FROM daily_limit_check WHERE final_validation_result = 'VALID'
    )
    RETURNING account_id, balance as new_balance, version_number
  ),

  -- Step 7: Update destination account balance (atomic operation)
  destination_balance_update AS (
    UPDATE accounts_transactional 
    SET 
      balance = balance + tle.transaction_amount,
      last_activity = CURRENT_TIMESTAMP,
      version_number = version_number + 1,
      updated_at = CURRENT_TIMESTAMP
    FROM transaction_log_entry tle
    WHERE accounts_transactional.account_id = (
      SELECT dest_account_id FROM daily_limit_check WHERE final_validation_result = 'VALID'
    )
    RETURNING account_id, balance as new_balance, version_number
  ),

  -- Step 8: Create balance history records for audit trail
  balance_history_records AS (
    INSERT INTO balance_history_acid
    SELECT 
      GENERATE_UUID() as history_id,
      account_updates.account_id,
      tle.transaction_id,

      -- Balance change details
      dlc.balance as previous_balance, -- Source account original balance
      account_updates.balance_change as transaction_amount,
      account_updates.new_balance,
      account_updates.new_balance as running_balance,

      -- Audit metadata
      CURRENT_TIMESTAMP as recorded_at,
      ROW_NUMBER() OVER (PARTITION BY account_updates.account_id ORDER BY CURRENT_TIMESTAMP) as sequence_number,
      true as balance_verified,
      CURRENT_TIMESTAMP as verification_timestamp

    FROM transaction_log_entry tle
    CROSS JOIN (
      -- Union source and destination balance changes
      SELECT sbu.account_id, -tle.transaction_amount as balance_change, sbu.new_balance
      FROM source_balance_update sbu, transaction_log_entry tle

      UNION ALL

      SELECT dbu.account_id, tle.transaction_amount as balance_change, dbu.new_balance  
      FROM destination_balance_update dbu, transaction_log_entry tle
    ) account_updates
    CROSS JOIN daily_limit_check dlc -- For previous balance reference
    RETURNING history_id, account_id, new_balance
  ),

  -- Step 9: Finalize transaction log status
  transaction_completion AS (
    UPDATE transaction_log_acid 
    SET 
      transaction_status = 'completed',
      processing_stage = 'completed',
      completed_at = CURRENT_TIMESTAMP
    FROM transaction_log_entry tle
    WHERE transaction_log_acid.transaction_id = tle.transaction_id
    RETURNING transaction_id, reference_number, completed_at
  )

  -- Step 10: Return comprehensive transaction result
  SELECT 
    tc.transaction_id,
    tc.reference_number,
    tle.transaction_amount,
    tc.completed_at,

    -- Account balance results
    sbu.new_balance as source_new_balance,
    dbu.new_balance as destination_new_balance,

    -- Transaction metadata
    dlc.account_number as source_account,
    dlc.dest_account_number as destination_account,
    dlc.initiated_by,

    -- Validation and compliance results
    dlc.final_validation_result as validation_status,
    'ACID_GUARANTEED' as consistency_model,
    'completed' as transaction_status,

    -- Performance metrics
    EXTRACT(EPOCH FROM (tc.completed_at - tle.created_at)) * 1000 as processing_time_ms,
    sbu.version_number as source_account_version,
    dbu.version_number as destination_account_version,

    -- Audit trail references
    ARRAY_AGG(bhr.history_id) as balance_history_ids

  FROM transaction_completion tc
  JOIN transaction_log_entry tle ON tc.transaction_id = tle.transaction_id
  JOIN source_balance_update sbu ON sbu.account_id IS NOT NULL
  JOIN destination_balance_update dbu ON dbu.account_id IS NOT NULL
  JOIN daily_limit_check dlc ON dlc.final_validation_result = 'VALID'
  LEFT JOIN balance_history_records bhr ON bhr.account_id IN (sbu.account_id, dbu.account_id)
  GROUP BY tc.transaction_id, tc.reference_number, tle.transaction_amount, tc.completed_at,
           sbu.new_balance, dbu.new_balance, dlc.account_number, dlc.dest_account_number,
           dlc.initiated_by, dlc.final_validation_result, tle.created_at,
           sbu.version_number, dbu.version_number
)

-- Commit transaction with ACID guarantees
COMMIT TRANSACTION;

-- Batch transaction processing with distributed ACID
BEGIN TRANSACTION ISOLATION LEVEL SNAPSHOT
WITH (
  write_concern = 'majority',
  read_concern = 'snapshot',
  enable_batch_operations = true,
  max_batch_size = 1000
);

WITH batch_transfer_processing AS (
  -- Define batch transfer specifications
  WITH transfer_batch AS (
    SELECT 
      batch_spec.*,
      GENERATE_UUID() as batch_transaction_id,
      ROW_NUMBER() OVER (ORDER BY batch_spec.source_account) as batch_sequence
    FROM (VALUES
      ('ACC000001', 'ACC000002', 500.00, 'Batch transfer 1'),
      ('ACC000003', 'ACC000001', 750.00, 'Batch transfer 2'), 
      ('ACC000002', 'ACC000004', 300.00, 'Batch transfer 3'),
      ('ACC000004', 'ACC000003', 450.00, 'Batch transfer 4')
    ) AS batch_spec(source_account, destination_account, amount, description)
  ),

  -- Create batch processing log
  batch_initialization AS (
    INSERT INTO batch_processing_log_acid
    SELECT 
      tb.batch_transaction_id,
      'fund_transfers' as batch_type,
      'system_batch_processor' as initiated_by,
      CURRENT_TIMESTAMP as initiated_at,
      COUNT(*) as total_operations,
      'processing' as batch_status,

      -- Batch configuration
      JSON_AGG(
        JSON_BUILD_OBJECT(
          'sequence', tb.batch_sequence,
          'source', tb.source_account,
          'destination', tb.destination_account,
          'amount', tb.amount,
          'description', tb.description
        ) ORDER BY tb.batch_sequence
      ) as batch_operations

    FROM transfer_batch tb
    GROUP BY tb.batch_transaction_id
    RETURNING batch_transaction_id, total_operations
  ),

  -- Process all transfers within single transaction scope
  batch_execution AS (
    -- Validate all accounts first (prevents partial failures)
    WITH account_validation AS (
      SELECT 
        tb.batch_sequence,
        tb.batch_transaction_id,

        -- Source account details
        sa.account_id as source_id,
        sa.account_number as source_account,
        sa.balance as source_balance,
        sa.daily_transfer_limit as source_limit,
        sa.overdraft_limit as source_overdraft,

        -- Destination account details  
        da.account_id as dest_id,
        da.account_number as dest_account,
        da.balance as dest_balance,

        -- Transfer details
        tb.amount,
        tb.description,

        -- Validation logic
        CASE 
          WHEN sa.account_id IS NULL THEN 'SOURCE_NOT_FOUND'
          WHEN da.account_id IS NULL THEN 'DESTINATION_NOT_FOUND'
          WHEN sa.account_status != 'active' THEN 'SOURCE_INACTIVE'
          WHEN da.account_status = 'closed' THEN 'DESTINATION_CLOSED'
          WHEN tb.amount <= 0 THEN 'INVALID_AMOUNT'
          WHEN sa.balance + sa.overdraft_limit < tb.amount THEN 'INSUFFICIENT_FUNDS'
          ELSE 'VALID'
        END as validation_status

      FROM transfer_batch tb
      LEFT JOIN accounts_transactional sa ON sa.account_number = tb.source_account
      LEFT JOIN accounts_transactional da ON da.account_number = tb.destination_account
      FOR UPDATE -- Lock all involved accounts
    ),

    -- Execute valid transfers atomically
    balance_updates AS (
      UPDATE accounts_transactional 
      SET 
        balance = CASE 
          WHEN account_id IN (SELECT source_id FROM account_validation WHERE validation_status = 'VALID')
          THEN balance - (SELECT amount FROM account_validation av WHERE av.source_id = accounts_transactional.account_id AND validation_status = 'VALID')

          WHEN account_id IN (SELECT dest_id FROM account_validation WHERE validation_status = 'VALID')
          THEN balance + (SELECT amount FROM account_validation av WHERE av.dest_id = accounts_transactional.account_id AND validation_status = 'VALID')

          ELSE balance
        END,
        last_activity = CURRENT_TIMESTAMP,
        version_number = version_number + 1,
        updated_at = CURRENT_TIMESTAMP
      WHERE account_id IN (
        SELECT source_id FROM account_validation WHERE validation_status = 'VALID'
        UNION
        SELECT dest_id FROM account_validation WHERE validation_status = 'VALID'
      )
      RETURNING account_id, balance, version_number
    ),

    -- Create transaction log entries for all transfers
    transaction_logging AS (
      INSERT INTO transaction_log_acid
      SELECT 
        GENERATE_UUID() as transaction_id,
        av.batch_transaction_id as parent_batch_id,
        av.batch_sequence,
        CURRENT_TIMESTAMP as transaction_timestamp,
        'transfer' as transaction_type,

        av.source_id as source_account_id,
        av.dest_id as destination_account_id,
        av.amount as transaction_amount,
        'USD' as currency_code,
        av.description,

        'TXN_BATCH_' || av.batch_transaction_id || '_' || av.batch_sequence as reference_number,

        -- Status based on validation
        CASE av.validation_status
          WHEN 'VALID' THEN 'completed'
          ELSE 'failed'
        END as transaction_status,

        'batch_processing' as processing_stage,
        'system_batch_processor' as initiated_by,

        -- Error details for failed transfers
        CASE av.validation_status 
          WHEN 'VALID' THEN NULL
          ELSE av.validation_status
        END as error_code,

        CURRENT_TIMESTAMP as created_at,
        CASE av.validation_status WHEN 'VALID' THEN CURRENT_TIMESTAMP ELSE NULL END as completed_at

      FROM account_validation av
      RETURNING transaction_id, batch_sequence, transaction_status, transaction_amount
    ),

    -- Update batch processing status
    batch_completion AS (
      UPDATE batch_processing_log_acid 
      SET 
        batch_status = CASE 
          WHEN (SELECT COUNT(*) FROM transaction_logging WHERE transaction_status = 'failed') = 0 THEN 'completed_success'
          WHEN (SELECT COUNT(*) FROM transaction_logging WHERE transaction_status = 'completed') = 0 THEN 'completed_failure'
          ELSE 'completed_partial'
        END,
        completed_at = CURRENT_TIMESTAMP,
        successful_operations = (SELECT COUNT(*) FROM transaction_logging WHERE transaction_status = 'completed'),
        failed_operations = (SELECT COUNT(*) FROM transaction_logging WHERE transaction_status = 'failed'),
        total_amount_processed = (SELECT SUM(transaction_amount) FROM transaction_logging WHERE transaction_status = 'completed')
      FROM batch_initialization bi
      WHERE batch_processing_log_acid.batch_transaction_id = bi.batch_transaction_id
      RETURNING batch_transaction_id, batch_status, successful_operations, failed_operations, total_amount_processed
    )

    -- Return comprehensive batch results
    SELECT 
      bc.batch_transaction_id,
      bc.batch_status,
      bc.successful_operations,
      bc.failed_operations,
      bc.total_amount_processed,

      -- Timing and performance
      EXTRACT(EPOCH FROM (bc.completed_at - bi.initiated_at)) * 1000 as batch_processing_time_ms,
      bi.total_operations as planned_operations,

      -- Success rate calculation
      ROUND(
        (bc.successful_operations::decimal / bi.total_operations::decimal) * 100, 
        2
      ) as success_rate_percent,

      -- Transaction details summary
      JSON_AGG(
        JSON_BUILD_OBJECT(
          'sequence', tl.batch_sequence,
          'status', tl.transaction_status,
          'amount', tl.transaction_amount,
          'transaction_id', tl.transaction_id,
          'error', tl.error_code
        ) ORDER BY tl.batch_sequence
      ) as transaction_results,

      -- ACID guarantees confirmation
      'ACID_GUARANTEED' as consistency_model,
      'DISTRIBUTED_TRANSACTION' as execution_model,
      COUNT(DISTINCT bu.account_id) as accounts_modified

    FROM batch_completion bc
    JOIN batch_initialization bi ON bc.batch_transaction_id = bi.batch_transaction_id
    LEFT JOIN transaction_logging tl ON tl.parent_batch_id = bc.batch_transaction_id
    LEFT JOIN balance_updates bu ON bu.account_id IS NOT NULL
    GROUP BY bc.batch_transaction_id, bc.batch_status, bc.successful_operations, 
             bc.failed_operations, bc.total_amount_processed, bc.completed_at,
             bi.initiated_at, bi.total_operations
  )

  SELECT * FROM batch_execution
)

COMMIT TRANSACTION;

-- Transaction performance monitoring and optimization
WITH transaction_performance_analysis AS (
  SELECT 
    DATE_TRUNC('hour', transaction_timestamp) as hour_period,
    transaction_type,
    transaction_status,

    -- Volume metrics
    COUNT(*) as transaction_count,
    SUM(transaction_amount) as total_amount,
    AVG(transaction_amount) as avg_amount,

    -- Performance metrics  
    AVG(EXTRACT(EPOCH FROM (completed_at - created_at)) * 1000) as avg_processing_time_ms,
    MAX(EXTRACT(EPOCH FROM (completed_at - created_at)) * 1000) as max_processing_time_ms,
    PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY EXTRACT(EPOCH FROM (completed_at - created_at)) * 1000) as p95_processing_time_ms,

    -- Success rate analysis
    ROUND(
      (COUNT(*) FILTER (WHERE transaction_status = 'completed')::decimal / COUNT(*)::decimal) * 100,
      2
    ) as success_rate_percent,

    -- Error analysis
    COUNT(*) FILTER (WHERE transaction_status = 'failed') as failed_transactions,
    STRING_AGG(DISTINCT error_code, ', ') as error_types,

    -- ACID consistency metrics
    COUNT(DISTINCT source_account_id) as unique_source_accounts,
    COUNT(DISTINCT destination_account_id) as unique_destination_accounts,
    AVG(CASE WHEN transaction_status = 'completed' THEN 1.0 ELSE 0.0 END) as acid_consistency_score

  FROM transaction_log_acid
  WHERE transaction_timestamp >= CURRENT_TIMESTAMP - INTERVAL '24 hours'
  GROUP BY DATE_TRUNC('hour', transaction_timestamp), transaction_type, transaction_status
),

-- Transaction optimization recommendations
optimization_recommendations AS (
  SELECT 
    tpa.hour_period,
    tpa.transaction_type,
    tpa.transaction_count,
    tpa.success_rate_percent,
    tpa.avg_processing_time_ms,
    tpa.acid_consistency_score,

    -- Performance assessment
    CASE 
      WHEN tpa.avg_processing_time_ms < 100 THEN 'excellent'
      WHEN tpa.avg_processing_time_ms < 500 THEN 'good'
      WHEN tpa.avg_processing_time_ms < 2000 THEN 'acceptable'
      ELSE 'needs_optimization'
    END as performance_rating,

    -- Optimization recommendations
    CASE 
      WHEN tpa.success_rate_percent < 95 THEN 'Investigate transaction failures and implement retry logic'
      WHEN tpa.avg_processing_time_ms > 1000 THEN 'Optimize transaction scope and reduce lock contention'
      WHEN tpa.unique_source_accounts > 100 THEN 'Consider transaction sharding and load balancing'
      WHEN tpa.transaction_count > 1000 THEN 'Implement batch processing for high-volume scenarios'
      ELSE 'Transaction performance within optimal parameters'
    END as optimization_recommendation,

    -- Capacity planning
    CASE 
      WHEN tpa.transaction_count > 5000 THEN 'high_volume'
      WHEN tpa.transaction_count > 1000 THEN 'medium_volume' 
      ELSE 'low_volume'
    END as volume_classification,

    -- ACID compliance status
    CASE 
      WHEN tpa.acid_consistency_score = 1.0 THEN 'full_acid_compliance'
      WHEN tpa.acid_consistency_score > 0.99 THEN 'high_acid_compliance'
      WHEN tpa.acid_consistency_score > 0.95 THEN 'acceptable_acid_compliance'
      ELSE 'acid_compliance_issues'
    END as consistency_status

  FROM transaction_performance_analysis tpa
)

-- Generate comprehensive transaction management dashboard
SELECT 
  or_.hour_period,
  or_.transaction_type,
  or_.transaction_count,
  or_.success_rate_percent || '%' as success_rate,
  ROUND(or_.avg_processing_time_ms, 2) || ' ms' as avg_processing_time,
  or_.performance_rating,
  or_.consistency_status,
  or_.volume_classification,

  -- Operational guidance
  or_.optimization_recommendation,

  -- Action priorities
  CASE 
    WHEN or_.performance_rating = 'needs_optimization' THEN 'high'
    WHEN or_.consistency_status LIKE '%issues' THEN 'high'
    WHEN or_.success_rate_percent < 98 THEN 'medium'
    WHEN or_.volume_classification = 'high_volume' THEN 'medium'
    ELSE 'low'
  END as action_priority,

  -- Technical recommendations
  CASE or_.performance_rating
    WHEN 'needs_optimization' THEN 'Reduce transaction scope, optimize indexes, implement connection pooling'
    WHEN 'acceptable' THEN 'Monitor transaction patterns, consider batch optimizations'
    ELSE 'Continue monitoring performance trends'
  END as technical_recommendations,

  -- Business impact assessment
  CASE 
    WHEN or_.success_rate_percent < 99 AND or_.volume_classification = 'high_volume' THEN 'High business impact - immediate attention required'
    WHEN or_.performance_rating = 'needs_optimization' THEN 'Moderate business impact - performance degradation affecting user experience'
    WHEN or_.consistency_status LIKE '%issues' THEN 'High business impact - data consistency risks'
    ELSE 'Low business impact - systems operating within acceptable parameters'
  END as business_impact_assessment,

  -- Success metrics and KPIs
  JSON_BUILD_OBJECT(
    'transaction_throughput', or_.transaction_count,
    'data_consistency_score', ROUND(or_.acid_consistency_score * 100, 2),
    'system_reliability', or_.success_rate_percent,
    'performance_efficiency', 
      CASE or_.performance_rating
        WHEN 'excellent' THEN 100
        WHEN 'good' THEN 80
        WHEN 'acceptable' THEN 60
        ELSE 40
      END,
    'operational_maturity',
      CASE 
        WHEN or_.success_rate_percent >= 99 AND or_.performance_rating IN ('excellent', 'good') THEN 'advanced'
        WHEN or_.success_rate_percent >= 95 AND or_.performance_rating != 'needs_optimization' THEN 'intermediate'
        ELSE 'basic'
      END
  ) as performance_kpis

FROM optimization_recommendations or_
ORDER BY 
  CASE action_priority 
    WHEN 'high' THEN 1 
    WHEN 'medium' THEN 2 
    ELSE 3 
  END,
  or_.transaction_count DESC;

-- QueryLeaf provides comprehensive MongoDB transaction capabilities:
-- 1. Native multi-document ACID operations with SQL-familiar transaction syntax
-- 2. Distributed transaction support across replica sets and sharded clusters
-- 3. Automatic retry logic with deadlock detection and conflict resolution
-- 4. Enterprise-grade consistency management with configurable isolation levels  
-- 5. Performance optimization through snapshot isolation and optimistic concurrency
-- 6. Comprehensive transaction monitoring with performance analysis and recommendations
-- 7. Business logic integration with compliance checks and audit trails
-- 8. Scalable batch processing with maintained ACID guarantees across high-volume operations
-- 9. SQL-style transaction management for familiar distributed system consistency patterns
-- 10. Advanced error handling and recovery procedures for mission-critical applications

Best Practices for MongoDB Transaction Implementation

Enterprise Transaction Design

Essential practices for implementing MongoDB transactions effectively:

Transaction Scope Optimization: Design transaction boundaries to minimize lock duration while maintaining business logic integrity
Read/Write Concern Configuration: Configure appropriate read and write concerns based on consistency requirements and performance needs
Retry Logic Implementation: Implement comprehensive retry logic for transient failures with exponential backoff strategies
Performance Monitoring: Establish monitoring for transaction performance, success rates, and resource utilization
Deadlock Prevention: Design transaction ordering and timeout strategies to minimize deadlock scenarios
Error Handling Strategy: Implement robust error handling with appropriate compensation procedures for failed operations

Production Deployment and Scalability

Optimize MongoDB transactions for enterprise-scale requirements:

Index Strategy: Design indexes that support transaction workloads while minimizing lock contention
Connection Management: Implement connection pooling and session management for optimal transaction performance
Sharding Considerations: Plan transaction patterns that work effectively across sharded cluster architectures
Resource Planning: Plan for transaction overhead in capacity models and performance baselines
Monitoring and Alerting: Implement comprehensive monitoring for transaction health, performance trends, and failure patterns
Business Continuity: Design transaction patterns that support high availability and disaster recovery requirements

Conclusion

MongoDB's multi-document transactions provide comprehensive ACID guarantees that enable complex business operations while maintaining data integrity across distributed architectures. The native transaction support eliminates the complexity of manual coordination procedures while providing enterprise-grade consistency management and performance optimization.

Key MongoDB Transaction benefits include:

Native ACID Guarantees: Full ACID properties across multiple documents and collections without external coordination
Distributed Consistency: Transactions work seamlessly across replica sets and sharded clusters with automatic failover
Performance Optimization: Snapshot isolation and optimistic concurrency control for minimal lock contention
Automatic Recovery: Built-in retry logic and deadlock detection for robust transaction processing
Horizontal Scaling: ACID guarantees maintained across distributed database architectures
SQL Accessibility: Familiar transaction management through SQL-style syntax and operational patterns

Whether you're building financial systems, e-commerce platforms, inventory management, or any application requiring strong consistency guarantees, MongoDB transactions with QueryLeaf's familiar SQL interface provide the foundation for reliable, scalable, and maintainable data operations.

QueryLeaf Integration: QueryLeaf automatically optimizes MongoDB transactions while providing SQL-familiar syntax for multi-document ACID operations, consistency management, and transaction monitoring. Advanced transaction patterns, error handling strategies, and performance optimization techniques are seamlessly accessible through familiar SQL constructs, making sophisticated transaction management both powerful and approachable for SQL-oriented development teams.

The combination of MongoDB's distributed transaction capabilities with SQL-style consistency management makes it an ideal platform for applications requiring both strong data integrity guarantees and familiar operational patterns, ensuring your transaction strategies scale efficiently while maintaining enterprise-grade reliability and consistency.

December 11, 2025
20 min read

MongoDB Change Streams: Real-Time Event Processing and Reactive Microservices Architecture for Modern Applications

Modern applications require real-time reactivity to data changes - instant notifications, live dashboards, automatic synchronization, and event-driven microservices communication. Traditional relational databases provide limited change detection through triggers, polling mechanisms, or third-party CDC (Change Data Capture) solutions that add complexity, latency, and operational overhead to real-time application architectures.

MongoDB Change Streams provide native real-time change detection capabilities that enable applications to react instantly to data modifications across collections, databases, or entire deployments. Unlike external CDC tools that require complex setup and maintenance, Change Streams deliver real-time event streams with resume capability, filtering, and transformation - essential for building responsive, event-driven applications that scale.

The Traditional Change Detection Challenge

Conventional database change detection approaches face significant limitations for real-time applications:

-- Traditional PostgreSQL change detection - complex triggers and polling overhead

-- User activity tracking with trigger-based change capture
CREATE TABLE user_profiles (
    user_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    username VARCHAR(100) UNIQUE NOT NULL,
    email VARCHAR(255) UNIQUE NOT NULL,
    profile_data JSONB DEFAULT '{}',
    last_login TIMESTAMP,
    status VARCHAR(20) DEFAULT 'active',
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

-- Change log table for tracking modifications
CREATE TABLE user_profile_changes (
    change_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    user_id UUID NOT NULL REFERENCES user_profiles(user_id),
    change_type VARCHAR(10) NOT NULL, -- INSERT, UPDATE, DELETE
    old_values JSONB,
    new_values JSONB,
    changed_fields TEXT[],
    change_timestamp TIMESTAMP DEFAULT CURRENT_TIMESTAMP,

    -- Change metadata
    trigger_source VARCHAR(50),
    session_info JSONB,
    application_context JSONB
);

-- Complex trigger function for change detection
CREATE OR REPLACE FUNCTION track_user_profile_changes()
RETURNS TRIGGER AS $$
DECLARE
    change_type_val VARCHAR(10);
    old_data JSONB DEFAULT NULL;
    new_data JSONB DEFAULT NULL;
    changed_fields_array TEXT[];
    field_name TEXT;

BEGIN
    -- Determine change type
    CASE TG_OP
        WHEN 'INSERT' THEN 
            change_type_val := 'INSERT';
            new_data := row_to_json(NEW)::jsonb;
        WHEN 'UPDATE' THEN
            change_type_val := 'UPDATE';
            old_data := row_to_json(OLD)::jsonb;
            new_data := row_to_json(NEW)::jsonb;

            -- Detect changed fields
            changed_fields_array := ARRAY[]::TEXT[];
            FOR field_name IN SELECT jsonb_object_keys(new_data) LOOP
                IF old_data->>field_name IS DISTINCT FROM new_data->>field_name THEN
                    changed_fields_array := array_append(changed_fields_array, field_name);
                END IF;
            END LOOP;

        WHEN 'DELETE' THEN
            change_type_val := 'DELETE';
            old_data := row_to_json(OLD)::jsonb;
    END CASE;

    -- Insert change record
    INSERT INTO user_profile_changes (
        user_id, 
        change_type, 
        old_values, 
        new_values, 
        changed_fields,
        trigger_source,
        session_info
    ) VALUES (
        COALESCE(NEW.user_id, OLD.user_id),
        change_type_val,
        old_data,
        new_data,
        changed_fields_array,
        TG_TABLE_NAME,
        jsonb_build_object(
            'user', current_user,
            'application_name', current_setting('application_name', true),
            'client_addr', inet_client_addr()
        )
    );

    -- Notify external applications (limited payload size)
    PERFORM pg_notify(
        'user_profile_changes',
        json_build_object(
            'change_id', (SELECT change_id FROM user_profile_changes ORDER BY change_timestamp DESC LIMIT 1),
            'user_id', COALESCE(NEW.user_id, OLD.user_id),
            'change_type', change_type_val,
            'timestamp', CURRENT_TIMESTAMP
        )::text
    );

    RETURN CASE TG_OP WHEN 'DELETE' THEN OLD ELSE NEW END;
END;
$$ LANGUAGE plpgsql;

-- Create triggers for all DML operations
CREATE TRIGGER user_profile_changes_trigger
    AFTER INSERT OR UPDATE OR DELETE ON user_profiles
    FOR EACH ROW EXECUTE FUNCTION track_user_profile_changes();

-- Application code to listen for notifications (complex polling)
CREATE OR REPLACE FUNCTION process_change_notifications()
RETURNS VOID AS $$
DECLARE
    notification_payload RECORD;
    change_details RECORD;
    processing_start TIMESTAMP := CURRENT_TIMESTAMP;
    processed_count INTEGER := 0;

BEGIN
    RAISE NOTICE 'Starting change notification processing at %', processing_start;

    -- Listen for notifications (requires persistent connection)
    LISTEN user_profile_changes;

    -- Process pending changes (polling approach)
    FOR change_details IN 
        SELECT 
            change_id,
            user_id,
            change_type,
            old_values,
            new_values,
            changed_fields,
            change_timestamp
        FROM user_profile_changes
        WHERE change_timestamp > CURRENT_TIMESTAMP - INTERVAL '5 minutes'
          AND processed = FALSE
        ORDER BY change_timestamp ASC
    LOOP
        BEGIN
            -- Process individual change
            CASE change_details.change_type
                WHEN 'INSERT' THEN
                    RAISE NOTICE 'Processing new user registration: %', change_details.user_id;
                    -- Trigger welcome email, setup defaults, etc.

                WHEN 'UPDATE' THEN
                    RAISE NOTICE 'Processing user profile update: % fields changed', 
                        array_length(change_details.changed_fields, 1);

                    -- Handle specific field changes
                    IF 'email' = ANY(change_details.changed_fields) THEN
                        RAISE NOTICE 'Email changed for user %, verification required', change_details.user_id;
                        -- Trigger email verification workflow
                    END IF;

                    IF 'status' = ANY(change_details.changed_fields) THEN
                        RAISE NOTICE 'Status changed for user %: % -> %', 
                            change_details.user_id,
                            change_details.old_values->>'status',
                            change_details.new_values->>'status';
                        -- Handle status-specific logic
                    END IF;

                WHEN 'DELETE' THEN
                    RAISE NOTICE 'Processing user deletion: %', change_details.user_id;
                    -- Cleanup related data, send notifications
            END CASE;

            -- Mark as processed
            UPDATE user_profile_changes 
            SET processed = TRUE, processed_at = CURRENT_TIMESTAMP
            WHERE change_id = change_details.change_id;

            processed_count := processed_count + 1;

        EXCEPTION
            WHEN OTHERS THEN
                RAISE WARNING 'Error processing change %: %', change_details.change_id, SQLERRM;

                UPDATE user_profile_changes 
                SET processing_error = SQLERRM,
                    error_count = COALESCE(error_count, 0) + 1
                WHERE change_id = change_details.change_id;
        END;
    END LOOP;

    RAISE NOTICE 'Change notification processing completed: % changes processed in %',
        processed_count, CURRENT_TIMESTAMP - processing_start;
END;
$$ LANGUAGE plpgsql;

-- Polling-based change detection (performance overhead)
CREATE OR REPLACE FUNCTION detect_recent_changes()
RETURNS TABLE (
    table_name TEXT,
    change_count BIGINT,
    latest_change TIMESTAMP,
    change_summary JSONB
) AS $$
BEGIN
    RETURN QUERY
    WITH change_summary AS (
        SELECT 
            'user_profiles' as table_name,
            COUNT(*) as change_count,
            MAX(change_timestamp) as latest_change,
            jsonb_build_object(
                'inserts', COUNT(*) FILTER (WHERE change_type = 'INSERT'),
                'updates', COUNT(*) FILTER (WHERE change_type = 'UPDATE'),
                'deletes', COUNT(*) FILTER (WHERE change_type = 'DELETE'),
                'most_changed_fields', (
                    SELECT jsonb_agg(field_name ORDER BY field_count DESC)
                    FROM (
                        SELECT unnest(changed_fields) as field_name, COUNT(*) as field_count
                        FROM user_profile_changes 
                        WHERE change_timestamp >= CURRENT_TIMESTAMP - INTERVAL '1 hour'
                        GROUP BY unnest(changed_fields)
                        ORDER BY field_count DESC
                        LIMIT 5
                    ) field_stats
                ),
                'peak_activity_hour', (
                    SELECT EXTRACT(HOUR FROM change_timestamp)
                    FROM user_profile_changes 
                    WHERE change_timestamp >= CURRENT_DATE
                    GROUP BY EXTRACT(HOUR FROM change_timestamp)
                    ORDER BY COUNT(*) DESC 
                    LIMIT 1
                )
            ) as change_summary
        FROM user_profile_changes
        WHERE change_timestamp >= CURRENT_TIMESTAMP - INTERVAL '24 hours'
    )
    SELECT 
        cs.table_name,
        cs.change_count,
        cs.latest_change,
        cs.change_summary
    FROM change_summary cs;
END;
$$ LANGUAGE plpgsql;

-- Problems with traditional change detection:
-- 1. Complex trigger logic and maintenance overhead requiring database expertise
-- 2. Limited notification payload size affecting real-time application integration
-- 3. Polling overhead and latency impacting application performance and responsiveness
-- 4. Manual change tracking implementation for every table requiring modifications
-- 5. No built-in resume capability for handling connection failures or processing errors
-- 6. Performance impact on write operations due to trigger execution overhead
-- 7. Difficulty filtering changes and implementing business logic within database constraints
-- 8. Complex error handling and retry logic for failed change processing
-- 9. Limited scalability for high-volume change scenarios affecting database performance
-- 10. Tight coupling between database schema changes and change detection logic

MongoDB provides native Change Streams with comprehensive real-time change detection:

// MongoDB Change Streams - Native real-time change detection and event processing
const { MongoClient } = require('mongodb');

// Advanced MongoDB Change Streams Manager for Real-Time Applications
class MongoDBChangeStreamsManager {
  constructor(client, config = {}) {
    this.client = client;
    this.db = client.db(config.database || 'real_time_app');

    this.config = {
      // Change stream configuration
      enableChangeStreams: config.enableChangeStreams !== false,
      enableResumeTokens: config.enableResumeTokens !== false,
      enablePrePostImages: config.enablePrePostImages || false,

      // Real-time processing
      batchSize: config.batchSize || 100,
      maxAwaitTimeMS: config.maxAwaitTimeMS || 1000,
      processingTimeout: config.processingTimeout || 30000,

      // Error handling
      enableRetryLogic: config.enableRetryLogic !== false,
      maxRetryAttempts: config.maxRetryAttempts || 3,
      retryDelayMs: config.retryDelayMs || 1000,

      // Event processing
      enableEventSourcing: config.enableEventSourcing || false,
      enableEventFiltering: config.enableEventFiltering !== false,
      enableEventTransformation: config.enableEventTransformation !== false
    };

    // Change stream management
    this.activeStreams = new Map();
    this.resumeTokens = new Map();
    this.eventProcessors = new Map();

    this.initializeChangeStreamsManager();
  }

  async initializeChangeStreamsManager() {
    console.log('Initializing MongoDB Change Streams Manager...');

    try {
      // Setup collections for change stream management
      await this.setupChangeStreamCollections();

      // Initialize event processors
      await this.initializeEventProcessors();

      // Setup default change streams
      if (this.config.enableChangeStreams) {
        await this.setupDefaultChangeStreams();
      }

      console.log('MongoDB Change Streams Manager initialized successfully');

    } catch (error) {
      console.error('Error initializing change streams manager:', error);
      throw error;
    }
  }

  async setupChangeStreamCollections() {
    console.log('Setting up change stream tracking collections...');

    try {
      // Resume tokens collection for fault tolerance
      const resumeTokensCollection = this.db.collection('change_stream_resume_tokens');
      await resumeTokensCollection.createIndexes([
        { key: { streamId: 1 }, unique: true, background: true },
        { key: { lastUpdated: 1 }, background: true }
      ]);

      // Event processing log
      const eventLogCollection = this.db.collection('change_event_log');
      await eventLogCollection.createIndexes([
        { key: { eventId: 1 }, unique: true, background: true },
        { key: { timestamp: -1 }, background: true },
        { key: { collection: 1, operationType: 1, timestamp: -1 }, background: true },
        { key: { processed: 1, timestamp: 1 }, background: true }
      ]);

      // Event processor status tracking
      const processorStatusCollection = this.db.collection('event_processor_status');
      await processorStatusCollection.createIndexes([
        { key: { processorId: 1 }, unique: true, background: true },
        { key: { lastHeartbeat: 1 }, background: true }
      ]);

      console.log('Change stream collections configured successfully');

    } catch (error) {
      console.error('Error setting up change stream collections:', error);
      throw error;
    }
  }

  async createCollectionChangeStream(collectionName, options = {}) {
    console.log(`Creating change stream for collection: ${collectionName}`);

    try {
      const collection = this.db.collection(collectionName);
      const streamId = `${collectionName}_stream`;

      // Load resume token if available
      const resumeToken = await this.loadResumeToken(streamId);

      // Configure change stream options
      const changeStreamOptions = {
        fullDocument: options.fullDocument || 'updateLookup',
        fullDocumentBeforeChange: options.fullDocumentBeforeChange || 'whenAvailable',
        batchSize: options.batchSize || this.config.batchSize,
        maxAwaitTimeMS: options.maxAwaitTimeMS || this.config.maxAwaitTimeMS,
        ...( resumeToken && { resumeAfter: resumeToken })
      };

      // Create change stream with pipeline
      const pipeline = this.buildChangeStreamPipeline(options.filters || {});
      const changeStream = collection.watch(pipeline, changeStreamOptions);

      // Store change stream reference
      this.activeStreams.set(streamId, {
        changeStream: changeStream,
        collection: collectionName,
        options: options,
        createdAt: new Date(),
        status: 'active'
      });

      // Setup event processing
      this.setupChangeStreamEventHandler(streamId, changeStream, options.eventProcessor);

      console.log(`Change stream created for ${collectionName}: ${streamId}`);

      return {
        streamId: streamId,
        changeStream: changeStream,
        collection: collectionName
      };

    } catch (error) {
      console.error(`Error creating change stream for ${collectionName}:`, error);
      throw error;
    }
  }

  buildChangeStreamPipeline(filters = {}) {
    const pipeline = [];

    // Operation type filtering
    if (filters.operationType) {
      const operationTypes = Array.isArray(filters.operationType) 
        ? filters.operationType 
        : [filters.operationType];

      pipeline.push({
        $match: {
          operationType: { $in: operationTypes }
        }
      });
    }

    // Field-level filtering
    if (filters.updatedFields) {
      pipeline.push({
        $match: {
          $or: filters.updatedFields.map(field => ({
            [`updateDescription.updatedFields.${field}`]: { $exists: true }
          }))
        }
      });
    }

    // Document filtering
    if (filters.documentFilter) {
      pipeline.push({
        $match: {
          'fullDocument': filters.documentFilter
        }
      });
    }

    // Custom pipeline stages
    if (filters.customPipeline) {
      pipeline.push(...filters.customPipeline);
    }

    return pipeline;
  }

  async setupChangeStreamEventHandler(streamId, changeStream, eventProcessor) {
    console.log(`Setting up event handler for stream: ${streamId}`);

    const eventHandler = async (changeEvent) => {
      try {
        const eventId = this.generateEventId(changeEvent);
        const timestamp = new Date();

        // Log the event
        await this.logChangeEvent(eventId, changeEvent, streamId, timestamp);

        // Update resume token
        await this.saveResumeToken(streamId, changeEvent._id);

        // Process the event
        if (eventProcessor) {
          await this.processChangeEvent(eventId, changeEvent, eventProcessor);
        } else {
          await this.defaultEventProcessing(eventId, changeEvent);
        }

        // Update processor heartbeat
        await this.updateProcessorHeartbeat(streamId);

      } catch (error) {
        console.error(`Error processing change event for ${streamId}:`, error);
        await this.handleEventProcessingError(streamId, changeEvent, error);
      }
    };

    // Setup event listeners
    changeStream.on('change', eventHandler);

    changeStream.on('error', async (error) => {
      console.error(`Change stream error for ${streamId}:`, error);
      await this.handleStreamError(streamId, error);
    });

    changeStream.on('close', async () => {
      console.log(`Change stream closed for ${streamId}`);
      await this.handleStreamClose(streamId);
    });

    changeStream.on('end', async () => {
      console.log(`Change stream ended for ${streamId}`);
      await this.handleStreamEnd(streamId);
    });
  }

  async createUserProfileChangeStream() {
    console.log('Creating user profile change stream with business logic...');

    return await this.createCollectionChangeStream('user_profiles', {
      fullDocument: 'updateLookup',
      fullDocumentBeforeChange: 'whenAvailable',
      filters: {
        operationType: ['insert', 'update', 'delete'],
        updatedFields: ['email', 'status', 'profile_data.preferences']
      },
      eventProcessor: async (eventId, changeEvent) => {
        const { operationType, fullDocument, fullDocumentBeforeChange } = changeEvent;

        switch (operationType) {
          case 'insert':
            await this.handleUserRegistration(eventId, fullDocument);
            break;

          case 'update':
            await this.handleUserProfileUpdate(
              eventId, 
              fullDocument, 
              fullDocumentBeforeChange,
              changeEvent.updateDescription
            );
            break;

          case 'delete':
            await this.handleUserDeletion(eventId, fullDocumentBeforeChange);
            break;
        }
      }
    });
  }

  async handleUserRegistration(eventId, userDocument) {
    console.log(`Processing new user registration: ${userDocument._id}`);

    try {
      // Welcome email workflow
      await this.triggerWelcomeWorkflow(userDocument);

      // Setup default preferences
      await this.initializeUserDefaults(userDocument._id);

      // Analytics tracking
      await this.trackUserRegistrationEvent(userDocument);

      // Notifications to admin systems
      await this.notifyUserManagementSystems('user_registered', {
        userId: userDocument._id,
        email: userDocument.email,
        registrationDate: userDocument.created_at
      });

      console.log(`User registration processed successfully: ${userDocument._id}`);

    } catch (error) {
      console.error(`Error processing user registration for ${userDocument._id}:`, error);
      throw error;
    }
  }

  async handleUserProfileUpdate(eventId, currentDocument, previousDocument, updateDescription) {
    console.log(`Processing user profile update: ${currentDocument._id}`);

    try {
      const updatedFields = Object.keys(updateDescription.updatedFields || {});
      const removedFields = updateDescription.removedFields || [];

      // Email change handling
      if (updatedFields.includes('email')) {
        await this.handleEmailChange(
          currentDocument._id,
          previousDocument.email,
          currentDocument.email
        );
      }

      // Status change handling
      if (updatedFields.includes('status')) {
        await this.handleStatusChange(
          currentDocument._id,
          previousDocument.status,
          currentDocument.status
        );
      }

      // Preferences change handling
      const preferencesFields = updatedFields.filter(field => 
        field.startsWith('profile_data.preferences')
      );
      if (preferencesFields.length > 0) {
        await this.handlePreferencesChange(
          currentDocument._id,
          preferencesFields,
          currentDocument.profile_data?.preferences,
          previousDocument.profile_data?.preferences
        );
      }

      // Update analytics
      await this.trackUserUpdateEvent(currentDocument._id, updatedFields);

      console.log(`User profile update processed: ${currentDocument._id}`);

    } catch (error) {
      console.error(`Error processing user profile update:`, error);
      throw error;
    }
  }

  async handleEmailChange(userId, oldEmail, newEmail) {
    console.log(`Processing email change for user ${userId}: ${oldEmail} -> ${newEmail}`);

    // Trigger email verification
    await this.db.collection('email_verification_requests').insertOne({
      userId: userId,
      newEmail: newEmail,
      oldEmail: oldEmail,
      verificationToken: this.generateVerificationToken(),
      requestedAt: new Date(),
      status: 'pending'
    });

    // Send verification email
    await this.sendEmailVerificationRequest(userId, newEmail);

    // Update user status to pending verification
    await this.db.collection('user_profiles').updateOne(
      { _id: userId },
      { 
        $set: { 
          emailVerificationStatus: 'pending',
          emailVerificationRequestedAt: new Date()
        }
      }
    );
  }

  async handleStatusChange(userId, oldStatus, newStatus) {
    console.log(`Processing status change for user ${userId}: ${oldStatus} -> ${newStatus}`);

    // Status-specific logic
    switch (newStatus) {
      case 'suspended':
        await this.handleUserSuspension(userId, oldStatus);
        break;

      case 'active':
        if (oldStatus === 'suspended') {
          await this.handleUserReactivation(userId);
        }
        break;

      case 'deleted':
        await this.handleUserDeletion(null, { _id: userId, status: oldStatus });
        break;
    }

    // Notify related systems
    await this.notifyUserManagementSystems('status_changed', {
      userId: userId,
      oldStatus: oldStatus,
      newStatus: newStatus,
      changedAt: new Date()
    });
  }

  async createOrderChangeStream() {
    console.log('Creating order processing change stream...');

    return await this.createCollectionChangeStream('orders', {
      fullDocument: 'updateLookup',
      filters: {
        operationType: ['insert', 'update'],
        updatedFields: ['status', 'payment_status', 'fulfillment_status']
      },
      eventProcessor: async (eventId, changeEvent) => {
        const { operationType, fullDocument, updateDescription } = changeEvent;

        if (operationType === 'insert') {
          await this.handleNewOrder(eventId, fullDocument);
        } else if (operationType === 'update') {
          await this.handleOrderUpdate(eventId, fullDocument, updateDescription);
        }
      }
    });
  }

  async handleNewOrder(eventId, orderDocument) {
    console.log(`Processing new order: ${orderDocument._id}`);

    try {
      // Inventory management
      await this.updateInventoryForOrder(orderDocument);

      // Payment processing workflow
      if (orderDocument.payment_method === 'credit_card') {
        await this.initiatePaymentProcessing(orderDocument);
      }

      // Order confirmation
      await this.sendOrderConfirmation(orderDocument);

      // Analytics tracking
      await this.trackOrderEvent('order_created', orderDocument);

      console.log(`New order processed successfully: ${orderDocument._id}`);

    } catch (error) {
      console.error(`Error processing new order ${orderDocument._id}:`, error);

      // Update order with error status
      await this.db.collection('orders').updateOne(
        { _id: orderDocument._id },
        { 
          $set: { 
            processing_error: error.message,
            status: 'processing_failed',
            last_error_at: new Date()
          }
        }
      );

      throw error;
    }
  }

  async handleOrderUpdate(eventId, orderDocument, updateDescription) {
    console.log(`Processing order update: ${orderDocument._id}`);

    const updatedFields = Object.keys(updateDescription.updatedFields || {});

    try {
      // Status change handling
      if (updatedFields.includes('status')) {
        await this.handleOrderStatusChange(orderDocument);
      }

      // Payment status change
      if (updatedFields.includes('payment_status')) {
        await this.handlePaymentStatusChange(orderDocument);
      }

      // Fulfillment status change
      if (updatedFields.includes('fulfillment_status')) {
        await this.handleFulfillmentStatusChange(orderDocument);
      }

      console.log(`Order update processed: ${orderDocument._id}`);

    } catch (error) {
      console.error(`Error processing order update:`, error);
      throw error;
    }
  }

  async createAggregatedChangeStream() {
    console.log('Creating aggregated change stream across multiple collections...');

    try {
      // Database-level change stream
      const changeStreamOptions = {
        fullDocument: 'updateLookup',
        batchSize: this.config.batchSize
      };

      // Multi-collection pipeline
      const pipeline = [
        {
          $match: {
            'ns.coll': { $in: ['user_profiles', 'orders', 'products', 'inventory'] },
            operationType: { $in: ['insert', 'update', 'delete'] }
          }
        },
        {
          $addFields: {
            eventType: {
              $concat: ['$ns.coll', '_', '$operationType']
            },
            timestamp: '$clusterTime'
          }
        }
      ];

      const changeStream = this.db.watch(pipeline, changeStreamOptions);
      const streamId = 'aggregated_database_stream';

      this.activeStreams.set(streamId, {
        changeStream: changeStream,
        collection: 'database',
        type: 'aggregated',
        createdAt: new Date(),
        status: 'active'
      });

      // Setup aggregated event processing
      changeStream.on('change', async (changeEvent) => {
        try {
          await this.processAggregatedEvent(changeEvent);
        } catch (error) {
          console.error('Error processing aggregated change event:', error);
        }
      });

      console.log('Aggregated change stream created successfully');

      return {
        streamId: streamId,
        changeStream: changeStream,
        type: 'aggregated'
      };

    } catch (error) {
      console.error('Error creating aggregated change stream:', error);
      throw error;
    }
  }

  async processAggregatedEvent(changeEvent) {
    const { ns, operationType, fullDocument } = changeEvent;
    const collection = ns.coll;
    const eventType = `${collection}_${operationType}`;

    // Route to appropriate handler
    switch (eventType) {
      case 'user_profiles_insert':
      case 'user_profiles_update':
        await this.handleUserEvent(changeEvent);
        break;

      case 'orders_insert':
      case 'orders_update':
        await this.handleOrderEvent(changeEvent);
        break;

      case 'products_update':
        await this.handleProductEvent(changeEvent);
        break;

      case 'inventory_update':
        await this.handleInventoryEvent(changeEvent);
        break;
    }

    // Update real-time analytics
    await this.updateRealTimeAnalytics(eventType, fullDocument);
  }

  async handleUserEvent(changeEvent) {
    // Real-time user activity tracking
    const userId = changeEvent.fullDocument?._id;
    if (userId) {
      await this.updateUserActivityMetrics(userId, changeEvent.operationType);
    }
  }

  async handleOrderEvent(changeEvent) {
    // Real-time order analytics
    if (changeEvent.operationType === 'insert') {
      await this.updateOrderMetrics('new_order', changeEvent.fullDocument);
    } else if (changeEvent.operationType === 'update') {
      await this.updateOrderMetrics('order_updated', changeEvent.fullDocument);
    }
  }

  async createRealtimeDashboardStream() {
    console.log('Creating real-time dashboard change stream...');

    const pipeline = [
      {
        $match: {
          $or: [
            // New orders
            {
              'ns.coll': 'orders',
              operationType: 'insert'
            },
            // Order status updates
            {
              'ns.coll': 'orders',
              operationType: 'update',
              'updateDescription.updatedFields.status': { $exists: true }
            },
            // New user registrations
            {
              'ns.coll': 'user_profiles',
              operationType: 'insert'
            },
            // Inventory changes
            {
              'ns.coll': 'products',
              operationType: 'update',
              'updateDescription.updatedFields.inventory_count': { $exists: true }
            }
          ]
        }
      },
      {
        $project: {
          eventType: { $concat: ['$ns.coll', '_', '$operationType'] },
          timestamp: '$clusterTime',
          documentKey: '$documentKey',
          operationType: 1,
          updateDescription: 1,
          fullDocument: 1
        }
      }
    ];

    const changeStream = this.db.watch(pipeline, {
      fullDocument: 'updateLookup',
      batchSize: 50
    });

    changeStream.on('change', async (changeEvent) => {
      try {
        await this.broadcastDashboardUpdate(changeEvent);
      } catch (error) {
        console.error('Error broadcasting dashboard update:', error);
      }
    });

    return changeStream;
  }

  async broadcastDashboardUpdate(changeEvent) {
    const { eventType, timestamp, fullDocument } = changeEvent;

    const dashboardUpdate = {
      eventType: eventType,
      timestamp: timestamp,
      data: this.extractDashboardData(eventType, fullDocument)
    };

    // Broadcast to WebSocket clients, Redis pub/sub, etc.
    await this.broadcastToClients('dashboard_update', dashboardUpdate);

    // Update real-time metrics
    await this.updateRealTimeMetrics(eventType, fullDocument);
  }

  extractDashboardData(eventType, document) {
    switch (eventType) {
      case 'orders_insert':
        return {
          orderId: document._id,
          total: document.total_amount,
          customerId: document.customer_id,
          status: document.status
        };

      case 'user_profiles_insert':
        return {
          userId: document._id,
          email: document.email,
          registrationDate: document.created_at
        };

      case 'products_update':
        return {
          productId: document._id,
          name: document.name,
          inventoryCount: document.inventory_count,
          lowStock: document.inventory_count < document.low_stock_threshold
        };

      default:
        return document;
    }
  }

  // Utility methods for change stream management

  generateEventId(changeEvent) {
    const timestamp = changeEvent.clusterTime.toString();
    const documentKey = JSON.stringify(changeEvent.documentKey);
    const operation = changeEvent.operationType;

    return `${operation}_${timestamp}_${this.hashString(documentKey)}`;
  }

  hashString(str) {
    let hash = 0;
    for (let i = 0; i < str.length; i++) {
      const char = str.charCodeAt(i);
      hash = ((hash << 5) - hash) + char;
      hash = hash & hash;
    }
    return Math.abs(hash).toString(16);
  }

  async logChangeEvent(eventId, changeEvent, streamId, timestamp) {
    const eventLogDoc = {
      eventId: eventId,
      streamId: streamId,
      collection: changeEvent.ns?.coll || 'unknown',
      operationType: changeEvent.operationType,
      documentKey: changeEvent.documentKey,
      timestamp: timestamp,
      clusterTime: changeEvent.clusterTime,

      // Event metadata
      hasFullDocument: !!changeEvent.fullDocument,
      hasUpdateDescription: !!changeEvent.updateDescription,
      updateFields: changeEvent.updateDescription ? 
        Object.keys(changeEvent.updateDescription.updatedFields || {}) : [],

      // Processing status
      processed: false,
      processingAttempts: 0,
      processingErrors: []
    };

    await this.db.collection('change_event_log').insertOne(eventLogDoc);
  }

  async saveResumeToken(streamId, resumeToken) {
    await this.db.collection('change_stream_resume_tokens').updateOne(
      { streamId: streamId },
      { 
        $set: { 
          resumeToken: resumeToken,
          lastUpdated: new Date()
        }
      },
      { upsert: true }
    );
  }

  async loadResumeToken(streamId) {
    const tokenDoc = await this.db.collection('change_stream_resume_tokens')
      .findOne({ streamId: streamId });

    return tokenDoc ? tokenDoc.resumeToken : null;
  }

  async updateProcessorHeartbeat(streamId) {
    await this.db.collection('event_processor_status').updateOne(
      { processorId: streamId },
      {
        $set: { 
          lastHeartbeat: new Date(),
          status: 'active'
        },
        $inc: { eventCount: 1 }
      },
      { upsert: true }
    );
  }

  async getChangeStreamStatus() {
    console.log('Retrieving change stream status...');

    const status = {
      activeStreams: this.activeStreams.size,
      streams: {},
      summary: {
        totalEvents: 0,
        processingErrors: 0,
        avgProcessingTime: 0
      }
    };

    // Get stream details
    for (const [streamId, streamInfo] of this.activeStreams) {
      const events = await this.db.collection('change_event_log')
        .find({ streamId: streamId })
        .sort({ timestamp: -1 })
        .limit(100)
        .toArray();

      const errors = await this.db.collection('change_event_log')
        .countDocuments({ 
          streamId: streamId,
          processingErrors: { $ne: [] }
        });

      status.streams[streamId] = {
        collection: streamInfo.collection,
        createdAt: streamInfo.createdAt,
        status: streamInfo.status,
        recentEvents: events.length,
        processingErrors: errors,
        lastEvent: events[0]?.timestamp
      };

      status.summary.totalEvents += events.length;
      status.summary.processingErrors += errors;
    }

    return status;
  }

  async cleanup() {
    console.log('Cleaning up Change Streams Manager...');

    // Close all active streams
    for (const [streamId, streamInfo] of this.activeStreams) {
      try {
        await streamInfo.changeStream.close();
        console.log(`Closed change stream: ${streamId}`);
      } catch (error) {
        console.error(`Error closing change stream ${streamId}:`, error);
      }
    }

    this.activeStreams.clear();
    this.resumeTokens.clear();
    this.eventProcessors.clear();

    console.log('Change Streams Manager cleanup completed');
  }
}

// Example usage demonstrating real-time event processing
async function demonstrateRealtimeEventProcessing() {
  const client = new MongoClient('mongodb://localhost:27017');
  await client.connect();

  const changeStreamsManager = new MongoDBChangeStreamsManager(client, {
    database: 'realtime_application',
    enablePrePostImages: true,
    enableEventSourcing: true
  });

  try {
    // Create user profile change stream
    const userStream = await changeStreamsManager.createUserProfileChangeStream();
    console.log('User profile change stream created');

    // Create order processing change stream
    const orderStream = await changeStreamsManager.createOrderChangeStream();
    console.log('Order processing change stream created');

    // Create aggregated change stream
    const aggregatedStream = await changeStreamsManager.createAggregatedChangeStream();
    console.log('Aggregated change stream created');

    // Create real-time dashboard stream
    const dashboardStream = await changeStreamsManager.createRealtimeDashboardStream();
    console.log('Real-time dashboard stream created');

    // Simulate some data changes
    const db = client.db('realtime_application');

    // Insert test user
    await db.collection('user_profiles').insertOne({
      username: 'john_doe',
      email: '[email protected]',
      status: 'active',
      profile_data: {
        preferences: {
          theme: 'dark',
          notifications: true
        }
      },
      created_at: new Date()
    });

    // Wait for processing
    await new Promise(resolve => setTimeout(resolve, 1000));

    // Update user email
    await db.collection('user_profiles').updateOne(
      { username: 'john_doe' },
      { 
        $set: { 
          email: '[email protected]',
          'profile_data.preferences.theme': 'light'
        }
      }
    );

    // Insert test order
    await db.collection('orders').insertOne({
      customer_id: 'customer_123',
      items: [
        { product_id: 'product_1', quantity: 2, price: 29.99 },
        { product_id: 'product_2', quantity: 1, price: 59.99 }
      ],
      total_amount: 119.97,
      status: 'pending',
      payment_status: 'pending',
      created_at: new Date()
    });

    // Wait for processing
    await new Promise(resolve => setTimeout(resolve, 2000));

    // Get change stream status
    const status = await changeStreamsManager.getChangeStreamStatus();
    console.log('Change Stream Status:', JSON.stringify(status, null, 2));

    return {
      userStream,
      orderStream,
      aggregatedStream,
      dashboardStream,
      status
    };

  } catch (error) {
    console.error('Error demonstrating real-time event processing:', error);
    throw error;
  } finally {
    // Note: In a real application, don't immediately cleanup
    // Let streams run continuously
    setTimeout(async () => {
      await changeStreamsManager.cleanup();
      await client.close();
    }, 5000);
  }
}

// Benefits of MongoDB Change Streams:
// - Native real-time change detection without external tools or polling overhead
// - Resume capability for fault-tolerant event processing and guaranteed delivery
// - Flexible filtering and aggregation for sophisticated event routing and processing
// - Pre and post-image support for complete change context and audit trails
// - Scalable real-time processing that handles high-volume change scenarios
// - Database-level and collection-level streams for granular or comprehensive monitoring
// - Built-in clustering support for distributed real-time applications
// - Integration with MongoDB's ACID guarantees for consistent event processing

module.exports = {
  MongoDBChangeStreamsManager,
  demonstrateRealtimeEventProcessing
};

SQL-Style Change Stream Operations with QueryLeaf

QueryLeaf provides familiar SQL syntax for MongoDB Change Streams and real-time event processing:

-- QueryLeaf change stream operations with SQL-familiar syntax

-- Create change stream for real-time monitoring
CREATE CHANGE STREAM user_activity_stream 
ON user_profiles
WITH (
  full_document = 'updateLookup',
  full_document_before_change = 'whenAvailable',
  operation_types = ARRAY['insert', 'update', 'delete'],
  batch_size = 100,
  max_await_time_ms = 1000
)
FILTER (
  -- Only monitor specific operations
  operation_type IN ('insert', 'update', 'delete')
  AND (
    -- New user registrations
    operation_type = 'insert'
    OR 
    -- Status or email changes
    (operation_type = 'update' AND (
      updated_fields ? 'status' OR 
      updated_fields ? 'email' OR
      updated_fields ? 'profile_data.preferences'
    ))
  )
);

-- Process change stream events with SQL-style handling
WITH change_events AS (
  SELECT 
    change_id,
    cluster_time,
    operation_type,
    document_key,

    -- Document details
    full_document,
    full_document_before_change,
    update_description,

    -- Extract key fields
    full_document->>'_id' as user_id,
    full_document->>'email' as current_email,
    full_document->>'status' as current_status,
    full_document_before_change->>'email' as previous_email,
    full_document_before_change->>'status' as previous_status,

    -- Change analysis
    CASE operation_type
      WHEN 'insert' THEN 'user_registration'
      WHEN 'update' THEN 
        CASE 
          WHEN update_description->'updatedFields' ? 'email' THEN 'email_change'
          WHEN update_description->'updatedFields' ? 'status' THEN 'status_change'
          WHEN update_description->'updatedFields' ? 'profile_data.preferences' THEN 'preferences_change'
          ELSE 'profile_update'
        END
      WHEN 'delete' THEN 'user_deletion'
    END as event_type,

    -- Event metadata
    CURRENT_TIMESTAMP as processed_at,
    JSON_OBJECT_KEYS(update_description->'updatedFields') as changed_fields

  FROM CHANGE_STREAM('user_activity_stream')
  WHERE operation_type IN ('insert', 'update', 'delete')
),

-- Route events to appropriate handlers
event_routing AS (
  SELECT 
    *,
    -- Determine processing priority
    CASE event_type
      WHEN 'user_registration' THEN 1
      WHEN 'email_change' THEN 2
      WHEN 'status_change' THEN 2
      WHEN 'user_deletion' THEN 3
      ELSE 4
    END as priority,

    -- Business logic flags
    CASE 
      WHEN event_type = 'email_change' THEN true
      ELSE false
    END as requires_verification,

    CASE
      WHEN event_type = 'user_registration' THEN true
      WHEN event_type = 'status_change' AND current_status = 'active' AND previous_status = 'suspended' THEN true
      ELSE false
    END as triggers_welcome_workflow
)

-- Process events with business logic
SELECT 
  event_type,
  user_id,
  priority,

  -- User registration processing
  CASE WHEN event_type = 'user_registration' THEN
    JSON_BUILD_OBJECT(
      'action', 'process_registration',
      'user_id', user_id,
      'email', current_email,
      'welcome_workflow', true,
      'setup_defaults', true,
      'send_verification', true
    )
  END as registration_actions,

  -- Email change processing
  CASE WHEN event_type = 'email_change' THEN
    JSON_BUILD_OBJECT(
      'action', 'process_email_change',
      'user_id', user_id,
      'old_email', previous_email,
      'new_email', current_email,
      'requires_verification', true,
      'suspend_until_verified', true
    )
  END as email_change_actions,

  -- Status change processing
  CASE WHEN event_type = 'status_change' THEN
    JSON_BUILD_OBJECT(
      'action', 'process_status_change',
      'user_id', user_id,
      'old_status', previous_status,
      'new_status', current_status,
      'notify_admin', CASE WHEN current_status = 'suspended' THEN true ELSE false END,
      'cleanup_sessions', CASE WHEN current_status IN ('suspended', 'deleted') THEN true ELSE false END
    )
  END as status_change_actions,

  -- Analytics and tracking
  JSON_BUILD_OBJECT(
    'event_id', change_id,
    'event_type', event_type,
    'user_id', user_id,
    'timestamp', processed_at,
    'changed_fields', changed_fields,
    'processing_priority', priority
  ) as analytics_payload

FROM event_routing
ORDER BY priority ASC, processed_at ASC;

-- Real-time order processing with change streams
CREATE CHANGE STREAM order_processing_stream
ON orders
WITH (
  full_document = 'updateLookup',
  operation_types = ARRAY['insert', 'update']
)
FILTER (
  operation_type = 'insert'
  OR (
    operation_type = 'update' AND (
      updated_fields ? 'status' OR
      updated_fields ? 'payment_status' OR
      updated_fields ? 'fulfillment_status'
    )
  )
);

-- Process order changes with inventory and fulfillment logic
WITH order_changes AS (
  SELECT 
    change_id,
    operation_type,
    full_document->>'_id' as order_id,
    full_document->>'customer_id' as customer_id,
    full_document->'items' as order_items,
    full_document->>'status' as order_status,
    full_document->>'payment_status' as payment_status,
    full_document->>'fulfillment_status' as fulfillment_status,
    full_document->>'total_amount' as total_amount,

    -- Change analysis
    update_description->'updatedFields' as updated_fields,

    CASE operation_type
      WHEN 'insert' THEN 'new_order'
      WHEN 'update' THEN
        CASE 
          WHEN update_description->'updatedFields' ? 'status' THEN 'status_change'
          WHEN update_description->'updatedFields' ? 'payment_status' THEN 'payment_change'
          WHEN update_description->'updatedFields' ? 'fulfillment_status' THEN 'fulfillment_change'
          ELSE 'order_update'
        END
    END as change_type

  FROM CHANGE_STREAM('order_processing_stream')
),

-- Inventory impact analysis
inventory_updates AS (
  SELECT 
    oc.*,
    -- Calculate inventory requirements
    JSON_AGG(
      JSON_BUILD_OBJECT(
        'product_id', item->>'product_id',
        'quantity_required', (item->>'quantity')::INTEGER,
        'reserved_quantity', CASE WHEN oc.change_type = 'new_order' THEN (item->>'quantity')::INTEGER ELSE 0 END
      )
    ) as inventory_impact
  FROM order_changes oc
  CROSS JOIN JSON_ARRAY_ELEMENTS(oc.order_items) as item
  WHERE oc.change_type = 'new_order'
  GROUP BY oc.change_id, oc.operation_type, oc.order_id, oc.customer_id, 
           oc.order_status, oc.payment_status, oc.total_amount, oc.change_type
),

-- Order processing workflows
order_workflows AS (
  SELECT 
    oc.*,

    -- New order workflow
    CASE WHEN oc.change_type = 'new_order' THEN
      JSON_BUILD_OBJECT(
        'workflow', 'new_order_processing',
        'steps', ARRAY[
          'validate_order',
          'reserve_inventory', 
          'process_payment',
          'send_confirmation',
          'trigger_fulfillment'
        ],
        'priority', 'high',
        'estimated_processing_time', '5 minutes'
      )
    END as new_order_workflow,

    -- Payment status workflow
    CASE WHEN oc.change_type = 'payment_change' THEN
      JSON_BUILD_OBJECT(
        'workflow', 'payment_status_processing',
        'payment_status', oc.payment_status,
        'actions', 
          CASE oc.payment_status
            WHEN 'completed' THEN ARRAY['release_inventory', 'trigger_fulfillment', 'send_receipt']
            WHEN 'failed' THEN ARRAY['restore_inventory', 'cancel_order', 'notify_customer']
            WHEN 'pending' THEN ARRAY['hold_inventory', 'monitor_payment']
            ELSE ARRAY['investigate_status']
          END
      )
    END as payment_workflow,

    -- Fulfillment workflow
    CASE WHEN oc.change_type = 'fulfillment_change' THEN
      JSON_BUILD_OBJECT(
        'workflow', 'fulfillment_processing',
        'fulfillment_status', oc.fulfillment_status,
        'actions',
          CASE oc.fulfillment_status
            WHEN 'shipped' THEN ARRAY['send_tracking', 'update_customer', 'schedule_delivery_confirmation']
            WHEN 'delivered' THEN ARRAY['confirm_delivery', 'request_review', 'process_loyalty_points']
            WHEN 'cancelled' THEN ARRAY['restore_inventory', 'process_refund', 'notify_cancellation']
            ELSE ARRAY['monitor_fulfillment']
          END
      )
    END as fulfillment_workflow

  FROM order_changes oc
  LEFT JOIN inventory_updates iu ON oc.order_id = iu.order_id
)

-- Generate processing instructions
SELECT 
  change_type,
  order_id,
  customer_id,

  -- Workflow instructions
  COALESCE(new_order_workflow, payment_workflow, fulfillment_workflow) as workflow_config,

  -- Real-time notifications
  JSON_BUILD_OBJECT(
    'notification_type', 
      CASE change_type
        WHEN 'new_order' THEN 'order_received'
        WHEN 'payment_change' THEN 'payment_update'
        WHEN 'fulfillment_change' THEN 'fulfillment_update'
        ELSE 'order_update'
      END,
    'customer_id', customer_id,
    'order_id', order_id,
    'timestamp', CURRENT_TIMESTAMP,
    'requires_immediate_action', 
      CASE change_type 
        WHEN 'new_order' THEN true
        WHEN 'payment_change' AND payment_status = 'failed' THEN true
        ELSE false
      END
  ) as customer_notification,

  -- Analytics tracking
  JSON_BUILD_OBJECT(
    'event_type', change_type,
    'order_value', total_amount,
    'processing_timestamp', CURRENT_TIMESTAMP,
    'workflow_triggered', true
  ) as analytics_data

FROM order_workflows
WHERE workflow_config IS NOT NULL;

-- Multi-collection aggregated change stream monitoring
CREATE CHANGE STREAM business_intelligence_stream
ON DATABASE real_time_app
WITH (
  full_document = 'updateLookup',
  operation_types = ARRAY['insert', 'update', 'delete']
)
FILTER (
  namespace_collection IN ('user_profiles', 'orders', 'products', 'reviews')
  AND (
    -- New records across all collections
    operation_type = 'insert'
    OR
    -- Important field changes
    (operation_type = 'update' AND (
      (namespace_collection = 'orders' AND updated_fields ? 'status') OR
      (namespace_collection = 'user_profiles' AND updated_fields ? 'status') OR
      (namespace_collection = 'products' AND updated_fields ? 'inventory_count') OR
      (namespace_collection = 'reviews' AND updated_fields ? 'rating')
    ))
  )
);

-- Real-time business intelligence aggregation
WITH cross_collection_events AS (
  SELECT 
    cluster_time,
    namespace_collection as collection,
    operation_type,
    full_document,

    -- Collection-specific metrics extraction
    CASE namespace_collection
      WHEN 'user_profiles' THEN
        JSON_BUILD_OBJECT(
          'metric_type', 'user_activity',
          'user_id', full_document->>'_id',
          'action', operation_type,
          'user_status', full_document->>'status',
          'registration_date', full_document->>'created_at'
        )
      WHEN 'orders' THEN
        JSON_BUILD_OBJECT(
          'metric_type', 'sales_activity', 
          'order_id', full_document->>'_id',
          'customer_id', full_document->>'customer_id',
          'order_value', (full_document->>'total_amount')::DECIMAL,
          'order_status', full_document->>'status',
          'item_count', JSON_ARRAY_LENGTH(full_document->'items')
        )
      WHEN 'products' THEN
        JSON_BUILD_OBJECT(
          'metric_type', 'inventory_activity',
          'product_id', full_document->>'_id',
          'product_name', full_document->>'name',
          'inventory_count', (full_document->>'inventory_count')::INTEGER,
          'low_stock_alert', (full_document->>'inventory_count')::INTEGER < (full_document->>'low_stock_threshold')::INTEGER
        )
      WHEN 'reviews' THEN
        JSON_BUILD_OBJECT(
          'metric_type', 'customer_feedback',
          'review_id', full_document->>'_id',
          'product_id', full_document->>'product_id',
          'customer_id', full_document->>'customer_id',
          'rating', (full_document->>'rating')::DECIMAL,
          'sentiment', full_document->>'sentiment'
        )
    END as metrics_data,

    -- Event timing
    DATE_TRUNC('hour', TO_TIMESTAMP(cluster_time)) as event_hour,
    DATE_TRUNC('day', TO_TIMESTAMP(cluster_time)) as event_date

  FROM CHANGE_STREAM('business_intelligence_stream')
),

-- Real-time KPI aggregation
realtime_kpis AS (
  SELECT 
    event_hour,

    -- User activity KPIs
    COUNT(*) FILTER (WHERE metrics_data->>'metric_type' = 'user_activity' AND operation_type = 'insert') as new_user_registrations,
    COUNT(*) FILTER (WHERE metrics_data->>'metric_type' = 'user_activity') as total_user_events,

    -- Sales KPIs  
    COUNT(*) FILTER (WHERE metrics_data->>'metric_type' = 'sales_activity' AND operation_type = 'insert') as new_orders,
    SUM((metrics_data->>'order_value')::DECIMAL) FILTER (WHERE metrics_data->>'metric_type' = 'sales_activity' AND operation_type = 'insert') as hourly_revenue,
    AVG((metrics_data->>'order_value')::DECIMAL) FILTER (WHERE metrics_data->>'metric_type' = 'sales_activity' AND operation_type = 'insert') as avg_order_value,

    -- Inventory KPIs
    COUNT(*) FILTER (WHERE metrics_data->>'metric_type' = 'inventory_activity' AND (metrics_data->>'low_stock_alert')::BOOLEAN = true) as low_stock_alerts,

    -- Customer satisfaction KPIs
    COUNT(*) FILTER (WHERE metrics_data->>'metric_type' = 'customer_feedback') as new_reviews,
    AVG((metrics_data->>'rating')::DECIMAL) FILTER (WHERE metrics_data->>'metric_type' = 'customer_feedback') as avg_rating_hour,

    -- Real-time alerts
    ARRAY_AGG(
      CASE 
        WHEN metrics_data->>'metric_type' = 'inventory_activity' AND (metrics_data->>'low_stock_alert')::BOOLEAN = true THEN
          JSON_BUILD_OBJECT(
            'alert_type', 'low_stock',
            'product_id', metrics_data->>'product_id',
            'product_name', metrics_data->>'product_name',
            'current_inventory', metrics_data->>'inventory_count'
          )
        WHEN metrics_data->>'metric_type' = 'customer_feedback' AND (metrics_data->>'rating')::DECIMAL <= 2 THEN
          JSON_BUILD_OBJECT(
            'alert_type', 'negative_review',
            'product_id', metrics_data->>'product_id',
            'customer_id', metrics_data->>'customer_id',
            'rating', metrics_data->>'rating'
          )
      END
    ) FILTER (WHERE 
      (metrics_data->>'metric_type' = 'inventory_activity' AND (metrics_data->>'low_stock_alert')::BOOLEAN = true) OR
      (metrics_data->>'metric_type' = 'customer_feedback' AND (metrics_data->>'rating')::DECIMAL <= 2)
    ) as real_time_alerts

  FROM cross_collection_events
  GROUP BY event_hour
)

-- Generate real-time business intelligence dashboard
SELECT 
  TO_CHAR(event_hour, 'YYYY-MM-DD HH24:00') as hour_period,
  new_user_registrations,
  new_orders,
  ROUND(hourly_revenue, 2) as hourly_revenue,
  ROUND(avg_order_value, 2) as avg_order_value,
  low_stock_alerts,
  new_reviews,
  ROUND(avg_rating_hour, 2) as avg_hourly_rating,

  -- Business health indicators
  CASE 
    WHEN new_orders > 50 THEN 'high_activity'
    WHEN new_orders > 20 THEN 'normal_activity'
    WHEN new_orders > 0 THEN 'low_activity'
    ELSE 'no_activity'
  END as sales_activity_level,

  CASE
    WHEN avg_rating_hour >= 4.5 THEN 'excellent'
    WHEN avg_rating_hour >= 4.0 THEN 'good' 
    WHEN avg_rating_hour >= 3.5 THEN 'fair'
    ELSE 'needs_attention'
  END as customer_satisfaction_level,

  -- Immediate action items
  CASE
    WHEN low_stock_alerts > 0 THEN 'restock_required'
    WHEN array_length(real_time_alerts, 1) > 5 THEN 'multiple_alerts_require_attention'
    WHEN avg_rating_hour < 3.0 THEN 'investigate_customer_issues'
    ELSE 'normal_operations'
  END as operational_status,

  real_time_alerts,

  -- Performance metrics
  JSON_BUILD_OBJECT(
    'data_freshness', 'real_time',
    'processing_timestamp', CURRENT_TIMESTAMP,
    'events_processed', total_user_events + new_orders + new_reviews,
    'alert_count', array_length(real_time_alerts, 1)
  ) as dashboard_metadata

FROM realtime_kpis  
ORDER BY event_hour DESC;

-- Change stream performance monitoring
SELECT 
  stream_name,
  collection_name,

  -- Stream health metrics
  COUNT(*) as total_events_processed,
  COUNT(*) FILTER (WHERE processed_successfully = true) as successful_events,
  COUNT(*) FILTER (WHERE processed_successfully = false) as failed_events,

  -- Performance metrics
  AVG(processing_duration_ms) as avg_processing_time_ms,
  MAX(processing_duration_ms) as max_processing_time_ms,
  MIN(processing_duration_ms) as min_processing_time_ms,

  -- Latency analysis
  AVG(EXTRACT(EPOCH FROM (processed_at - event_timestamp)) * 1000) as avg_latency_ms,
  MAX(EXTRACT(EPOCH FROM (processed_at - event_timestamp)) * 1000) as max_latency_ms,

  -- Stream reliability
  ROUND(
    (COUNT(*) FILTER (WHERE processed_successfully = true)::DECIMAL / COUNT(*)) * 100, 
    2
  ) as success_rate_percent,

  -- Recent activity
  COUNT(*) FILTER (WHERE event_timestamp >= CURRENT_TIMESTAMP - INTERVAL '1 hour') as events_last_hour,
  COUNT(*) FILTER (WHERE event_timestamp >= CURRENT_TIMESTAMP - INTERVAL '1 day') as events_last_day,

  -- Error analysis
  STRING_AGG(DISTINCT error_message, '; ') as common_errors,

  -- Performance recommendations
  CASE 
    WHEN AVG(processing_duration_ms) > 5000 THEN 'Optimize event processing logic'
    WHEN ROUND((COUNT(*) FILTER (WHERE processed_successfully = true)::DECIMAL / COUNT(*)) * 100, 2) < 95 THEN 'Investigate processing failures'
    WHEN MAX(EXTRACT(EPOCH FROM (processed_at - event_timestamp)) * 1000) > 10000 THEN 'Review processing latency'
    ELSE 'Stream performing within acceptable parameters'
  END as optimization_recommendation

FROM change_stream_processing_log
WHERE event_timestamp >= CURRENT_TIMESTAMP - INTERVAL '7 days'
GROUP BY stream_name, collection_name
ORDER BY total_events_processed DESC;

-- QueryLeaf provides comprehensive change stream capabilities:
-- 1. SQL-familiar change stream creation and configuration
-- 2. Real-time event filtering and processing with business logic
-- 3. Cross-collection aggregated monitoring for comprehensive insights
-- 4. Automated workflow triggers based on change patterns
-- 5. Real-time business intelligence and KPI tracking
-- 6. Performance monitoring and optimization recommendations
-- 7. Fault-tolerant event processing with resume capabilities
-- 8. Integration with MongoDB's native change stream features
-- 9. Scalable real-time architectures for modern applications
-- 10. Event sourcing patterns with SQL-style event processing

Best Practices for Change Streams Implementation

Real-Time Architecture Design

Essential practices for building reliable change stream applications:

Resume Token Management: Implement robust resume token storage for fault-tolerant event processing
Event Filtering: Use precise filtering to minimize processing overhead and focus on relevant changes
Error Handling: Design comprehensive error handling with retry logic and dead letter queues
Performance Monitoring: Track processing latency, throughput, and error rates for optimization
Scalability Planning: Design change stream processors to scale horizontally with application growth
Testing Strategies: Implement thorough testing including failure scenarios and resume capability

Event Processing Optimization

Optimize change stream processing for enterprise-scale applications:

Batch Processing: Group related events for efficient processing while maintaining real-time responsiveness
Async Processing: Use asynchronous patterns to prevent blocking and improve throughput
Event Prioritization: Implement priority queues for critical business events
Resource Management: Monitor memory usage and connection pools for sustained operation
Observability: Implement comprehensive logging and metrics for operational excellence
Data Consistency: Ensure proper ordering and exactly-once processing semantics

Conclusion

MongoDB Change Streams provide native real-time change detection that enables building responsive, event-driven applications without the complexity and overhead of external CDC solutions. The combination of comprehensive change detection, fault tolerance, and familiar SQL-style operations makes implementing real-time features both powerful and accessible.

Key Change Streams benefits include:

Native Real-Time: Built-in change detection without external tools or polling overhead
Fault Tolerant: Resume capability ensures reliable event processing and delivery
Flexible Filtering: Sophisticated filtering for precise event routing and processing
Scalable Processing: High-performance event streams that scale with application demand
Complete Context: Pre and post-image support for comprehensive change understanding
SQL Integration: Familiar query patterns for change stream operations and event processing

Whether you're building real-time dashboards, microservices communication, event sourcing architectures, or reactive applications, MongoDB Change Streams with QueryLeaf's familiar SQL interface provide the foundation for modern real-time systems.

QueryLeaf Integration: QueryLeaf automatically manages MongoDB Change Streams while providing SQL-familiar syntax for change stream creation, event processing, and real-time analytics. Advanced event routing, business logic integration, and performance optimization are seamlessly handled through familiar SQL patterns, making sophisticated real-time applications both powerful and maintainable.

The integration of native change detection with SQL-style event processing makes MongoDB an ideal platform for applications requiring both real-time reactivity and familiar database interaction patterns, ensuring your real-time solutions remain both effective and maintainable as they scale and evolve.