Skip to content

MongoDB Embedded Documents vs References: Data Modeling Patterns and Performance Optimization for Enterprise Applications

Modern applications require sophisticated data modeling strategies that balance query performance, data consistency, and schema flexibility across complex relationships and evolving business requirements. Traditional relational databases force all relationships through normalized foreign key structures that often create performance bottlenecks, complex joins, and rigid schemas that resist change as applications evolve and business requirements shift.

MongoDB's document-oriented architecture provides powerful flexibility in how relationships are modeled, offering both embedded document patterns that co-locate related data within single documents and reference patterns that maintain relationships through document identifiers. Understanding when to embed versus when to reference is crucial for designing scalable, performant applications that can adapt to changing requirements while maintaining optimal query performance and data consistency.

The Traditional Relational Normalization Challenge

Conventional relational database modeling relies heavily on normalization principles that create complex join-heavy queries and performance challenges:

-- Traditional PostgreSQL normalized schema with complex relationship management overhead

-- User profile management with multiple related entities requiring joins
CREATE TABLE users (
    user_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    email VARCHAR(255) UNIQUE NOT NULL,
    username VARCHAR(100) UNIQUE NOT NULL,
    first_name VARCHAR(100) NOT NULL,
    last_name VARCHAR(100) NOT NULL,
    created_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,

    -- Basic user metadata
    date_of_birth DATE,
    phone_number VARCHAR(20),
    status VARCHAR(20) DEFAULT 'active',

    CONSTRAINT valid_status CHECK (status IN ('active', 'inactive', 'suspended', 'deleted'))
);

-- User addresses requiring separate table and joins for access
CREATE TABLE user_addresses (
    address_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    user_id UUID NOT NULL REFERENCES users(user_id) ON DELETE CASCADE,
    address_type VARCHAR(20) NOT NULL DEFAULT 'home',

    -- Address components
    street_address VARCHAR(500) NOT NULL,
    apartment_unit VARCHAR(100),
    city VARCHAR(100) NOT NULL,
    state_province VARCHAR(100),
    postal_code VARCHAR(20) NOT NULL,
    country VARCHAR(3) NOT NULL DEFAULT 'USA',

    -- Address metadata
    is_primary BOOLEAN DEFAULT FALSE,
    is_billing BOOLEAN DEFAULT FALSE,
    is_shipping BOOLEAN DEFAULT FALSE,

    created_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,

    CONSTRAINT valid_address_type CHECK (address_type IN ('home', 'work', 'billing', 'shipping', 'other'))
);

-- User preferences requiring separate storage and complex queries
CREATE TABLE user_preferences (
    preference_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    user_id UUID NOT NULL REFERENCES users(user_id) ON DELETE CASCADE,
    preference_category VARCHAR(50) NOT NULL,
    preference_key VARCHAR(100) NOT NULL,
    preference_value JSONB NOT NULL,

    created_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,

    UNIQUE (user_id, preference_category, preference_key),
    CONSTRAINT valid_category CHECK (preference_category IN (
        'notifications', 'display', 'privacy', 'content', 'accessibility'
    ))
);

-- User social connections with bidirectional relationship complexity
CREATE TABLE user_connections (
    connection_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    requester_user_id UUID NOT NULL REFERENCES users(user_id) ON DELETE CASCADE,
    requested_user_id UUID NOT NULL REFERENCES users(user_id) ON DELETE CASCADE,
    connection_type VARCHAR(30) NOT NULL DEFAULT 'friend',
    connection_status VARCHAR(20) NOT NULL DEFAULT 'pending',

    -- Connection metadata
    requested_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
    responded_at TIMESTAMP WITH TIME ZONE,
    last_interaction_at TIMESTAMP WITH TIME ZONE,

    -- Connection details
    connection_strength INTEGER DEFAULT 1 CHECK (connection_strength BETWEEN 1 AND 10),
    mutual_connections INTEGER DEFAULT 0,
    shared_interests TEXT[],

    CONSTRAINT no_self_connection CHECK (requester_user_id != requested_user_id),
    CONSTRAINT valid_connection_type CHECK (connection_type IN (
        'friend', 'family', 'colleague', 'acquaintance', 'blocked'
    )),
    CONSTRAINT valid_status CHECK (connection_status IN (
        'pending', 'accepted', 'declined', 'blocked', 'removed'
    )),

    UNIQUE (requester_user_id, requested_user_id, connection_type)
);

-- User activity tracking requiring separate table with heavy join overhead
CREATE TABLE user_activities (
    activity_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    user_id UUID NOT NULL REFERENCES users(user_id) ON DELETE CASCADE,
    activity_type VARCHAR(50) NOT NULL,
    activity_timestamp TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,

    -- Activity details
    activity_data JSONB NOT NULL DEFAULT '{}',
    activity_source VARCHAR(50) DEFAULT 'web',
    ip_address INET,
    user_agent TEXT,

    -- Context information
    session_id VARCHAR(100),
    page_url TEXT,
    referrer_url TEXT,

    -- Performance tracking
    response_time_ms INTEGER,
    error_occurred BOOLEAN DEFAULT FALSE,
    error_details JSONB,

    CONSTRAINT valid_activity_type CHECK (activity_type IN (
        'login', 'logout', 'page_view', 'action_performed', 'data_modified', 'error_occurred'
    )),
    CONSTRAINT valid_source CHECK (activity_source IN ('web', 'mobile', 'api', 'system'))
);

-- Complex query requiring multiple joins for complete user profile
CREATE OR REPLACE VIEW complete_user_profiles AS
SELECT 
    u.user_id,
    u.email,
    u.username,
    u.first_name,
    u.last_name,
    u.date_of_birth,
    u.phone_number,
    u.status,
    u.created_at,

    -- Primary address information (requires join)
    primary_addr.street_address as primary_street,
    primary_addr.city as primary_city,
    primary_addr.state_province as primary_state,
    primary_addr.postal_code as primary_postal,
    primary_addr.country as primary_country,

    -- Aggregated address count
    COALESCE(addr_counts.total_addresses, 0) as total_addresses,

    -- Connection statistics (expensive aggregation)
    COALESCE(conn_stats.total_connections, 0) as total_connections,
    COALESCE(conn_stats.pending_requests, 0) as pending_requests,
    COALESCE(conn_stats.accepted_connections, 0) as accepted_connections,

    -- Recent activity summary (expensive aggregation with time windows)
    COALESCE(activity_stats.total_activities_7d, 0) as activities_last_7_days,
    COALESCE(activity_stats.last_login, null) as last_login_time,
    COALESCE(activity_stats.last_activity, null) as last_activity_time,

    -- Preference counts (requires additional join)
    COALESCE(pref_counts.total_preferences, 0) as total_preferences

FROM users u

-- Left join for primary address (performance impact)
LEFT JOIN user_addresses primary_addr ON u.user_id = primary_addr.user_id 
    AND primary_addr.is_primary = TRUE

-- Subquery for address counts (additional performance overhead)
LEFT JOIN (
    SELECT user_id, COUNT(*) as total_addresses
    FROM user_addresses
    GROUP BY user_id
) addr_counts ON u.user_id = addr_counts.user_id

-- Complex subquery for connection statistics
LEFT JOIN (
    SELECT 
        COALESCE(requester_user_id, requested_user_id) as user_id,
        COUNT(*) as total_connections,
        COUNT(*) FILTER (WHERE connection_status = 'pending') as pending_requests,
        COUNT(*) FILTER (WHERE connection_status = 'accepted') as accepted_connections
    FROM user_connections
    GROUP BY COALESCE(requester_user_id, requested_user_id)
) conn_stats ON u.user_id = conn_stats.user_id

-- Time-based activity aggregation (expensive computation)
LEFT JOIN (
    SELECT 
        user_id,
        COUNT(*) FILTER (WHERE activity_timestamp >= CURRENT_TIMESTAMP - INTERVAL '7 days') as total_activities_7d,
        MAX(activity_timestamp) FILTER (WHERE activity_type = 'login') as last_login,
        MAX(activity_timestamp) as last_activity
    FROM user_activities
    GROUP BY user_id
) activity_stats ON u.user_id = activity_stats.user_id

-- Preference aggregation
LEFT JOIN (
    SELECT user_id, COUNT(*) as total_preferences
    FROM user_preferences
    GROUP BY user_id
) pref_counts ON u.user_id = pref_counts.user_id;

-- Performance analysis of complex join-heavy queries
EXPLAIN (ANALYZE, BUFFERS) 
SELECT * FROM complete_user_profiles 
WHERE status = 'active' 
AND total_connections > 10 
ORDER BY last_activity_time DESC NULLS LAST
LIMIT 20;

-- Complex friend recommendation query with multiple joins and aggregations
WITH friend_recommendations AS (
    SELECT DISTINCT
        u1.user_id as target_user_id,
        u2.user_id as recommended_user_id,
        u2.first_name,
        u2.last_name,
        u2.username,

        -- Mutual connections calculation (expensive)
        mutual_stats.mutual_count,

        -- Shared interests analysis
        CASE 
            WHEN EXISTS (
                SELECT 1 FROM user_connections uc1
                JOIN user_connections uc2 ON uc1.requested_user_id = uc2.requester_user_id
                WHERE uc1.requester_user_id = u1.user_id 
                AND uc2.requested_user_id = u2.user_id
                AND uc1.shared_interests && uc2.shared_interests
            ) THEN TRUE ELSE FALSE
        END as has_shared_interests,

        -- Activity similarity
        CASE 
            WHEN EXISTS (
                SELECT 1 FROM user_activities ua1
                JOIN user_activities ua2 ON ua1.activity_type = ua2.activity_type
                WHERE ua1.user_id = u1.user_id 
                AND ua2.user_id = u2.user_id
                AND ua1.activity_timestamp >= CURRENT_TIMESTAMP - INTERVAL '30 days'
                AND ua2.activity_timestamp >= CURRENT_TIMESTAMP - INTERVAL '30 days'
                GROUP BY ua1.activity_type 
                HAVING COUNT(*) > 5
            ) THEN TRUE ELSE FALSE
        END as similar_activity_patterns,

        -- Geographic proximity (if addresses available)
        CASE 
            WHEN addr1.city = addr2.city AND addr1.state_province = addr2.state_province 
            THEN TRUE ELSE FALSE
        END as same_geographic_area,

        -- Recommendation score calculation
        (
            COALESCE(mutual_stats.mutual_count, 0) * 3 +
            CASE WHEN has_shared_interests THEN 2 ELSE 0 END +
            CASE WHEN similar_activity_patterns THEN 2 ELSE 0 END +
            CASE WHEN same_geographic_area THEN 1 ELSE 0 END
        ) as recommendation_score

    FROM users u1
    CROSS JOIN users u2

    -- Ensure not already connected
    LEFT JOIN user_connections existing_conn ON (
        (existing_conn.requester_user_id = u1.user_id AND existing_conn.requested_user_id = u2.user_id) OR
        (existing_conn.requester_user_id = u2.user_id AND existing_conn.requested_user_id = u1.user_id)
    )

    -- Mutual connections calculation (very expensive subquery)
    LEFT JOIN (
        SELECT 
            uc1.requester_user_id as user1_id,
            uc2.requester_user_id as user2_id,
            COUNT(*) as mutual_count
        FROM user_connections uc1
        JOIN user_connections uc2 ON uc1.requested_user_id = uc2.requested_user_id
        WHERE uc1.connection_status = 'accepted' 
        AND uc2.connection_status = 'accepted'
        AND uc1.requester_user_id != uc2.requester_user_id
        GROUP BY uc1.requester_user_id, uc2.requester_user_id
    ) mutual_stats ON mutual_stats.user1_id = u1.user_id AND mutual_stats.user2_id = u2.user_id

    -- Address proximity joins
    LEFT JOIN user_addresses addr1 ON u1.user_id = addr1.user_id AND addr1.is_primary = TRUE
    LEFT JOIN user_addresses addr2 ON u2.user_id = addr2.user_id AND addr2.is_primary = TRUE

    WHERE u1.user_id != u2.user_id
    AND u1.status = 'active'
    AND u2.status = 'active'
    AND existing_conn.connection_id IS NULL -- Not already connected
)

SELECT 
    target_user_id,
    recommended_user_id,
    first_name,
    last_name,
    username,
    mutual_count,
    recommendation_score,
    has_shared_interests,
    similar_activity_patterns,
    same_geographic_area,

    -- Ranking within recommendations for this user
    ROW_NUMBER() OVER (
        PARTITION BY target_user_id 
        ORDER BY recommendation_score DESC, mutual_count DESC
    ) as recommendation_rank

FROM friend_recommendations
WHERE recommendation_score > 0
ORDER BY target_user_id, recommendation_score DESC;

-- Problems with traditional normalized relational modeling:
-- 1. Complex multi-table joins required for basic user profile queries affecting performance
-- 2. Expensive aggregation queries across multiple related tables with poor scalability  
-- 3. Rigid schema structure requiring ALTER TABLE operations for new fields
-- 4. Foreign key constraint management overhead affecting insert/update performance
-- 5. Complex query optimization challenges with multiple join paths and aggregations
-- 6. Difficulty modeling variable or optional relationship structures
-- 7. Performance degradation as related data volume increases due to join complexity
-- 8. Complex application code required to reconstruct related objects from multiple tables
-- 9. Limited ability to co-locate frequently accessed related data for optimal performance
-- 10. Expensive view materialization and maintenance for denormalized query patterns

MongoDB provides flexible document modeling patterns that optimize for query performance and data access patterns:

// MongoDB Document Modeling - Flexible embedded and reference patterns for optimal performance
const { MongoClient, ObjectId } = require('mongodb');

// Advanced MongoDB Document Modeling Manager for Enterprise Data Relationship Optimization
class AdvancedDocumentModelingManager {
  constructor(client, config = {}) {
    this.client = client;
    this.db = client.db(config.database || 'enterprise_application');

    this.config = {
      // Modeling configuration
      enableEmbeddedOptimization: config.enableEmbeddedOptimization !== false,
      enableReferenceOptimization: config.enableReferenceOptimization !== false,
      enableHybridModeling: config.enableHybridModeling !== false,

      // Performance optimization
      enableQueryOptimization: config.enableQueryOptimization !== false,
      enableIndexOptimization: config.enableIndexOptimization !== false,
      enableAggregationOptimization: config.enableAggregationOptimization !== false,

      // Data consistency
      enableConsistencyValidation: config.enableConsistencyValidation !== false,
      enableReferentialIntegrity: config.enableReferentialIntegrity !== false,
      enableDataSynchronization: config.enableDataSynchronization !== false,

      // Monitoring and analytics
      enablePerformanceMonitoring: config.enablePerformanceMonitoring !== false,
      enableQueryAnalytics: config.enableQueryAnalytics !== false,
      enableDocumentSizeMonitoring: config.enableDocumentSizeMonitoring !== false
    };

    // Modeling strategy tracking
    this.modelingStrategies = new Map();
    this.performanceMetrics = new Map();
    this.queryPatterns = new Map();

    this.initializeModelingManager();
  }

  async initializeModelingManager() {
    console.log('Initializing Advanced Document Modeling Manager...');

    try {
      // Setup embedded document patterns
      await this.setupEmbeddedDocumentPatterns();

      // Setup reference patterns
      await this.setupReferencePatterns();

      // Setup hybrid modeling patterns
      await this.setupHybridModelingPatterns();

      // Initialize performance monitoring
      if (this.config.enablePerformanceMonitoring) {
        await this.initializePerformanceMonitoring();
      }

      console.log('Document Modeling Manager initialized successfully');

    } catch (error) {
      console.error('Error initializing document modeling manager:', error);
      throw error;
    }
  }

  async setupEmbeddedDocumentPatterns() {
    console.log('Setting up embedded document modeling patterns...');

    try {
      // User profile with embedded addresses and preferences - optimal for frequent co-access
      const userProfilesCollection = this.db.collection('user_profiles_embedded');

      // Create optimized indexes for embedded document queries
      await userProfilesCollection.createIndexes([
        { key: { email: 1 }, unique: true, background: true },
        { key: { username: 1 }, unique: true, background: true },
        { key: { 'addresses.type': 1, 'addresses.isPrimary': 1 }, background: true },
        { key: { 'preferences.category': 1, 'preferences.key': 1 }, background: true },
        { key: { status: 1, lastActivityAt: -1 }, background: true }
      ]);

      this.modelingStrategies.set('user_profiles_embedded', {
        collection: userProfilesCollection,
        pattern: 'embedded_documents',
        useCase: 'frequently_accessed_related_data',
        benefits: [
          'Single query for complete user profile',
          'Atomic updates for user and related data',
          'No joins required for common queries',
          'Optimal performance for read-heavy workloads'
        ],
        considerations: [
          'Document size growth with related data',
          'Potential for data duplication',
          'Complex update operations for nested data'
        ],
        queryOptimization: {
          primaryQueries: ['find_by_user_id', 'find_by_email', 'find_with_addresses'],
          indexStrategy: 'compound_indexes_for_embedded_fields',
          projectionStrategy: 'selective_field_projection'
        }
      });

      // Order documents with embedded line items - transactional consistency
      const ordersCollection = this.db.collection('orders_embedded');

      await ordersCollection.createIndexes([
        { key: { customerId: 1, orderDate: -1 }, background: true },
        { key: { orderStatus: 1, orderDate: -1 }, background: true },
        { key: { 'items.productId': 1 }, background: true },
        { key: { 'items.category': 1, orderDate: -1 }, background: true },
        { key: { totalAmount: 1 }, background: true }
      ]);

      this.modelingStrategies.set('orders_embedded', {
        collection: ordersCollection,
        pattern: 'embedded_array_documents',
        useCase: 'transactional_consistency_required',
        benefits: [
          'ACID guarantees for order and line items',
          'Single document queries for complete orders',
          'Efficient aggregation across order items',
          'Simplified application logic'
        ],
        considerations: [
          'Document size with many line items',
          'Array index performance for large arrays',
          'Memory usage for large embedded arrays'
        ]
      });

      console.log('Embedded document patterns configured successfully');

    } catch (error) {
      console.error('Error setting up embedded document patterns:', error);
      throw error;
    }
  }

  async setupReferencePatterns() {
    console.log('Setting up reference modeling patterns...');

    try {
      // User collection with references to separate related collections
      const usersCollection = this.db.collection('users_referenced');
      const addressesCollection = this.db.collection('user_addresses_referenced');
      const activitiesCollection = this.db.collection('user_activities_referenced');

      // User collection indexes
      await usersCollection.createIndexes([
        { key: { email: 1 }, unique: true, background: true },
        { key: { username: 1 }, unique: true, background: true },
        { key: { status: 1, createdAt: -1 }, background: true }
      ]);

      // Address collection with user references
      await addressesCollection.createIndexes([
        { key: { userId: 1, type: 1 }, background: true },
        { key: { userId: 1, isPrimary: 1 }, background: true },
        { key: { city: 1, stateProvince: 1 }, background: true }
      ]);

      // Activity collection with user references and time-based queries
      await activitiesCollection.createIndexes([
        { key: { userId: 1, timestamp: -1 }, background: true },
        { key: { activityType: 1, timestamp: -1 }, background: true },
        { key: { timestamp: -1 }, background: true }
      ]);

      this.modelingStrategies.set('users_referenced', {
        collections: {
          users: usersCollection,
          addresses: addressesCollection,  
          activities: activitiesCollection
        },
        pattern: 'normalized_references',
        useCase: 'independent_entity_management',
        benefits: [
          'Normalized data structure reduces duplication',
          'Independent scaling of related collections',
          'Flexible querying of individual entity types',
          'Efficient updates to specific data types'
        ],
        considerations: [
          'Multiple queries required for complete data',
          'Application-level join complexity',
          'Potential consistency challenges',
          'Network round-trips for related data'
        ],
        queryOptimization: {
          primaryQueries: ['find_user_with_addresses', 'find_user_activities', 'aggregate_user_data'],
          joinStrategy: 'application_level_population',
          indexStrategy: 'reference_field_optimization'
        }
      });

      console.log('Reference patterns configured successfully');

    } catch (error) {
      console.error('Error setting up reference patterns:', error);
      throw error;
    }
  }

  async setupHybridModelingPatterns() {
    console.log('Setting up hybrid modeling patterns...');

    try {
      // Blog posts with embedded metadata and referenced comments
      const blogPostsCollection = this.db.collection('blog_posts_hybrid');
      const commentsCollection = this.db.collection('blog_comments_hybrid');

      await blogPostsCollection.createIndexes([
        { key: { authorId: 1, publishedAt: -1 }, background: true },
        { key: { 'tags.name': 1, publishedAt: -1 }, background: true },
        { key: { status: 1, publishedAt: -1 }, background: true },
        { key: { 'metadata.category': 1 }, background: true }
      ]);

      await commentsCollection.createIndexes([
        { key: { postId: 1, createdAt: -1 }, background: true },
        { key: { authorId: 1, createdAt: -1 }, background: true },
        { key: { status: 1, createdAt: -1 }, background: true }
      ]);

      this.modelingStrategies.set('blog_posts_hybrid', {
        collections: {
          posts: blogPostsCollection,
          comments: commentsCollection
        },
        pattern: 'hybrid_embedded_and_referenced',
        useCase: 'mixed_access_patterns',
        benefits: [
          'Optimized for different query patterns',
          'Embedded data for frequent access',
          'Referenced data for independent management',
          'Balanced performance and flexibility'
        ],
        considerations: [
          'Complex modeling decisions',
          'Mixed query strategies required',
          'Potential data consistency complexity'
        ]
      });

      console.log('Hybrid modeling patterns configured successfully');

    } catch (error) {
      console.error('Error setting up hybrid patterns:', error);
      throw error;
    }
  }

  async createEmbeddedUserProfile(userData) {
    console.log('Creating user profile with embedded document pattern...');

    try {
      const userProfilesCollection = this.modelingStrategies.get('user_profiles_embedded').collection;

      const embeddedProfile = {
        _id: new ObjectId(),
        email: userData.email,
        username: userData.username,
        firstName: userData.firstName,
        lastName: userData.lastName,
        phoneNumber: userData.phoneNumber,
        dateOfBirth: userData.dateOfBirth,
        status: 'active',

        // Embedded addresses for optimal co-access
        addresses: userData.addresses?.map(addr => ({
          _id: new ObjectId(),
          type: addr.type,
          streetAddress: addr.streetAddress,
          apartmentUnit: addr.apartmentUnit,
          city: addr.city,
          stateProvince: addr.stateProvince,
          postalCode: addr.postalCode,
          country: addr.country,
          isPrimary: addr.isPrimary || false,
          isBilling: addr.isBilling || false,
          isShipping: addr.isShipping || false,
          createdAt: new Date(),
          updatedAt: new Date()
        })) || [],

        // Embedded preferences for atomic updates
        preferences: userData.preferences?.map(pref => ({
          _id: new ObjectId(),
          category: pref.category,
          key: pref.key,
          value: pref.value,
          dataType: pref.dataType || 'string',
          createdAt: new Date(),
          updatedAt: new Date()
        })) || [],

        // Embedded profile metadata
        profileMetadata: {
          theme: userData.theme || 'light',
          language: userData.language || 'en',
          timezone: userData.timezone || 'UTC',
          notificationSettings: {
            email: userData.emailNotifications !== false,
            push: userData.pushNotifications !== false,
            sms: userData.smsNotifications || false
          },
          privacySettings: {
            profileVisibility: userData.profileVisibility || 'public',
            allowDirectMessages: userData.allowDirectMessages !== false,
            shareActivityStatus: userData.shareActivityStatus !== false
          }
        },

        // Activity summary (embedded for performance)
        activitySummary: {
          totalLogins: 0,
          lastLoginAt: null,
          lastActivityAt: new Date(),
          accountCreatedAt: new Date(),
          profileCompletionScore: this.calculateProfileCompleteness(userData)
        },

        // Audit information
        createdAt: new Date(),
        updatedAt: new Date(),
        version: 1
      };

      const result = await userProfilesCollection.insertOne(embeddedProfile);

      // Update performance metrics
      await this.updateModelingMetrics('user_profiles_embedded', 'create', embeddedProfile);

      console.log(`Embedded user profile created: ${result.insertedId}`);

      return {
        userId: result.insertedId,
        modelingPattern: 'embedded_documents',
        documentsCreated: 1,
        queryOptimized: true,
        atomicUpdates: true
      };

    } catch (error) {
      console.error('Error creating embedded user profile:', error);
      throw error;
    }
  }

  async createReferencedUserProfile(userData) {
    console.log('Creating user profile with reference pattern...');

    try {
      const usersCollection = this.modelingStrategies.get('users_referenced').collections.users;
      const addressesCollection = this.modelingStrategies.get('users_referenced').collections.addresses;

      // Create main user document
      const userDocument = {
        _id: new ObjectId(),
        email: userData.email,
        username: userData.username,
        firstName: userData.firstName,
        lastName: userData.lastName,
        phoneNumber: userData.phoneNumber,
        dateOfBirth: userData.dateOfBirth,
        status: 'active',

        // Basic profile information
        profileMetadata: {
          theme: userData.theme || 'light',
          language: userData.language || 'en',
          timezone: userData.timezone || 'UTC'
        },

        createdAt: new Date(),
        updatedAt: new Date(),
        version: 1
      };

      const userResult = await usersCollection.insertOne(userDocument);
      const userId = userResult.insertedId;

      // Create referenced address documents
      const addressDocuments = userData.addresses?.map(addr => ({
        _id: new ObjectId(),
        userId: userId,
        type: addr.type,
        streetAddress: addr.streetAddress,
        apartmentUnit: addr.apartmentUnit,
        city: addr.city,
        stateProvince: addr.stateProvince,
        postalCode: addr.postalCode,
        country: addr.country,
        isPrimary: addr.isPrimary || false,
        isBilling: addr.isBilling || false,
        isShipping: addr.isShipping || false,
        createdAt: new Date(),
        updatedAt: new Date()
      })) || [];

      let addressResults = null;
      if (addressDocuments.length > 0) {
        addressResults = await addressesCollection.insertMany(addressDocuments);
      }

      // Update performance metrics
      await this.updateModelingMetrics('users_referenced', 'create', {
        mainDocument: userDocument,
        referencedDocuments: addressDocuments
      });

      console.log(`Referenced user profile created: ${userId} with ${addressDocuments.length} addresses`);

      return {
        userId: userId,
        modelingPattern: 'normalized_references',
        documentsCreated: 1 + addressDocuments.length,
        addressIds: addressResults ? Object.values(addressResults.insertedIds) : [],
        queryOptimized: false, // Requires joins
        normalizedStructure: true
      };

    } catch (error) {
      console.error('Error creating referenced user profile:', error);
      throw error;
    }
  }

  async getUserProfileEmbedded(userId, options = {}) {
    console.log(`Retrieving embedded user profile: ${userId}`);

    try {
      const userProfilesCollection = this.modelingStrategies.get('user_profiles_embedded').collection;

      // Single query for complete profile - optimal performance
      const projection = options.fields ? this.buildProjection(options.fields) : {};

      const profile = await userProfilesCollection.findOne(
        { _id: new ObjectId(userId) },
        { projection }
      );

      if (!profile) {
        throw new Error(`User profile not found: ${userId}`);
      }

      // Update query metrics
      await this.updateQueryMetrics('user_profiles_embedded', 'single_document_query', {
        documentsReturned: 1,
        queryTime: Date.now(),
        projectionUsed: Object.keys(projection).length > 0
      });

      console.log(`Embedded profile retrieved: ${userId} (single query)`);

      return {
        profile: profile,
        modelingPattern: 'embedded_documents',
        queriesExecuted: 1,
        performanceOptimized: true,
        dataConsistency: 'guaranteed'
      };

    } catch (error) {
      console.error(`Error retrieving embedded profile ${userId}:`, error);
      throw error;
    }
  }

  async getUserProfileReferenced(userId, options = {}) {
    console.log(`Retrieving referenced user profile: ${userId}`);

    try {
      const collections = this.modelingStrategies.get('users_referenced').collections;

      // Multiple queries required for complete profile
      const queries = [];

      // Main user query
      queries.push(
        collections.users.findOne({ _id: new ObjectId(userId) })
      );

      // Related data queries
      if (!options.userOnly) {
        queries.push(
          collections.addresses.find({ userId: new ObjectId(userId) }).toArray()
        );
      }

      const [userDoc, addressDocs] = await Promise.all(queries);

      if (!userDoc) {
        throw new Error(`User not found: ${userId}`);
      }

      // Construct complete profile from multiple documents
      const completeProfile = {
        ...userDoc,
        addresses: addressDocs || [],

        // Derived fields
        primaryAddress: addressDocs?.find(addr => addr.isPrimary),
        addressCount: addressDocs?.length || 0
      };

      // Update query metrics
      await this.updateQueryMetrics('users_referenced', 'multi_document_query', {
        documentsReturned: 1 + (addressDocs?.length || 0),
        queriesExecuted: queries.length,
        queryTime: Date.now()
      });

      console.log(`Referenced profile retrieved: ${userId} (${queries.length} queries)`);

      return {
        profile: completeProfile,
        modelingPattern: 'normalized_references', 
        queriesExecuted: queries.length,
        performanceOptimized: false,
        dataConsistency: 'eventual'
      };

    } catch (error) {
      console.error(`Error retrieving referenced profile ${userId}:`, error);
      throw error;
    }
  }

  async updateEmbeddedUserAddress(userId, addressId, updateData) {
    console.log(`Updating embedded user address: ${userId}, ${addressId}`);

    try {
      const userProfilesCollection = this.modelingStrategies.get('user_profiles_embedded').collection;

      // Atomic update of embedded address document
      const updateFields = {};
      Object.keys(updateData).forEach(key => {
        updateFields[`addresses.$.${key}`] = updateData[key];
      });
      updateFields['addresses.$.updatedAt'] = new Date();
      updateFields['updatedAt'] = new Date();

      const result = await userProfilesCollection.updateOne(
        { 
          _id: new ObjectId(userId), 
          'addresses._id': new ObjectId(addressId) 
        },
        { 
          $set: updateFields,
          $inc: { version: 1 }
        }
      );

      if (result.matchedCount === 0) {
        throw new Error(`Address not found: ${addressId} for user ${userId}`);
      }

      console.log(`Embedded address updated: ${addressId} (atomic operation)`);

      return {
        addressId: addressId,
        modelingPattern: 'embedded_documents',
        atomicUpdate: true,
        documentsModified: result.modifiedCount,
        consistencyGuaranteed: true
      };

    } catch (error) {
      console.error(`Error updating embedded address:`, error);
      throw error;
    }
  }

  async updateReferencedUserAddress(userId, addressId, updateData) {
    console.log(`Updating referenced user address: ${userId}, ${addressId}`);

    try {
      const addressesCollection = this.modelingStrategies.get('users_referenced').collections.addresses;

      // Update referenced address document
      const result = await addressesCollection.updateOne(
        { 
          _id: new ObjectId(addressId),
          userId: new ObjectId(userId) 
        },
        { 
          $set: {
            ...updateData,
            updatedAt: new Date()
          }
        }
      );

      if (result.matchedCount === 0) {
        throw new Error(`Address not found: ${addressId} for user ${userId}`);
      }

      // Potentially update user document timestamp (separate operation)
      const usersCollection = this.modelingStrategies.get('users_referenced').collections.users;
      await usersCollection.updateOne(
        { _id: new ObjectId(userId) },
        { $set: { updatedAt: new Date() } }
      );

      console.log(`Referenced address updated: ${addressId} (separate operations)`);

      return {
        addressId: addressId,
        modelingPattern: 'normalized_references',
        atomicUpdate: false,
        documentsModified: result.modifiedCount,
        consistencyGuaranteed: false
      };

    } catch (error) {
      console.error(`Error updating referenced address:`, error);
      throw error;
    }
  }

  async performComplexAggregation(pattern, aggregationQuery) {
    console.log(`Performing complex aggregation with ${pattern} pattern`);

    try {
      let result;
      const startTime = Date.now();

      if (pattern === 'embedded') {
        const collection = this.modelingStrategies.get('user_profiles_embedded').collection;

        // Single collection aggregation pipeline
        const pipeline = [
          { $match: aggregationQuery.match || {} },

          // Unwind embedded arrays for aggregation
          ...(aggregationQuery.unwindAddresses ? [{ $unwind: '$addresses' }] : []),
          ...(aggregationQuery.unwindPreferences ? [{ $unwind: '$preferences' }] : []),

          // Group and aggregate
          {
            $group: {
              _id: aggregationQuery.groupBy || null,
              userCount: { $sum: 1 },
              avgProfileScore: { $avg: '$activitySummary.profileCompletionScore' },
              totalAddresses: { $sum: { $size: '$addresses' } },
              activeUsers: { 
                $sum: { $cond: [{ $eq: ['$status', 'active'] }, 1, 0] } 
              }
            }
          },

          { $sort: { userCount: -1 } },
          { $limit: aggregationQuery.limit || 100 }
        ];

        result = await collection.aggregate(pipeline).toArray();

      } else if (pattern === 'referenced') {
        // Multi-collection aggregation with $lookup
        const usersCollection = this.modelingStrategies.get('users_referenced').collections.users;

        const pipeline = [
          { $match: aggregationQuery.match || {} },

          // Lookup addresses
          {
            $lookup: {
              from: 'user_addresses_referenced',
              localField: '_id',
              foreignField: 'userId',
              as: 'addresses'
            }
          },

          // Lookup activities
          {
            $lookup: {
              from: 'user_activities_referenced', 
              localField: '_id',
              foreignField: 'userId',
              as: 'activities'
            }
          },

          // Group and aggregate
          {
            $group: {
              _id: aggregationQuery.groupBy || null,
              userCount: { $sum: 1 },
              totalAddresses: { $sum: { $size: '$addresses' } },
              totalActivities: { $sum: { $size: '$activities' } },
              activeUsers: { 
                $sum: { $cond: [{ $eq: ['$status', 'active'] }, 1, 0] } 
              }
            }
          },

          { $sort: { userCount: -1 } },
          { $limit: aggregationQuery.limit || 100 }
        ];

        result = await usersCollection.aggregate(pipeline).toArray();
      }

      const executionTime = Date.now() - startTime;

      // Update aggregation metrics
      await this.updateQueryMetrics(`${pattern}_aggregation`, 'complex_aggregation', {
        executionTime: executionTime,
        documentsProcessed: result.length,
        pipelineStages: aggregationQuery.pipelineStages || 0
      });

      console.log(`${pattern} aggregation completed in ${executionTime}ms`);

      return {
        results: result,
        modelingPattern: pattern,
        executionTime: executionTime,
        performanceProfile: executionTime < 100 ? 'optimal' : executionTime < 500 ? 'acceptable' : 'needs_optimization'
      };

    } catch (error) {
      console.error(`Error performing ${pattern} aggregation:`, error);
      throw error;
    }
  }

  // Utility methods for document modeling optimization

  calculateProfileCompleteness(userData) {
    let score = 0;

    // Basic information (50 points)
    if (userData.firstName) score += 10;
    if (userData.lastName) score += 10;
    if (userData.email) score += 10;
    if (userData.phoneNumber) score += 10;
    if (userData.dateOfBirth) score += 10;

    // Addresses (25 points)
    if (userData.addresses?.length > 0) score += 25;

    // Preferences (25 points)
    if (userData.preferences?.length > 0) score += 25;

    return Math.min(score, 100);
  }

  buildProjection(fields) {
    const projection = {};
    fields.forEach(field => {
      projection[field] = 1;
    });
    return projection;
  }

  async updateModelingMetrics(strategy, operation, metadata) {
    if (!this.config.enablePerformanceMonitoring) return;

    const metrics = this.performanceMetrics.get(strategy) || {
      totalOperations: 0,
      operationTypes: {},
      averageDocumentSize: 0,
      performanceProfile: 'unknown'
    };

    metrics.totalOperations++;
    metrics.operationTypes[operation] = (metrics.operationTypes[operation] || 0) + 1;
    metrics.lastOperation = new Date();

    if (metadata.documentsCreated) {
      metrics.documentsCreated = (metrics.documentsCreated || 0) + metadata.documentsCreated;
    }

    this.performanceMetrics.set(strategy, metrics);
  }

  async updateQueryMetrics(strategy, queryType, metadata) {
    if (!this.config.enableQueryAnalytics) return;

    const queryMetrics = this.queryPatterns.get(strategy) || {
      totalQueries: 0,
      queryTypes: {},
      averageQueryTime: 0,
      performanceProfile: {}
    };

    queryMetrics.totalQueries++;
    queryMetrics.queryTypes[queryType] = (queryMetrics.queryTypes[queryType] || 0) + 1;

    if (metadata.queryTime) {
      const currentAvg = queryMetrics.averageQueryTime || 0;
      queryMetrics.averageQueryTime = (currentAvg + metadata.queryTime) / 2;
    }

    if (metadata.executionTime) {
      queryMetrics.performanceProfile[queryType] = metadata.executionTime;
    }

    this.queryPatterns.set(strategy, queryMetrics);
  }

  async getModelingRecommendations(collectionName, queryPatterns) {
    console.log(`Generating modeling recommendations for: ${collectionName}`);

    const recommendations = {
      currentPattern: 'unknown',
      recommendedPattern: 'unknown',
      reasoning: [],
      tradeoffs: {},
      migrationComplexity: 'unknown'
    };

    // Analyze query patterns
    const embeddedQueries = queryPatterns.filter(q => q.type === 'find_complete_document').length;
    const partialQueries = queryPatterns.filter(q => q.type === 'find_partial_data').length;
    const updateFrequency = queryPatterns.filter(q => q.type === 'update_operation').length;
    const aggregationComplexity = queryPatterns.filter(q => q.type === 'aggregation').length;

    // Analyze data characteristics
    const avgDocumentSize = queryPatterns.reduce((sum, q) => sum + (q.documentSize || 0), 0) / queryPatterns.length;
    const dataGrowthRate = queryPatterns.reduce((sum, q) => sum + (q.growthRate || 0), 0) / queryPatterns.length;

    // Generate recommendations based on patterns
    if (embeddedQueries > partialQueries * 2 && avgDocumentSize < 16 * 1024 * 1024) {
      recommendations.recommendedPattern = 'embedded_documents';
      recommendations.reasoning.push('High frequency of complete document queries');
      recommendations.reasoning.push('Document size within MongoDB limits');

      if (updateFrequency > embeddedQueries * 0.3) {
        recommendations.reasoning.push('Consider hybrid pattern due to high update frequency');
      }

    } else if (partialQueries > embeddedQueries && dataGrowthRate > 0.1) {
      recommendations.recommendedPattern = 'normalized_references';
      recommendations.reasoning.push('High frequency of partial data queries');
      recommendations.reasoning.push('High data growth rate favors normalization');

    } else if (aggregationComplexity > queryPatterns.length * 0.2) {
      recommendations.recommendedPattern = 'hybrid_pattern';
      recommendations.reasoning.push('Complex aggregation requirements');
      recommendations.reasoning.push('Mixed access patterns detected');
    }

    // Define tradeoffs
    recommendations.tradeoffs = {
      embedded_documents: {
        benefits: ['Single query performance', 'Atomic updates', 'Data locality'],
        drawbacks: ['Document size growth', 'Potential duplication', 'Complex nested updates']
      },
      normalized_references: {
        benefits: ['Data normalization', 'Independent scaling', 'Flexible querying'],
        drawbacks: ['Multiple queries required', 'Application complexity', 'Consistency challenges']
      },
      hybrid_pattern: {
        benefits: ['Optimized for mixed patterns', 'Balanced performance'],
        drawbacks: ['Increased complexity', 'Mixed consistency models']
      }
    };

    return recommendations;
  }

  async getPerformanceAnalysis() {
    console.log('Generating performance analysis for modeling patterns...');

    const analysis = {
      embeddedPatterns: {},
      referencedPatterns: {},
      hybridPatterns: {},
      recommendations: []
    };

    // Analyze embedded pattern performance
    for (const [strategy, metrics] of this.performanceMetrics) {
      if (strategy.includes('embedded')) {
        analysis.embeddedPatterns[strategy] = {
          totalOperations: metrics.totalOperations,
          operationBreakdown: metrics.operationTypes,
          averagePerformance: metrics.averageQueryTime || 0,
          performanceRating: this.ratePerformance(metrics.averageQueryTime || 0)
        };
      } else if (strategy.includes('referenced')) {
        analysis.referencedPatterns[strategy] = {
          totalOperations: metrics.totalOperations,
          operationBreakdown: metrics.operationTypes,
          averagePerformance: metrics.averageQueryTime || 0,
          performanceRating: this.ratePerformance(metrics.averageQueryTime || 0)
        };
      }
    }

    // Generate global recommendations
    analysis.recommendations = [
      'Use embedded documents for frequently co-accessed data',
      'Use references for large or independently managed entities',
      'Consider hybrid patterns for complex applications',
      'Monitor document sizes to avoid 16MB limit',
      'Optimize indexes based on query patterns'
    ];

    return analysis;
  }

  ratePerformance(avgTime) {
    if (avgTime < 10) return 'excellent';
    if (avgTime < 50) return 'good';
    if (avgTime < 200) return 'acceptable';
    return 'needs_optimization';
  }

  async cleanup() {
    console.log('Cleaning up Document Modeling Manager...');

    this.modelingStrategies.clear();
    this.performanceMetrics.clear();
    this.queryPatterns.clear();

    console.log('Document Modeling Manager cleanup completed');
  }
}

// Example usage demonstrating embedded vs referenced patterns
async function demonstrateDocumentModelingPatterns() {
  const client = new MongoClient('mongodb://localhost:27017');
  await client.connect();

  const modelingManager = new AdvancedDocumentModelingManager(client, {
    database: 'document_modeling_demo',
    enablePerformanceMonitoring: true,
    enableQueryAnalytics: true
  });

  try {
    // Sample user data for demonstration
    const sampleUserData = {
      email: '[email protected]',
      username: 'johndoe123',
      firstName: 'John',
      lastName: 'Doe',
      phoneNumber: '+1-555-0123',
      dateOfBirth: new Date('1990-05-15'),

      addresses: [
        {
          type: 'home',
          streetAddress: '123 Main Street',
          apartmentUnit: 'Apt 4B',
          city: 'New York',
          stateProvince: 'NY',
          postalCode: '10001',
          country: 'USA',
          isPrimary: true,
          isShipping: true
        },
        {
          type: 'work',
          streetAddress: '456 Corporate Blvd',
          city: 'New York',
          stateProvince: 'NY',
          postalCode: '10002',
          country: 'USA',
          isBilling: true
        }
      ],

      preferences: [
        {
          category: 'notifications',
          key: 'email_frequency',
          value: 'daily',
          dataType: 'string'
        },
        {
          category: 'display',
          key: 'theme',
          value: 'dark',
          dataType: 'string'
        }
      ]
    };

    // Demonstrate embedded document pattern
    console.log('Creating embedded user profile...');
    const embeddedResult = await modelingManager.createEmbeddedUserProfile(sampleUserData);
    console.log('Embedded Result:', embeddedResult);

    // Demonstrate referenced pattern
    console.log('Creating referenced user profile...');
    const referencedResult = await modelingManager.createReferencedUserProfile(sampleUserData);
    console.log('Referenced Result:', referencedResult);

    // Demonstrate query performance differences
    console.log('Comparing query performance...');

    const embeddedQuery = await modelingManager.getUserProfileEmbedded(embeddedResult.userId);
    console.log('Embedded Query Result:', {
      pattern: embeddedQuery.modelingPattern,
      queries: embeddedQuery.queriesExecuted,
      optimized: embeddedQuery.performanceOptimized
    });

    const referencedQuery = await modelingManager.getUserProfileReferenced(referencedResult.userId);
    console.log('Referenced Query Result:', {
      pattern: referencedQuery.modelingPattern,
      queries: referencedQuery.queriesExecuted,
      optimized: referencedQuery.performanceOptimized
    });

    // Demonstrate update operations
    console.log('Comparing update operations...');

    const addressId = embeddedQuery.profile.addresses[0]._id;
    const referencedAddressId = referencedResult.addressIds[0];

    const embeddedUpdate = await modelingManager.updateEmbeddedUserAddress(
      embeddedResult.userId,
      addressId,
      { streetAddress: '789 Updated Street' }
    );
    console.log('Embedded Update:', embeddedUpdate);

    const referencedUpdate = await modelingManager.updateReferencedUserAddress(
      referencedResult.userId,
      referencedAddressId,
      { streetAddress: '789 Updated Street' }
    );
    console.log('Referenced Update:', referencedUpdate);

    // Demonstrate aggregation performance
    console.log('Comparing aggregation performance...');

    const embeddedAggregation = await modelingManager.performComplexAggregation('embedded', {
      match: { status: 'active' },
      groupBy: '$profileMetadata.theme',
      limit: 10
    });

    const referencedAggregation = await modelingManager.performComplexAggregation('referenced', {
      match: { status: 'active' },
      groupBy: '$profileMetadata.theme',
      limit: 10
    });

    console.log('Aggregation Comparison:', {
      embedded: {
        time: embeddedAggregation.executionTime,
        profile: embeddedAggregation.performanceProfile
      },
      referenced: {
        time: referencedAggregation.executionTime,
        profile: referencedAggregation.performanceProfile
      }
    });

    // Get performance analysis
    const performanceAnalysis = await modelingManager.getPerformanceAnalysis();
    console.log('Performance Analysis:', performanceAnalysis);

    return {
      embeddedResult,
      referencedResult,
      queryComparison: {
        embedded: embeddedQuery,
        referenced: referencedQuery
      },
      updateComparison: {
        embedded: embeddedUpdate,
        referenced: referencedUpdate
      },
      aggregationComparison: {
        embedded: embeddedAggregation,
        referenced: referencedAggregation
      },
      performanceAnalysis
    };

  } catch (error) {
    console.error('Error demonstrating document modeling patterns:', error);
    throw error;
  } finally {
    await modelingManager.cleanup();
    await client.close();
  }
}

// Benefits of MongoDB Flexible Document Modeling:
// - Embedded documents provide optimal query performance for frequently co-accessed data
// - Reference patterns enable normalized data structures and independent entity management
// - Hybrid patterns optimize for mixed access patterns and complex application requirements
// - Flexible schema evolution accommodates changing business requirements without migrations
// - Query optimization strategies can be tailored to specific data access patterns
// - Atomic operations available for embedded documents ensure data consistency
// - Application-level joins provide flexibility while maintaining performance where needed
// - Document size management enables balanced approaches between embedding and referencing

module.exports = {
  AdvancedDocumentModelingManager,
  demonstrateDocumentModelingPatterns
};

SQL-Style Document Modeling with QueryLeaf

QueryLeaf provides familiar SQL syntax for MongoDB document relationship management and modeling pattern optimization:

-- QueryLeaf document modeling with SQL-familiar embedded and reference pattern syntax

-- Configure document modeling optimization settings
SET enable_embedded_optimization = true;
SET enable_reference_optimization = true;
SET enable_hybrid_modeling = true;
SET document_size_monitoring = true;
SET query_pattern_analysis = true;
SET performance_monitoring = true;

-- Create embedded document pattern for frequently co-accessed data
WITH embedded_user_profiles AS (
  INSERT INTO user_profiles_embedded
  SELECT 
    GENERATE_UUID() as user_id,
    'user' || generate_series(1, 1000) || '@example.com' as email,
    'user' || generate_series(1, 1000) as username,
    (ARRAY['John', 'Jane', 'Mike', 'Sarah', 'David'])[1 + floor(random() * 5)] as first_name,
    (ARRAY['Smith', 'Johnson', 'Williams', 'Brown', 'Jones'])[1 + floor(random() * 5)] as last_name,
    '+1-555-' || LPAD(floor(random() * 10000)::text, 4, '0') as phone_number,
    CURRENT_DATE - (random() * 365 * 30 + 18 * 365)::int as date_of_birth,
    'active' as status,

    -- Embedded addresses array for optimal co-access
    JSON_BUILD_ARRAY(
      JSON_BUILD_OBJECT(
        '_id', GENERATE_UUID(),
        'type', 'home',
        'streetAddress', floor(random() * 9999 + 1) || ' ' || 
          (ARRAY['Main St', 'Oak Ave', 'First St', 'Second Ave', 'Third St'])[1 + floor(random() * 5)],
        'apartmentUnit', CASE WHEN random() > 0.6 THEN 'Apt ' || (1 + floor(random() * 50))::text ELSE NULL END,
        'city', (ARRAY['New York', 'Los Angeles', 'Chicago', 'Houston', 'Phoenix'])[1 + floor(random() * 5)],
        'stateProvince', (ARRAY['NY', 'CA', 'IL', 'TX', 'AZ'])[1 + floor(random() * 5)],
        'postalCode', LPAD(floor(random() * 100000)::text, 5, '0'),
        'country', 'USA',
        'isPrimary', true,
        'isBilling', random() > 0.5,
        'isShipping', random() > 0.3,
        'createdAt', CURRENT_TIMESTAMP,
        'updatedAt', CURRENT_TIMESTAMP
      ),
      -- Additional address if random condition met
      CASE WHEN random() > 0.7 THEN
        JSON_BUILD_OBJECT(
          '_id', GENERATE_UUID(),
          'type', 'work',
          'streetAddress', floor(random() * 999 + 100) || ' Business Blvd',
          'city', (ARRAY['New York', 'Los Angeles', 'Chicago', 'Houston', 'Phoenix'])[1 + floor(random() * 5)],
          'stateProvince', (ARRAY['NY', 'CA', 'IL', 'TX', 'AZ'])[1 + floor(random() * 5)],
          'postalCode', LPAD(floor(random() * 100000)::text, 5, '0'),
          'country', 'USA',
          'isPrimary', false,
          'isBilling', true,
          'isShipping', false,
          'createdAt', CURRENT_TIMESTAMP,
          'updatedAt', CURRENT_TIMESTAMP
        )
      ELSE NULL END
    ) FILTER (WHERE JSON_BUILD_OBJECT IS NOT NULL) as addresses,

    -- Embedded preferences for atomic updates
    JSON_BUILD_ARRAY(
      JSON_BUILD_OBJECT(
        '_id', GENERATE_UUID(),
        'category', 'notifications',
        'key', 'email_frequency', 
        'value', (ARRAY['immediate', 'daily', 'weekly', 'never'])[1 + floor(random() * 4)],
        'dataType', 'string',
        'createdAt', CURRENT_TIMESTAMP,
        'updatedAt', CURRENT_TIMESTAMP
      ),
      JSON_BUILD_OBJECT(
        '_id', GENERATE_UUID(),
        'category', 'display',
        'key', 'theme',
        'value', (ARRAY['light', 'dark', 'auto'])[1 + floor(random() * 3)],
        'dataType', 'string',
        'createdAt', CURRENT_TIMESTAMP,
        'updatedAt', CURRENT_TIMESTAMP
      ),
      JSON_BUILD_OBJECT(
        '_id', GENERATE_UUID(),
        'category', 'privacy',
        'key', 'profile_visibility',
        'value', (ARRAY['public', 'friends', 'private'])[1 + floor(random() * 3)],
        'dataType', 'string',
        'createdAt', CURRENT_TIMESTAMP,
        'updatedAt', CURRENT_TIMESTAMP
      )
    ) as preferences,

    -- Embedded profile metadata for single-query access
    JSON_BUILD_OBJECT(
      'theme', (ARRAY['light', 'dark', 'auto'])[1 + floor(random() * 3)],
      'language', (ARRAY['en', 'es', 'fr', 'de'])[1 + floor(random() * 4)],
      'timezone', (ARRAY['UTC', 'EST', 'PST', 'CST', 'MST'])[1 + floor(random() * 5)],
      'notificationSettings', JSON_BUILD_OBJECT(
        'email', random() > 0.2,
        'push', random() > 0.3,
        'sms', random() > 0.8
      ),
      'privacySettings', JSON_BUILD_OBJECT(
        'profileVisibility', (ARRAY['public', 'friends', 'private'])[1 + floor(random() * 3)],
        'allowDirectMessages', random() > 0.1,
        'shareActivityStatus', random() > 0.4
      )
    ) as profile_metadata,

    -- Embedded activity summary for performance
    JSON_BUILD_OBJECT(
      'totalLogins', floor(random() * 100),
      'lastLoginAt', CURRENT_TIMESTAMP - (random() * INTERVAL '30 days'),
      'lastActivityAt', CURRENT_TIMESTAMP - (random() * INTERVAL '7 days'),
      'accountCreatedAt', CURRENT_TIMESTAMP - (random() * 365 + 30) * INTERVAL '1 day',
      'profileCompletionScore', 70 + floor(random() * 30) -- 70-100%
    ) as activity_summary,

    CURRENT_TIMESTAMP as created_at,
    CURRENT_TIMESTAMP as updated_at,
    1 as version
  RETURNING user_id, email, username
),

-- Create normalized reference pattern for independent entity management  
users_referenced AS (
  INSERT INTO users_referenced
  SELECT 
    GENERATE_UUID() as user_id,
    'ref_user' || generate_series(1, 1000) || '@example.com' as email,
    'ref_user' || generate_series(1, 1000) as username,
    (ARRAY['Alice', 'Bob', 'Carol', 'David', 'Eve'])[1 + floor(random() * 5)] as first_name,
    (ARRAY['Wilson', 'Davis', 'Miller', 'Moore', 'Taylor'])[1 + floor(random() * 5)] as last_name,
    '+1-555-' || LPAD(floor(random() * 10000)::text, 4, '0') as phone_number,
    CURRENT_DATE - (random() * 365 * 30 + 18 * 365)::int as date_of_birth,
    'active' as status,

    -- Basic profile metadata only (normalized approach)
    JSON_BUILD_OBJECT(
      'theme', (ARRAY['light', 'dark', 'auto'])[1 + floor(random() * 3)],
      'language', (ARRAY['en', 'es', 'fr', 'de'])[1 + floor(random() * 4)],
      'timezone', (ARRAY['UTC', 'EST', 'PST', 'CST', 'MST'])[1 + floor(random() * 5)]
    ) as profile_metadata,

    CURRENT_TIMESTAMP as created_at,
    CURRENT_TIMESTAMP as updated_at,
    1 as version
  RETURNING user_id, email, username
),

-- Create separate referenced address documents
user_addresses_referenced AS (
  INSERT INTO user_addresses_referenced
  SELECT 
    GENERATE_UUID() as address_id,
    ur.user_id,

    -- Address type and details
    (ARRAY['home', 'work', 'billing', 'shipping'])[1 + floor(random() * 4)] as type,
    floor(random() * 9999 + 1) || ' ' || 
      (ARRAY['Broadway', 'Park Ave', 'Wall St', 'Madison Ave', 'Fifth Ave'])[1 + floor(random() * 5)] as street_address,
    CASE WHEN random() > 0.7 THEN 'Unit ' || (1 + floor(random() * 100))::text ELSE NULL END as apartment_unit,
    (ARRAY['Boston', 'Philadelphia', 'San Antonio', 'San Diego', 'Dallas'])[1 + floor(random() * 5)] as city,
    (ARRAY['MA', 'PA', 'TX', 'CA', 'TX'])[1 + floor(random() * 5)] as state_province,
    LPAD(floor(random() * 100000)::text, 5, '0') as postal_code,
    'USA' as country,

    -- Address flags
    row_number() OVER (PARTITION BY ur.user_id) = 1 as is_primary, -- First address is primary
    random() > 0.6 as is_billing,
    random() > 0.4 as is_shipping,

    CURRENT_TIMESTAMP as created_at,
    CURRENT_TIMESTAMP as updated_at

  FROM users_referenced ur
  CROSS JOIN generate_series(1, 1 + floor(random() * 2)::int) -- 1-3 addresses per user
  RETURNING address_id, user_id, type
),

-- Create referenced user activities for independent tracking
user_activities_referenced AS (
  INSERT INTO user_activities_referenced  
  SELECT 
    GENERATE_UUID() as activity_id,
    ur.user_id,

    -- Activity classification
    (ARRAY['login', 'logout', 'page_view', 'action_performed', 'data_modified', 'error_occurred'])
      [1 + floor(random() * 6)] as activity_type,
    CURRENT_TIMESTAMP - (random() * INTERVAL '90 days') as activity_timestamp,

    -- Activity details
    JSON_BUILD_OBJECT(
      'page', (ARRAY['/dashboard', '/profile', '/settings', '/reports', '/help'])[1 + floor(random() * 5)],
      'action', (ARRAY['click', 'view', 'edit', 'save', 'delete'])[1 + floor(random() * 5)],
      'duration', floor(random() * 300 + 5), -- 5-305 seconds
      'userAgent', 'Mozilla/5.0 (Enterprise Browser)',
      'ipAddress', '192.168.' || (1 + floor(random() * 254)) || '.' || (1 + floor(random() * 254))
    ) as activity_data,

    (ARRAY['web', 'mobile', 'api', 'system'])[1 + floor(random() * 4)] as activity_source,
    ('192.168.' || (1 + floor(random() * 254)) || '.' || (1 + floor(random() * 254)))::inet as ip_address,
    'Mozilla/5.0 (compatible; Enterprise App)' as user_agent,

    -- Session and tracking
    'session_' || floor(random() * 10000) as session_id,
    'https://app.example.com' || (ARRAY['/dashboard', '/profile', '/settings'])[1 + floor(random() * 3)] as page_url,

    -- Performance tracking  
    floor(random() * 500 + 50) as response_time_ms,
    random() > 0.95 as error_occurred, -- 5% error rate
    CASE WHEN random() > 0.95 THEN
      JSON_BUILD_OBJECT('error', 'timeout', 'code', '500', 'message', 'Request timeout')
    ELSE NULL END as error_details

  FROM users_referenced ur
  CROSS JOIN generate_series(1, floor(random() * 50 + 10)::int) -- 10-60 activities per user
  RETURNING activity_id, user_id, activity_type, activity_timestamp
)

-- Query performance comparison between embedded and referenced patterns
SELECT 
  'EMBEDDED_PATTERN' as modeling_approach,
  'Single document query for complete profile' as query_description,
  1 as queries_required,
  'Optimal - all data co-located' as performance_profile,
  'Guaranteed - single document ACID' as consistency_model,
  'Atomic updates possible' as update_characteristics,
  'Potential 16MB limit concern' as scalability_considerations

UNION ALL

SELECT 
  'REFERENCED_PATTERN' as modeling_approach,
  'Multiple queries required for complete profile' as query_description,
  3 as queries_required,
  'Moderate - requires joins/lookups' as performance_profile,
  'Eventual - across multiple documents' as consistency_model,
  'Independent entity updates' as update_characteristics,
  'Unlimited growth potential' as scalability_considerations;

-- Demonstrate embedded document queries (single collection access)
WITH embedded_query_patterns AS (
  -- Single query retrieves complete user profile with all related data
  SELECT 
    user_id,
    email,
    first_name,
    last_name,

    -- Extract embedded address information
    JSON_ARRAY_LENGTH(addresses) as total_addresses,
    JSON_EXTRACT_PATH_TEXT(addresses, '0', 'city') as primary_city,
    JSON_EXTRACT_PATH_TEXT(addresses, '0', 'stateProvince') as primary_state,

    -- Extract embedded preferences
    JSON_ARRAY_LENGTH(preferences) as total_preferences,

    -- Extract activity summary (embedded for performance)
    CAST(JSON_EXTRACT_PATH_TEXT(activity_summary, 'totalLogins') AS INTEGER) as total_logins,
    TO_TIMESTAMP(JSON_EXTRACT_PATH_TEXT(activity_summary, 'lastLoginAt'), 'YYYY-MM-DD"T"HH24:MI:SS.MS"Z"') as last_login,
    CAST(JSON_EXTRACT_PATH_TEXT(activity_summary, 'profileCompletionScore') AS INTEGER) as completion_score,

    -- Performance metrics
    1 as documents_accessed,
    0 as join_operations_required,
    'immediate' as consistency_guarantee,

    -- Query classification
    'embedded_single_document' as query_pattern,
    'optimal_performance' as performance_classification

  FROM user_profiles_embedded
  WHERE status = 'active'
  AND JSON_EXTRACT_PATH_TEXT(profile_metadata, 'theme') = 'dark'
  LIMIT 100
),

-- Demonstrate referenced pattern queries (multiple collection access required)
referenced_query_patterns AS (
  -- Multiple queries required to reconstruct complete user profile
  SELECT 
    u.user_id,
    u.email,
    u.first_name,
    u.last_name,

    -- Address information requires separate query/join
    COUNT(addr.address_id) as total_addresses,
    addr_primary.city as primary_city,
    addr_primary.state_province as primary_state,

    -- Activity summary requires aggregation from separate collection
    COUNT(act.activity_id) as total_activities,
    MAX(act.activity_timestamp) FILTER (WHERE act.activity_type = 'login') as last_login,
    COUNT(act.activity_id) FILTER (WHERE act.activity_timestamp >= CURRENT_TIMESTAMP - INTERVAL '7 days') as recent_activities,

    -- Performance metrics
    3 as documents_accessed, -- Users + Addresses + Activities
    2 as join_operations_required,
    'eventual' as consistency_guarantee,

    -- Query classification
    'referenced_multi_document' as query_pattern,
    'moderate_performance' as performance_classification

  FROM users_referenced u
  LEFT JOIN user_addresses_referenced addr ON u.user_id = addr.user_id
  LEFT JOIN user_addresses_referenced addr_primary ON u.user_id = addr_primary.user_id AND addr_primary.is_primary = true
  LEFT JOIN user_activities_referenced act ON u.user_id = act.user_id

  WHERE u.status = 'active'
  AND JSON_EXTRACT_PATH_TEXT(u.profile_metadata, 'theme') = 'dark'

  GROUP BY u.user_id, u.email, u.first_name, u.last_name, addr_primary.city, addr_primary.state_province
  LIMIT 100
),

-- Performance analysis and comparison
modeling_performance_analysis AS (
  SELECT 
    query_pattern,
    performance_classification,
    AVG(documents_accessed) as avg_documents_per_query,
    AVG(join_operations_required) as avg_joins_per_query,
    COUNT(*) as total_queries_analyzed,

    -- Performance scoring
    CASE 
      WHEN AVG(documents_accessed) = 1 AND AVG(join_operations_required) = 0 THEN 'excellent'
      WHEN AVG(documents_accessed) <= 3 AND AVG(join_operations_required) <= 2 THEN 'good'
      WHEN AVG(documents_accessed) <= 5 AND AVG(join_operations_required) <= 4 THEN 'acceptable'
      ELSE 'needs_optimization'
    END as overall_performance_rating,

    -- Consistency analysis
    MODE() WITHIN GROUP (ORDER BY consistency_guarantee) as primary_consistency_model,

    -- Scalability assessment
    CASE 
      WHEN query_pattern = 'embedded_single_document' THEN 'Limited by 16MB document size'
      WHEN query_pattern = 'referenced_multi_document' THEN 'Unlimited horizontal scaling'
      ELSE 'Hybrid scaling characteristics'
    END as scalability_profile

  FROM (
    SELECT * FROM embedded_query_patterns
    UNION ALL
    SELECT * FROM referenced_query_patterns
  ) combined_patterns
  GROUP BY query_pattern, performance_classification
),

-- Document modeling recommendations based on query patterns
modeling_recommendations AS (
  SELECT 
    mpa.query_pattern,
    mpa.overall_performance_rating,
    mpa.scalability_profile,
    mpa.primary_consistency_model,

    -- Use case recommendations
    CASE 
      WHEN mpa.query_pattern = 'embedded_single_document' THEN
        JSON_BUILD_ARRAY(
          'Optimal for frequently co-accessed related data',
          'Best for read-heavy workloads with complete document queries',
          'Ideal for maintaining ACID guarantees across related entities',
          'Suitable for moderate data growth with stable relationships'
        )
      WHEN mpa.query_pattern = 'referenced_multi_document' THEN
        JSON_BUILD_ARRAY(
          'Best for large datasets with independent entity management',
          'Optimal for write-heavy workloads with frequent partial updates',
          'Ideal for applications requiring flexible schema evolution',
          'Suitable for unlimited horizontal scaling requirements'
        )
      ELSE
        JSON_BUILD_ARRAY(
          'Consider hybrid approach for mixed access patterns',
          'Evaluate specific query requirements for optimization',
          'Balance performance and scalability based on use case'
        )
    END as use_case_recommendations,

    -- Performance optimization strategies
    CASE mpa.overall_performance_rating
      WHEN 'excellent' THEN 'Continue current approach with monitoring'
      WHEN 'good' THEN 'Minor optimizations possible through indexing'
      WHEN 'acceptable' THEN 'Consider query pattern optimization or hybrid approach'
      ELSE 'Significant architectural changes recommended'
    END as optimization_strategy,

    -- Specific implementation guidance
    JSON_BUILD_OBJECT(
      'indexing_strategy', 
        CASE 
          WHEN mpa.query_pattern = 'embedded_single_document' THEN 'Compound indexes on embedded fields'
          ELSE 'Reference field optimization with lookup performance'
        END,
      'consistency_approach',
        CASE mpa.primary_consistency_model
          WHEN 'immediate' THEN 'Single document transactions available'
          ELSE 'Application-level consistency management required'
        END,
      'scaling_considerations',
        CASE 
          WHEN mpa.scalability_profile LIKE '%16MB%' THEN 'Monitor document sizes and consider archival strategies'
          ELSE 'Plan for horizontal scaling and sharding strategies'
        END
    ) as implementation_guidance

  FROM modeling_performance_analysis mpa
)

-- Comprehensive document modeling strategy dashboard
SELECT 
  mr.query_pattern,
  mr.overall_performance_rating,
  mr.primary_consistency_model,
  mr.optimization_strategy,

  -- Performance characteristics
  mpa.avg_documents_per_query as avg_docs_per_query,
  mpa.avg_joins_per_query as avg_joins_required,
  mpa.total_queries_analyzed,

  -- Architectural guidance
  mr.use_case_recommendations,
  mr.implementation_guidance,

  -- Decision matrix
  CASE 
    WHEN mr.query_pattern = 'embedded_single_document' AND mr.overall_performance_rating = 'excellent' THEN
      'RECOMMENDED: Use embedded documents for this use case'
    WHEN mr.query_pattern = 'referenced_multi_document' AND mr.scalability_profile LIKE '%Unlimited%' THEN
      'RECOMMENDED: Use referenced pattern for scalability requirements'
    ELSE
      'EVALUATE: Consider hybrid approach or further analysis'
  END as architectural_recommendation,

  -- Implementation priorities
  JSON_BUILD_OBJECT(
    'immediate_actions', 
      CASE mr.overall_performance_rating
        WHEN 'needs_optimization' THEN JSON_BUILD_ARRAY('Review query patterns', 'Optimize indexing', 'Consider architectural changes')
        WHEN 'acceptable' THEN JSON_BUILD_ARRAY('Monitor performance trends', 'Optimize critical queries')
        ELSE JSON_BUILD_ARRAY('Continue monitoring', 'Plan for growth')
      END,
    'monitoring_focus',
      CASE 
        WHEN mr.query_pattern = 'embedded_single_document' THEN 'Document size growth and query performance'
        ELSE 'Join performance and data consistency'
      END,
    'success_metrics',
      JSON_BUILD_OBJECT(
        'performance_target', CASE mr.overall_performance_rating WHEN 'excellent' THEN 'maintain' ELSE 'improve' END,
        'consistency_requirement', mr.primary_consistency_model,
        'scalability_readiness', 
          CASE WHEN mr.scalability_profile LIKE '%Unlimited%' THEN 'high' ELSE 'moderate' END
      )
  ) as implementation_roadmap

FROM modeling_recommendations mr
JOIN modeling_performance_analysis mpa ON mr.query_pattern = mpa.query_pattern
ORDER BY 
  CASE mr.overall_performance_rating
    WHEN 'excellent' THEN 1
    WHEN 'good' THEN 2
    WHEN 'acceptable' THEN 3
    ELSE 4
  END,
  mpa.avg_documents_per_query ASC;

-- QueryLeaf provides comprehensive MongoDB document modeling capabilities:
-- 1. Embedded document patterns for optimal query performance and data locality
-- 2. Referenced patterns for normalized structures and independent entity scaling
-- 3. Hybrid modeling strategies combining embedding and referencing for complex requirements
-- 4. Performance analysis and optimization recommendations based on query patterns
-- 5. SQL-familiar syntax for document relationship management and pattern selection
-- 6. Comprehensive modeling analytics with performance profiling and scalability assessment
-- 7. Automated recommendations for optimal modeling patterns based on access requirements
-- 8. Enterprise-grade consistency and performance monitoring for production deployments
-- 9. Flexible schema evolution support with minimal application impact
-- 10. Advanced query optimization techniques tailored to document modeling patterns

Best Practices for MongoDB Document Modeling Implementation

Strategic Modeling Decisions

Essential practices for making optimal embedded vs referenced modeling decisions:

  1. Query Pattern Analysis: Design document structure based on actual application query patterns and data access requirements
  2. Data Growth Assessment: Evaluate data growth patterns to prevent document size issues with embedded patterns
  3. Update Frequency Analysis: Consider update patterns when deciding between atomic embedded updates and independent referenced updates
  4. Consistency Requirements: Choose modeling patterns based on consistency requirements and transaction scope needs
  5. Performance Baseline Establishment: Establish performance baselines for different modeling approaches with realistic data volumes
  6. Scalability Planning: Design modeling strategies that accommodate expected growth in data volume and query complexity

Production Optimization and Management

Optimize document modeling for enterprise-scale applications:

  1. Index Strategy Optimization: Design indexes that support both embedded field queries and reference lookups efficiently
  2. Document Size Monitoring: Implement monitoring for document sizes to prevent 16MB limit issues with embedded patterns
  3. Query Performance Analysis: Continuously analyze query performance across different modeling patterns for optimization opportunities
  4. Migration Planning: Plan for potential modeling pattern changes as application requirements evolve
  5. Consistency Management: Implement appropriate consistency management strategies for referenced patterns
  6. Monitoring and Alerting: Establish comprehensive monitoring for performance, consistency, and scalability metrics

Conclusion

MongoDB's flexible document modeling provides powerful options for optimizing data relationships through embedded documents, references, or hybrid approaches. The choice between embedding and referencing depends on specific query patterns, consistency requirements, scalability needs, and performance objectives. Understanding these tradeoffs enables architects to design optimal data models that balance performance, scalability, and maintainability.

Key MongoDB Document Modeling benefits include:

  • Performance Optimization: Choose modeling patterns that optimize for specific query patterns and data access requirements
  • Flexible Relationships: Model relationships using the approach that best fits application needs rather than rigid normalization rules
  • ACID Guarantees: Leverage single-document ACID properties for embedded patterns or manage consistency for referenced patterns
  • Scalability Options: Scale using approaches appropriate to data growth patterns and access requirements
  • Schema Evolution: Evolve document structures as requirements change without expensive migration procedures
  • SQL Accessibility: Manage document relationships using familiar SQL-style syntax and optimization techniques

Whether you're building user management systems, content platforms, e-commerce applications, or analytics systems, MongoDB's document modeling flexibility with QueryLeaf's familiar SQL interface provides the foundation for scalable, performant, and maintainable data architectures.

QueryLeaf Integration: QueryLeaf automatically optimizes MongoDB document relationships while providing SQL-familiar syntax for embedded and referenced pattern management. Advanced modeling strategies, performance analysis, and optimization recommendations are seamlessly accessible through familiar SQL constructs, making sophisticated document relationship management both powerful and approachable for SQL-oriented development teams.

The combination of MongoDB's flexible document modeling with SQL-style relationship management makes it an ideal platform for applications requiring both optimal query performance and familiar operational patterns, ensuring your data architecture can adapt to changing requirements while maintaining performance excellence and development productivity.