Skip to content

MongoDB GridFS File Storage Management: Advanced Strategies for Large File Handling, Streaming, and Content Distribution with SQL-Style File Operations

Modern applications require sophisticated file storage solutions that can handle large media files, document repositories, streaming content, and complex file management workflows while maintaining high performance, scalability, and reliability across distributed systems. Traditional file storage approaches often struggle with large file limitations, metadata management complexity, and the challenges of integrating file operations with database transactions, leading to performance bottlenecks, storage inefficiencies, and operational complexity in production environments.

MongoDB GridFS provides comprehensive large file storage capabilities through intelligent file chunking, sophisticated metadata management, and seamless integration with MongoDB's document database features that enable applications to store, retrieve, and stream files of any size while maintaining ACID transaction support and distributed system reliability. Unlike traditional file systems that impose size limitations and separate file metadata from database operations, GridFS integrates advanced file storage directly with MongoDB's query engine, indexing capabilities, and replication features.

The Traditional File Storage Challenge

Conventional approaches to large file storage in enterprise applications face significant limitations in scalability and integration:

-- Traditional PostgreSQL file storage - limited and fragmented approach

-- Basic file metadata table (limited capabilities)
CREATE TABLE file_metadata (
    file_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    filename VARCHAR(255) NOT NULL,
    file_path VARCHAR(500) NOT NULL,
    file_size BIGINT NOT NULL,

    -- Basic file information
    mime_type VARCHAR(100),
    file_extension VARCHAR(10),
    original_filename VARCHAR(255),

    -- Simple metadata (limited structure)
    file_description TEXT,
    file_category VARCHAR(50),
    tags TEXT[], -- Basic array support

    -- Upload information
    uploaded_by UUID,
    uploaded_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,

    -- Storage information (filesystem dependent)
    storage_location VARCHAR(100) DEFAULT 'local', -- local, s3, azure, gcs
    storage_path VARCHAR(500),
    storage_bucket VARCHAR(100),

    -- Basic versioning (very limited)
    version_number INTEGER DEFAULT 1,
    is_current_version BOOLEAN DEFAULT TRUE,
    parent_file_id UUID REFERENCES file_metadata(file_id),

    -- Simple access control
    is_public BOOLEAN DEFAULT FALSE,
    access_permissions JSONB,

    -- Basic status tracking
    processing_status VARCHAR(20) DEFAULT 'uploaded', -- uploaded, processing, ready, error

    -- Audit fields
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

-- File chunks table for large file handling (manual implementation)
CREATE TABLE file_chunks (
    chunk_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    file_id UUID NOT NULL REFERENCES file_metadata(file_id) ON DELETE CASCADE,
    chunk_number INTEGER NOT NULL,
    chunk_size INTEGER NOT NULL,

    -- Chunk data (limited by database constraints)
    chunk_data BYTEA, -- Limited to ~1GB in PostgreSQL

    -- Chunk integrity
    chunk_checksum VARCHAR(64), -- MD5 or SHA256 hash

    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,

    UNIQUE (file_id, chunk_number)
);

-- File access log (basic tracking)
CREATE TABLE file_access_log (
    access_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    file_id UUID NOT NULL REFERENCES file_metadata(file_id),

    -- Access information
    accessed_by UUID,
    access_type VARCHAR(20), -- read, write, delete, stream
    access_timestamp TIMESTAMP DEFAULT CURRENT_TIMESTAMP,

    -- Request details
    client_ip INET,
    user_agent TEXT,
    request_method VARCHAR(10),

    -- Response information
    bytes_transferred BIGINT,
    response_status INTEGER,
    response_time_ms INTEGER,

    -- Streaming information (limited)
    stream_start_position BIGINT DEFAULT 0,
    stream_end_position BIGINT,

    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

-- Complex query to manage file operations (expensive and limited)
WITH file_statistics AS (
    SELECT 
        fm.file_id,
        fm.filename,
        fm.file_size,
        fm.mime_type,
        fm.storage_location,
        fm.processing_status,

        -- Calculate chunk information (expensive operation)
        COUNT(fc.chunk_id) as total_chunks,
        SUM(fc.chunk_size) as total_chunk_size,

        -- Basic integrity check
        CASE 
            WHEN fm.file_size = SUM(fc.chunk_size) THEN 'intact'
            WHEN SUM(fc.chunk_size) IS NULL THEN 'no_chunks'
            ELSE 'corrupted'
        END as file_integrity,

        -- Recent access statistics (limited analysis)
        COUNT(CASE WHEN fal.access_timestamp >= CURRENT_TIMESTAMP - INTERVAL '24 hours' 
                   THEN 1 END) as daily_access_count,
        COUNT(CASE WHEN fal.access_timestamp >= CURRENT_TIMESTAMP - INTERVAL '7 days' 
                   THEN 1 END) as weekly_access_count,

        -- Data transfer statistics
        SUM(CASE WHEN fal.access_timestamp >= CURRENT_TIMESTAMP - INTERVAL '24 hours' 
                 THEN fal.bytes_transferred ELSE 0 END) as daily_bytes_transferred,

        -- Performance metrics
        AVG(CASE WHEN fal.access_timestamp >= CURRENT_TIMESTAMP - INTERVAL '24 hours' 
                 THEN fal.response_time_ms END) as avg_response_time_ms

    FROM file_metadata fm
    LEFT JOIN file_chunks fc ON fm.file_id = fc.file_id
    LEFT JOIN file_access_log fal ON fm.file_id = fal.file_id
    WHERE fm.is_current_version = TRUE
    GROUP BY fm.file_id, fm.filename, fm.file_size, fm.mime_type, 
             fm.storage_location, fm.processing_status
),

storage_analysis AS (
    SELECT 
        storage_location,
        COUNT(*) as file_count,
        SUM(file_size) as total_storage_bytes,
        AVG(file_size) as avg_file_size,

        -- Storage health indicators
        COUNT(CASE WHEN file_integrity = 'corrupted' THEN 1 END) as corrupted_files,
        COUNT(CASE WHEN processing_status = 'error' THEN 1 END) as error_files,

        -- Access patterns
        AVG(daily_access_count) as avg_daily_access,
        SUM(daily_bytes_transferred) as total_daily_transfer,

        -- Performance indicators
        AVG(avg_response_time_ms) as avg_response_time

    FROM file_statistics
    GROUP BY storage_location
)

SELECT 
    fs.filename,
    fs.file_size,
    fs.mime_type,
    fs.storage_location,
    fs.total_chunks,
    fs.file_integrity,
    fs.processing_status,

    -- Access metrics
    fs.daily_access_count,
    fs.weekly_access_count,
    fs.avg_response_time_ms,

    -- Data transfer
    ROUND(fs.daily_bytes_transferred / 1024.0 / 1024.0, 2) as daily_mb_transferred,

    -- Storage efficiency (limited calculation)
    ROUND((fs.total_chunk_size::DECIMAL / fs.file_size) * 100, 2) as storage_efficiency_percent,

    -- Health indicators
    CASE 
        WHEN fs.file_integrity = 'corrupted' THEN 'Critical - File Corrupted'
        WHEN fs.processing_status = 'error' THEN 'Error - Processing Failed'
        WHEN fs.avg_response_time_ms > 5000 THEN 'Warning - Slow Response'
        WHEN fs.daily_access_count > 1000 THEN 'High Usage'
        ELSE 'Normal'
    END as file_status

FROM file_statistics fs
ORDER BY fs.daily_access_count DESC, fs.file_size DESC
LIMIT 100;

-- Problems with traditional file storage approach:
-- 1. Database size limitations prevent storing large files
-- 2. Manual chunking implementation is complex and error-prone
-- 3. Limited integration between file operations and database transactions
-- 4. Poor performance for streaming and partial file access
-- 5. Complex metadata management across multiple tables
-- 6. Limited support for file versioning and content management
-- 7. Expensive joins required for file operations
-- 8. No built-in support for distributed file storage
-- 9. Manual implementation of file integrity and consistency checks
-- 10. Limited indexing and query capabilities for file metadata

MongoDB GridFS eliminates these limitations with intelligent file management:

// MongoDB GridFS - comprehensive large file storage and management
const { MongoClient, GridFSBucket } = require('mongodb');
const crypto = require('crypto');
const fs = require('fs');

// Advanced GridFS file management system
class MongoGridFSManager {
  constructor(client, databaseName, bucketName = 'files') {
    this.client = client;
    this.db = client.db(databaseName);
    this.bucket = new GridFSBucket(this.db, { 
      bucketName: bucketName,
      chunkSizeBytes: 1024 * 1024 // 1MB chunks for optimal performance
    });

    this.fileMetrics = {
      totalUploads: 0,
      totalDownloads: 0,
      totalStreams: 0,
      bytesUploaded: 0,
      bytesDownloaded: 0,
      averageUploadTime: 0,
      averageDownloadTime: 0,
      errorCount: 0
    };
  }

  // Upload large files with comprehensive metadata and progress tracking
  async uploadFile(filePath, options = {}) {
    const startTime = Date.now();

    try {
      // Generate comprehensive file metadata
      const fileStats = fs.statSync(filePath);
      const filename = options.filename || path.basename(filePath);

      // Create file hash for integrity checking
      const fileHash = await this.generateFileHash(filePath);

      // Comprehensive metadata for advanced file management
      const metadata = {
        // Basic file information
        originalName: filename,
        uploadedAt: new Date(),
        fileSize: fileStats.size,
        mimeType: options.mimeType || this.detectMimeType(filename),

        // File integrity and versioning
        md5Hash: fileHash.md5,
        sha256Hash: fileHash.sha256,
        version: options.version || 1,
        parentFileId: options.parentFileId || null,

        // Content management
        description: options.description || '',
        category: options.category || 'general',
        tags: options.tags || [],

        // Access control and permissions
        uploadedBy: options.uploadedBy || 'system',
        isPublic: options.isPublic || false,
        accessPermissions: options.accessPermissions || { read: ['authenticated'] },

        // Processing and workflow
        processingStatus: 'uploaded',
        processingMetadata: {},

        // Content-specific metadata
        contentMetadata: options.contentMetadata || {},

        // Storage and performance optimization
        compressionType: options.compression || 'none',
        encryptionStatus: options.encrypted || false,
        storageClass: options.storageClass || 'standard', // standard, archival, frequent_access

        // Business context
        projectId: options.projectId,
        customFields: options.customFields || {},

        // Audit and compliance
        retentionPolicy: options.retentionPolicy || 'standard',
        complianceFlags: options.complianceFlags || [],

        // Performance tracking
        uploadDuration: null, // Will be set after upload completes
        lastAccessedAt: new Date(),
        accessCount: 0,
        totalBytesTransferred: 0
      };

      return new Promise((resolve, reject) => {
        // Create upload stream with progress tracking
        const uploadStream = this.bucket.openUploadStream(filename, {
          metadata: metadata,
          chunkSizeBytes: options.chunkSize || (1024 * 1024) // 1MB default chunks
        });

        // Progress tracking variables
        let bytesUploaded = 0;
        const totalBytes = fileStats.size;

        // Create read stream from file
        const readStream = fs.createReadStream(filePath);

        // Progress tracking
        readStream.on('data', (chunk) => {
          bytesUploaded += chunk.length;

          if (options.onProgress) {
            const progress = {
              bytesUploaded: bytesUploaded,
              totalBytes: totalBytes,
              percentage: (bytesUploaded / totalBytes) * 100,
              remainingBytes: totalBytes - bytesUploaded,
              elapsedTime: Date.now() - startTime
            };
            options.onProgress(progress);
          }
        });

        // Handle upload completion
        uploadStream.on('finish', async () => {
          const uploadDuration = Date.now() - startTime;

          // Update file metadata with final upload information
          await this.db.collection(`${this.bucket.options.bucketName}.files`).updateOne(
            { _id: uploadStream.id },
            { 
              $set: { 
                'metadata.uploadDuration': uploadDuration,
                'metadata.uploadCompletedAt': new Date()
              }
            }
          );

          // Update metrics
          this.fileMetrics.totalUploads++;
          this.fileMetrics.bytesUploaded += totalBytes;
          this.fileMetrics.averageUploadTime = 
            (this.fileMetrics.averageUploadTime + uploadDuration) / this.fileMetrics.totalUploads;

          console.log(`File uploaded successfully: ${filename} (${totalBytes} bytes, ${uploadDuration}ms)`);

          resolve({
            fileId: uploadStream.id,
            filename: filename,
            size: totalBytes,
            uploadDuration: uploadDuration,
            metadata: metadata,
            chunksCount: Math.ceil(totalBytes / (options.chunkSize || (1024 * 1024)))
          });
        });

        // Handle upload errors
        uploadStream.on('error', (error) => {
          this.fileMetrics.errorCount++;
          console.error('Upload error:', error);
          reject(error);
        });

        // Start the upload
        readStream.pipe(uploadStream);
      });

    } catch (error) {
      this.fileMetrics.errorCount++;
      console.error('File upload error:', error);
      throw error;
    }
  }

  // Advanced file streaming with range support and performance optimization
  async streamFile(fileId, options = {}) {
    const startTime = Date.now();

    try {
      // Get file information for streaming optimization
      const fileInfo = await this.getFileInfo(fileId);
      if (!fileInfo) {
        throw new Error('File not found');
      }

      // Update access metrics
      await this.updateAccessMetrics(fileId);

      // Create download stream with optional range support
      const downloadOptions = {};

      // Support for HTTP range requests (partial content)
      if (options.start !== undefined || options.end !== undefined) {
        downloadOptions.start = options.start || 0;
        downloadOptions.end = options.end || fileInfo.length - 1;

        console.log(`Streaming file range: ${downloadOptions.start}-${downloadOptions.end}/${fileInfo.length}`);
      }

      const downloadStream = this.bucket.openDownloadStream(fileId, downloadOptions);

      // Track streaming metrics
      let bytesStreamed = 0;

      downloadStream.on('data', (chunk) => {
        bytesStreamed += chunk.length;

        if (options.onProgress) {
          const progress = {
            bytesStreamed: bytesStreamed,
            totalBytes: fileInfo.length,
            percentage: (bytesStreamed / fileInfo.length) * 100,
            elapsedTime: Date.now() - startTime
          };
          options.onProgress(progress);
        }
      });

      downloadStream.on('end', () => {
        const streamDuration = Date.now() - startTime;

        // Update metrics
        this.fileMetrics.totalStreams++;
        this.fileMetrics.bytesDownloaded += bytesStreamed;
        this.fileMetrics.averageDownloadTime = 
          (this.fileMetrics.averageDownloadTime + streamDuration) / this.fileMetrics.totalStreams;

        console.log(`File streamed: ${fileInfo.filename} (${bytesStreamed} bytes, ${streamDuration}ms)`);
      });

      downloadStream.on('error', (error) => {
        this.fileMetrics.errorCount++;
        console.error('Streaming error:', error);
      });

      return downloadStream;

    } catch (error) {
      this.fileMetrics.errorCount++;
      console.error('File streaming error:', error);
      throw error;
    }
  }

  // Comprehensive file search and metadata querying
  async searchFiles(query = {}, options = {}) {
    try {
      const searchCriteria = {};

      // Build comprehensive search query
      if (query.filename) {
        searchCriteria.filename = new RegExp(query.filename, 'i');
      }

      if (query.mimeType) {
        searchCriteria['metadata.mimeType'] = query.mimeType;
      }

      if (query.category) {
        searchCriteria['metadata.category'] = query.category;
      }

      if (query.tags && query.tags.length > 0) {
        searchCriteria['metadata.tags'] = { $in: query.tags };
      }

      if (query.uploadedBy) {
        searchCriteria['metadata.uploadedBy'] = query.uploadedBy;
      }

      if (query.dateRange) {
        searchCriteria.uploadDate = {};
        if (query.dateRange.from) {
          searchCriteria.uploadDate.$gte = new Date(query.dateRange.from);
        }
        if (query.dateRange.to) {
          searchCriteria.uploadDate.$lte = new Date(query.dateRange.to);
        }
      }

      if (query.sizeRange) {
        searchCriteria.length = {};
        if (query.sizeRange.min) {
          searchCriteria.length.$gte = query.sizeRange.min;
        }
        if (query.sizeRange.max) {
          searchCriteria.length.$lte = query.sizeRange.max;
        }
      }

      if (query.isPublic !== undefined) {
        searchCriteria['metadata.isPublic'] = query.isPublic;
      }

      // Full-text search in description and custom fields
      if (query.textSearch) {
        searchCriteria.$or = [
          { 'metadata.description': new RegExp(query.textSearch, 'i') },
          { 'metadata.customFields': new RegExp(query.textSearch, 'i') }
        ];
      }

      // Execute search with aggregation pipeline for advanced features
      const pipeline = [
        { $match: searchCriteria },

        // Add computed fields for enhanced results
        {
          $addFields: {
            fileSizeMB: { $divide: ['$length', 1024 * 1024] },
            uploadAge: { 
              $divide: [
                { $subtract: [new Date(), '$uploadDate'] },
                1000 * 60 * 60 * 24 // Convert to days
              ]
            }
          }
        },

        // Sort by relevance and recency
        {
          $sort: options.sortBy === 'size' ? { length: -1 } :
                 options.sortBy === 'name' ? { filename: 1 } :
                 { uploadDate: -1 } // Default: newest first
        },

        // Pagination
        { $skip: options.skip || 0 },
        { $limit: options.limit || 50 },

        // Project only needed fields for performance
        {
          $project: {
            _id: 1,
            filename: 1,
            length: 1,
            fileSizeMB: 1,
            uploadDate: 1,
            uploadAge: 1,
            md5: 1,
            'metadata.mimeType': 1,
            'metadata.category': 1,
            'metadata.tags': 1,
            'metadata.description': 1,
            'metadata.uploadedBy': 1,
            'metadata.isPublic': 1,
            'metadata.accessCount': 1,
            'metadata.lastAccessedAt': 1,
            'metadata.processingStatus': 1
          }
        }
      ];

      const files = await this.db.collection(`${this.bucket.options.bucketName}.files`)
        .aggregate(pipeline)
        .toArray();

      // Get total count for pagination
      const totalCount = await this.db.collection(`${this.bucket.options.bucketName}.files`)
        .countDocuments(searchCriteria);

      return {
        files: files,
        totalCount: totalCount,
        hasMore: (options.skip || 0) + files.length < totalCount,
        searchCriteria: searchCriteria,
        executionTime: Date.now()
      };

    } catch (error) {
      console.error('File search error:', error);
      throw error;
    }
  }

  // File versioning and content management
  async createFileVersion(originalFileId, newFilePath, versionOptions = {}) {
    try {
      // Get original file information
      const originalFile = await this.getFileInfo(originalFileId);
      if (!originalFile) {
        throw new Error('Original file not found');
      }

      // Create new version with inherited metadata
      const versionMetadata = {
        ...originalFile.metadata,
        version: (originalFile.metadata.version || 1) + 1,
        parentFileId: originalFileId,
        versionDescription: versionOptions.description || '',
        versionCreatedAt: new Date(),
        versionCreatedBy: versionOptions.createdBy || 'system',
        changeLog: versionOptions.changeLog || []
      };

      // Upload new version
      const uploadResult = await this.uploadFile(newFilePath, {
        filename: originalFile.filename,
        metadata: versionMetadata,
        ...versionOptions
      });

      // Update version tracking
      await this.updateVersionHistory(originalFileId, uploadResult.fileId);

      return {
        newVersionId: uploadResult.fileId,
        versionNumber: versionMetadata.version,
        originalFileId: originalFileId,
        uploadResult: uploadResult
      };

    } catch (error) {
      console.error('File versioning error:', error);
      throw error;
    }
  }

  // Advanced file analytics and reporting
  async getFileAnalytics(timeRange = '30d') {
    try {
      const now = new Date();
      const timeRanges = {
        '1d': 1,
        '7d': 7,
        '30d': 30,
        '90d': 90,
        '365d': 365
      };

      const days = timeRanges[timeRange] || 30;
      const startDate = new Date(now.getTime() - (days * 24 * 60 * 60 * 1000));

      // Comprehensive analytics aggregation
      const analyticsResults = await Promise.all([

        // Storage analytics
        this.db.collection(`${this.bucket.options.bucketName}.files`).aggregate([
          {
            $group: {
              _id: null,
              totalFiles: { $sum: 1 },
              totalStorageBytes: { $sum: '$length' },
              averageFileSize: { $avg: '$length' },
              largestFile: { $max: '$length' },
              smallestFile: { $min: '$length' }
            }
          }
        ]).toArray(),

        // Upload trends
        this.db.collection(`${this.bucket.options.bucketName}.files`).aggregate([
          {
            $match: {
              uploadDate: { $gte: startDate }
            }
          },
          {
            $group: {
              _id: {
                year: { $year: '$uploadDate' },
                month: { $month: '$uploadDate' },
                day: { $dayOfMonth: '$uploadDate' }
              },
              dailyUploads: { $sum: 1 },
              dailyStorageAdded: { $sum: '$length' }
            }
          },
          {
            $sort: { '_id.year': 1, '_id.month': 1, '_id.day': 1 }
          }
        ]).toArray(),

        // File type distribution
        this.db.collection(`${this.bucket.options.bucketName}.files`).aggregate([
          {
            $group: {
              _id: '$metadata.mimeType',
              fileCount: { $sum: 1 },
              totalSize: { $sum: '$length' },
              averageSize: { $avg: '$length' }
            }
          },
          {
            $sort: { fileCount: -1 }
          },
          {
            $limit: 20
          }
        ]).toArray(),

        // Category analysis
        this.db.collection(`${this.bucket.options.bucketName}.files`).aggregate([
          {
            $group: {
              _id: '$metadata.category',
              fileCount: { $sum: 1 },
              totalSize: { $sum: '$length' }
            }
          },
          {
            $sort: { fileCount: -1 }
          }
        ]).toArray(),

        // Access patterns
        this.db.collection(`${this.bucket.options.bucketName}.files`).aggregate([
          {
            $match: {
              'metadata.lastAccessedAt': { $gte: startDate }
            }
          },
          {
            $group: {
              _id: null,
              averageAccessCount: { $avg: '$metadata.accessCount' },
              totalBytesTransferred: { $sum: '$metadata.totalBytesTransferred' },
              mostAccessedFiles: { $push: {
                filename: '$filename',
                accessCount: '$metadata.accessCount'
              }}
            }
          }
        ]).toArray()
      ]);

      // Compile comprehensive analytics report
      const [storageStats, uploadTrends, fileTypeStats, categoryStats, accessStats] = analyticsResults;

      const analytics = {
        reportGeneratedAt: new Date(),
        timeRange: timeRange,

        // Storage overview
        storage: storageStats[0] || {
          totalFiles: 0,
          totalStorageBytes: 0,
          averageFileSize: 0,
          largestFile: 0,
          smallestFile: 0
        },

        // Upload trends
        uploadTrends: uploadTrends,

        // File type distribution
        fileTypes: fileTypeStats,

        // Category distribution
        categories: categoryStats,

        // Access patterns
        accessPatterns: accessStats[0] || {
          averageAccessCount: 0,
          totalBytesTransferred: 0,
          mostAccessedFiles: []
        },

        // Performance metrics
        performanceMetrics: {
          ...this.fileMetrics,
          reportedAt: new Date()
        },

        // Storage efficiency calculations
        efficiency: {
          storageUtilizationMB: Math.round((storageStats[0]?.totalStorageBytes || 0) / (1024 * 1024)),
          averageFileSizeMB: Math.round((storageStats[0]?.averageFileSize || 0) / (1024 * 1024)),
          chunksPerFile: Math.ceil((storageStats[0]?.averageFileSize || 0) / (1024 * 1024)), // Assumes 1MB chunks
          compressionRatio: 1.0 // Would be calculated from actual compression data
        }
      };

      return analytics;

    } catch (error) {
      console.error('Analytics generation error:', error);
      throw error;
    }
  }

  // Utility methods for file operations
  async getFileInfo(fileId) {
    try {
      const fileInfo = await this.db.collection(`${this.bucket.options.bucketName}.files`)
        .findOne({ _id: fileId });
      return fileInfo;
    } catch (error) {
      console.error('Get file info error:', error);
      return null;
    }
  }

  async updateAccessMetrics(fileId) {
    try {
      await this.db.collection(`${this.bucket.options.bucketName}.files`).updateOne(
        { _id: fileId },
        {
          $inc: { 'metadata.accessCount': 1 },
          $set: { 'metadata.lastAccessedAt': new Date() }
        }
      );
    } catch (error) {
      console.error('Access metrics update error:', error);
    }
  }

  async generateFileHash(filePath) {
    return new Promise((resolve, reject) => {
      const md5Hash = crypto.createHash('md5');
      const sha256Hash = crypto.createHash('sha256');
      const stream = fs.createReadStream(filePath);

      stream.on('data', (data) => {
        md5Hash.update(data);
        sha256Hash.update(data);
      });

      stream.on('end', () => {
        resolve({
          md5: md5Hash.digest('hex'),
          sha256: sha256Hash.digest('hex')
        });
      });

      stream.on('error', reject);
    });
  }

  detectMimeType(filename) {
    const extension = filename.toLowerCase().split('.').pop();
    const mimeTypes = {
      'jpg': 'image/jpeg',
      'jpeg': 'image/jpeg',
      'png': 'image/png',
      'gif': 'image/gif',
      'pdf': 'application/pdf',
      'mp4': 'video/mp4',
      'mp3': 'audio/mpeg',
      'txt': 'text/plain',
      'json': 'application/json',
      'zip': 'application/zip'
    };
    return mimeTypes[extension] || 'application/octet-stream';
  }

  async updateVersionHistory(originalFileId, newVersionId) {
    // Implementation for version history tracking
    await this.db.collection('file_versions').insertOne({
      originalFileId: originalFileId,
      versionId: newVersionId,
      createdAt: new Date()
    });
  }

  // File cleanup and maintenance
  async deleteFile(fileId) {
    try {
      await this.bucket.delete(fileId);
      console.log(`File deleted: ${fileId}`);
      return { success: true, deletedAt: new Date() };
    } catch (error) {
      console.error('File deletion error:', error);
      throw error;
    }
  }

  // Get comprehensive system metrics
  getSystemMetrics() {
    return {
      ...this.fileMetrics,
      timestamp: new Date(),
      bucketName: this.bucket.options.bucketName,
      chunkSize: this.bucket.options.chunkSizeBytes
    };
  }
}

// Example usage demonstrating comprehensive GridFS functionality
async function demonstrateGridFSOperations() {
  const client = new MongoClient('mongodb://localhost:27017');
  await client.connect();

  const gridFSManager = new MongoGridFSManager(client, 'mediaStorage', 'uploads');

  try {
    console.log('Demonstrating MongoDB GridFS advanced file management...');

    // Upload a large file with comprehensive metadata
    console.log('Uploading large file...');
    const uploadResult = await gridFSManager.uploadFile('/path/to/large-video.mp4', {
      description: 'Corporate training video',
      category: 'training',
      tags: ['corporate', 'training', 'hr'],
      uploadedBy: 'admin',
      isPublic: false,
      contentMetadata: {
        duration: 3600, // seconds
        resolution: '1920x1080',
        codec: 'h264'
      },
      onProgress: (progress) => {
        console.log(`Upload progress: ${progress.percentage.toFixed(1)}%`);
      }
    });

    console.log('Upload completed:', uploadResult);

    // Search for files with comprehensive criteria
    console.log('Searching for video files...');
    const searchResults = await gridFSManager.searchFiles({
      mimeType: 'video/mp4',
      category: 'training',
      tags: ['corporate'],
      sizeRange: { min: 100 * 1024 * 1024 } // Files larger than 100MB
    }, {
      sortBy: 'size',
      limit: 10
    });

    console.log(`Found ${searchResults.totalCount} matching files`);
    searchResults.files.forEach(file => {
      console.log(`- ${file.filename} (${file.fileSizeMB.toFixed(1)} MB)`);
    });

    // Stream file with range support
    if (searchResults.files.length > 0) {
      const fileToStream = searchResults.files[0];
      console.log(`Streaming file: ${fileToStream.filename}`);

      const streamOptions = {
        start: 0,
        end: 1024 * 1024, // First 1MB
        onProgress: (progress) => {
          console.log(`Streaming progress: ${progress.percentage.toFixed(1)}%`);
        }
      };

      const stream = await gridFSManager.streamFile(fileToStream._id, streamOptions);

      // In a real application, you would pipe this to a response or file
      stream.on('end', () => {
        console.log('Streaming completed');
      });
    }

    // Generate comprehensive analytics
    console.log('Generating file analytics...');
    const analytics = await gridFSManager.getFileAnalytics('30d');

    console.log('Storage Analytics:');
    console.log(`Total Files: ${analytics.storage.totalFiles}`);
    console.log(`Total Storage: ${(analytics.storage.totalStorageBytes / (1024 * 1024 * 1024)).toFixed(2)} GB`);
    console.log(`Average File Size: ${(analytics.storage.averageFileSize / (1024 * 1024)).toFixed(2)} MB`);

    console.log('File Type Distribution:');
    analytics.fileTypes.slice(0, 5).forEach(type => {
      console.log(`- ${type._id}: ${type.fileCount} files (${(type.totalSize / (1024 * 1024)).toFixed(1)} MB)`);
    });

    // Get system metrics
    const metrics = gridFSManager.getSystemMetrics();
    console.log('System Performance Metrics:', metrics);

  } catch (error) {
    console.error('GridFS demonstration error:', error);
  } finally {
    await client.close();
  }
}

// Benefits of MongoDB GridFS:
// - Seamless large file storage without size limitations
// - Automatic file chunking with optimal performance
// - Comprehensive metadata management with flexible schemas
// - Built-in streaming support with range request capabilities
// - Integration with MongoDB's query engine and indexing
// - ACID transaction support for file operations
// - Advanced search and analytics capabilities
// - Automatic replication and distributed storage
// - File versioning and content management features
// - High-performance concurrent access and streaming

SQL-Style File Operations with QueryLeaf

QueryLeaf provides familiar approaches to MongoDB GridFS file management and operations:

-- QueryLeaf GridFS file management with SQL-familiar syntax

-- Upload file with comprehensive metadata
INSERT INTO FILES (filename, file_data, metadata) VALUES (
  'corporate-training.mp4',
  UPLOAD_FILE('/path/to/video.mp4'),
  JSON_OBJECT(
    'description', 'Corporate training video on data security',
    'category', 'training',
    'tags', JSON_ARRAY('corporate', 'security', 'training'),
    'uploadedBy', '[email protected]',
    'isPublic', false,
    'contentMetadata', JSON_OBJECT(
      'duration', 3600,
      'resolution', '1920x1080',
      'codec', 'h264',
      'bitrate', '2000kbps'
    ),
    'processingStatus', 'uploaded',
    'retentionPolicy', 'business-7years',
    'complianceFlags', JSON_ARRAY('gdpr', 'sox')
  )
);

-- Search and query files with comprehensive criteria
SELECT 
  file_id,
  filename,
  file_size,
  ROUND(file_size / 1024.0 / 1024.0, 2) as file_size_mb,
  upload_date,

  -- Extract metadata fields
  JSON_EXTRACT(metadata, '$.description') as description,
  JSON_EXTRACT(metadata, '$.category') as category,
  JSON_EXTRACT(metadata, '$.tags') as tags,
  JSON_EXTRACT(metadata, '$.uploadedBy') as uploaded_by,
  JSON_EXTRACT(metadata, '$.contentMetadata.duration') as duration_seconds,
  JSON_EXTRACT(metadata, '$.processingStatus') as processing_status,

  -- Access metrics
  JSON_EXTRACT(metadata, '$.accessCount') as access_count,
  JSON_EXTRACT(metadata, '$.lastAccessedAt') as last_accessed,
  JSON_EXTRACT(metadata, '$.totalBytesTransferred') as total_bytes_transferred,

  -- File integrity
  md5_hash,
  chunk_count,

  -- Computed fields
  CASE 
    WHEN file_size > 1024*1024*1024 THEN 'Large (>1GB)'
    WHEN file_size > 100*1024*1024 THEN 'Medium (>100MB)'
    ELSE 'Small (<100MB)'
  END as size_category,

  DATEDIFF(CURRENT_DATE(), upload_date) as days_since_upload

FROM GRIDFS_FILES()
WHERE 
  -- File type filtering
  JSON_EXTRACT(metadata, '$.mimeType') LIKE 'video/%'

  -- Category and tag filtering
  AND JSON_EXTRACT(metadata, '$.category') = 'training'
  AND JSON_CONTAINS(JSON_EXTRACT(metadata, '$.tags'), '"corporate"')

  -- Size filtering
  AND file_size > 100 * 1024 * 1024  -- Files larger than 100MB

  -- Date range filtering
  AND upload_date >= DATE_SUB(CURRENT_DATE(), INTERVAL 90 DAY)

  -- Access pattern filtering
  AND CAST(JSON_EXTRACT(metadata, '$.accessCount') AS UNSIGNED) > 5

  -- Processing status filtering
  AND JSON_EXTRACT(metadata, '$.processingStatus') = 'ready'

ORDER BY 
  CAST(JSON_EXTRACT(metadata, '$.accessCount') AS UNSIGNED) DESC,
  upload_date DESC
LIMIT 50;

-- File analytics and usage patterns
WITH file_analytics AS (
  SELECT 
    DATE_FORMAT(upload_date, '%Y-%m') as upload_month,
    JSON_EXTRACT(metadata, '$.category') as category,
    JSON_EXTRACT(metadata, '$.mimeType') as mime_type,

    -- File metrics
    COUNT(*) as file_count,
    SUM(file_size) as total_size_bytes,
    AVG(file_size) as avg_file_size,
    MIN(file_size) as min_file_size,
    MAX(file_size) as max_file_size,

    -- Access metrics
    SUM(CAST(JSON_EXTRACT(metadata, '$.accessCount') AS UNSIGNED)) as total_access_count,
    AVG(CAST(JSON_EXTRACT(metadata, '$.accessCount') AS UNSIGNED)) as avg_access_count,
    SUM(CAST(JSON_EXTRACT(metadata, '$.totalBytesTransferred') AS UNSIGNED)) as total_bytes_transferred,

    -- Performance metrics
    AVG(CAST(JSON_EXTRACT(metadata, '$.uploadDuration') AS UNSIGNED)) as avg_upload_time_ms

  FROM GRIDFS_FILES()
  WHERE upload_date >= DATE_SUB(CURRENT_DATE(), INTERVAL 12 MONTH)
  GROUP BY 
    DATE_FORMAT(upload_date, '%Y-%m'),
    JSON_EXTRACT(metadata, '$.category'),
    JSON_EXTRACT(metadata, '$.mimeType')
),

category_summary AS (
  SELECT 
    category,

    -- Volume metrics
    SUM(file_count) as total_files,
    SUM(total_size_bytes) as category_total_size,
    ROUND(SUM(total_size_bytes) / 1024.0 / 1024.0 / 1024.0, 2) as category_total_gb,

    -- Access patterns
    SUM(total_access_count) as category_total_accesses,
    ROUND(AVG(avg_access_count), 2) as category_avg_access_per_file,

    -- Performance indicators
    ROUND(AVG(avg_upload_time_ms), 2) as category_avg_upload_time,

    -- Growth trends
    COUNT(DISTINCT upload_month) as active_months,

    -- Storage efficiency
    ROUND(AVG(avg_file_size) / 1024.0 / 1024.0, 2) as avg_file_size_mb,
    ROUND(SUM(total_bytes_transferred) / 1024.0 / 1024.0 / 1024.0, 2) as total_transfer_gb

  FROM file_analytics
  GROUP BY category
)

SELECT 
  category,
  total_files,
  category_total_gb,
  category_avg_access_per_file,
  avg_file_size_mb,
  total_transfer_gb,

  -- Storage cost estimation (example rates)
  ROUND(category_total_gb * 0.023, 2) as estimated_monthly_storage_cost_usd,
  ROUND(total_transfer_gb * 0.09, 2) as estimated_transfer_cost_usd,

  -- Performance assessment
  CASE 
    WHEN category_avg_upload_time < 1000 THEN 'Excellent'
    WHEN category_avg_upload_time < 5000 THEN 'Good'
    WHEN category_avg_upload_time < 15000 THEN 'Fair'
    ELSE 'Needs Optimization'
  END as upload_performance,

  -- Usage classification
  CASE 
    WHEN category_avg_access_per_file > 100 THEN 'High Usage'
    WHEN category_avg_access_per_file > 20 THEN 'Medium Usage'
    WHEN category_avg_access_per_file > 5 THEN 'Low Usage'
    ELSE 'Archived/Inactive'
  END as usage_pattern

FROM category_summary
ORDER BY category_total_gb DESC, category_avg_access_per_file DESC;

-- File streaming and download operations
SELECT 
  file_id,
  filename,
  file_size,

  -- Create streaming URLs with range support
  CONCAT('/api/files/stream/', file_id) as stream_url,
  CONCAT('/api/files/stream/', file_id, '?range=0-1048576') as preview_stream_url,
  CONCAT('/api/files/download/', file_id) as download_url,

  -- Content delivery optimization
  CASE 
    WHEN JSON_EXTRACT(metadata, '$.mimeType') LIKE 'video/%' THEN 'streaming'
    WHEN JSON_EXTRACT(metadata, '$.mimeType') LIKE 'audio/%' THEN 'streaming'
    WHEN JSON_EXTRACT(metadata, '$.mimeType') LIKE 'image/%' THEN 'direct'
    ELSE 'download'
  END as recommended_delivery_method,

  -- CDN configuration suggestions
  CASE 
    WHEN CAST(JSON_EXTRACT(metadata, '$.accessCount') AS UNSIGNED) > 1000 THEN 'edge-cache'
    WHEN CAST(JSON_EXTRACT(metadata, '$.accessCount') AS UNSIGNED) > 100 THEN 'regional-cache'
    ELSE 'origin-only'
  END as cdn_strategy,

  -- Access control
  JSON_EXTRACT(metadata, '$.isPublic') as is_public,
  JSON_EXTRACT(metadata, '$.accessPermissions') as access_permissions

FROM GRIDFS_FILES()
WHERE JSON_EXTRACT(metadata, '$.processingStatus') = 'ready'
ORDER BY CAST(JSON_EXTRACT(metadata, '$.accessCount') AS UNSIGNED) DESC;

-- File maintenance and cleanup operations
WITH file_maintenance AS (
  SELECT 
    file_id,
    filename,
    file_size,
    upload_date,

    -- Metadata analysis
    JSON_EXTRACT(metadata, '$.category') as category,
    JSON_EXTRACT(metadata, '$.retentionPolicy') as retention_policy,
    JSON_EXTRACT(metadata, '$.lastAccessedAt') as last_accessed,
    CAST(JSON_EXTRACT(metadata, '$.accessCount') AS UNSIGNED) as access_count,

    -- Age calculations
    DATEDIFF(CURRENT_DATE(), upload_date) as days_since_upload,
    DATEDIFF(CURRENT_DATE(), STR_TO_DATE(JSON_EXTRACT(metadata, '$.lastAccessedAt'), '%Y-%m-%d')) as days_since_access,

    -- Maintenance flags
    CASE 
      WHEN JSON_EXTRACT(metadata, '$.retentionPolicy') = 'business-7years' AND 
           DATEDIFF(CURRENT_DATE(), upload_date) > 2555 THEN 'DELETE'
      WHEN JSON_EXTRACT(metadata, '$.retentionPolicy') = 'business-3years' AND 
           DATEDIFF(CURRENT_DATE(), upload_date) > 1095 THEN 'DELETE'
      WHEN DATEDIFF(CURRENT_DATE(), STR_TO_DATE(JSON_EXTRACT(metadata, '$.lastAccessedAt'), '%Y-%m-%d')) > 365
           AND CAST(JSON_EXTRACT(metadata, '$.accessCount') AS UNSIGNED) = 0 THEN 'ARCHIVE'
      WHEN DATEDIFF(CURRENT_DATE(), STR_TO_DATE(JSON_EXTRACT(metadata, '$.lastAccessedAt'), '%Y-%m-%d')) > 180
           AND CAST(JSON_EXTRACT(metadata, '$.accessCount') AS UNSIGNED) < 5 THEN 'COLD_STORAGE'
      ELSE 'ACTIVE'
    END as maintenance_action

  FROM GRIDFS_FILES()
)

SELECT 
  maintenance_action,
  COUNT(*) as file_count,
  ROUND(SUM(file_size) / 1024.0 / 1024.0 / 1024.0, 2) as total_size_gb,

  -- Cost impact analysis
  ROUND((SUM(file_size) / 1024.0 / 1024.0 / 1024.0) * 0.023, 2) as current_monthly_cost_usd,

  -- Storage class optimization
  CASE maintenance_action
    WHEN 'COLD_STORAGE' THEN ROUND((SUM(file_size) / 1024.0 / 1024.0 / 1024.0) * 0.004, 2)
    WHEN 'ARCHIVE' THEN ROUND((SUM(file_size) / 1024.0 / 1024.0 / 1024.0) * 0.001, 2)
    WHEN 'DELETE' THEN 0
    ELSE ROUND((SUM(file_size) / 1024.0 / 1024.0 / 1024.0) * 0.023, 2)
  END as optimized_monthly_cost_usd,

  -- Sample files for review
  GROUP_CONCAT(
    CONCAT(filename, ' (', ROUND(file_size/1024/1024, 1), 'MB)')
    ORDER BY file_size DESC
    SEPARATOR '; '
  ) as sample_files

FROM file_maintenance
GROUP BY maintenance_action
ORDER BY total_size_gb DESC;

-- Real-time file system monitoring
CREATE VIEW file_system_health AS
SELECT 
  -- Current system status
  COUNT(*) as total_files,
  ROUND(SUM(file_size) / 1024.0 / 1024.0 / 1024.0, 2) as total_storage_gb,
  COUNT(CASE WHEN upload_date >= DATE_SUB(NOW(), INTERVAL 24 HOUR) THEN 1 END) as files_uploaded_24h,
  COUNT(CASE WHEN STR_TO_DATE(JSON_EXTRACT(metadata, '$.lastAccessedAt'), '%Y-%m-%d %H:%i:%s') >= DATE_SUB(NOW(), INTERVAL 1 HOUR) THEN 1 END) as files_accessed_1h,

  -- Performance indicators
  AVG(CAST(JSON_EXTRACT(metadata, '$.uploadDuration') AS UNSIGNED)) as avg_upload_time_ms,
  COUNT(CASE WHEN JSON_EXTRACT(metadata, '$.processingStatus') = 'error' THEN 1 END) as files_with_errors,
  COUNT(CASE WHEN chunk_count != CEIL(file_size / 1048576.0) THEN 1 END) as files_with_integrity_issues,

  -- Storage distribution
  COUNT(CASE WHEN file_size > 1024*1024*1024 THEN 1 END) as large_files_1gb_plus,
  COUNT(CASE WHEN file_size BETWEEN 100*1024*1024 AND 1024*1024*1024 THEN 1 END) as medium_files_100mb_1gb,
  COUNT(CASE WHEN file_size < 100*1024*1024 THEN 1 END) as small_files_under_100mb,

  -- Health assessment
  CASE 
    WHEN COUNT(CASE WHEN JSON_EXTRACT(metadata, '$.processingStatus') = 'error' THEN 1 END) > 
         COUNT(*) * 0.05 THEN 'Critical - High Error Rate'
    WHEN AVG(CAST(JSON_EXTRACT(metadata, '$.uploadDuration') AS UNSIGNED)) > 30000 THEN 'Warning - Slow Uploads'
    WHEN COUNT(CASE WHEN chunk_count != CEIL(file_size / 1048576.0) THEN 1 END) > 0 THEN 'Warning - Integrity Issues'
    ELSE 'Healthy'
  END as system_health_status,

  NOW() as report_timestamp

FROM GRIDFS_FILES()
WHERE upload_date >= DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY);

-- QueryLeaf GridFS provides:
-- 1. SQL-familiar file upload and management operations
-- 2. Comprehensive file search and filtering capabilities
-- 3. Advanced analytics and usage pattern analysis
-- 4. Intelligent file lifecycle management and cleanup
-- 5. Real-time system health monitoring and alerting
-- 6. Cost optimization and storage class recommendations
-- 7. Integration with MongoDB's GridFS streaming capabilities
-- 8. Metadata-driven content management and organization
-- 9. Performance monitoring and optimization insights
-- 10. Enterprise-grade file operations with ACID guarantees

Best Practices for MongoDB GridFS

File Storage Strategy

Optimal GridFS configuration for different application types:

  1. Media Streaming Applications: Large chunk sizes for optimal streaming performance
  2. Document Management Systems: Metadata-rich storage with comprehensive indexing
  3. Content Distribution Networks: Integration with CDN and caching strategies
  4. Backup and Archival Systems: Compression and long-term storage optimization
  5. Real-time Applications: Fast upload/download with minimal latency
  6. Multi-tenant Systems: Secure isolation and access control patterns

Performance Optimization Guidelines

Essential considerations for production GridFS deployments:

  1. Chunk Size Optimization: Balance between storage efficiency and streaming performance
  2. Index Strategy: Create appropriate indexes on metadata fields for fast queries
  3. Replication Configuration: Optimize replica set configuration for file operations
  4. Connection Pooling: Configure connection pools for concurrent file operations
  5. Monitoring Integration: Implement comprehensive file operation monitoring
  6. Storage Management: Plan for growth and implement lifecycle management

Conclusion

MongoDB GridFS provides sophisticated large file storage and management capabilities that seamlessly integrate with MongoDB's document database features while supporting unlimited file sizes, intelligent streaming, and comprehensive metadata management. By implementing advanced file management patterns, streaming optimization, and automated analytics, applications can handle complex file storage requirements while maintaining high performance and operational efficiency.

Key GridFS benefits include:

  • Unlimited File Storage: No size limitations with automatic chunking and distribution
  • Seamless Integration: Native integration with MongoDB queries, indexes, and transactions
  • Intelligent Streaming: High-performance streaming with range request support
  • Comprehensive Metadata: Flexible, searchable metadata with rich query capabilities
  • High Availability: Automatic replication and distributed storage across replica sets
  • Advanced Analytics: Built-in analytics and reporting for file usage and performance

Whether you're building media streaming platforms, document management systems, content distribution networks, or file-intensive applications, MongoDB GridFS with QueryLeaf's familiar file operation interface provides the foundation for scalable, efficient large file management. This combination enables you to leverage advanced file storage capabilities while maintaining familiar database administration patterns and SQL-style file operations.

QueryLeaf Integration: QueryLeaf automatically translates SQL-familiar file operations into optimal MongoDB GridFS commands while providing comprehensive file management and analytics through SQL-style queries. Advanced file storage patterns, streaming optimization, and lifecycle management are seamlessly handled through familiar database administration interfaces, making sophisticated file storage both powerful and accessible.

The integration of intelligent file storage with SQL-style file operations makes MongoDB an ideal platform for applications requiring both scalable file management and familiar database administration patterns, ensuring your files remain both accessible and efficiently managed as they scale to meet demanding production requirements.